open-clip

$npx mdskill add mkurman/zorai/open-clip

Encode images and text into semantic embeddings for zero-shot classification.

  • Converts visual and textual inputs into high-dimensional feature vectors.
  • Depends on PyTorch and the open-clip-torch Python package.
  • Selects embeddings via multi-head attention pooling and SigLIP loss variants.
  • Outputs softmax probabilities indicating predicted class labels.

SKILL.md

.github/skills/open-clipView on GitHub ↗
---
name: open-clip
description: "OpenCLIP — open-source implementation of CLIP trained on LAION-5B/OpenCLIP datasets. Multi-head attention pooling, SigLIP loss variants, and wide model zoo (ViT, ConvNeXt, EVA). Community-driven."
tags: [open-clip, multimodal, image-text, laion, zero-shot, embeddings, zorai]
---
## Overview

OpenCLIP is an open-source reimplementation of CLIP trained on LAION-5B, LAION-400M, and DataComp. Provides larger and better architectures than the original: ViT-H/14, ConvNeXt, EVA-02, SigLIP. Full model transparency with flexible training customizations.

## Installation

```bash
uv pip install open-clip-torch
```

## Encoding Images and Text

```python
import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-H-14", pretrained="laion2b_s32b_b79k")
tokenizer = open_clip.get_tokenizer("ViT-H-14")

image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
text = tokenizer(["a dog", "a cat", "a car"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits = (image_features @ text_features.T).softmax(dim=-1)
    print(f"Predicted: class {logits.argmax().item()} with {logits.max():.2%}")
```

## References
- [OpenCLIP GitHub](https://github.com/mlfoundations/open_clip)
- [OpenCLIP paper](https://arxiv.org/abs/2211.04293)

More from mkurman/zorai

SkillDescription
account-management>
agile-scrum>
albumentationsFast image augmentation library (Albumentations). 70+ transforms for classification, segmentation, object detection, keypoints, and pose estimation. Optimized OpenCV-based pipeline with unified API across all CV tasks. Supports images, masks, bounding boxes, and keypoints simultaneously. Note: classic Albumentations (MIT) is no longer maintained; successor AlbumentationsX uses AGPL-3.0. For torchvision-native augmentations, use torchvision.transforms.v2.
aml-complianceAnti-Money Laundering (AML) and Know Your Customer (KYC) compliance workflow. Sanctions screening, PEP detection, transaction monitoring, suspicious activity reporting (SAR), and OFAC compliance.
anki-connectThis skill is for interacting with Anki through AnkiConnect, and should be used whenever a user asks to interact with Anki, including to read or modify decks, notes, cards, models, media, or sync operations.
approval-checkpoint-long-taskCanonical long-task pack for daemon-managed work with deliberate approval checkpoints, status summaries, rollback notes, and mobile-safe governance-aware updates.
auditing-goal-artifactsUse when reviewing recent zorai goal run outputs, closure markers, ledgers, or evidence bundles to judge whether completion is credible or to identify remaining uncertainty.
autogenAutoGen (Microsoft) — multi-agent conversation framework. Agent-to-agent chat, code generation & execution, tool use, group chat, and human-in-the-loop. Build collaborative AI systems with specialized agents.
backtraderPython backtesting framework for trading strategies. Data feeds, brokers, analyzers, and live trading support. Strategy development with commission models, slippage, and signal-based execution.
beautiful-mermaidRender Mermaid diagrams as SVG and PNG using the Beautiful Mermaid library. Use when the user asks to render a Mermaid diagram.