mlx-vlm
$
npx mdskill add TerminalSkills/skills/mlx-vlmRuns vision-language models locally on Apple Silicon using MLX
- Enables inference and fine-tuning of vision models on macOS without GPU servers
- Supports models like Pixtral, Qwen2-VL, Phi-3-Vision, and Llama-3.2-Vision
- Processes images and text inputs using unified memory for efficient execution
- Delivers responses directly via local AI, ideal for batch processing and custom datasets
SKILL.md
.github/skills/mlx-vlmView on GitHub ↗
---
name: mlx-vlm
description: >-
Run Vision Language Models locally on Apple Silicon Macs using MLX. Use when:
installing mlx-vlm, running VLM inference (image + text → response), fine-tuning
vision models on custom datasets, batch processing images with local AI,
comparing local VLM to cloud APIs (GPT-4V, Claude Vision), or working with
LLaVA, Phi-3-Vision, Qwen2-VL, Pixtral, Llama-3.2-Vision on Mac.
license: Apache-2.0
compatibility: "macOS 14+, Apple Silicon, Python 3.10+"
metadata:
author: terminal-skills
version: "1.0.0"
category: data-ai
tags: ["mlx", "vision", "apple-silicon"]
---
# MLX-VLM — Vision Language Models on Apple Silicon
## Overview
mlx-vlm runs vision-language models natively on Apple Silicon using the MLX framework. It supports inference and fine-tuning with unified memory — no GPU server needed.
**Repo:** `Blaizzy/mlx-vlm`
**Requirements:** macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.10+
## Installation
```bash
# Create virtual environment (recommended)
python3 -m venv ~/.venvs/mlx-vlm
source ~/.venvs/mlx-vlm/bin/activate
# Install
pip install mlx-vlm
```
For development:
```bash
git clone https://github.com/Blaizzy/mlx-vlm.git
cd mlx-vlm && pip install -e .
```
## Supported Models
| Model | HuggingFace ID | Best For |
|-------|---------------|----------|
| Pixtral | `mistral-community/pixtral-12b-240910` | General vision, multi-image |
| Qwen2-VL | `Qwen/Qwen2-VL-7B-Instruct` | OCR, document understanding |
| Phi-3-Vision | `microsoft/Phi-3.5-vision-instruct` | Lightweight, fast inference |
| LLaVA-1.6 | `llava-hf/llava-v1.6-mistral-7b-hf` | Conversation about images |
| Llama-3.2-Vision | `meta-llama/Llama-3.2-11B-Vision-Instruct` | Strong general reasoning |
## Inference
### CLI
```bash
# Single image analysis
python -m mlx_vlm.generate \
--model mlx-community/pixtral-12b-240910-4bit \
--image path/to/image.jpg \
--prompt "Describe this image in detail" \
--max-tokens 512
# Multi-image comparison
python -m mlx_vlm.generate \
--model mlx-community/pixtral-12b-240910-4bit \
--image img1.jpg img2.jpg \
--prompt "Compare these two images"
```
### Python API
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model_path = "mlx-community/pixtral-12b-240910-4bit"
model, processor = load(model_path)
prompt = apply_chat_template(
processor,
config=model.config,
prompt="What objects are in this image?",
images=["product.jpg"],
)
output = generate(
model, processor, prompt,
images=["product.jpg"],
max_tokens=512,
temperature=0.7,
)
print(output)
```
### Batch Processing
```python
import os, csv
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("mlx-community/pixtral-12b-240910-4bit")
image_dir = "images/"
results = []
for filename in os.listdir(image_dir):
if not filename.lower().endswith((".jpg", ".png", ".webp")):
continue
path = os.path.join(image_dir, filename)
prompt = apply_chat_template(
processor, config=model.config,
prompt="Describe this product photo. Include: category, color, condition, key features.",
images=[path],
)
desc = generate(model, processor, prompt, images=[path], max_tokens=256)
results.append({"file": filename, "description": desc})
with open("descriptions.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["file", "description"])
writer.writeheader()
writer.writerows(results)
```
## Fine-Tuning
### Prepare Dataset
Create JSONL with image paths and conversations:
```json
{"image": "train/001.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Electronics, Subcategory: Headphones, Condition: New"}]}
{"image": "train/002.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Clothing, Subcategory: T-Shirt, Condition: Used - Good"}]}
```
### Run Fine-Tuning (LoRA)
```bash
python -m mlx_vlm.lora \
--model mlx-community/pixtral-12b-240910-4bit \
--data ./dataset \
--train-file train.jsonl \
--valid-file val.jsonl \
--num-layers 8 \
--batch-size 1 \
--epochs 3 \
--lr 1e-5 \
--adapter-path ./adapters
```
### Inference with Fine-Tuned Adapter
```bash
python -m mlx_vlm.generate \
--model mlx-community/pixtral-12b-240910-4bit \
--adapter-path ./adapters \
--image test.jpg \
--prompt "Classify this product"
```
## Cloud API Comparison
| Factor | mlx-vlm (Local) | Cloud APIs (GPT-4V, Claude) |
|--------|-----------------|---------------------------|
| Cost | $0 after hardware | $0.01-0.04 per image |
| Privacy | Data stays local | Data sent to provider |
| Speed | ~2-8s per image (M3 Max) | ~1-3s per image |
| Offline | Yes | No |
| Custom models | LoRA fine-tuning | Limited / expensive |
| Quality | Good (7-12B models) | Excellent (frontier models) |
## Performance Tips
- Use 4-bit quantized models (`4bit` in name) for 2-3x speedup with minimal quality loss
- M3 Max / M4 Pro with 36GB+ RAM can run 12B models comfortably
- For M1/M2 with 16GB, stick to 7B 4-bit models
- Set `MLX_METAL_JIT=1` for potential speedup on first run
- Close memory-heavy apps before inference — unified memory is shared with system