image-understander

Name: image-understander
Author: openakita/openakita

$npx mdskill add openakita/openakita/image-understander

Analyze images for text, objects, and visual questions using GPT-4 Vision.

Extracts text from screenshots and identifies objects in photos.
Depends on OpenAI GPT-4 Vision API for image processing.
Executes specific modes like OCR, description, or visual Q&A.
Returns structured JSON output with analysis results.

SKILL.md

.github/skills/image-understanderView on GitHub ↗

---
name: openakita/skills@image-understander
description: Analyze images using GPT-4 Vision for detailed description, OCR text extraction, object recognition, and visual Q&A. Use when the user needs to understand image content, extract text from screenshots, identify objects in photos, or ask questions about images via OpenAI GPT-4 Vision API.
license: MIT
metadata:
  author: openakita
  version: "1.0.0"
---

# 图片理解技能 (Image Understander)

## 📋 概述

一个基于 OpenAI GPT-4 Vision 的图片理解工具，支持图片描述、文字识别(OCR)、物体识别和图片问答。

## 🚀 功能

| 功能 | 命令 | 说明 |
|------|------|------|
| 图片描述 | `-m describe` | 详细描述图片内容 |
| 文字提取 | `-m ocr` | 提取图片中的所有文字 |
| 物体识别 | `-m objects` | 识别并列出图片中的物体 |
| 图片问答 | `-m qa` | 针对图片回答问题 |

## 📦 安装

```bash
# 安装依赖
pip install openai pillow requests
```

## 🔧 配置

### 方式一：环境变量
```bash
set OPENAI_API_KEY=sk-your-api-key-here
```

### 方式二：命令行传入
```bash
python scripts/main.py -i photo.jpg -a sk-your-key
```

## 📖 使用方法

### 基本使用
```bash
# 描述图片
python scripts/main.py -i photo.jpg -m describe

# 提取文字（OCR）
python scripts/main.py -i screenshot.png -m ocr

# 识别物体
python scripts/main.py -i photo.jpg -m objects

# 图片问答
python scripts/main.py -i photo.jpg -m qa -q "这个图片里有什么？"
```

### 完整参数
```bash
python scripts/main.py \
  --image PATH_TO_IMAGE \
  --mode describe|ocr|objects|qa \
  --api-key YOUR_API_KEY \
  --prompt "你的问题" \
  --output OUTPUT.json \
  --verbose
```

## 📁 输出示例

```json
{
  "mode": "describe",
  "image": "photo.jpg",
  "result": "A beautiful sunset over the ocean with orange and purple sky...",
  "objects": [],
  "text": ""
}
```

## ⚠️ 注意事项

- 需要 OpenAI API Key（支持 GPT-4 Vision）
- 支持的图片格式：PNG、JPG、GIF、BMP
- 图片大小建议小于 20MB

More from openakita/openakita

Skill	Description
add-memory	Record important information to long-term memory for learning user preferences, successful patterns, and error lessons. When you need to remember user preferences, save successful patterns, or record lessons from errors.
algorithmic-art	Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
apify-scraper	Web data extraction using 55+ Apify Actors for AI-driven scraping. Supports Instagram, Facebook, TikTok, YouTube, Google, and more. Auto-selects best Actor for the task. Structured output in JSON/CSV with rate limiting and ethical scraping guidelines.
baoyu-article-illustrator	Analyzes article structure, identifies positions requiring visual aids, generates illustrations with Type × Style two-dimension approach. Use when user asks to "illustrate article", "add images", "generate images for article", or "为文章配图".
baoyu-cover-image	Generates article cover images with 5 dimensions (type, palette, rendering, text, mood) combining 9 color palettes and 6 rendering styles. Supports cinematic (2.35:1), widescreen (16:9), and square (1:1) aspects. Use when user asks to "generate cover image", "create article cover", or "make cover".
baoyu-format-markdown	Formats plain text or markdown files with frontmatter, titles, summaries, headings, bold, lists, and code blocks. Use when user asks to "format markdown", "beautify article", "add formatting", or improve article layout. Outputs to {filename}-formatted.md.
baoyu-image-gen	Generate AI images using multiple providers (OpenAI DALL-E, Google Imagen, DashScope/Tongyi Wanxiang, Replicate). Supports various aspect ratios, quality presets, batch generation, and provider-specific prompt engineering techniques.
baoyu-slide-deck	Generates professional slide deck images from content. Creates outlines with style instructions, then generates individual slide images. Use when user asks to "create slides", "make a presentation", "generate deck", "slide deck", or "PPT".
baoyu-url-to-markdown	Fetch any URL and convert to markdown using Chrome CDP. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.
bilibili-watcher	Extract subtitles and transcripts from Bilibili and YouTube videos. Use when the user wants to get subtitles from B站 (Bilibili) or YouTube, extract Chinese/Japanese video transcripts, watch member-only Bilibili content, or perform Q&A on video content. Supports dual-platform subtitle extraction with yt-dlp.