stepfun-asr
$
npx mdskill add daymade/claude-code-skills/stepfun-asrTranscribe long audio with StepFun's stepaudio-2.5-asr using the correct SSE endpoint
- Solves transcription issues caused by incorrect endpoints or model errors
- Uses StepFun's SSE endpoint /v1/audio/asr/sse with 32K context and 30-minute single-call limit
- Triggers on keywords like 阶跃 ASR, StepFun ASR, and 长音频转写
- Returns transcriptions for Chinese/English audio in one request without client-side chunking
SKILL.md
.github/skills/stepfun-asrView on GitHub ↗
---
name: stepfun-asr
description: Transcribe audio with StepFun's stepaudio-2.5-asr — an SSE endpoint (NOT /v1/audio/transcriptions) with 32K context, ~85-101x RTF on long audio, and a single-call ceiling around 30 minutes (no client-side chunking). Use when transcribing Chinese / English audio with StepFun, when long-form recordings (5-30 min) need to land in one request, when migrating from step-asr / step-asr-1.1, or when hitting the misleading `model stepaudio-2.5-asr not supported` error (which actually means wrong endpoint). Triggers on 阶跃 ASR, StepFun ASR, stepaudio-2.5-asr, 转录, 语音识别, 长音频转写, 语音转文字. For TTS with the sibling stepaudio-2.5-tts model, use the stepfun-tts skill instead.
---
# StepFun stepaudio-2.5-asr
Transcribe audio with StepFun's `stepaudio-2.5-asr` (released 2026-04, verified 2026-04-23). Long audio in one call, no chunking — but **only** if the request hits the right endpoint with the right body shape. The wrong endpoint returns an error that looks identical to "model doesn't exist", which is the #1 reason this skill exists.
> Companion: for TTS with `stepaudio-2.5-tts` (the sibling model), use the `stepfun-tts` skill — they share an API key but live on different endpoints with different body shapes.
## Why this skill exists — three traps that cost hours
1. **Wrong endpoint, wrong error**. `stepaudio-2.5-asr` does **not** live on `/v1/audio/transcriptions` (that endpoint serves the older `step-asr` family). It lives on `/v1/audio/asr/sse` — SSE streaming, JSON body, base64 audio. Sending it to the wrong endpoint returns `{"error":{"message":"model stepaudio-2.5-asr not supported"}}`, which is **identical in structure** to a genuinely nonexistent model name. People waste hours filing whitelist tickets.
2. **Plan key vs Normal key, silent failure**. StepFun's "Plan" subscription keys (cheap, text-only) cannot call audio endpoints, but the failure manifests as a 4xx with no auth-shaped error message. If your account has a Plan subscription, you need a separate "Normal" key from the same console.
3. **SSE error events are real**. Censorship can fire on the ASR side too (rarely). Don't assume only `transcript.text.delta` and `transcript.text.done` events arrive — handle `type: error` events in the stream or you'll silently drop them.
## Config and auth
API key resolves in this order (fail-fast, no defaults):
1. `$STEPFUN_API_KEY` environment variable
2. `${CLAUDE_PLUGIN_DATA}/config.json` with `{"api_key": "..."}` (cross-session persistence)
First-time setup:
```bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste Normal key here>"}
EOF
```
If the user has not set a key, ask them to paste it — do not guess or use a placeholder. Get keys at https://platform.stepfun.com/ → API Keys. **Use a Normal key, not a Plan key.**
## Quick start — single file
```bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3
```
Output: plain text transcription on stdout.
For machine-readable output with usage / timing:
```bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --json
```
For non-Chinese audio:
```bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --language en
```
The script handles base64 encoding, the nested `{audio: {data, input: {transcription, format}}}` body, SSE parsing, and the misleading-endpoint pitfall. Prefer it over hand-rolled HTTP calls unless integrating into a larger pipeline.
## Decision table
| Scenario | Action |
|---|---|
| Short clip (< 5 min), Chinese or English, mp3/wav/ogg/opus | `python3 scripts/asr_transcribe.py audio.mp3` |
| Long audio (5-30 min) | Same script — 32K context handles it in a single call, no chunking needed |
| Audio > 30 min | Split with ffmpeg before sending; the API rejects oversized payloads |
| Need usage/billing data | Add `--json` to capture `usage.input_tokens` / `usage.total_tokens` from `transcript.text.done` |
| Highly repetitive content (same phrase 5+ times, > 90s) | Cross-validate with `step-asr-1.1` — see repetition hallucination in `references/known_issues.md` |
| Hit `model stepaudio-2.5-asr not supported` | Wrong endpoint. Switch from `/v1/audio/transcriptions` to `/v1/audio/asr/sse` |
| Hit silent 4xx auth failure | Verify your key is "Normal" not "Plan" — Plan keys cannot call audio endpoints |
| Need to write raw HTTP (no Python) | Read `references/api_reference.md` for exact JSON body and SSE event shapes |
## Supported audio formats
The script auto-detects from extension; pass `--format` to override:
| Extension | Format flag | Notes |
|---|---|---|
| `.mp3` | `mp3` | Most common, default |
| `.wav` | `wav` | Lossless |
| `.ogg` | `ogg` | OGG container |
| `.opus` | `ogg` | Opus codec in OGG container — pass through unchanged |
| `.pcm` | `pcm` | Raw PCM — also requires `format.rate`, `format.channel`, `format.bits` (see API reference) |
For mp4/m4a/webm/etc., transcode to one of the above first via ffmpeg. Production pipelines often pre-transcode everything to OGG/Opus 16kHz mono to minimize base64 payload size.
## Capacity and performance (verified 2026-04-23)
- **32K context window** — single-call upper limit, no chunking needed for ≤ 30 min audio
- **~85-101× RTF** on long audio (17.4 min audio → 10.4s wall clock)
- **~5.3× speedup vs step-asr-1.1** at the 100s+ length range
- **Only ~2× speedup** at the 5-15s range — the LLM spin-up cost dominates short clips. If your workload is many short clips, the migration ROI is modest
## Common error patterns
| Error response | Actual cause | Fix |
|---|---|---|
| `"model stepaudio-2.5-asr not supported"` on `/v1/audio/transcriptions` | Wrong endpoint | Switch to `/v1/audio/asr/sse` (script does this) |
| Silent 4xx with no auth message | Using a "Plan" key on audio endpoint | Get a "Normal" key from the StepFun console |
| ASR returns 3-4× expected character count | Repetition hallucination on highly-repetitive audio | Cross-validate with `step-asr-1.1`; see `references/known_issues.md` |
| `data: {"type":"error","message":"content blocked..."}` mid-stream | Censorship fired on user-uploaded content | Handle SSE `error` event explicitly; don't assume only `delta`/`done` arrive |
More edge cases in `references/known_issues.md`.
## Design invariants (do not break)
1. **Always pass through SSE** — don't try to buffer the response with a non-streaming client. The model emits `transcript.text.delta` for long audio; `transcript.text.done` carries the authoritative full text and `usage`. Reject the SSE format entirely and you'll get nothing.
2. **Take final text from `transcript.text.done.text`** — concatenated deltas can drift on edge cases. Deltas are for progressive UI; the `done` event is the source of truth.
3. **Handle `error` events in the stream** — don't treat the SSE stream as if only success events arrive. A blocked-content event mid-stream returns `type: error` with no `done` event.
4. **Fail-fast on missing API key** — never default to a placeholder or empty string. The script does this; preserve the behavior in any wrapper.
## When to read references
- `references/api_reference.md` — exact JSON request body, all fields, all SSE event types, response examples. Read when writing raw HTTP calls instead of using the bundled script.
- `references/known_issues.md` — repetition hallucination details, the wrong-endpoint diagnostic trail, Plan-vs-Normal key gotcha, ASR-side censorship handling, pricing opacity. Read when debugging anomalous output or evaluating whether to migrate from `step-asr-1.1`.
## Pricing (verified 2026-04-23, volatile)
`stepaudio-2.5-asr` is in invitation beta as of 2026-04-23 — no public per-minute rate. The `step-asr-1.1` baseline is 2.2 元/小时. The invitation PDF mentions "成本直降 80%" implying ~0.4 元/小时, but this is not yet on the pricing page. Re-verify at https://platform.stepfun.com/docs/zh/guides/pricing/details before quoting to stakeholders.
## Next Step
ASR transcripts almost always benefit from one of two downstream steps. After producing the transcript, suggest the natural next skill:
```
Transcription complete. The output is raw text from the model — common next steps:
Options:
A) transcript-fixer — clean up ASR errors (homophones, segmentation, filler words). Recommended if the recording is a real-world conversation, podcast, or interview rather than read-aloud text
B) meeting-minutes-taker — turn the transcript into structured minutes with decisions, action items, and speaker attribution. Recommended if the recording is a meeting
C) No thanks — the raw transcript is what I needed
```
Skip the suggestion when the user has already specified the downstream tool, or when the transcription was clearly a one-off lookup (e.g., "what does this 15-second clip say?").
More from daymade/claude-code-skills
- asr-transcribe-to-textTranscribes audio and video files to text using Qwen3-ASR. Supports two modes — local MLX inference on macOS Apple Silicon (no API key, 15-27x realtime) and remote API via vLLM/OpenAI-compatible endpoints. Auto-detects platform and recommends the best path. Triggers when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字. Also triggers for meeting recordings, lectures, interviews, podcasts, screen recordings, or any audio/video file the user wants converted to text.
- auto-repo-setup|
- benchmark-due-diligence>
- bigdata-skill>-
- capture-screenProgrammatic screenshot capture on macOS. Find window IDs with Swift CGWindowListCopyWindowInfo, control application windows via AppleScript (zoom, scroll, select), and capture with screencapture. Use when automating screenshots, capturing application windows for documentation, or building multi-shot visual workflows.
- claude-code-history-files-finderFinds and recovers content from Claude Code session history files. This skill should be used when searching for deleted files, tracking changes across sessions, analyzing conversation history, or recovering code from previous Claude interactions. Triggers include mentions of "session history", "recover deleted", "find in history", "previous conversation", or ".claude/projects".
- claude-md-progressive-disclosurer|
- claude-skills-troubleshootingDiagnose and resolve Claude Code plugin and skill issues. This skill should be used when plugins are installed but not showing in available skills list, skills are not activating as expected, or when troubleshooting enabledPlugins configuration in settings.json. Triggers include "plugin not working", "skill not showing", "installed but disabled", or "enabledPlugins" issues.
- cli-demo-generatorGenerates professional animated CLI demos as GIFs using VHS terminal recordings. Handles tape file creation, self-bootstrapping demos with hidden setup, output noise filtering, post-processing speed-up, and frame-level verification. Use when users want to create terminal demos, record CLI workflows as GIFs, generate animated documentation, build demo tapes for README files, or need to showcase any command-line tool visually. Also triggers on "record terminal", "VHS tape", "demo GIF", "animate my CLI", or any request to visually demonstrate shell commands.
- cloudflare-troubleshootingInvestigate and resolve Cloudflare configuration issues using API-driven evidence gathering. Use when troubleshooting ERR_TOO_MANY_REDIRECTS, SSL errors, DNS issues, or any Cloudflare-related problems. Focus on systematic investigation using Cloudflare API to examine actual configuration rather than making assumptions.