stepfun-tts
$
npx mdskill add daymade/claude-code-skills/stepfun-ttsGenerate Chinese / Japanese speech with `stepaudio-2.5-tts` (released 2026-04, verified 2026-04-23). Contextual TTS — emotion and prosody go through natural-language description, not fixed labels.
SKILL.md
.github/skills/stepfun-ttsView on GitHub ↗
---
name: stepfun-tts
description: Generate Chinese / Japanese speech with StepFun's stepaudio-2.5-tts — Contextual TTS that replaces step-tts-2's `voice_label` with natural-language `instruction` (≤200 chars) plus inline `()` parentheses for句内 prosody. Use when the user wants emotional / prosody control over voice synthesis (whisper, pause, stress, mood pivot mid-sentence), batch-generates game / app voice lines, migrates from `step-tts-2` (the `voice_label → instruction` breaking change), or hits StepFun's stricter 2.5-era censorship (死/消失/political terms). Triggers on 阶跃 TTS, StepAudio 合成, 语音合成, 配音, 文本转语音, TTS 升级, 迁移 step-tts-2. For transcription with the sibling stepaudio-2.5-asr model, use the stepfun-asr skill instead.
---
# StepFun stepaudio-2.5-tts
Generate Chinese / Japanese speech with `stepaudio-2.5-tts` (released 2026-04, verified 2026-04-23). Contextual TTS — emotion and prosody go through natural-language description, not fixed labels.
> Companion: for transcription with `stepaudio-2.5-asr` (the sibling model), use the `stepfun-asr` skill — they share an API key but live on different endpoints with different body shapes.
**Why this skill exists** — StepAudio 2.5 has two non-obvious pitfalls that cost hours if you don't know them:
1. `stepaudio-2.5-tts` **rejects** `voice_label` (the step-tts-2 way). Emotion/prosody now goes through `instruction` (natural-language description, ≤200 chars) and inline `()` parentheses inside the text itself.
2. Censorship is stricter — anything containing 死 / 消失 / sensitive political terms returns `censorship_block`. Your rewrite options are in `references/migration_from_v2.md`.
## Config and auth
API key lives in `$STEPFUN_API_KEY` (preferred) or `${CLAUDE_PLUGIN_DATA}/config.json` (fallback for cross-session persistence). All bundled scripts try env first, then config.
First-time setup (one-liner):
```bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste key here>"}
EOF
```
If the user hasn't set a key, ask them to paste it (don't guess / don't use a placeholder). StepFun API keys are available at https://platform.stepfun.com/ → API Keys. **Use a Normal key, not a Plan key** (Plan keys are restricted to text models and silently fail on audio endpoints).
## Common tasks — decision tree
| User wants... | Script | Key detail |
|---|---|---|
| Synthesize 1–500 char Chinese with emotion | `scripts/tts_generate.py` | Use `instruction` for mood, `()` for inline prosody |
| Synthesize long text (500–1000 char) | `scripts/tts_generate.py` | 1000 char is the hard cap; split at semantic boundaries above that |
| Batch-generate game/app voice lines | `scripts/tts_generate.py --batch <jsonl>` | Handle `censorship_block` fallback individually |
| A/B compare two TTS models | `scripts/ab_compare.sh` | Compares duration/size across two directories |
| Migrate from `step-tts-2` | see `references/migration_from_v2.md` | `voice_label.emotion` → `instruction` rewrite + censorship list |
## Starting points
- **Synthesize a single line**: Run `python3 scripts/tts_generate.py --text "你好" --out /tmp/hello.mp3 --instruction "温暖的希望感"`. For fine-grained control read the "Contextual TTS" section below.
- **A full migration** from `step-tts-2` → `stepaudio-2.5-tts`: read `references/migration_from_v2.md` end-to-end before touching code. It has the `INSTRUCTION_MAP`, the SKIP_CENSORED list pattern, and the output-directory-strategy for non-destructive A/B.
## Contextual TTS — beyond emotion labels
The headline feature of `stepaudio-2.5-tts` is that you stop mapping emotions to fixed tags and start describing what you want in natural language. Two layers:
**Global context (`instruction` parameter)** — sets the overall tone for the entire utterance. ≤200 chars. Think of it like giving stage direction to a voice actor.
```
instruction: "克制的悲伤,语气低沉柔弱,像快要消失一样"
```
**Inline context (`()` parentheses inside `input`)** —句内 directives. Parenthesised content is consumed as directions and is NOT read aloud. Use for precise control of pauses, breath, emphasis, or mid-sentence emotion shifts.
```
input: "(试探着问)你好吗?(开心地)太好了!(突然沉下来)不过...我快要消失了。"
```
Examples that worked in practice (from 2026-04-23 verification):
- `instruction: "活泼俏皮,像是在撒娇,带点嘴硬"` — visibly speeds up delivery vs neutral
- `instruction: "耳语声,气声很重,几乎听不清"` — produces audible whisper/breath
- `input: "你好(停顿一下)我是蕾格(轻声)今天(加重)的天气真不错。"` — inline directives all respected
**What `stepaudio-2.5-tts` will NOT accept** — `voice_label` parameter. Error: `voice_label is not supported for v2 models`. This is the #1 migration gotcha from step-tts-2.
## Common error patterns (real errors, real fixes)
| Error response | Actual cause | Fix |
|---|---|---|
| `"voice_label is not supported for v2 models"` | Sent `voice_label` to `stepaudio-2.5-tts` | Remove `voice_label`; put the same intent into `instruction` as natural language |
| `"The content you provided or machine outputted is blocked." type: censorship_block` | Sensitive word (死 / 消失 / etc.) | Rewrite the phrase OR fall back to `step-tts-2` for that specific line (mixed-model is fine) |
| Silent audio truncation (input > 1000 chars) | Hard cap exceeded | Split at semantic boundaries; don't truncate mid-sentence |
More in `references/known_issues.md`.
## When to read references
- `references/api_reference.md` — exact request/response JSON for `/v1/audio/speech`, all fields, error responses. Read when writing raw HTTP calls instead of using the bundled scripts.
- `references/migration_from_v2.md` — complete playbook for moving a step-tts-2 project to stepaudio-2.5-tts. Has the emotion→instruction rewrite table, the A/B directory strategy, decision checkpoints, and the 2026-04 speed/quality trade-off data (`stepaudio-2.5-tts` is ~20% slower than step-tts-2; audible prosody improvement). Read before any migration work.
- `references/known_issues.md` — censorship patterns, TTS duration inflation, v2-family parameter naming gotcha, 1000-char hard cap. Read when debugging anomalous output or evaluating whether to adopt.
## Design invariants (don't break these)
1. **Non-destructive A/B output** — when regenerating a corpus with a new model, write to a parallel directory (`voice/zh_v25/`), never overwrite the production corpus. The migration playbook shows why.
2. **Per-line censorship handling** — if 2/29 lines get `censorship_block`, don't fail the batch. Log the skipped IDs, continue. Mixed-model fallback (step-tts-2 for the skipped 2) is normal.
3. **Don't duplicate voice_label logic in new code** — any new TTS code targeting stepaudio-2.5-tts should only use `instruction` + inline `()`. Do not write a branch that conditionally emits `voice_label`.
## Pricing (verified 2026-04-23, volatile)
- `stepaudio-2.5-tts` contextual synthesis: ~5.8 元 / 万字符
- Zero-shot voice cloning: ~9.9 元 / 音色
Re-verify at https://platform.stepfun.com/docs/zh/guides/pricing/details before quoting to stakeholders.
More from daymade/claude-code-skills
- asr-transcribe-to-textTranscribes audio and video files to text using Qwen3-ASR. Supports two modes — local MLX inference on macOS Apple Silicon (no API key, 15-27x realtime) and remote API via vLLM/OpenAI-compatible endpoints. Auto-detects platform and recommends the best path. Triggers when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字. Also triggers for meeting recordings, lectures, interviews, podcasts, screen recordings, or any audio/video file the user wants converted to text.
- auto-repo-setup|
- benchmark-due-diligence>
- bigdata-skill>-
- capture-screenProgrammatic screenshot capture on macOS. Find window IDs with Swift CGWindowListCopyWindowInfo, control application windows via AppleScript (zoom, scroll, select), and capture with screencapture. Use when automating screenshots, capturing application windows for documentation, or building multi-shot visual workflows.
- claude-code-history-files-finderFinds and recovers content from Claude Code session history files. This skill should be used when searching for deleted files, tracking changes across sessions, analyzing conversation history, or recovering code from previous Claude interactions. Triggers include mentions of "session history", "recover deleted", "find in history", "previous conversation", or ".claude/projects".
- claude-md-progressive-disclosurer|
- claude-skills-troubleshootingDiagnose and resolve Claude Code plugin and skill issues. This skill should be used when plugins are installed but not showing in available skills list, skills are not activating as expected, or when troubleshooting enabledPlugins configuration in settings.json. Triggers include "plugin not working", "skill not showing", "installed but disabled", or "enabledPlugins" issues.
- cli-demo-generatorGenerates professional animated CLI demos as GIFs using VHS terminal recordings. Handles tape file creation, self-bootstrapping demos with hidden setup, output noise filtering, post-processing speed-up, and frame-level verification. Use when users want to create terminal demos, record CLI workflows as GIFs, generate animated documentation, build demo tapes for README files, or need to showcase any command-line tool visually. Also triggers on "record terminal", "VHS tape", "demo GIF", "animate my CLI", or any request to visually demonstrate shell commands.
- cloudflare-troubleshootingInvestigate and resolve Cloudflare configuration issues using API-driven evidence gathering. Use when troubleshooting ERR_TOO_MANY_REDIRECTS, SSL errors, DNS issues, or any Cloudflare-related problems. Focus on systematic investigation using Cloudflare API to examine actual configuration rather than making assumptions.