fish-audio
$
npx mdskill add vellum-ai/vellum-assistant/fish-audioGenerate expressive audio clips using the Fish Audio S2 TTS API with `[bracket]` emotion tags.
SKILL.md
.github/skills/fish-audioView on GitHub ↗
---
name: fish-audio
description: "Generate expressive audio clips using Fish Audio S2 TTS with bracket emotion tags. Record voice memos, narration, audio messages, or any spoken content."
compatibility: "Designed for Vellum personal assistants"
metadata:
emoji: "🎙️"
vellum:
category: "voice"
display-name: "Fish Audio TTS"
---
# Fish Audio TTS
Generate expressive audio clips using the Fish Audio S2 TTS API with `[bracket]` emotion tags.
## Overview
This skill lets you create audio clips on demand — narration, announcements, podcast intros, dramatic readings, voice memos, or any spoken content. Uses Fish Audio S2 Pro with the full bracket syntax for emotional expressiveness.
## Configuration
- **API Endpoint:** `https://api.fish.audio/v1/tts`
- **Model:** `s2-pro`
- **Voice Reference ID:** Configured via `assistant config get services.tts.providers.fish-audio.referenceId`
- **API Key:** Stored as credential `fish-audio/api_key`
- **Default Format:** `mp3` at 192kbps
- **Default Output Directory:** `scratch/`
## API Key Setup
The Fish Audio API key must be stored securely via the credential store. Get an API key from the Fish Audio dashboard at https://fish.audio.
Check if the key is already configured:
```bash
assistant credentials inspect --service fish-audio --field api_key --json
```
If not set, collect it securely (never ask the user to paste it in chat):
```
credential_store action="prompt" service="fish-audio" field="api_key" label="Fish Audio API Key" description="Enter your Fish Audio API key" placeholder="sk-..."
```
## Generating a Single Clip
Use `bash` with `curl` to call the Fish Audio API:
```bash
curl -s -X POST "https://api.fish.audio/v1/tts" \
-H "Authorization: Bearer $(assistant credentials reveal --service fish-audio --field api_key)" \
-H "Content-Type: application/json" \
-H "model: s2-pro" \
-d '{
"text": "YOUR TEXT WITH [bracket] TAGS HERE",
"reference_id": "'"$(assistant config get services.tts.providers.fish-audio.referenceId)"'",
"format": "mp3",
"mp3_bitrate": 192,
"temperature": 0.8
}' --output scratch/OUTPUT_FILENAME.mp3
```
**Important:** This API call requires network access. Always use `network_mode: proxied` when running this command.
## Generating Multiple Clips & Combining
For longer pieces (narrations, multi-part messages), generate each clip separately then combine with ffmpeg:
### 1. Generate silence for gaps between clips
```bash
ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 1.5 -q:a 9 -acodec libmp3lame scratch/silence.mp3 -y
```
### 2. Create a concat file
```bash
cat > scratch/concat.txt << 'EOF'
file 'clip1.mp3'
file 'silence.mp3'
file 'clip2.mp3'
file 'silence.mp3'
file 'clip3.mp3'
EOF
```
### 3. Combine
```bash
ffmpeg -f concat -safe 0 -i scratch/concat.txt -c copy scratch/final_output.mp3 -y
```
## Bracket Syntax — Complete Guide
Fish Audio S2 uses `[bracket]` syntax for inline emotion and prosody control. This is the core of what makes the voice expressive. Tags are natural-language instructions placed directly in the text that control how words are spoken — the delivery, emotion, pacing, or vocal quality at that exact point.
**Key principle:** You are not choosing from a fixed menu. You write the description, and S2 interprets it. If you can describe it to a voice actor, S2 can attempt it. Over 15,000+ unique tags are supported, and the system understands free-form descriptions.
### How Placement Works
Tags affect what comes **after** them. Place the tag at the **exact point** where the shift should happen. Placement IS meaning.
```
[whispering] I didn't want to go inside. <- whispers the entire line
I didn't want to go [whispering] inside. <- only whispers from "inside" onward
```
Tags can go **anywhere** — start, middle, or end of a sentence. They apply from the point they appear until the next tag or end of the sentence.
### Well-Tested Tags (Reliable Out of the Box)
These tags consistently produce strong results. Organized by category:
#### Emotions
| Tag | Effect | Best For |
| --------------- | ----------------------- | --------------------------- |
| `[happy]` | Cheerful, upbeat | Good news, greetings |
| `[sad]` | Melancholic, downcast | Sympathy, vulnerability |
| `[angry]` | Frustrated, aggressive | Arguments, complaints |
| `[excited]` | Energetic, enthusiastic | Celebrations, announcements |
| `[surprised]` | Shocked, amazed | Reactions, discoveries |
| `[embarrassed]` | Awkward, flustered | Mistakes, confessions |
| `[delight]` | Very pleased, joyful | Genuine happiness |
| `[nervous]` | Anxious, uncertain | Vulnerability, apologies |
| `[confident]` | Assertive, self-assured | Bold statements |
| `[nostalgic]` | Longing for the past | Memories, stories |
| `[scared]` | Frightened, fearful | Warnings, tension |
| `[jealous]` | Envious, resentful | Comparisons, possessiveness |
| `[shocked]` | Sudden realization | Dramatic reveals |
| `[moved]` | Emotionally touched | Heartfelt moments |
#### Voice Quality & Style
| Tag | Effect | Best For |
| ---------------------- | -------------------- | -------------------------- |
| `[soft]` | Gentle, tender | Intimate moments, kindness |
| `[whisper]` | Very quiet, close | Secrets, tension, suspense |
| `[breathy]` | Airy, expressive | Vulnerability, emphasis |
| `[low voice]` | Deep, quiet register | Gravity, seriousness |
| `[loud]` | Raised volume | Emphasis, excitement |
| `[screaming]` | Full volume yelling | Anger, extreme excitement |
| `[shouting]` | Forceful projection | Arguments, calling out |
| `[emphasis]` | Stressed delivery | Key words, making a point |
| `[singing]` | Musical quality | Playfulness, joy |
| `[echo]` | Reverberant effect | Dramatic moments |
| `[with strong accent]` | Pronounced accent | Character work |
#### Paralinguistic Sounds (Non-Speech Vocalizations)
| Tag | Effect | Best For |
| ------------------- | ---------------------- | -------------------------------- |
| `[laughing]` | Full laugh | Joy, humor, warmth |
| `[chuckling]` | Soft, low laugh | Warmth, amusement |
| `[giggling]` | Light, playful laugh | Lightheartedness, delight |
| `[sigh]` | Audible exhale | Relief, longing, exasperation |
| `[inhale]` | Audible breath in | Before speaking, anticipation |
| `[exhale]` | Breath out | Relief, settling |
| `[panting]` | Heavy breathing | Exertion, intensity |
| `[gasp]` | Sharp intake of breath | Surprise, shock |
| `[tsk]` | Disapproving click | Judgment, disapproval |
| `[clearing throat]` | Ahem | Transitioning, getting attention |
| `[moaning]` | Vocal moan | Pain, frustration |
| `[sobbing]` | Crying with voice | Deep sadness |
| `[crying loudly]` | Full crying | Extreme emotion |
#### Pacing & Rhythm
| Tag | Effect | Best For |
| --------------- | -------------------------- | -------------------------------------- |
| `[pause]` | Brief silence (~0.5-1s) | Beat between thoughts |
| `[short pause]` | Quick beat (~0.3s) | Rhythm, emphasis |
| `[long pause]` | Extended silence (~1.5-2s) | Dramatic tension, letting moments land |
#### Volume Control
| Tag | Effect | Best For |
| --------------- | ------------------ | ------------------ |
| `[volume up]` | Gradually louder | Building energy |
| `[volume down]` | Gradually quieter | Drawing someone in |
| `[low volume]` | Consistently quiet | Background, aside |
### Free-Form Tags (The Real Power)
You are NOT limited to the tags above. S2 accepts **any natural language description** in brackets. The model generalizes from its training data to interpret novel instructions. Write what you would tell a voice actor:
#### Compound Emotions
- `[laughing nervously]`
- `[angry but trying to stay calm]`
- `[happy with a hint of sadness]`
- `[excited but whispering]`
- `[voice rough from crying, trying to sound normal]`
#### Specific Delivery Styles
- `[professional broadcast tone]`
- `[speaking slowly, almost hesitant]`
- `[whispering like a secret]`
- `[dead tired, end of a very long shift]`
- `[the calm, measured tone of someone who has done this a thousand times]`
- `[overly cheerful, clearly forcing it]`
#### Prosody & Pitch
- `[pitch up]`
- `[pitch down]`
- `[speaking slowly with warmth]`
- `[speaking quickly with excitement]`
- `[pitch up slightly while maintaining warmth]`
- `[trailing off]`
#### Character Directions
- `[voice breaking]`
- `[barely holding it together]`
- `[soft voice]`
- `[interrupting]`
- `[laughing tone]` (speaking while laughing, not just a laugh)
- `[excited tone]` (speaking with excitement woven through)
### Writing Great Scripts — Best Practices
#### 1. Start Simple, Then Layer
A single well-placed `[sigh]` or `[long pause]` can change a line completely. Add more tags only when the simpler version is not enough. Over-tagging competes with itself.
**Too many tags (competing):**
```
[soft] [whisper] [sad] [slow] I miss the old days.
```
**Better — one well-chosen tag:**
```
[nostalgic] I miss the old days.
```
#### 2. Use Emotional Contrast for Impact
The most powerful moments come from sudden shifts. Going from loud to soft, angry to vulnerable, laughing to serious — the contrast is what creates emotional impact.
```
[screaming] I can't BELIEVE you did that! [long pause] [soft] ...do you even care?
```
```
[excited] Oh my god we got the apartment! [pause] [voice breaking] I can't believe it's actually happening.
```
#### 3. Let Silence Do the Work
`[pause]` and `[long pause]` are your most powerful tags. Use them:
- Before something vulnerable
- After something that needs to land
- Before a punchline or tonal shift
- To create tension or anticipation
```
[confident] I have an announcement to make. [long pause] [excited] We did it. We actually did it.
```
#### 4. Paralinguistic Sounds Add Humanity
Real people laugh, sigh, gasp, and breathe between words. Weaving these in makes speech feel alive rather than read.
```
[sigh] Look, I know this is hard. [pause] [inhale] But we need to talk about it.
```
```
I told him the news and he just — [laughing] he literally dropped his coffee.
```
#### 5. Match Tag Intensity to Content
Do not use `[screaming]` for mild annoyance or `[sobbing]` for minor disappointment. The tag should match the emotional weight of the words.
#### 6. Use Free-Form Tags for Nuance
When a single-word tag is not enough, describe the exact delivery you want:
```
[speaking slowly, choosing each word carefully] I think we should reconsider our approach.
```
This gives S2 much richer information than just `[slow]` or `[sad]`.
#### 7. Emotion Transitions Within a Single Passage
S2 excels at dynamic emotional shifts. Use this for natural-feeling monologues:
```
[excited] I got the promotion! [pause] [uncertain] But... it means relocating. [sad] I'll miss everyone here. [long pause] [hopeful] Maybe it'll be worth it though.
```
### Example Scripts
**Narration (audiobook style):**
```
[soft] The city was quiet that morning. [pause] Not the peaceful kind of quiet — [long pause] [low voice] the kind that makes you hold your breath. [inhale] [whisper] Something was about to change. [pause] [confident] And everyone knew it.
```
**Podcast intro:**
```
[excited] Welcome back to another episode! [pause] [professional broadcast tone] Today we're diving into something I've been researching for months. [chuckling] And honestly? It blew my mind. [pause] [volume down] [speaking slowly with warmth] So grab your coffee, get comfortable, and let's get into it.
```
**Dramatic reading:**
```
[soft] She stood at the edge of the platform, [pause] watching the last train pull away. [long pause] [voice breaking] It wasn't supposed to end like this. [sigh] [whisper] None of it was. [pause] [angry but trying to stay calm] And yet here she stood — [emphasis] alone — [long pause] [nostalgic] remembering a time when the station was full of laughter.
```
**Announcement:**
```
[confident] Attention everyone. [pause] [excited] After three years of development, [volume up] we are thrilled to announce [emphasis] the official launch! [long pause] [laughing] I know, I know — it's been a long time coming. [pause] [soft] But we wanted to get it right. [pause] [professional broadcast tone] And we did.
```
## API Parameters
| Parameter | Default | Description |
| -------------- | ------------- | --------------------------------------------- |
| `text` | (required) | The text to synthesize, with [bracket] tags |
| `reference_id` | (from config) | Voice model ID |
| `format` | `mp3` | Output format: `mp3`, `wav`, `pcm`, `opus` |
| `mp3_bitrate` | `192` | MP3 quality: `64`, `128`, `192` |
| `temperature` | `0.8` | Expressiveness (higher = more varied) |
| `top_p` | `0.7` | Diversity via nucleus sampling |
| `chunk_length` | `300` | Text segment size (100-300) |
| `latency` | `normal` | Quality tradeoff: `normal`, `balanced`, `low` |
## Tips
- **Temperature 0.7-0.8** works best for expressive, natural speech
- **Break long texts into multiple clips** — each clip should be a natural paragraph or thought
- **Add 1-1.5s silence between clips** when combining for natural pacing
- **Listen and iterate** — generate a few takes with different temperatures if the first one does not hit right
- **The voice carries context** — `condition_on_previous_chunks: true` (default) helps maintain consistency within a single API call
- Always deliver the final audio to the user with `<vellum-attachment>` tags
- Only use `[bracket]` syntax inside text passed to the Fish Audio API, not in regular text responses