fish-audio

$npx mdskill add vellum-ai/vellum-assistant/fish-audio

Generate expressive audio clips using the Fish Audio S2 TTS API with `[bracket]` emotion tags.

SKILL.md

.github/skills/fish-audioView on GitHub ↗
---
name: fish-audio
description: "Generate expressive audio clips using Fish Audio S2 TTS with bracket emotion tags. Record voice memos, narration, audio messages, or any spoken content."
compatibility: "Designed for Vellum personal assistants"
metadata:
  emoji: "🎙️"
  vellum:
    category: "voice"
    display-name: "Fish Audio TTS"
---

# Fish Audio TTS

Generate expressive audio clips using the Fish Audio S2 TTS API with `[bracket]` emotion tags.

## Overview

This skill lets you create audio clips on demand — narration, announcements, podcast intros, dramatic readings, voice memos, or any spoken content. Uses Fish Audio S2 Pro with the full bracket syntax for emotional expressiveness.

## Configuration

- **API Endpoint:** `https://api.fish.audio/v1/tts`
- **Model:** `s2-pro`
- **Voice Reference ID:** Configured via `assistant config get services.tts.providers.fish-audio.referenceId`
- **API Key:** Stored as credential `fish-audio/api_key`
- **Default Format:** `mp3` at 192kbps
- **Default Output Directory:** `scratch/`

## API Key Setup

The Fish Audio API key must be stored securely via the credential store. Get an API key from the Fish Audio dashboard at https://fish.audio.

Check if the key is already configured:

```bash
assistant credentials inspect --service fish-audio --field api_key --json
```

If not set, collect it securely (never ask the user to paste it in chat):

```
credential_store action="prompt" service="fish-audio" field="api_key" label="Fish Audio API Key" description="Enter your Fish Audio API key" placeholder="sk-..."
```

## Generating a Single Clip

Use `bash` with `curl` to call the Fish Audio API:

```bash
curl -s -X POST "https://api.fish.audio/v1/tts" \
  -H "Authorization: Bearer $(assistant credentials reveal --service fish-audio --field api_key)" \
  -H "Content-Type: application/json" \
  -H "model: s2-pro" \
  -d '{
    "text": "YOUR TEXT WITH [bracket] TAGS HERE",
    "reference_id": "'"$(assistant config get services.tts.providers.fish-audio.referenceId)"'",
    "format": "mp3",
    "mp3_bitrate": 192,
    "temperature": 0.8
  }' --output scratch/OUTPUT_FILENAME.mp3
```

**Important:** This API call requires network access. Always use `network_mode: proxied` when running this command.

## Generating Multiple Clips & Combining

For longer pieces (narrations, multi-part messages), generate each clip separately then combine with ffmpeg:

### 1. Generate silence for gaps between clips

```bash
ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 1.5 -q:a 9 -acodec libmp3lame scratch/silence.mp3 -y
```

### 2. Create a concat file

```bash
cat > scratch/concat.txt << 'EOF'
file 'clip1.mp3'
file 'silence.mp3'
file 'clip2.mp3'
file 'silence.mp3'
file 'clip3.mp3'
EOF
```

### 3. Combine

```bash
ffmpeg -f concat -safe 0 -i scratch/concat.txt -c copy scratch/final_output.mp3 -y
```

## Bracket Syntax — Complete Guide

Fish Audio S2 uses `[bracket]` syntax for inline emotion and prosody control. This is the core of what makes the voice expressive. Tags are natural-language instructions placed directly in the text that control how words are spoken — the delivery, emotion, pacing, or vocal quality at that exact point.

**Key principle:** You are not choosing from a fixed menu. You write the description, and S2 interprets it. If you can describe it to a voice actor, S2 can attempt it. Over 15,000+ unique tags are supported, and the system understands free-form descriptions.

### How Placement Works

Tags affect what comes **after** them. Place the tag at the **exact point** where the shift should happen. Placement IS meaning.

```
[whispering] I didn't want to go inside.     <- whispers the entire line
I didn't want to go [whispering] inside.     <- only whispers from "inside" onward
```

Tags can go **anywhere** — start, middle, or end of a sentence. They apply from the point they appear until the next tag or end of the sentence.

### Well-Tested Tags (Reliable Out of the Box)

These tags consistently produce strong results. Organized by category:

#### Emotions

| Tag             | Effect                  | Best For                    |
| --------------- | ----------------------- | --------------------------- |
| `[happy]`       | Cheerful, upbeat        | Good news, greetings        |
| `[sad]`         | Melancholic, downcast   | Sympathy, vulnerability     |
| `[angry]`       | Frustrated, aggressive  | Arguments, complaints       |
| `[excited]`     | Energetic, enthusiastic | Celebrations, announcements |
| `[surprised]`   | Shocked, amazed         | Reactions, discoveries      |
| `[embarrassed]` | Awkward, flustered      | Mistakes, confessions       |
| `[delight]`     | Very pleased, joyful    | Genuine happiness           |
| `[nervous]`     | Anxious, uncertain      | Vulnerability, apologies    |
| `[confident]`   | Assertive, self-assured | Bold statements             |
| `[nostalgic]`   | Longing for the past    | Memories, stories           |
| `[scared]`      | Frightened, fearful     | Warnings, tension           |
| `[jealous]`     | Envious, resentful      | Comparisons, possessiveness |
| `[shocked]`     | Sudden realization      | Dramatic reveals            |
| `[moved]`       | Emotionally touched     | Heartfelt moments           |

#### Voice Quality & Style

| Tag                    | Effect               | Best For                   |
| ---------------------- | -------------------- | -------------------------- |
| `[soft]`               | Gentle, tender       | Intimate moments, kindness |
| `[whisper]`            | Very quiet, close    | Secrets, tension, suspense |
| `[breathy]`            | Airy, expressive     | Vulnerability, emphasis    |
| `[low voice]`          | Deep, quiet register | Gravity, seriousness       |
| `[loud]`               | Raised volume        | Emphasis, excitement       |
| `[screaming]`          | Full volume yelling  | Anger, extreme excitement  |
| `[shouting]`           | Forceful projection  | Arguments, calling out     |
| `[emphasis]`           | Stressed delivery    | Key words, making a point  |
| `[singing]`            | Musical quality      | Playfulness, joy           |
| `[echo]`               | Reverberant effect   | Dramatic moments           |
| `[with strong accent]` | Pronounced accent    | Character work             |

#### Paralinguistic Sounds (Non-Speech Vocalizations)

| Tag                 | Effect                 | Best For                         |
| ------------------- | ---------------------- | -------------------------------- |
| `[laughing]`        | Full laugh             | Joy, humor, warmth               |
| `[chuckling]`       | Soft, low laugh        | Warmth, amusement                |
| `[giggling]`        | Light, playful laugh   | Lightheartedness, delight        |
| `[sigh]`            | Audible exhale         | Relief, longing, exasperation    |
| `[inhale]`          | Audible breath in      | Before speaking, anticipation    |
| `[exhale]`          | Breath out             | Relief, settling                 |
| `[panting]`         | Heavy breathing        | Exertion, intensity              |
| `[gasp]`            | Sharp intake of breath | Surprise, shock                  |
| `[tsk]`             | Disapproving click     | Judgment, disapproval            |
| `[clearing throat]` | Ahem                   | Transitioning, getting attention |
| `[moaning]`         | Vocal moan             | Pain, frustration                |
| `[sobbing]`         | Crying with voice      | Deep sadness                     |
| `[crying loudly]`   | Full crying            | Extreme emotion                  |

#### Pacing & Rhythm

| Tag             | Effect                     | Best For                               |
| --------------- | -------------------------- | -------------------------------------- |
| `[pause]`       | Brief silence (~0.5-1s)    | Beat between thoughts                  |
| `[short pause]` | Quick beat (~0.3s)         | Rhythm, emphasis                       |
| `[long pause]`  | Extended silence (~1.5-2s) | Dramatic tension, letting moments land |

#### Volume Control

| Tag             | Effect             | Best For           |
| --------------- | ------------------ | ------------------ |
| `[volume up]`   | Gradually louder   | Building energy    |
| `[volume down]` | Gradually quieter  | Drawing someone in |
| `[low volume]`  | Consistently quiet | Background, aside  |

### Free-Form Tags (The Real Power)

You are NOT limited to the tags above. S2 accepts **any natural language description** in brackets. The model generalizes from its training data to interpret novel instructions. Write what you would tell a voice actor:

#### Compound Emotions

- `[laughing nervously]`
- `[angry but trying to stay calm]`
- `[happy with a hint of sadness]`
- `[excited but whispering]`
- `[voice rough from crying, trying to sound normal]`

#### Specific Delivery Styles

- `[professional broadcast tone]`
- `[speaking slowly, almost hesitant]`
- `[whispering like a secret]`
- `[dead tired, end of a very long shift]`
- `[the calm, measured tone of someone who has done this a thousand times]`
- `[overly cheerful, clearly forcing it]`

#### Prosody & Pitch

- `[pitch up]`
- `[pitch down]`
- `[speaking slowly with warmth]`
- `[speaking quickly with excitement]`
- `[pitch up slightly while maintaining warmth]`
- `[trailing off]`

#### Character Directions

- `[voice breaking]`
- `[barely holding it together]`
- `[soft voice]`
- `[interrupting]`
- `[laughing tone]` (speaking while laughing, not just a laugh)
- `[excited tone]` (speaking with excitement woven through)

### Writing Great Scripts — Best Practices

#### 1. Start Simple, Then Layer

A single well-placed `[sigh]` or `[long pause]` can change a line completely. Add more tags only when the simpler version is not enough. Over-tagging competes with itself.

**Too many tags (competing):**

```
[soft] [whisper] [sad] [slow] I miss the old days.
```

**Better — one well-chosen tag:**

```
[nostalgic] I miss the old days.
```

#### 2. Use Emotional Contrast for Impact

The most powerful moments come from sudden shifts. Going from loud to soft, angry to vulnerable, laughing to serious — the contrast is what creates emotional impact.

```
[screaming] I can't BELIEVE you did that! [long pause] [soft] ...do you even care?
```

```
[excited] Oh my god we got the apartment! [pause] [voice breaking] I can't believe it's actually happening.
```

#### 3. Let Silence Do the Work

`[pause]` and `[long pause]` are your most powerful tags. Use them:

- Before something vulnerable
- After something that needs to land
- Before a punchline or tonal shift
- To create tension or anticipation

```
[confident] I have an announcement to make. [long pause] [excited] We did it. We actually did it.
```

#### 4. Paralinguistic Sounds Add Humanity

Real people laugh, sigh, gasp, and breathe between words. Weaving these in makes speech feel alive rather than read.

```
[sigh] Look, I know this is hard. [pause] [inhale] But we need to talk about it.
```

```
I told him the news and he just — [laughing] he literally dropped his coffee.
```

#### 5. Match Tag Intensity to Content

Do not use `[screaming]` for mild annoyance or `[sobbing]` for minor disappointment. The tag should match the emotional weight of the words.

#### 6. Use Free-Form Tags for Nuance

When a single-word tag is not enough, describe the exact delivery you want:

```
[speaking slowly, choosing each word carefully] I think we should reconsider our approach.
```

This gives S2 much richer information than just `[slow]` or `[sad]`.

#### 7. Emotion Transitions Within a Single Passage

S2 excels at dynamic emotional shifts. Use this for natural-feeling monologues:

```
[excited] I got the promotion! [pause] [uncertain] But... it means relocating. [sad] I'll miss everyone here. [long pause] [hopeful] Maybe it'll be worth it though.
```

### Example Scripts

**Narration (audiobook style):**

```
[soft] The city was quiet that morning. [pause] Not the peaceful kind of quiet — [long pause] [low voice] the kind that makes you hold your breath. [inhale] [whisper] Something was about to change. [pause] [confident] And everyone knew it.
```

**Podcast intro:**

```
[excited] Welcome back to another episode! [pause] [professional broadcast tone] Today we're diving into something I've been researching for months. [chuckling] And honestly? It blew my mind. [pause] [volume down] [speaking slowly with warmth] So grab your coffee, get comfortable, and let's get into it.
```

**Dramatic reading:**

```
[soft] She stood at the edge of the platform, [pause] watching the last train pull away. [long pause] [voice breaking] It wasn't supposed to end like this. [sigh] [whisper] None of it was. [pause] [angry but trying to stay calm] And yet here she stood — [emphasis] alone — [long pause] [nostalgic] remembering a time when the station was full of laughter.
```

**Announcement:**

```
[confident] Attention everyone. [pause] [excited] After three years of development, [volume up] we are thrilled to announce [emphasis] the official launch! [long pause] [laughing] I know, I know — it's been a long time coming. [pause] [soft] But we wanted to get it right. [pause] [professional broadcast tone] And we did.
```

## API Parameters

| Parameter      | Default       | Description                                   |
| -------------- | ------------- | --------------------------------------------- |
| `text`         | (required)    | The text to synthesize, with [bracket] tags   |
| `reference_id` | (from config) | Voice model ID                                |
| `format`       | `mp3`         | Output format: `mp3`, `wav`, `pcm`, `opus`    |
| `mp3_bitrate`  | `192`         | MP3 quality: `64`, `128`, `192`               |
| `temperature`  | `0.8`         | Expressiveness (higher = more varied)         |
| `top_p`        | `0.7`         | Diversity via nucleus sampling                |
| `chunk_length` | `300`         | Text segment size (100-300)                   |
| `latency`      | `normal`      | Quality tradeoff: `normal`, `balanced`, `low` |

## Tips

- **Temperature 0.7-0.8** works best for expressive, natural speech
- **Break long texts into multiple clips** — each clip should be a natural paragraph or thought
- **Add 1-1.5s silence between clips** when combining for natural pacing
- **Listen and iterate** — generate a few takes with different temperatures if the first one does not hit right
- **The voice carries context** — `condition_on_previous_chunks: true` (default) helps maintain consistency within a single API call
- Always deliver the final audio to the user with `<vellum-attachment>` tags
- Only use `[bracket]` syntax inside text passed to the Fish Audio API, not in regular text responses

More from vellum-ai/vellum-assistant

SkillDescription
acpSpawn external coding agents via the Agent Client Protocol (ACP)
amazonShop on Amazon and Amazon Fresh through your browser
api-mappingRecord and analyze API surfaces of web services
app-builderBuild and edit small, personal visual tools and artifacts — dashboards, trackers, calculators, data visualizations, charts, simple landing pages, and slide decks the user wants for THEMSELVES. This is the right skill whenever the user asks to "visualize this," "make a chart," or "build an artifact" for their own use, or to edit an app they already built here. Do NOT reach for a ui_show dynamic_page to fake an artifact — build a real persistent app here. NOT for complex, multi-user, or shippable products — those go to a real project folder with a coding agent (see Scope below).
app-controlDrive a specific named macOS app via raw input bypassing the Accessibility tree
assistant-migrationMigrate from ChatGPT, Claude, OpenClaw, Hermes, Manus, and other AI assistants into Vellum by inspecting their data exports, conversation archives, files, prompts, custom instructions, memory, saved memories, tools, GPTs, workflows, integrations, and relationships, then mapping as much as safely possible into Vellum primitives. Handles single-source and multi-source migrations with a unified, deduplicated inventory.
chatgpt-importImport conversation history from ChatGPT into Vellum
cli-discoverDiscover which CLI tools are installed, their versions, and authentication status
computer-useControl the macOS desktop
contactsManage contacts, communication channels, access control, and invite links