video-audio-design

Name: video-audio-design
Author: mkurman/zorai
$npx mdskill add mkurman/zorai/video-audio-design
Generate professional video audio with layered narration, SFX, and ducking.
Creates polished soundtracks for programmatic video content.
Integrates ElevenLabs TTS, FFmpeg, and Remotion tools.
Prioritizes narration duration to drive scene timing.
Delivers mixed audio layers with smooth volume transitions.
SKILL.md
.github/skills/video-audio-designView on GitHub ↗
---
name: video-audio-design
version: 0.1.0
description: >
  Use this skill when adding audio to programmatic videos - generating narration
  with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with
  FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion.
  Triggers on ElevenLabs, text-to-speech, voice generation, background music,
  sound effects, audio mixing, and volume ducking.
tags: [elevenlabs, tts, audio-design, sfx, background-music, audio-mixing, experimental-design]
category: video
recommended_skills: [remotion-video, video-creator, video-scriptwriting]
platforms:
  - claude-code
  - gemini-cli
  - openai-codex
license: MIT
maintainers:
  - github: maddhruv
---

## Key principles

1. **Layered audio architecture** - Every video has three audio layers:
   narration on top (loudest), SFX in the middle (accent volume), and
   background music at the base (lowest).

2. **Narration drives timing** - Generate narration first, measure its
   duration, then set scene timing to match. Never fit narration into
   arbitrary scene lengths.

3. **Duck music during speech** - Background music must drop 50-60% when
   narration plays. Use smooth ramps (10-15 frames) to avoid jarring jumps.

4. **SFX as accents, not distractions** - Keep SFX short (under 0.5s),
   subtle in volume, and relevant to on-screen action.

5. **Test audio in context** - Always preview the full mix with all layers
   together. Listen for muddy speech, volume spikes, or dead silence.

---

## Core concepts

### 3-layer audio architecture

| Layer | Role | Base Volume | During Narration |
|---|---|---|---|
| Narration | Conveys information, drives pacing | 0.8-1.0 | N/A (top layer) |
| SFX | Accents transitions and actions | 0.3-0.5 | 0.3-0.5 (unchanged) |
| Background Music | Sets emotional tone, fills silence | 0.3-0.5 | 0.15-0.25 (ducked) |

### ElevenLabs API model

ElevenLabs provides neural TTS via a REST API. The core flow:
1. Pick a voice (pre-made or cloned) - each has a `voice_id`
2. Send text + voice settings to `/v1/text-to-speech/{voice_id}`
3. Receive raw audio bytes (mp3 by default)
4. Write to file and measure duration for scene timing

Voice settings:

| Setting | Range | Low | High | Recommended |
|---|---|---|---|---|
| stability | 0-1 | More expressive, variable | More consistent, monotone | 0.4-0.6 |
| similarity_boost | 0-1 | More creative | Closer to original voice | 0.6-0.8 |
| style | 0-1 | Neutral delivery | Exaggerated style | 0.3-0.6 |

### Audio ducking concept

Audio ducking reduces background music volume when narration starts and
restores it when narration ends. In Remotion, use `interpolate()`:

```
Music volume:  0.4 ---\              /--- 0.4
                       \            /
               0.15     \__________/
                     narration start → end
```

Ramps should take 10-15 frames (~0.3-0.5s at 30fps).

### Frame-based audio sync in Remotion

- `useCurrentFrame()` returns the current frame number
- `interpolate()` maps frame ranges to value ranges (e.g., volume)
- `<Sequence from={frame}>` places audio at a specific frame
- `<Audio volume={fn}>` accepts a static number or a per-frame function

Convert seconds to frames: `frames = seconds * fps`.

---

## Common tasks

### 1. Set up ElevenLabs API key and generate narration

```typescript
import fs from 'fs';

const ELEVENLABS_API_URL = 'https://api.elevenlabs.io/v1';

async function generateNarration(
  text: string,
  voiceId: string,
  outputPath: string
): Promise<void> {
  const response = await fetch(
    `${ELEVENLABS_API_URL}/text-to-speech/${voiceId}`,
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'xi-api-key': process.env.ELEVEN_LABS_API_KEY!,
      },
      body: JSON.stringify({
        text,
        model_id: 'eleven_multilingual_v2',
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75,
          style: 0.5,
          use_speaker_boost: true,
        },
      }),
    }
  );

  if (!response.ok) {
    const error = await response.text();
    throw new Error(`ElevenLabs API error ${response.status}: ${error}`);
  }

  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync(outputPath, buffer);
}
```

### 2. Select and configure voice settings

Voice selection questions: gender, age range, accent, energy level, warmth.

```typescript
interface VoiceSettings {
  stability: number;
  similarity_boost: number;
  style: number;
  use_speaker_boost: boolean;
}

const presets: Record<string, VoiceSettings> = {
  explainer: { stability: 0.6, similarity_boost: 0.75, style: 0.4, use_speaker_boost: true },
  promo: { stability: 0.3, similarity_boost: 0.7, style: 0.7, use_speaker_boost: true },
  tutorial: { stability: 0.7, similarity_boost: 0.8, style: 0.2, use_speaker_boost: false },
};
```

### 3. Generate narration per scene from a script

```typescript
import { execSync } from 'child_process';
import path from 'path';

interface Scene { id: string; narrationText: string; }
interface SceneWithAudio extends Scene {
  audioPath: string;
  durationMs: number;
  durationFrames: number;
}

function getAudioDurationMs(filePath: string): number {
  const output = execSync(
    `ffprobe -v error -show_entries format=duration -of csv=p=0 "${filePath}"`
  ).toString().trim();
  return Math.round(parseFloat(output) * 1000);
}

async function generateSceneNarrations(
  scenes: Scene[], voiceId: string, outputDir: string, fps: number
): Promise<SceneWithAudio[]> {
  const results: SceneWithAudio[] = [];
  for (const scene of scenes) {
    const audioPath = path.join(outputDir, `${scene.id}.mp3`);
    await generateNarration(scene.narrationText, voiceId, audioPath);
    const durationMs = getAudioDurationMs(audioPath);
    results.push({
      ...scene, audioPath, durationMs,
      durationFrames: Math.ceil((durationMs / 1000) * fps),
    });
  }
  return results;
}
```

### 4. Source background music

Royalty-free music sources:
- **Pixabay Audio**: https://pixabay.com/music/ (free, no attribution)
- **Freesound**: https://freesound.org/ (CC0/CC-BY)
- **YouTube Audio Library**: download from YouTube Studio
- **Local files**: place in `public/audio/` for Remotion's `staticFile()`

### 5. Generate SFX with FFmpeg

```bash
# Click sound - short sine burst
ffmpeg -f lavfi -i "sine=frequency=800:duration=0.05" \
  -af "afade=t=out:st=0.02:d=0.03" click.wav

# Keyboard typing - filtered noise burst
ffmpeg -f lavfi -i "anoisesrc=d=0.08:c=white:a=0.3" \
  -af "highpass=f=2000,lowpass=f=8000,afade=t=out:st=0.04:d=0.04" type.wav

# Whoosh - frequency sweep
ffmpeg -f lavfi -i "sine=frequency=200:duration=0.4" \
  -af "vibrato=f=8:d=0.5,afade=t=in:d=0.1,afade=t=out:st=0.2:d=0.2,lowpass=f=1000" \
  whoosh.wav

# Ding/chime - bell synthesis
ffmpeg -f lavfi -i "sine=frequency=1200:duration=0.6" \
  -af "afade=t=out:st=0.1:d=0.5,aecho=0.8:0.88:40:0.4" ding.wav

# Pop - impulse
ffmpeg -f lavfi -i "sine=frequency=400:duration=0.08" \
  -af "afade=t=out:st=0.02:d=0.06,lowpass=f=600" pop.wav

# Transition swoosh
ffmpeg -f lavfi -i "sine=frequency=300:duration=0.3" \
  -af "vibrato=f=12:d=0.8,afade=t=in:d=0.05,afade=t=out:st=0.15:d=0.15,bandpass=f=500:w=400" \
  swoosh.wav
```

### 6. Implement audio ducking in Remotion

```tsx
import React from 'react';
import { Audio, useCurrentFrame, interpolate, Sequence } from 'remotion';

const AudioMixer: React.FC<{
  narrationSrc: string;
  musicSrc: string;
  narrationStart: number;
  narrationDuration: number;
}> = ({ narrationSrc, musicSrc, narrationStart, narrationDuration }) => {
  const frame = useCurrentFrame();

  const duckRampFrames = 10;
  const musicVolume = interpolate(
    frame,
    [
      narrationStart - duckRampFrames,
      narrationStart,
      narrationStart + narrationDuration,
      narrationStart + narrationDuration + duckRampFrames,
    ],
    [0.4, 0.15, 0.15, 0.4],
    { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
  );

  return (
    <>
      <Audio src={musicSrc} volume={musicVolume} />
      <Sequence from={narrationStart} durationInFrames={narrationDuration}>
        <Audio src={narrationSrc} volume={0.9} />
      </Sequence>
    </>
  );
};

export default AudioMixer;
```

### 7. Mix 3 audio layers in a Remotion composition

```tsx
import React from 'react';
import { Audio, Sequence, useCurrentFrame, interpolate } from 'remotion';

interface NarrationSegment { src: string; startFrame: number; durationFrames: number; }
interface SfxEvent { src: string; frame: number; }

const FullAudioMix: React.FC<{
  narrations: NarrationSegment[];
  sfxEvents: SfxEvent[];
  musicSrc: string;
}> = ({ narrations, sfxEvents, musicSrc }) => {
  const frame = useCurrentFrame();
  const duckRamp = 10;

  let musicVolume = 0.4;
  for (const seg of narrations) {
    const duck = interpolate(
      frame,
      [seg.startFrame - duckRamp, seg.startFrame,
       seg.startFrame + seg.durationFrames, seg.startFrame + seg.durationFrames + duckRamp],
      [1, 0.375, 0.375, 1],
      { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
    );
    musicVolume = musicVolume * duck;
  }

  return (
    <>
      <Audio src={musicSrc} volume={musicVolume} loop />
      {sfxEvents.map((sfx, i) => (
        <Sequence key={i} from={sfx.frame} durationInFrames={30}>
          <Audio src={sfx.src} volume={0.4} />
        </Sequence>
      ))}
      {narrations.map((seg, i) => (
        <Sequence key={i} from={seg.startFrame} durationInFrames={seg.durationFrames}>
          <Audio src={seg.src} volume={0.9} />
        </Sequence>
      ))}
    </>
  );
};

export default FullAudioMix;
```

### 8. Use alternative TTS providers

**OpenAI TTS** - good quality, simple API, six built-in voices:

```typescript
import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI();

async function generateWithOpenAI(
  text: string,
  outputPath: string,
  voice: 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer' = 'alloy'
): Promise<void> {
  const mp3 = await openai.audio.speech.create({
    model: 'tts-1-hd',
    voice,
    input: text,
  });
  const buffer = Buffer.from(await mp3.arrayBuffer());
  fs.writeFileSync(outputPath, buffer);
}
```

**Edge TTS** - free, many voices, uses Microsoft Edge's TTS service:

```bash
pip install edge-tts
edge-tts --voice en-US-AriaNeural --text "Hello world" --write-media output.mp3
edge-tts --list-voices
```

---

## Anti-patterns / common mistakes

| Mistake | Why it is wrong | What to do instead |
|---|---|---|
| Music same volume during narration | Speech becomes unintelligible | Implement audio ducking - drop music 50-60% during speech |
| Hardcoding ElevenLabs API key | Key leaks into version control | Use environment variables: `process.env.ELEVEN_LABS_API_KEY` |
| Using TTS without measuring duration | Scene timing wrong, narration cut off | Measure audio duration with ffprobe after generation |
| SFX louder than narration | Distracts from content | SFX at 0.3-0.5, narration at 0.8-1.0 |
| No fade on music start/end | Abrupt start/stop sounds like a bug | Add 0.5-1s fade-in at start and fade-out at end |
| Using low-quality TTS model | Robotic voice undermines quality | Use eleven_multilingual_v2 or tts-1-hd |
| Ignoring audio file format | Some formats add silence padding | Use MP3 for narration, WAV for SFX |

---

## Gotchas

1. **ElevenLabs rate limits and character quotas** - The free tier has a monthly character limit. Cache generated audio aggressively and only regenerate when text changes. Use a hash of the text as the cache key.

2. **MP3 encoder padding adds silence** - MP3 files often have 20-50ms of silence at the start. Trim with `ffmpeg -af silenceremove=1:0:-50dB` or account for the offset in frame timing.

3. **Remotion Audio volume is per-component, not global** - Two `<Audio>` components at volume 1.0 can clip. Keep total volume across simultaneous layers under 1.0.

4. **FFmpeg SFX sound different across systems** - Always specify `-ar 44100 -sample_fmt s16` for consistent output across machines.

5. **Voice consistency across scenes** - ElevenLabs can produce different tones for the same settings with varying text. Use stability >= 0.5 for multi-scene narration.

---

## References

For detailed patterns on specific audio sub-domains, read the relevant file
from the `references/` folder:

- `references/elevenlabs-api.md` - advanced ElevenLabs API patterns including
  voice cloning, streaming TTS, websocket API, pronunciation dictionaries,
  and quota management
- `references/audio-mixing-patterns.md` - advanced mixing patterns including
  multi-segment ducking, crossfades between scenes, volume automation curves,
  and mastering the final mix
- `references/sfx-generation.md` - comprehensive SFX generation with FFmpeg
  including complex synthesis, layering multiple generators, and building a
  reusable SFX library

Only load a references file if the current task requires it - they are
long and will consume context.

---

## Companion check

> On first activation of this skill in a conversation: check which companion skills are installed by running `ls ~/.claude/skills/ ~/.agent/skills/ ~/.agents/skills/ .claude/skills/ .agent/skills/ .agents/skills/ 2>/dev/null`. Compare the results against the `recommended_skills` field in this file's frontmatter. For any that are missing, mention them once and offer to install:
> ```
> npx skills add AbsolutelySkilled/AbsolutelySkilled --skill <name>
> ```
> Skip entirely if `recommended_skills` is empty or all companions are already installed.