gemini-tts
$
npx mdskill add sonichi/sutando/gemini-ttsSynthesizes text to speech using Google Gemini TTS models
- Converts written text into natural-sounding audio for narration and voiceovers
- Uses Google Gemini TTS models including flash, pro, and lite variants
- Interprets inline style tags like [whispers] and [excitedly] for expressive output
- Returns synthesized audio as MP3 files for immediate use in videos or audio projects
SKILL.md
.github/skills/gemini-ttsView on GitHub ↗
---
name: gemini-tts
description: "Render text to mp3 via Google Gemini Flash TTS. Free-tier eligible (1500 req/day). Use for video narration, demo voiceovers, audio notes. Parallels openai-tts; default for make-viral-video."
user-invocable: true
---
# Gemini TTS
Synthesize speech via Google's `gemini-2.5-flash-preview-tts` (or `-pro-tts` / `-lite-preview-tts` per env override). Reads `GEMINI_API_KEY` from `.env`.
This is offline synthesis — distinct from voice-agent's bidirectional Gemini Live audio. Same model family, different surface (POST text → get audio bytes back, no streaming).
**Usage**: `/gemini-tts [text]`
ARGUMENTS: $ARGUMENTS
## Voices
`Aoede` (default — alto, neutral), `Charon` (baritone, news-anchor), `Kore` (mid, expressive), `Puck` (high, conversational). Per Lucy's 2026-05-09 testing: Aoede is the closest match to OpenAI's `sage`.
## Audio tags for expression
Inline bracket tags like `[whispers]`, `[excitedly]`, `[slowly]` are interpreted as stylistic direction, not spoken literally. Empirically verified against `gemini-2.5-flash-preview-tts` (per PR #646 comment): `[whispers] hello` → 1.05s audio; `hello` alone → 1.01s. If the tag were spoken literally as 8 words, the clip would be ~5× longer.
```bash
bash "$SKILL_DIR/scripts/synthesize.sh" -- "[whispers] Pull request 691 has landed."
```
## Model selection
Default: `gemini-2.5-flash-preview-tts` (free tier, 1500 req/day, $0 within quota).
Override via `GEMINI_TTS_MODEL` env var:
- `gemini-2.5-pro-tts` — paid, higher fidelity
- `gemini-2.5-flash-lite-preview-tts` — preview, faster
- `gemini-3.1-flash-tts-preview` — preview
## Examples
```bash
bash "$SKILL_DIR/scripts/synthesize.sh" -- "Hello, this is Sutando."
bash "$SKILL_DIR/scripts/synthesize.sh" --voice Charon --out /tmp/intro.mp3 -- "Hi."
GEMINI_TTS_MODEL=gemini-2.5-pro-tts bash "$SKILL_DIR/scripts/synthesize.sh" -- "High-fidelity narration."
```
Default output path: `results/gemini-tts-{epoch}.mp3`.
## Cost
Free tier: $0 within 1500 req/day quota. For our cadence (a few demos a day), stays free indefinitely.
Paid (Flash): $0.50 / 1M input tokens + $10.00 / 1M output tokens.
Compared to OpenAI TTS (`gpt-4o-mini-tts`) at ~$0.02 per 60s: Gemini Flash is free-equivalent for typical demo workloads.
## When to fall back to openai-tts
The `make-viral-video` skill auto-falls-back to OpenAI TTS when:
- Gemini API returns 4xx/5xx
- Gemini quota hit (429)
- `GEMINI_API_KEY` missing
- `TTS_PROVIDER=OPENAI` env override set
## If Invoked As A Slash Command
If ARGUMENTS is empty, ask the user for the text. Otherwise:
```bash
bash "$SKILL_DIR/scripts/synthesize.sh" -- "$ARGUMENTS"
```
More from sonichi/sutando
- agent-registryLocal Agent Registry — a standalone, dependency-free service that tracks running Claude Code (and other) agent instances. Agents self-register on startup and heartbeat while alive; the Electron overlay and Sutando dashboard read the live list. Use when you need to know which coding agents are running, where, and since when.
- bot2bot-postPost a coordination message from this bot to the shared bot2bot channel, @-mentioning the other Sutando node.
- claude-codexBash wrapper around the local Codex CLI for non-interactive runs from inside Sutando (bridges, cron, scripts). For interactive code review or task hand-off from this Claude Code session, prefer the official `/codex:*` plugin commands; this skill is the file-bridge-compatible path that `discord-bridge.py` invokes for team-tier sandboxed delegation.
- claude-geminiUse the local Gemini CLI from Claude Code with the user's existing Gemini authentication or API configuration. Use for large-context repo scans, multimodal analysis, second-opinion planning, or structured Gemini runs in the current workspace.
- claude-routerChoose between the local Codex CLI and Gemini CLI from Claude Code. Use for automatic model selection when the user wants the best local delegate for code review, repo-wide analysis, planning, or implementation.
- cross-node-syncRsync-over-ssh sync between Sutando nodes (Mac Studio and MacBook) for shared memory + notes. Optional — core runs fine without it; enables automatic cross-bot learning and note propagation by running from the proactive-loop cron on each pass.
- deal-finderScan configured sources (Craigslist now; eBay + Facebook Marketplace planned) for used-item listings matching the owner's criteria. Currently configured for a Mac mini search (M2+, 16GB+, 512GB+, ≤$500, near 94566). Notify owner via SMS + Telegram on a match.
- electron-overlay-dimmingReusable pattern for focus-based auto-dimming of Electron overlay windows — when the app loses focus, all overlay windows fade to a low opacity; when an overlay regains focus, they return to their configured opacity. Use when building always-on-top Electron overlays that should recede while the user works in other apps.
- macos-toolsmacOS native integrations: screen capture, calendar, reminders, contacts, email (Mail.app), Spotlight search. Use when the user asks about their screen, schedule, to-do list, contacts, or wants to send email on macOS.
- macos-useGUI control for macOS apps via mediar-ai's mcp-server-macos-use. Click, type, scroll, key-press, open apps — driven by accessibility tree, works in non-interactive Claude Code mode. Use this for any Sutando task that needs to drive another macOS application (Safari, Zoom, Mail, Finder, etc.).