vibe-voice
$
npx mdskill add TerminalSkills/skills/vibe-voiceGenerate realistic speech and transcribe audio instantly.
- Enables voice assistants and real-time chat applications.
- Depends on Microsoft's open-source VibeVoice models.
- Selects model size based on required latency or quality.
- Delivers audio output or text transcripts directly.
SKILL.md
.github/skills/vibe-voiceView on GitHub ↗
---
name: vibe-voice
description: >-
Build voice AI applications using Microsoft's VibeVoice — open-source frontier voice
synthesis, recognition, and real-time conversation. Use when: building voice assistants,
adding TTS/STT to applications, creating real-time voice chat, voice cloning.
license: MIT
compatibility: "Python 3.10+"
metadata:
author: terminal-skills
version: "1.0.0"
category: data-ai
tags:
- voice
- tts
- stt
- speech
- real-time
---
# VibeVoice
Open-source frontier voice AI from Microsoft. Family of models for text-to-speech (TTS), automatic speech recognition (ASR), and real-time streaming synthesis.
GitHub: [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
## Overview
VibeVoice is Microsoft's open-source voice AI platform offering three model sizes: ASR (7B) for transcription with speaker diarization, TTS (1.5B) for multi-speaker long-form synthesis, and Realtime (0.5B) for low-latency streaming. It supports 50+ languages and runs on consumer GPUs.
## Instructions
### Installation
```bash
pip install vibevoice
# Or clone for development
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .
```
### Hardware Requirements
- **ASR (7B)**: ~16GB VRAM (GPU recommended)
- **TTS (1.5B)**: ~6GB VRAM
- **Realtime (0.5B)**: ~2GB VRAM (runs on consumer GPUs)
### Text-to-Speech (TTS)
```python
from vibevoice import VibeVoiceTTS
model = VibeVoiceTTS.from_pretrained("microsoft/VibeVoice-1.5B")
# Single speaker
audio = model.synthesize(
text="Hello, welcome to the future of voice AI.",
speaker="default"
)
audio.save("output.wav")
```
### Multi-Speaker Conversation
Generate podcast-style audio with up to 4 distinct speakers:
```python
conversation = [
{"speaker": "host", "text": "Welcome to the show! Today we're discussing AI."},
{"speaker": "guest1", "text": "Thanks for having me. I'm excited to dive in."},
{"speaker": "host", "text": "Let's start with the biggest trends you're seeing."},
{"speaker": "guest2", "text": "I think voice AI is the most underrated development."},
]
audio = model.synthesize_conversation(conversation)
audio.save("podcast.wav") # Up to 90 minutes in a single pass
```
### Real-Time Streaming TTS
```python
from vibevoice import VibeVoiceRealtime
model = VibeVoiceRealtime.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")
for audio_chunk in model.stream("This is being generated in real time."):
play_audio(audio_chunk)
```
### Automatic Speech Recognition (ASR)
```python
from vibevoice import VibeVoiceASR
model = VibeVoiceASR.from_pretrained("microsoft/VibeVoice-ASR")
# Basic transcription
result = model.transcribe("meeting_recording.wav")
print(result.text)
# Rich transcription with diarization
result = model.transcribe("meeting.wav", diarize=True, timestamps=True)
for segment in result.segments:
print(f"[{segment.start:.1f}s - {segment.end:.1f}s] "
f"Speaker {segment.speaker}: {segment.text}")
```
### Custom Hotwords
```python
result = model.transcribe(
"medical_consultation.wav",
hotwords=["Lisinopril", "Metformin", "HbA1c", "systolic"]
)
```
## Examples
### Example 1: Build a Voice Assistant
```python
from vibevoice import VibeVoiceASR, VibeVoiceRealtime
asr = VibeVoiceASR.from_pretrained("microsoft/VibeVoice-ASR")
tts = VibeVoiceRealtime.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")
# Listen → Transcribe → Respond → Speak
user_text = asr.transcribe("user_input.wav").text
response = generate_response(user_text) # Your LLM call
for chunk in tts.stream(response):
play_audio(chunk)
```
### Example 2: Meeting Transcription with Speaker Labels
```python
result = asr.transcribe("team_standup.wav", diarize=True, timestamps=True)
for segment in result.segments:
print(f"[{segment.start:.1f}s] Speaker {segment.speaker}: {segment.text}")
# Output:
# [0.0s] Speaker 1: Let's review the Q3 numbers.
# [3.5s] Speaker 2: Revenue is up 15% from last quarter.
# [8.4s] Speaker 1: That's great. What about customer acquisition?
```
## Guidelines
- Use the Realtime (0.5B) model for voice assistants where latency matters
- The ASR model handles up to 60 minutes of audio in a single pass
- TTS supports up to 90 minutes of multi-speaker audio generation
- Use custom hotwords for domain-specific terms (medical, legal, technical)
- For production, consider vLLM for faster ASR inference
- Multilingual voices are experimental — test quality before deploying
- 7.5 Hz frame rate enables efficient long-sequence processing
## Resources
- [Project Page](https://microsoft.github.io/VibeVoice)
- [HuggingFace Collection](https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f)
- [TTS Paper](https://arxiv.org/pdf/2508.19205)
- [ASR Paper](https://arxiv.org/pdf/2601.18184)
- [Colab: Streaming TTS](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/VibeVoice_colab.ipynb)
More from TerminalSkills/skills