twilio-voice-conversation-relay
$
npx mdskill add openai/plugins/twilio-voice-conversation-relayBuild AI-powered voice agents with Twilio's real-time voice relay
- Enables real-time voice interactions between callers and AI agents
- Uses Twilio, WebSocket, ASR, TTS, and LLM integrations
- Processes speech input, generates responses via LLM, and converts to speech
- Streams audio bidirectionally over WebSocket for live voice calls
SKILL.md
.github/skills/twilio-voice-conversation-relayView on GitHub ↗
---
name: twilio-voice-conversation-relay
description: >
Build AI-powered voice agents using Twilio ConversationRelay. Handles
real-time speech recognition (ASR), text-to-speech (TTS), and bidirectional
audio streaming via WebSocket. Covers TwiML setup, WebSocket message types,
LLM integration, streaming responses, and voice provider configuration. Use
this skill to build voice bots, IVR replacements, or real-time AI voice
assistants on Twilio calls.
---
## Overview
ConversationRelay connects Twilio's telephony layer to your app via a persistent WebSocket. Twilio handles ASR (speech-to-text) and TTS (text-to-speech); your app receives transcripts, calls an LLM, and sends text back for playback.
```
Caller ←→ Twilio (ASR/TTS) ←→ WebSocket ←→ Your App ←→ LLM
```
---
## Prerequisites
- Upgraded Twilio account with ConversationRelay access (requires onboarding)
— New to Twilio? See `twilio-account-setup`
— Start onboarding at: [Console > Voice > ConversationRelay](https://console.twilio.com/us1/voice/conversation-relay) — access is **not** instant
- A voice-capable Twilio phone number
- `TWILIO_ACCOUNT_SID` and `TWILIO_AUTH_TOKEN` — see `twilio-iam-auth-setup`
- WebSocket server reachable via `wss://` (TLS required)
- An LLM integration (OpenAI, Anthropic, etc.)
- For placing calls: see `twilio-voice-outbound-calls`
**Onboarding:** Complete via Console > Voice > ConversationRelay > Onboarding. Select TTS/ASR providers:
- **TTS:** Deepgram, Amazon Polly, Google Cloud TTS, ElevenLabs
- **ASR:** Deepgram, Google Cloud STT
---
## Quickstart
**Step 1 — Return TwiML pointing to your WebSocket server**
**Python (Flask)**
```python
from flask import Flask
from twilio.twiml.voice_response import VoiceResponse, Connect, ConversationRelay
app = Flask(__name__)
@app.route("/voice", methods=["POST"])
def voice():
response = VoiceResponse()
connect = Connect()
connect.conversation_relay(
url="wss://yourapp.com/ws/voice",
welcome_greeting="Hello! How can I help you today?"
)
response.append(connect)
return str(response)
```
**Node.js (Express)**
```node
const { VoiceResponse } = require("twilio").twiml;
app.post("/voice", (req, res) => {
const response = new VoiceResponse();
const connect = response.connect();
connect.conversationRelay({
url: "wss://yourapp.com/ws/voice",
welcomeGreeting: "Hello! How can I help you today?",
});
res.type("text/xml").send(response.toString());
});
```
**Step 2 — Handle WebSocket events and respond with text**
**Python (websockets)**
```python
import asyncio, json, websockets
async def handle_call(websocket):
async for message in websocket:
event = json.loads(message)
if event["type"] == "prompt":
ai_response = await call_llm(event["voicePrompt"])
await websocket.send(json.dumps({"type": "text", "token": ai_response, "last": True}))
async def main():
async with websockets.serve(handle_call, "0.0.0.0", 8080):
await asyncio.Future()
asyncio.run(main())
```
**Node.js (ws)**
```node
const WebSocket = require("ws");
const wss = new WebSocket.Server({ port: 8080 });
wss.on("connection", (ws) => {
ws.on("message", async (data) => {
const event = JSON.parse(data);
if (event.type === "prompt") {
const aiResponse = await callLLM(event.voicePrompt);
ws.send(JSON.stringify({ type: "text", token: aiResponse, last: true }));
}
});
});
```
> **Security:** The `voicePrompt` field contains ASR-transcribed caller speech — it is untrusted external input. When passing to an LLM, isolate it as user input within a structured system prompt. Implement topic boundaries and output filtering to prevent the LLM from disclosing system instructions or speaking inappropriate content. ConversationRelay is a pure transport layer with no built-in content safety — any LLM output is spoken to the caller verbatim.
---
## Key Patterns
### WebSocket Message Types
**Received from Twilio:**
| Type | When | Key fields |
|------|------|-----------|
| `connected` | WebSocket opened | `callSid`, `streamSid` |
| `prompt` | User finished speaking | `voicePrompt` (transcript) |
| `interrupt` | User interrupted TTS | — |
| `dtmf` | User pressed keypad key | `digit` |
| `error` | An error occurred | `description` |
**Sent to Twilio:**
| Type | Purpose | Key fields |
|------|---------|-----------|
| `text` | Send TTS response | `token` (text), `last` (bool) |
| `interrupt` | Stop current TTS | — |
| `end` | Hang up the call | `reason` |
### Stream LLM Responses Token-by-Token
Lower latency by streaming as the LLM generates output — Twilio starts speaking before the full response is ready.
**Python**
```python
async for chunk in llm_stream:
await websocket.send(json.dumps({"type": "text", "token": chunk, "last": False}))
await websocket.send(json.dumps({"type": "text", "token": "", "last": True}))
```
**Node.js**
```node
for await (const chunk of llmStream) {
ws.send(JSON.stringify({ type: "text", token: chunk, last: false }));
}
ws.send(JSON.stringify({ type: "text", token: "", last: true }));
```
### Voice Configuration
**Python**
```python
connect.conversation_relay(
url="wss://yourapp.com/ws/voice",
voice="en-US-Neural2-F",
language="en-US",
transcription_provider="deepgram",
speech_model="nova-2-phonecall",
interrupt_by_dtmf=True,
)
```
**Node.js**
```node
connect.conversationRelay({
url: "wss://yourapp.com/ws/voice",
voice: "en-US-Neural2-F",
language: "en-US",
transcriptionProvider: "deepgram",
speechModel: "nova-2-phonecall",
interruptByDtmf: true,
});
```
---
## CANNOT
- **No raw audio access** — Text in, text out only. For raw audio, use `<Connect><Stream>` (Media Streams).
- **Cannot mix with Media Streams** — `<Connect><Stream>` and `<Connect><ConversationRelay>` are mutually exclusive on the same call. No error — one is silently ignored.
- **No custom STT/TTS engines** — Limited to Deepgram + Google (STT) and Deepgram + Amazon Polly + Google + ElevenLabs (TTS).
- **No mid-session voice/provider changes** — Voice and provider are set at TwiML time. Only `language` can be switched mid-session via WebSocket message.
- **No WebSocket auto-reconnection** — If the WebSocket drops, the call disconnects. Implement recovery via `<Connect action>` URL.
- **Voice only** — No SMS/messaging support. For omnichannel, use Conversation Orchestrator.
- **No built-in memory or context** — BYO conversation history and context management.
- **No LLM integration** — Pure transport layer. You bring your own LLM via the WebSocket server.
- **No server-side recording via REST API** — `record:true` on the Calls API is silently ignored. Must use `<Start><Recording>` before `<Connect>` in TwiML.
- **Not PCI compliant with Voice Intelligence v2** — Do not enable `intelligenceService` in PCI workflows.
- **ElevenLabs requires account enablement** — Accounts without ElevenLabs access get error 64101. Voice IDs (not human-readable names) are required.
- **Cannot use ConversationRelay without onboarding** — Not available immediately on a new account
- **Cannot use non-TLS WebSocket** — Server must be reachable via `wss://` (TLS required)
- **Cannot stream audio without `last: true`** — Twilio won't play audio until it sees a `last: true` token in the stream
---
## Next Steps
- **Place outbound calls:** `twilio-voice-outbound-calls`
- **Standard IVR without AI:** `twilio-voice-twiml`
More from openai/plugins
- accessibility-and-inclusive-visualizationMake data visualizations accessible and inclusive. Use when the user needs chart or diagram accessibility guidance, text alternatives for complex visuals, color and contrast review, keyboard support, reduced-motion behavior for animation or parallax, or an accessibility QA workflow for exported figures, UML-like diagrams, and dashboards.
- agent-browserBrowser automation CLI for AI agents. Use when the user needs to interact with websites, verify dev server output, test web apps, navigate pages, fill forms, click buttons, take screenshots, extract data, or automate any browser task. Also triggers when a dev server starts so you can verify it visually.
- agent-browser-verifyAutomated browser verification for dev servers. Triggers when a dev server starts to run a visual gut-check with agent-browser — verifies the page loads, checks for console errors, validates key UI elements, and reports pass/fail before continuing.
- agents-sdkBuild AI agents on Cloudflare Workers using the Agents SDK. Load when creating stateful agents, durable workflows, real-time WebSocket apps, scheduled tasks, MCP servers, or chat applications. Covers Agent class, state management, callable RPC, Workflows integration, and React hooks. Biases towards retrieval from Cloudflare docs over pre-trained knowledge.
- ai-elementsAI Elements component library guidance — pre-built React components for AI interfaces built on shadcn/ui. Use when building chat UIs, message displays, tool call rendering, streaming responses, reasoning panels, or any AI-native interface with the AI SDK.
- ai-gatewayVercel AI Gateway expert guidance. Use when configuring model routing, provider failover, cost tracking, or managing multiple AI providers through a unified API.
- ai-generation-persistenceAI generation persistence patterns — unique IDs, addressable URLs, database storage, and cost tracking for every LLM generation
- ai-sdkVercel AI SDK expert guidance. Use when building AI-powered features — chat interfaces, text generation, structured output, tool calling, agents, MCP integration, streaming, embeddings, reranking, image generation, or working with any LLM provider.
- aiq-deploy|
- aiq-research|