twilio-reliability-patterns

Name: twilio-reliability-patterns
Author: openai/plugins
$npx mdskill add openai/plugins/twilio-reliability-patterns
Implements reliability patterns for high-volume Twilio integrations
Handles 429 errors, retries, and throughput limits in production Twilio apps
Uses Twilio's API for messaging, calls, and status callback management
Applies exponential backoff with jitter and per-number rate limiting
Provides fallback chains and thin-receiver pattern for resilient delivery
SKILL.md
.github/skills/twilio-reliability-patternsView on GitHub ↗
---
name: twilio-reliability-patterns
description: >
  Handle rate limits, retries, and failures when building on Twilio at
  scale. Covers 429 exponential backoff with jitter, per-number throughput
  limits, StatusCallback resilience, thin-receiver pattern, and fallback
  chains. Use this skill whenever sending messages or making calls at
  volume, or when building production-grade Twilio integrations.
---

## Overview

Twilio enforces per-resource rate limits. At scale, 429 errors are expected behavior — not bugs. This skill teaches the patterns that prevent production failures: exponential backoff, throughput management, and resilient callback handling.

429 concurrency errors are not well documented — implement exponential backoff with ±10% jitter.

---

## Prerequisites

- A working Twilio integration (any product)
- Understanding of your expected volume (messages/sec, calls/sec)
- StatusCallback URLs configured — see `twilio-messaging-services`, `twilio-sms-send-message`

---

## Key Patterns

### 1. Exponential Backoff with Jitter

When you receive a 429 (Too Many Requests), wait and retry. Naive fixed-interval retry creates thundering herds. Use exponential backoff with randomized jitter.

**Python**
```python
import time, random, requests

def send_with_backoff(client, to, body, messaging_service_sid, max_retries=5):
    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                to=to,
                body=body,
                messaging_service_sid=messaging_service_sid,
                status_callback="https://yourapp.com/status"
            )
            return message
        except Exception as e:
            if hasattr(e, 'status') and e.status == 429:
                # Exponential backoff: 100ms, 200ms, 400ms, 800ms, 1600ms
                base_delay = 0.1 * (2 ** attempt)
                # Add ±10% jitter to prevent thundering herd
                jitter = base_delay * 0.1 * (2 * random.random() - 1)
                delay = min(base_delay + jitter, 30)  # cap at 30 seconds
                time.sleep(delay)
            else:
                raise  # Non-429 errors: don't retry, investigate
    raise Exception(f"Failed after {max_retries} retries")
```

**Node.js**
```node
async function sendWithBackoff(client, to, body, messagingServiceSid, maxRetries = 5) {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            return await client.messages.create({
                to,
                body,
                messagingServiceSid,
                statusCallback: "https://yourapp.com/status",
            });
        } catch (err) {
            if (err.status === 429) {
                // Exponential backoff: 100ms, 200ms, 400ms, 800ms, 1600ms
                const baseDelay = 100 * Math.pow(2, attempt);
                // Add ±10% jitter
                const jitter = baseDelay * 0.1 * (2 * Math.random() - 1);
                const delay = Math.min(baseDelay + jitter, 30000); // cap at 30s
                await new Promise(r => setTimeout(r, delay));
            } else {
                throw err; // Non-429: don't retry
            }
        }
    }
    throw new Error(`Failed after ${maxRetries} retries`);
}
```

**Parameters:**
- Initial delay: 100ms
- Multiplier: 2x per attempt
- Jitter: ±10% of base delay (randomized)
- Max delay: 30 seconds
- Max retries: 5 (covers up to ~3.2 second base delay)

### 2. Per-Number Throughput Limits

These limits are not prominently documented:

| Number type | SMS throughput | Voice throughput | Notes |
|-------------|---------------|-----------------|-------|
| Local (long code) | ~1 SMS/sec | 1 concurrent call | Lowest cost, lowest throughput |
| Toll-free | ~3 SMS/sec | — | Faster verification (3-5 days) |
| Short code | 10-100 SMS/sec | — | Highest throughput, 8-12 week provisioning, expensive |
| Messaging Service (pool) | Sum of all numbers in pool | — | Multiply throughput by adding numbers |

**Throughput opacity:** Sending velocity and queue depth are opaque — there is no dashboard showing messages per second. Use Messaging Services to multiply throughput by pooling numbers. A pool of 10 long codes = ~10 SMS/sec.

### 3. Bulk Send Pattern

For sending to large lists, use a rate-limited dispatch loop:

**Python**
```python
import asyncio
from collections import deque

async def bulk_send(client, recipients, body, messaging_service_sid, rate_per_second=10):
    """Send to a list of recipients with rate limiting and backoff."""
    queue = deque(recipients)
    results = []
    
    while queue:
        batch = []
        for _ in range(min(rate_per_second, len(queue))):
            batch.append(queue.popleft())
        
        for recipient in batch:
            try:
                msg = send_with_backoff(client, recipient, body, messaging_service_sid)
                results.append({"to": recipient, "sid": msg.sid, "status": "sent"})
            except Exception as e:
                results.append({"to": recipient, "error": str(e), "status": "failed"})
        
        if queue:  # Don't sleep after last batch
            await asyncio.sleep(1)  # 1 second between batches
    
    return results
```

**Key:** Set `rate_per_second` based on your number pool size, not your desired speed. Sending faster than your pool supports just generates 429s.

> **Compliance:** Before bulk sending, verify recipient consent (opt-in records), respect quiet hours, and implement maximum batch size limits. Monitor for anomalous send patterns that could indicate abuse.

### 4. StatusCallback Resilience

At scale, StatusCallbacks create their own load problem.

**The math:** 50 concurrent calls × 6 status events per call = 300 webhook invocations per second. Twilio Functions allow 30 concurrent executions per service.

**Thin-receiver pattern** — receive, queue, respond immediately:

**Node.js (Express)**
```node
const { Queue } = require("bullmq");
const statusQueue = new Queue("twilio-status");

// Thin receiver: accept callback, queue it, respond 200 immediately
app.post("/status", async (req, res) => {
    await statusQueue.add("status-event", {
        callSid: req.body.CallSid,
        callStatus: req.body.CallStatus,
        timestamp: Date.now(),
    });
    res.sendStatus(200);  // Respond FAST — Twilio will retry on timeout
});

// Process asynchronously
const worker = new Worker("twilio-status", async (job) => {
    const { callSid, callStatus } = job.data;
    await updateDatabase(callSid, callStatus);
});
```

**Python (Flask + Celery)**
```python
@app.route("/status", methods=["POST"])
def status_callback():
    # Queue for async processing
    process_status.delay(
        call_sid=request.form["CallSid"],
        call_status=request.form["CallStatus"]
    )
    return "", 200  # Respond FAST

@celery.task
def process_status(call_sid, call_status):
    update_database(call_sid, call_status)
```

**Idempotency key:** Use `{CallSid}-{CallStatus}` as a composite key. Twilio retries on timeout, which can cause duplicate callbacks. Deduplicate before processing.

### 5. Fallback Chains

When delivery on one channel fails, escalate to the next:

**Python**
```python
async def send_with_fallback(client, to, message, messaging_service_sid):
    """Try SMS → Voice → Email fallback chain."""
    
    # Try SMS first
    try:
        msg = client.messages.create(
            to=to, body=message, messaging_service_sid=messaging_service_sid,
            status_callback="https://yourapp.com/status"
        )
        # Wait for delivery confirmation via StatusCallback
        # If undelivered after timeout, fall through to voice
        return {"channel": "sms", "sid": msg.sid}
    except Exception:
        pass  # SMS failed, try voice
    
    # Fallback to voice
    try:
        call = client.calls.create(
            to=to, from_="+15551234567",
            twiml=f"<Response><Say>{message}</Say></Response>",
            status_callback="https://yourapp.com/call-status"
        )
        return {"channel": "voice", "sid": call.sid}
    except Exception:
        pass  # Voice failed, try email
    
    # Last resort: email
    # Use SendGrid — see twilio-sendgrid-email
    return {"channel": "email", "status": "queued"}
```

### 6. Voice Concurrency Limits

| Resource | Default limit | Notes |
|----------|--------------|-------|
| Concurrent calls per account | 1 (trial) / variable (paid) | Request increase via support |
| Calls per second (CPS) | 1 CPS (default) | Increase via support for outbound campaigns |
| Conference participants | 250 per conference | |
| Twilio Functions concurrent | 30 per service | Use thin-receiver pattern above |

For outbound campaigns, request CPS increase before launch — not during.

### 7. Webhook Timeout Handling

Twilio expects a response within **15 seconds** for voice webhooks and **15 seconds** for messaging webhooks. If your endpoint doesn't respond:
- Voice: Twilio hangs up or falls back to `voiceFallbackUrl`
- Messaging: Twilio retries the callback

**Always configure fallback URLs:**
```python
# On phone number configuration
number = client.incoming_phone_numbers(phone_sid).update(
    voice_url="https://yourapp.com/voice",
    voice_fallback_url="https://yourapp.com/voice-fallback",  # backup endpoint
    sms_url="https://yourapp.com/sms",
    sms_fallback_url="https://yourapp.com/sms-fallback"
)
```

---

## Monitoring Checklist

Set up these alerts before going to production:

| Metric | Alert threshold | How to track |
|--------|----------------|-------------|
| 429 error rate | > 5% of requests | Count 429s in your backoff handler |
| Delivery failure rate | > 2% of messages | StatusCallback `failed`/`undelivered` events |
| Webhook response time | > 5 seconds p95 | Your APM tool (DataDog, New Relic) |
| Queue depth | Growing over 5 minutes | Your message queue metrics |
| Concurrent calls | > 80% of limit | Twilio Usage API or Event Streams |

Twilio's built-in alerting systems are under-used — end-users often discover issues before developers do. Configure StatusCallbacks + Event Streams for delivery failure alerts on every integration.

---

## CANNOT

- **Cannot avoid 429 errors on any Twilio API** — Backoff patterns apply to all APIs (Messaging, Voice, Verify, Lookup)
- **Cannot increase per-number throughput** — Add more numbers via Messaging Services instead
- **Cannot configure StatusCallback retry behavior** — Twilio retries on timeout automatically; not configurable
- **Cannot exceed Twilio Functions limits** — 30 concurrent executions/service, 10-second timeout, 256 MB memory
- **Cannot use a native Twilio rate limiting API** — You must implement rate limiting in your application

---

## Next Steps

- **Messaging at scale:** `twilio-messaging-services`
- **Monitor delivery:** `twilio-sms-send-message` (StatusCallbacks)
- **Debug failures:** `twilio-debugging-observability`
- **Compliance for bulk sends:** `twilio-compliance-traffic`