slack-voice-interface

$npx mdskill add automateyournetwork/netclaw/slack-voice-interface

Respond to Slack voice clips with text and MP3 voice replies using edge-tts

  • Enables voice interaction in Slack by replying to voice messages with audio
  • Uses OpenClaw transcription and edge-tts for text-to-speech conversion
  • Processes transcribed text with NetClaw tools like pyATS, NetBox, and ServiceNow
  • Posts text response and uploads generated MP3 file to Slack thread
SKILL.md
.github/skills/slack-voice-interfaceView on GitHub ↗
---
name: slack-voice-interface
description: "Respond to Slack voice clips with both text and an MP3 voice reply using edge-tts. Voice IN is already handled by OpenClaw transcription. Use when a user sends a voice message in Slack, you need to reply with audio, or you want to generate a spoken MP3 response."
license: Apache-2.0
user-invocable: true
metadata:
  { "openclaw": { "requires": { "bins": ["python3"], "env": ["TTS_MCP_SCRIPT", "MCP_CALL"] } } }
---

# Slack Voice Interface

## How It Works

```
User sends voice clip in Slack
    |
    v
OpenClaw transcribes automatically (built-in)
    |
    v
NetClaw processes with full skill set
(pyATS, NetBox, ServiceNow, all 40 MCP servers)
    |
    v
python3 $MCP_CALL "python3 -u $TTS_MCP_SCRIPT" text_to_speech → MP3 file
    |
    v
Upload MP3 to Slack thread + post text response
```

## Voice Response Workflow

### Step 1: Process the question

Treat the transcribed voice message identically to a typed text message.
Use the full NetClaw skill set — pyATS, NetBox, ServiceNow, etc.

### Step 2: Generate voice response

After composing your text response, call `text_to_speech`:

```bash
python3 $MCP_CALL "python3 -u $TTS_MCP_SCRIPT" text_to_speech '{"text":"R1 has 3 OSPF neighbors, all in FULL state on Area 0...","voice":"en-US-GuyNeural"}'
```

This returns JSON with an `output_path` to the generated MP3 file.

To list available voices:

```bash
python3 $MCP_CALL "python3 -u $TTS_MCP_SCRIPT" list_voices '{"language":"en"}'
```

### Step 3: Deliver both text and voice

Post the text response in the Slack thread AND upload the MP3 file:

> :loud_speaker: **Voice Response**
> [MP3 audio file attached]
>
> R1 has 3 OSPF neighbors, all in FULL state on Area 0:
> - 2.2.2.2 (R2) via Gi1 — FULL/DR
> - 3.3.3.3 (R3) via Gi2 — FULL/BDR

**Always deliver text AND voice.** Text is primary (searchable, accessible).
Voice is supplementary.

## Voice Selection

| Voice | Description |
|-------|-------------|
| en-US-GuyNeural | Professional male — **default** |
| en-US-JennyNeural | Professional female |
| en-US-AriaNeural | Conversational female |
| en-GB-RyanNeural | British male |

Users can request a voice change:
- "Switch to a female voice" → use en-US-JennyNeural
- "Use a British accent" → use en-GB-RyanNeural

Call `list_voices` to see all 300+ available voices.

## Performance

| Phase | Latency |
|-------|---------|
| edge-tts synthesis | 1-2 seconds |
| Slack MP3 upload | < 1 second |

Voice synthesis adds minimal overhead to the response time.

## Fallback

If TTS fails, deliver the text response immediately. Do not block on voice.

## Tips for Voice Responses

- **Keep it concise** — under 100 words works best for spoken delivery
- **Avoid tables** — describe data conversationally for voice
- **Spell out abbreviations** — say "OSPF" not "O-S-P-F" (edge-tts handles this)
- **Use natural phrasing** — the text will be read aloud, so write for the ear

## GAIT Integration

Record voice interactions in the GAIT audit trail:

```
Input: Voice clip from @user (transcript: "What are your interfaces?")
Action: Queried R1 interfaces via pyATS
Output: 4 interfaces found — text + voice response delivered to Slack
```
More from automateyournetwork/netclaw