An end‑to‑end, production‑ready guide for building a low‑latency, natural, and secure voice assistant on Exotel AgentStream using speech‑to‑speech realtime models.
Repos & Docs
Why Speech‑to‑Speech (S2S)
Traditional voice bots chain ASR → LLM → TTS. This works, but:
- Latency creeps up (2 conversions + buffering).
- Prosody (tone, rhythm) gets flattened in text.
- Complexity rises (multiple vendors, codecs, failure points).
Realtime S2S collapses the chain, enabling:
- Lower end‑to‑end latency and faster first‑audio‑out.
- Smooth barge‑in (interrupt mid‑utterance, respond instantly).
- Natural prosody that sounds human, not robotic.
- Simpler stack, fewer moving parts.
Pair this with Exotel AgentStream (telco‑grade WebSocket media over PSTN/SIP/WebRTC) and you get human-like conversations on real phone calls.
Architecture (Inbound & Outbound)
Inbound (bidirectional Voicebot)
Path: Caller → Exotel Number → Call Flow → Voicebot Applet (opens WSS) → Your Bot → Realtime LLM → Audio back to caller → (optional) next applet.
Outbound (bidirectional Voicebot)
Path: Scheduler/API → Make‑a‑Call API → Call Flow with Voicebot → Your WSS → Bot ↔ Realtime LLM → Audio back to callee.
SEQUENCE FLOW: 1. Caller/Callee ──────────────► Exotel AgentStream (Voicebot) [PSTN/SIP/WebRTC call] 2. Exotel AgentStream ─────────► Bot Server (WSS) [Open WebSocket (bidirectional)] 3. Bot Server ────────────────► Realtime LLM [Realtime WS (audio in)] 4. Realtime LLM ──────────────► Bot Server [Streaming audio deltas] 5. Bot Server ────────────────► Exotel AgentStream [Base64 audio frames (PCM/PCMU)] 6. Exotel AgentStream ────────► Caller/Callee [Play audio in call] NOTE: Closing WSS advances the flow (transfer, webhook, etc.)
Streaming contract (Voicebot Applet)
- Events → your bot: connected, start (with stream_sid, call_sid, custom_parameters), media (base64 PCM), dtmf, mark/clear, stop.
- Audio in (Exotel → bot): base64 Linear PCM, 16‑bit, 8/16/24 kHz, mono (LE).
- Audio out (bot → Exotel): base64 PCM or PCMU in ~100 ms frames (≈3.2 KB @ 8/16/24 kHz PCM16), multiples of 320 bytes, ≤ 100 KB per frame.
Quickstart (TL;DR)
- Clone enterprise framework → configure .env (keys, model/voice, port).
- Run bot server → expose via ngrok → copy WSS URL.
- Exotel Account with KYC verified and available Virtual Number
- In Exotel Call Flow, drop Voicebot Applet, paste WSS (or dynamic HTTPS).
- Call inbound number or hit Make‑a‑Call API to test outbound.
- Watch logs for connected/start/media/dtmf/clear/stop.
- Ship with security, chunking discipline, and fallback to agents.
Set Up the Repos (Deep Dive) — Configure, Modify, and Test
Enterprise framework (recommended)
Production‑oriented scaffolding: multi‑rate audio (8k/16k/24k), adaptive chunking, containerization, metrics/logging hooks, and example skills.
# Clone and prepare git clone https://github.com/exotel/Agent-Stream.git cd Agent-Stream git checkout enterprise-framework-contribution python3 -m venv venv && source venv/bin/activate pip install -r requirements.txt cp .env.example .env # set keys and runtime config (see below)
Environment configuration
Open env.example in the enterprise framework repo and set values exactly as documented (API keys, model/voice, ports, audio options). Do not rename variables. Refer to the branch README for the authoritative list and defaults.
Run an example (paths may vary; check repo examples/)
python3 examples/realtime_voicebot.py
Then expose locally and copy the WSS URL:
ngrok http <PORT_FROM_REPO_README_OR_ENV> # Use the resulting wss://<subdomain>... path exposed by the sample # (append ?sample-rate=8000|16000|24000 if desired)
What you can modify next
- Persona & guardrails: edit SYSTEM_INSTRUCTIONS.
- Audio trade‑offs: pick pcmu at 8 kHz; switch to pcm for higher fidelity.
- Barge‑in: keep chunk_ms ~100–200 (per repo defaults); rely on server VAD in the Realtime session.
- Routing: switch the applet to a dynamic HTTPS resolver that returns { “url”: “wss://…” } per tenant/campaign.
- Observability: wire the built‑in metrics/logging sinks (request IDs, stream_sid, first‑audio‑out latency).
How to test (enterprise path)
- Inbound: bind Voicebot to a number → call → verify connected/start/media events and audible replies.
- Outbound: use Make‑a‑Call (see below) → confirm the call lands in the Voicebot applet.
- DTMF: press digits → see dtmf events.
- Sample‑rate: try ?sample-rate=8000 vs 16000 → listen for quality/latency deltas.
- Failure: kill LLM WS → bot should close WS → flow advances to next applet (e.g., passthru>>connect).
Minimal sample (fastest POC)
Tight, readable bridge for first‑call success and framing validation.
git clone https://github.com/exotel/Agent-Stream.git cd Agent-Stream/python python3 -m venv venv && source venv/bin/activate pip install -r requirements.txt cp .env.example .env # add keys and server port python3 openai_realtime_sales_bot.py
Quick edits to refine the POC
- Replace the default system prompt with your domain‑specific instructions.
- Choose return audio: pcmu for PSTN efficiency at 8 kHz; pcm if you favor fidelity.
- Keep responses short (1–2 sentences) to reduce perceived latency.
- Add ?sample-rate=8000 to the applet endpoint during early tests.
Validate quickly, then graduate to enterprise framework for:
- Metrics (latency, jitter, silence %, response rate)
- Containers/CI (Dockerfile, Procfile, K8s manifests)
- Multi‑rate handling and chunk boundaries enforcement
- Security hooks (Basic Auth/IP allowlisting at proxy/app)
Configure the Voicebot Applet (Bidirectional)
- Call Flow → App Bazaar: drag Voicebot Applet into your flow.
- Endpoint:
- Static: paste wss://your-domain/agent-stream.
- Dynamic: paste an HTTPS URL that returns { “url”: “wss://…” }.
- (Optional) Recording: enable for QA/compliance.
- Save and attach to number (inbound) or note App/Flow ID (outbound).
Event samples
{"event":"connected"} {"event":"start","start":{"stream_sid":"XP123","call_sid":"CA456","custom_parameters":"lead=42&loc=blr"}} {"event":"media","media":{"payload":"<base64 PCM16 8k mono>"}} {"event":"dtmf","dtmf":{"digit":"5","onEnd":true}} {"event":"stop","stop":{"reason":"hangup"}}
Send audio back
{ "event": "media", "streamSid": "XP123", "media": { "payload": "<base64 PCM/PCMU>" } }
Custom parameters & dynamic URL rules
- Max 3 custom parameters.
- The total length of the query string (text after ?) must be ≤ 256 characters.
- Dynamic HTTP(S) resolver must return a valid ws(s) URL.
- Example: ws://127.0.0.1:5001/media?param1=value1¶m2=value2¶m3=value3
Deployment: sample rate parameters
When deploying, ensure your WSS endpoints support the following sample rate parameters. Defining the query param ?sample-rate=xxxx in the Voicebot applet allows Exotel to negotiate and handle the defined sample rate end‑to‑end:
wss://your-domain.com/?sample-rate=8000 # Standard PSTN wss://your-domain.com/?sample-rate=16000 # Enhanced quality wss://your-domain.com/?sample-rate=24000 # HD quality
Your bot should read the sample-rate query parameter at connect time and set/convert its audio pipeline accordingly (e.g., forward to the LLM at 8 kHz or transcode as needed). Ensure returned audio matches the negotiated rate.
Outbound Calling (Make‑a‑Call API)
Originate calls that land inside your Voicebot flow.
Outgoing call to connect a number to a Call Flow (ExoML start_voice)
This API first calls the From number, and once they pick up, it connects them to the call flow (applet) you specify—e.g., your landing app (which contains the Voicebot Applet).
Endpoint
POST https://<your_api_key>:<your_api_token>@<subdomain>/v1/Accounts/<your_sid>/Calls/connect
- Prefer JSON? append .json → …/Calls/connect.json
- Subdomain by region:
- Singapore: api.exotel.com
- Mumbai: api.in.exotel.com
- <your_api_key>, <your_api_token>, <your_sid> are in API Settings on your Exotel dashboard.
Required parameters
- From — The phone number to be called first (E.164 recommended; landlines should be prefixed with STD, e.g., 0XXXXXX2400).
- CallerId — Your ExoPhone (outbound caller ID).
- Url — http://my.exotel.com/{your_sid}/exoml/start_voice/{app_id} where {app_id} is the flow (app) id to connect to after From answers.
Optional parameters
- CallType — trans for transactional calls.
- TimeLimit — Max call duration in seconds (≤ 14400 i.e., 4 hours).
- TimeOut — Ring timeout (seconds) for the legs.
- StatusCallback — On completion, Exotel POSTs: CallSid, Status (completed|failed|busy|no-answer), RecordingUrl (if enabled), DateUpdated.
- CustomField — Any value you want echoed back as a parameter to Passthru/Greetings GET later in the flow.
Example request
curl -X POST \ https://<your_api_key>:<your_api_token>@<subdomain>/v1/Accounts/<your_sid>/Calls/connect \ -d "From=XXXXX30240" \ -d "CallerId=0XXXXXX4890" \ -d "Url=http://my.exotel.com/{your_sid}/exoml/start_voice/926"
HTTP response
- 200 OK means the request was accepted, not that the call was answered. Track outcomes via StatusCallback or GET Call Details.
- The response body includes a Sid (unique call id). Log it for troubleshooting.
Routing to Agent or Contact Center After the Voicebot Applet
When the WebSocket connection closes—either because your bot disconnects once the interaction is complete or due to a network‑level termination—Exotel automatically advances to the next applet in the Call Flow. There is no explicit Stop event your bot needs to send to Exotel; instead, close the WebSocket to end the session. Exotel will internally emit Stop and continue the flow.
Passthru behaviour (AgentStream‑aware)
- HTTP method: Passthru makes an HTTP GET to your URL with URL‑encoded query parameters.
- Sync vs Async: Toggle Make Passthru Async on the applet.
- Sync → use HTTP codes to branch the flow (200 OK → Path A, 302 Found → Path B). Caller waits for your response.
- Async → fire‑and‑forget logging/analytics; does not block the caller.
- Payload: Standard passthru params plus streaming fields
(e.g., Stream[StreamSID], Stream[Status], Stream[Duration], Stream[RecordingUrl], Stream[StreamUrl], Stream[DisconnectedBy], Stream[DetailedStatus], Stream[Error]).- Sometimes all stream fields arrive as a single JSON string under Stream=…—deserialize accordingly.
- DisconnectedBy values are normalised to user, bot, NA.
- Recording URL appears when Record is enabled in the applet.
- Throttling: On concurrency breaches you may see Stream[Status]=failed with Stream[DetailedStatus]=Streaming_call_throttled—implement fallback.
Place Passthru right after Voicebot
This lets you:
- Fetch session metadata
- Log streaming stats (StreamSID, duration, RecordingUrl)
- Detect disconnects reliably
- Read escalation flags (your custom logic) and route next
Common routing scenarios
- Connect Applet → route to Exotel agent/queue
- SIP Connect via vSIP Trunk → route to enterprise contact center
- Hangup Applet → gracefully end the call
Example
Caller: “Talk to human.” → Bot finishes → WS closes → Exotel emits Stop → Passthru GET → your response indicates escalate=200 Ok (custom) → SwitchCase → vSIP Trunk via Connect
Best practices
- Keep Passthru immediately after Voicebot/Stream applet
- Use Clear/Mark events for context and observability
- Monitor Active Streams and enforce concurrency limits
- Decide routing based on DisconnectedBy/Status and your flags
- Follow WS timeouts/handshake guidelines
- Keep custom params concise and secure
- Close the WebSocket from the bot to end the session cleanly
Secure Your Bot (Auth Options)
- AgentStream supports Basic Auth on both WSS and HTTPS dynamic endpoints.
- You can additionally enable IP allowlisting to restrict access to Exotel egress IPs and your network.
Reference Implementation (repo‑only)
Use the exact code in the repos and keep your blog snippets synchronized with those files.
Main repo (quick POC)
- Path: python/openai_realtime_sales_bot.py
- How to start (from repo README):
python3 openai_realtime_sales_bot.py
Enterprise framework (production)
- Core engine: src/core/realtime_bot.py
- Example bots: src/examples/sales_bot.py, src/examples/support_bot.py, src/examples/qualification_bot.py, src/examples/collection_bot.py
- Environment template: env.example
Audio configuration
Use the variables in env.example of the enterprise framework (supported sample rates, chunk sizing/buffers). Follow the branch README for current names and defaults.
Run & expose
Use the exact run command from the enterprise framework README (script or server entry), then:
ngrok http <PORT_FROM_REPO_README_OR_ENV> # Paste the resulting wss://.../agent-stream in the Voicebot Applet. # To negotiate audio, append ?sample-rate=8000|16000|24000.
uvicorn app:app --host 0.0.0.0 --port 5000 ngrok http 5000 # Paste the resulting wss://.../agent-stream in the Voicebot Applet. # To negotiate audio, append ?sample-rate=8000|16000|24000.
Natural Conversation Tuning
- Barge‑in handling: Keep responses short (1–2 sentences). With server VAD enabled, ensure the bot stops playback immediately if the caller interrupts.
- Turn timing: Target outbound audio chunks of 100–200 ms (per repo defaults) for smooth half‑duplex flow. Tune downwards only if required by bandwidth constraints.
- Prosody & persona design: Craft concise system prompts to control tone, persona, and verbosity. Avoid overly long instructions that lead to rambling output.
- Language mixing (code‑switching): Configure the model to allow natural shifts (e.g., Hindi↔English) without forcing translation. Validate this in real test calls.
- Error recovery & silence detection: Detect long silences (>3–4 seconds) or low confidence responses; gracefully apologize, clarify, or escalate to an agent.
- Context retention: Use mark and clear events for contextual checkpoints. Reset state when moving to a new topic to avoid confusion.
- Testing & tuning loop: Continuously record and review sessions (if policy allows). Adjust system instructions, frame sizing, and escalation thresholds based on observed behavior.
- Framework choice: Use the enterprise framework from the repo for production (multi‑rate audio, metrics, containerization). Use the minimal sample only for POC.
- Industry‑specific design: For BFSI, enforce stricter compliance and escalation; for Healthcare, ensure HIPAA/PII masking; for Retail/E‑com, focus on COD/returns workflows; for Automotive, emphasize reminders and service journeys.
- LLM tuning: Iterate on
SYSTEM_INSTRUCTIONS
and voice settings to balance personality vs efficiency. For domain tasks, feed domain lexicons and few‑shot examples. Consider guardrails to avoid hallucinations. - Observability & feedback: Integrate Active Streams and Passthru logs with dashboards. Use conversation transcripts (when compliant) for supervised fine‑tuning.
Launch Checklist
- Sign up with Exotel
- KYC complete; Voicebot Applet enabled; correct region.
- Static/Dynamic endpoint configured; TLS valid.
.env
model/voice set; nearest hosting selected.- Chunking discipline: ~100 ms (or 60 ms) frames, multiples of 320 bytes, ≤100 KB.
- Security: Basic Auth on WSS/HTTPS + IP allowlist.
- Fallback: If WS/LLM fails → next applet (passthru/connect).
- Observability: Log
stream_sid
, latency, gaps; set alerts. - Passthru after Voicebot: Ensure proper routing/escalation handling.
- StatusCallback/Call Detail API: Integrate for outbound monitoring.
- Compliance: Recording/PII masking per policy; region‑based data residency.
- Scalability: Validate Active Streams concurrency and resource auto‑scaling.
Use Cases
- Sales & Marketing: lead qualification, personalized pitches, bookings, feedback collection, event registrations.
- Customer Support: NL‑IVR replacement, intelligent FAQs, order status/returns, troubleshooting, multilingual assistance.
- Collections & Reminders: EMI reminders, PTP capture, payment nudges, subscription renewals, overdue follow‑ups.
- Banking & Financial Services (BFSI): policy renewals, premium explainer, lead conversion, basic KYC assist (non‑sensitive), account activation.
- Healthcare & Healthtech: scheduling/rescheduling, pre‑visit guidance, post‑visit follow‑ups, vaccination reminders, medicine refills.
- Logistics & E‑commerce: COD confirmation, address validation, rescheduling, delivery instructions, return initiation.
- Travel & Mobility: itinerary updates, fare/tariff queries, rescheduling, booking confirmations, cancellation handling.
- Automotive & Consumer Services: service bookings, AMC/warranty renewals, recalls, roadside assistance, upsell of accessories.
Always include agent handover on request or low confidence.
Troubleshooting
- No audio back: Ensure base64 PCM/PCMU; obey frame size & 320‑byte multiples; confirm LLM emits deltas.
- Choppy audio: Frames too small/large; coalesce to ~40–60 ms; avoid >100 KB.
- Latency spikes: Shorter responses; lighter logging; nearest region; prefer PCMU @ 8/16/24 kHz.
- Handshake fails: TLS/host mismatch; malformed dynamic JSON; missing url; Basic Auth mismatch.
- DTMF missing: Ensure Voicebot (bidi) is used; log all events.
- Passthru not invoked: Confirm the Voicebot applet is immediately followed by a Passthru; check that WS closes cleanly.
- Custom params lost: Keep ≤3 params, query string ≤256 chars; validate they appear in
start.custom_parameters
. - Recording URL missing: Enable recording at the applet level; Passthru will only include
Stream[RecordingUrl]
when enabled. - Stream throttled: Watch for
Stream[DetailedStatus]=Streaming_call_throttled
; use Active Streams API to scale or failover. - Multi‑rate issues: Ensure your bot honors the negotiated
?sample-rate
(8k/16k/24k); transcode if your model requires a fixed rate. - Outbound call fails: Verify From and CallerId are correct ExoPhones; confirm Url points to a valid
start_voice
applet.
FAQ
- Can I use non‑OpenAI realtime models?
Yes — any WebSocket‑based audio in/out model (e.g., custom ASR‑TTS bridges, ElevenLabs, Vapi). - Do I need ASR/TTS separately?
Not with S2S. You can optionally log transcripts asynchronously for QA or analytics. - Which audio format should I return?
PCMU @ 8 kHz for PSTN efficiency; PCM for higher fidelity (16/24 kHz). Always respect frame sizing (≈40–100 ms, multiples of 320 bytes). - How do I pass per‑call context?
Use the dynamic endpoint: return{ "url": "wss://.../agent-stream?lead=987&locale=en-IN" }
. Your bot will read these viastart.custom_parameters
. - How do outbound campaigns work?
Use the ExoMLstart_voice
URL (see cURL example). Add retries, scheduling, and ensure your flow contains the Voicebot Applet. - Can I test locally?
Yes. Use ngrok to expose your bot server. Point the Voicebot Applet to the ngrok URL. - What about security?
Enable Basic Auth or IP allowlisting on your WSS/HTTPS endpoint. - How do I monitor quality?
Use Passthru + Active Streams API to log duration, disconnect reason, errors, and recording URLs.