Build a Real-Time Speech-to-Speech AI Voice Assistant on Exotel AgentStream (Bidirectional) with OpenAI Realtime & Python

An end‑to‑end, production‑ready guide for building a low‑latency, natural, and secure voice assistant on Exotel AgentStream using speech‑to‑speech realtime models.

Repos & Docs

Why Speech‑to‑Speech (S2S)

Traditional voice bots chain ASR → LLM → TTS. This works, but:

Latency creeps up (2 conversions + buffering).
Prosody (tone, rhythm) gets flattened in text.
Complexity rises (multiple vendors, codecs, failure points).

Realtime S2S collapses the chain, enabling:

Lower end‑to‑end latency and faster first‑audio‑out.
Smooth barge‑in (interrupt mid‑utterance, respond instantly).
Natural prosody that sounds human, not robotic.
Simpler stack, fewer moving parts.

Pair this with Exotel AgentStream (telco‑grade WebSocket media over PSTN/SIP/WebRTC) and you get human-like conversations on real phone calls.

Architecture (Inbound & Outbound)

Inbound (bidirectional Voicebot)

Path: Caller → Exotel Number → Call Flow → Voicebot Applet (opens WSS) → Your Bot → Realtime LLM → Audio back to caller → (optional) next applet.

Outbound (bidirectional Voicebot)

Path: Scheduler/API → Make‑a‑Call API → Call Flow with Voicebot → Your WSS → Bot ↔ Realtime LLM → Audio back to callee.

SEQUENCE FLOW:

1. Caller/Callee ──────────────► Exotel AgentStream (Voicebot)
   [PSTN/SIP/WebRTC call]

2. Exotel AgentStream ─────────► Bot Server (WSS)
   [Open WebSocket (bidirectional)]

3. Bot Server ────────────────► Realtime LLM
   [Realtime WS (audio in)]

4. Realtime LLM ──────────────► Bot Server
   [Streaming audio deltas]

5. Bot Server ────────────────► Exotel AgentStream
   [Base64 audio frames (PCM/PCMU)]

6. Exotel AgentStream ────────► Caller/Callee
   [Play audio in call]

NOTE: Closing WSS advances the flow (transfer, webhook, etc.)

Streaming contract (Voicebot Applet)

Events → your bot: connected, start (with stream_sid, call_sid, custom_parameters), media (base64 PCM), dtmf, mark/clear, stop.
Audio in (Exotel → bot): base64 Linear PCM, 16‑bit, 8/16/24 kHz, mono (LE).
Audio out (bot → Exotel): base64 PCM or PCMU in ~100 ms frames (≈3.2 KB @ 8/16/24 kHz PCM16), multiples of 320 bytes, ≤ 100 KB per frame.

Multi‑tenant routing: configure the Voicebot Applet to call your HTTPS resolver which returns { “url”: “wss://…” } per call.

Quickstart (TL;DR)

Clone enterprise framework → configure .env (keys, model/voice, port).
Run bot server → expose via ngrok → copy WSS URL.
Exotel Account with KYC verified and available Virtual Number
In Exotel Call Flow, drop Voicebot Applet, paste WSS (or dynamic HTTPS).
Call inbound number or hit Make‑a‑Call API to test outbound.
Watch logs for connected/start/media/dtmf/clear/stop.
Ship with security, chunking discipline, and fallback to agents.

Set Up the Repos (Deep Dive) — Configure, Modify, and Test

Enterprise framework (recommended)

Production‑oriented scaffolding: multi‑rate audio (8k/16k/24k), adaptive chunking, containerization, metrics/logging hooks, and example skills.

# Clone and prepare
git clone https://github.com/exotel/Agent-Stream.git
cd Agent-Stream
git checkout enterprise-framework-contribution
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # set keys and runtime config (see below)

Environment configuration

Open env.example in the enterprise framework repo and set values exactly as documented (API keys, model/voice, ports, audio options). Do not rename variables. Refer to the branch README for the authoritative list and defaults.

Run an example (paths may vary; check repo examples/)

python3 examples/realtime_voicebot.py

Then expose locally and copy the WSS URL:

ngrok http <PORT_FROM_REPO_README_OR_ENV>

# Use the resulting wss://<subdomain>... path exposed by the sample 
# (append ?sample-rate=8000|16000|24000 if desired)

Tip: Drive the negotiated sample rate from Exotel by adding ?sample-rate=8000|16000|24000 in the Voicebot applet endpoint; your server should read it and set/convert its audio path accordingly.

What you can modify next

Persona & guardrails: edit SYSTEM_INSTRUCTIONS.
Audio trade‑offs: pick pcmu at 8 kHz; switch to pcm for higher fidelity.
Barge‑in: keep chunk_ms ~100–200 (per repo defaults); rely on server VAD in the Realtime session.
Routing: switch the applet to a dynamic HTTPS resolver that returns { “url”: “wss://…” } per tenant/campaign.
Observability: wire the built‑in metrics/logging sinks (request IDs, stream_sid, first‑audio‑out latency).

How to test (enterprise path)

Inbound: bind Voicebot to a number → call → verify connected/start/media events and audible replies.
Outbound: use Make‑a‑Call (see below) → confirm the call lands in the Voicebot applet.
DTMF: press digits → see dtmf events.
Sample‑rate: try ?sample-rate=8000 vs 16000 → listen for quality/latency deltas.
Failure: kill LLM WS → bot should close WS → flow advances to next applet (e.g., passthru>>connect).

Minimal sample (fastest POC)

Tight, readable bridge for first‑call success and framing validation.

git clone https://github.com/exotel/Agent-Stream.git
cd Agent-Stream/python
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # add keys and server port
python3 openai_realtime_sales_bot.py

Quick edits to refine the POC

Replace the default system prompt with your domain‑specific instructions.
Choose return audio: pcmu for PSTN efficiency at 8 kHz; pcm if you favor fidelity.
Keep responses short (1–2 sentences) to reduce perceived latency.
Add ?sample-rate=8000 to the applet endpoint during early tests.

Validate quickly, then graduate to enterprise framework for:

Metrics (latency, jitter, silence %, response rate)
Containers/CI (Dockerfile, Procfile, K8s manifests)
Multi‑rate handling and chunk boundaries enforcement
Security hooks (Basic Auth/IP allowlisting at proxy/app)

Validate audio framing quickly with the minimal sample, then move to the enterprise framework for production features (metrics, containers, multi‑rate).

Configure the Voicebot Applet (Bidirectional)

Call Flow → App Bazaar: drag Voicebot Applet into your flow.
Endpoint:
- Static: paste wss://your-domain/agent-stream.
- Dynamic: paste an HTTPS URL that returns { “url”: “wss://…” }.
(Optional) Recording: enable for QA/compliance.
Save and attach to number (inbound) or note App/Flow ID (outbound).

Event samples

{"event":"connected"}
{"event":"start","start":{"stream_sid":"XP123","call_sid":"CA456","custom_parameters":"lead=42&loc=blr"}}
{"event":"media","media":{"payload":"<base64 PCM16 8k mono>"}}
{"event":"dtmf","dtmf":{"digit":"5","onEnd":true}}
{"event":"stop","stop":{"reason":"hangup"}}

Send audio back

{
  "event": "media",
  "streamSid": "XP123",
  "media": { "payload": "<base64 PCM/PCMU>" }
}

Custom parameters & dynamic URL rules

Max 3 custom parameters.
The total length of the query string (text after ?) must be ≤ 256 characters.
Dynamic HTTP(S) resolver must return a valid ws(s) URL.
Example: ws://127.0.0.1:5001/media?param1=value1¶m2=value2¶m3=value3

Deployment: sample rate parameters

When deploying, ensure your WSS endpoints support the following sample rate parameters. Defining the query param ?sample-rate=xxxx in the Voicebot applet allows Exotel to negotiate and handle the defined sample rate end‑to‑end:

wss://your-domain.com/?sample-rate=8000 # Standard PSTN
wss://your-domain.com/?sample-rate=16000 # Enhanced quality
wss://your-domain.com/?sample-rate=24000 # HD quality

Your bot should read the sample-rate query parameter at connect time and set/convert its audio pipeline accordingly (e.g., forward to the LLM at 8 kHz or transcode as needed). Ensure returned audio matches the negotiated rate.

Outbound Calling (Make‑a‑Call API)

Originate calls that land inside your Voicebot flow.

Outgoing call to connect a number to a Call Flow (ExoML start_voice)

This API first calls the From number, and once they pick up, it connects them to the call flow (applet) you specify—e.g., your landing app (which contains the Voicebot Applet).

Endpoint

POST https://<your_api_key>:<your_api_token>@<subdomain>/v1/Accounts/<your_sid>/Calls/connect

Prefer JSON? append .json → …/Calls/connect.json
Subdomain by region:
- Singapore: api.exotel.com
- Mumbai: api.in.exotel.com
<your_api_key>, <your_api_token>, <your_sid> are in API Settings on your Exotel dashboard.

Required parameters

From — The phone number to be called first (E.164 recommended; landlines should be prefixed with STD, e.g., 0XXXXXX2400).
CallerId — Your ExoPhone (outbound caller ID).
Url — http://my.exotel.com/{your_sid}/exoml/start_voice/{app_id} where {app_id} is the flow (app) id to connect to after From answers.

Optional parameters

CallType — trans for transactional calls.
TimeLimit — Max call duration in seconds (≤ 14400 i.e., 4 hours).
TimeOut — Ring timeout (seconds) for the legs.
StatusCallback — On completion, Exotel POSTs: CallSid, Status (completed|failed|busy|no-answer), RecordingUrl (if enabled), DateUpdated.
CustomField — Any value you want echoed back as a parameter to Passthru/Greetings GET later in the flow.

Example request

curl -X POST \
https://<your_api_key>:<your_api_token>@<subdomain>/v1/Accounts/<your_sid>/Calls/connect \
-d "From=XXXXX30240" \
-d "CallerId=0XXXXXX4890" \
-d "Url=http://my.exotel.com/{your_sid}/exoml/start_voice/926"

HTTP response

200 OK means the request was accepted, not that the call was answered. Track outcomes via StatusCallback or GET Call Details.
The response body includes a Sid (unique call id). Log it for troubleshooting.

For Voicebot use, ensure the flow referenced by AppId or ExoML Url contains the Voicebot Applet with your WSS endpoint (static or dynamic). Expose your bot publicly and, if needed, append ?sample-rate=8000|16000|24000 in the applet endpoint.

Routing to Agent or Contact Center After the Voicebot Applet

When the WebSocket connection closes—either because your bot disconnects once the interaction is complete or due to a network‑level termination—Exotel automatically advances to the next applet in the Call Flow. There is no explicit Stop event your bot needs to send to Exotel; instead, close the WebSocket to end the session. Exotel will internally emit Stop and continue the flow.

Passthru behaviour (AgentStream‑aware)

HTTP method: Passthru makes an HTTP GET to your URL with URL‑encoded query parameters.
Sync vs Async: Toggle Make Passthru Async on the applet.
- Sync → use HTTP codes to branch the flow (200 OK → Path A, 302 Found → Path B). Caller waits for your response.
- Async → fire‑and‑forget logging/analytics; does not block the caller.
Payload: Standard passthru params plus streaming fields
(e.g., Stream[StreamSID], Stream[Status], Stream[Duration], Stream[RecordingUrl], Stream[StreamUrl], Stream[DisconnectedBy], Stream[DetailedStatus], Stream[Error]).
- Sometimes all stream fields arrive as a single JSON string under Stream=…—deserialize accordingly.
DisconnectedBy values are normalised to user, bot, NA.
Recording URL appears when Record is enabled in the applet.
Throttling: On concurrency breaches you may see Stream[Status]=failed with Stream[DetailedStatus]=Streaming_call_throttled—implement fallback.

Place Passthru right after Voicebot

This lets you:

Fetch session metadata
Log streaming stats (StreamSID, duration, RecordingUrl)
Detect disconnects reliably
Read escalation flags (your custom logic) and route next

Common routing scenarios

Connect Applet → route to Exotel agent/queue
SIP Connect via vSIP Trunk → route to enterprise contact center
Hangup Applet → gracefully end the call

Example
Caller: “Talk to human.” → Bot finishes → WS closes → Exotel emits Stop → Passthru GET → your response indicates escalate=200 Ok (custom) → SwitchCase → vSIP Trunk via Connect

Best practices

Keep Passthru immediately after Voicebot/Stream applet
Use Clear/Mark events for context and observability
Monitor Active Streams and enforce concurrency limits
Decide routing based on DisconnectedBy/Status and your flags
Follow WS timeouts/handshake guidelines
Keep custom params concise and secure
Close the WebSocket from the bot to end the session cleanly

Secure Your Bot (Auth Options)

AgentStream supports Basic Auth on both WSS and HTTPS dynamic endpoints.
You can additionally enable IP allowlisting to restrict access to Exotel egress IPs and your network.

Reference Implementation (repo‑only)

Use the exact code in the repos and keep your blog snippets synchronized with those files.

Main repo (quick POC)

Path: python/openai_realtime_sales_bot.py
How to start (from repo README):

python3 openai_realtime_sales_bot.py

Enterprise framework (production)

Core engine: src/core/realtime_bot.py
Example bots: src/examples/sales_bot.py, src/examples/support_bot.py, src/examples/qualification_bot.py, src/examples/collection_bot.py
Environment template: env.example

Audio configuration

Use the variables in env.example of the enterprise framework (supported sample rates, chunk sizing/buffers). Follow the branch README for current names and defaults.

Run & expose

Use the exact run command from the enterprise framework README (script or server entry), then:

ngrok http <PORT_FROM_REPO_README_OR_ENV>

# Paste the resulting wss://.../agent-stream in the Voicebot Applet.
# To negotiate audio, append ?sample-rate=8000|16000|24000.

uvicorn app:app --host 0.0.0.0 --port 5000
ngrok http 5000
# Paste the resulting wss://.../agent-stream in the Voicebot Applet.
# To negotiate audio, append ?sample-rate=8000|16000|24000.

Natural Conversation Tuning

Barge‑in handling: Keep responses short (1–2 sentences). With server VAD enabled, ensure the bot stops playback immediately if the caller interrupts.
Turn timing: Target outbound audio chunks of 100–200 ms (per repo defaults) for smooth half‑duplex flow. Tune downwards only if required by bandwidth constraints.
Prosody & persona design: Craft concise system prompts to control tone, persona, and verbosity. Avoid overly long instructions that lead to rambling output.
Language mixing (code‑switching): Configure the model to allow natural shifts (e.g., Hindi↔English) without forcing translation. Validate this in real test calls.
Error recovery & silence detection: Detect long silences (>3–4 seconds) or low confidence responses; gracefully apologize, clarify, or escalate to an agent.
Context retention: Use mark and clear events for contextual checkpoints. Reset state when moving to a new topic to avoid confusion.
Testing & tuning loop: Continuously record and review sessions (if policy allows). Adjust system instructions, frame sizing, and escalation thresholds based on observed behavior.
Framework choice: Use the enterprise framework from the repo for production (multi‑rate audio, metrics, containerization). Use the minimal sample only for POC.
Industry‑specific design: For BFSI, enforce stricter compliance and escalation; for Healthcare, ensure HIPAA/PII masking; for Retail/E‑com, focus on COD/returns workflows; for Automotive, emphasize reminders and service journeys.
LLM tuning: Iterate on SYSTEM_INSTRUCTIONS and voice settings to balance personality vs efficiency. For domain tasks, feed domain lexicons and few‑shot examples. Consider guardrails to avoid hallucinations.
Observability & feedback: Integrate Active Streams and Passthru logs with dashboards. Use conversation transcripts (when compliant) for supervised fine‑tuning.

Launch Checklist

Sign up with Exotel
KYC complete; Voicebot Applet enabled; correct region.
Static/Dynamic endpoint configured; TLS valid.
.env model/voice set; nearest hosting selected.
Chunking discipline: ~100 ms (or 60 ms) frames, multiples of 320 bytes, ≤100 KB.
Security: Basic Auth on WSS/HTTPS + IP allowlist.
Fallback: If WS/LLM fails → next applet (passthru/connect).
Observability: Log stream_sid, latency, gaps; set alerts.
Passthru after Voicebot: Ensure proper routing/escalation handling.
StatusCallback/Call Detail API: Integrate for outbound monitoring.
Compliance: Recording/PII masking per policy; region‑based data residency.
Scalability: Validate Active Streams concurrency and resource auto‑scaling.

Use Cases

Sales & Marketing: lead qualification, personalized pitches, bookings, feedback collection, event registrations.
Customer Support: NL‑IVR replacement, intelligent FAQs, order status/returns, troubleshooting, multilingual assistance.
Collections & Reminders: EMI reminders, PTP capture, payment nudges, subscription renewals, overdue follow‑ups.
Banking & Financial Services (BFSI): policy renewals, premium explainer, lead conversion, basic KYC assist (non‑sensitive), account activation.
Healthcare & Healthtech: scheduling/rescheduling, pre‑visit guidance, post‑visit follow‑ups, vaccination reminders, medicine refills.
Logistics & E‑commerce: COD confirmation, address validation, rescheduling, delivery instructions, return initiation.
Travel & Mobility: itinerary updates, fare/tariff queries, rescheduling, booking confirmations, cancellation handling.
Automotive & Consumer Services: service bookings, AMC/warranty renewals, recalls, roadside assistance, upsell of accessories.
Always include agent handover on request or low confidence.

Troubleshooting

No audio back: Ensure base64 PCM/PCMU; obey frame size & 320‑byte multiples; confirm LLM emits deltas.
Choppy audio: Frames too small/large; coalesce to ~40–60 ms; avoid >100 KB.
Latency spikes: Shorter responses; lighter logging; nearest region; prefer PCMU @ 8/16/24 kHz.
Handshake fails: TLS/host mismatch; malformed dynamic JSON; missing url; Basic Auth mismatch.
DTMF missing: Ensure Voicebot (bidi) is used; log all events.
Passthru not invoked: Confirm the Voicebot applet is immediately followed by a Passthru; check that WS closes cleanly.
Custom params lost: Keep ≤3 params, query string ≤256 chars; validate they appear in start.custom_parameters.
Recording URL missing: Enable recording at the applet level; Passthru will only include Stream[RecordingUrl] when enabled.
Stream throttled: Watch for Stream[DetailedStatus]=Streaming_call_throttled; use Active Streams API to scale or failover.
Multi‑rate issues: Ensure your bot honors the negotiated ?sample-rate (8k/16k/24k); transcode if your model requires a fixed rate.
Outbound call fails: Verify From and CallerId are correct ExoPhones; confirm Url points to a valid start_voice applet.

FAQ

Can I use non‑OpenAI realtime models?
Yes — any WebSocket‑based audio in/out model (e.g., custom ASR‑TTS bridges, ElevenLabs, Vapi).
Do I need ASR/TTS separately?
Not with S2S. You can optionally log transcripts asynchronously for QA or analytics.
Which audio format should I return?
PCMU @ 8 kHz for PSTN efficiency; PCM for higher fidelity (16/24 kHz). Always respect frame sizing (≈40–100 ms, multiples of 320 bytes).
How do I pass per‑call context?
Use the dynamic endpoint: return { "url": "wss://.../agent-stream?lead=987&locale=en-IN" }. Your bot will read these via start.custom_parameters.
How do outbound campaigns work?
Use the ExoML start_voice URL (see cURL example). Add retries, scheduling, and ensure your flow contains the Voicebot Applet.
Can I test locally?
Yes. Use ngrok to expose your bot server. Point the Voicebot Applet to the ngrok URL.
What about security?
Enable Basic Auth or IP allowlisting on your WSS/HTTPS endpoint.
How do I monitor quality?
Use Passthru + Active Streams API to log duration, disconnect reason, errors, and recording URLs.

Saurabh Sharma

Saurabh Sharma is a Product Manager at Exotel, driving the development of voice and AI products including CPaaS APIs, Voice Streaming (AgentStream), Virtual SIP (vSIP), Digital Voice SDKs, Conversational AI (ExoBots), and the LeadX platform. With over a decade of experience, he specializes in building developer-first, enterprise-ready communication infrastructure that blends telephony and AI to deliver automation, scalability, and compliance. Passionate about simplifying complexity, Saurabh focuses on API-first platforms, AI-powered engagement, and product strategy that enable enterprises across BFSI, Automobile, Logistics, and EdTech to scale faster and deliver exceptional customer experiences.

Build a Real-Time Speech-to-Speech AI Voice Assistant on Exotel AgentStream (Bidirectional) with OpenAI Realtime & Python

Table of Contents

Transform CX with AI Solutions

Repos & Docs

Why Speech‑to‑Speech (S2S)

Architecture (Inbound & Outbound)

Inbound (bidirectional Voicebot)

Outbound (bidirectional Voicebot)

Streaming contract (Voicebot Applet)

Quickstart (TL;DR)

Set Up the Repos (Deep Dive) — Configure, Modify, and Test

Minimal sample (fastest POC)

Configure the Voicebot Applet (Bidirectional)

Custom parameters & dynamic URL rules

Deployment: sample rate parameters

Outbound Calling (Make‑a‑Call API)

Outgoing call to connect a number to a Call Flow (ExoML start_voice)

Routing to Agent or Contact Center After the Voicebot Applet

Passthru behaviour (AgentStream‑aware)

Place Passthru right after Voicebot

Common routing scenarios

Best practices

Secure Your Bot (Auth Options)

Reference Implementation (repo‑only)

Main repo (quick POC)

Enterprise framework (production)

Natural Conversation Tuning

Launch Checklist

Use Cases

Troubleshooting

FAQ

Saurabh Sharma

Top Customer Service Trends for 2025

Exotel’s MCP Server: Powering Agentic AI for Real Conversations

Related Articles

Build a Real-Time Speech-to-Speech AI Voice Assistant on Exotel AgentStream (Bidirectional) with OpenAI Realtime & Python

Table of Contents

Transform CX with AI Solutions

Found this interesting? Share it now!

Join Our Community

Repos & Docs

Why Speech‑to‑Speech (S2S)

Architecture (Inbound & Outbound)

Inbound (bidirectional Voicebot)

Outbound (bidirectional Voicebot)

Streaming contract (Voicebot Applet)

Quickstart (TL;DR)

Set Up the Repos (Deep Dive) — Configure, Modify, and Test

Minimal sample (fastest POC)

Configure the Voicebot Applet (Bidirectional)

Custom parameters & dynamic URL rules

Deployment: sample rate parameters

Outbound Calling (Make‑a‑Call API)

Outgoing call to connect a number to a Call Flow (ExoML start_voice)

Routing to Agent or Contact Center After the Voicebot Applet

Passthru behaviour (AgentStream‑aware)

Place Passthru right after Voicebot

Common routing scenarios

Best practices

Secure Your Bot (Auth Options)

Reference Implementation (repo‑only)

Main repo (quick POC)

Enterprise framework (production)

Natural Conversation Tuning

Launch Checklist

Use Cases

Troubleshooting

FAQ

Saurabh Sharma

Top Customer Service Trends for 2025

Exotel’s MCP Server: Powering Agentic AI for Real Conversations

Related Articles

What is an Autonomous Agent in AI? A Complete Guide

Key Strategies to Boost Your Call Center Performance

The Ultimate Handbook for Call Centers: Everything You Need to Know