When someone speaks to an AI on a phone call and it feels smooth, fast, and natural, the credit usually goes to the AI:
“The bot understood me.”
“It replied instantly.”
“It sounded human.”
But that experience is not created by the LLM alone. It is created by an invisible infrastructure layer that most developers never see — the layer that takes messy, unpredictable telephony and turns it into something an AI can actually work with.
This blog is about that layer.
- Not prompts.
- Not models.
- Not voices.
But the telephony + streaming system that makes Voice AI work outside demos, on real phone calls, at real scale.
What Happens Before Your AI Hears “Hello”
When a user dials a number and says “Hello”, your AI does not hear that word immediately.
Before the audio reaches your bot, Exotel has already done a large amount of work — in milliseconds.
Let’s slow it down.
Step 1: The Call Enters Telephony Reality
Caller → Mobile Network → Carrier → SIP → Exotel
- A SIP INVITE reaches Exotel’s edge
- Call metadata (From, To, region) is parsed
- Routing logic is applied
- A media path is prepared
- Codec negotiation happens (typically G.711 μ-law)
- Ringing and answer events are handled
- RTP audio starts flowing
At this point:
• There is no AI
• There is no streaming
• This is pure telephony — stateful and timing-sensitive
Step 2: Turning RTP Audio Into Streams
Once the call is answered, Exotel activates the streaming layer.
This is where AgentStream comes in.
AgentStream is Exotel’s unified streaming layer, built by combining:
- the Stream Applet (unidirectional streaming)
- the Voicebot Applet (bidirectional streaming)
Under the hood:
RTP → Media Engine → PCM → WebSocket
- RTP packets are received (20 ms frames)
- Jitter is absorbed
- μ-law audio is decoded to PCM
- Audio is framed for streaming
- A WebSocket connection is opened to your endpoint
- A connected → start event sequence is sent
Still no AI logic — just making audio consumable.
Step 3: Audio Reaches Your Bot
What your bot receives today is not raw buffers, and not callbacks.
It receives structured WebSocket JSON events:
- connected
- start
- media
- dtmf
- mark
- clear
- stop
A realistic handling loop looks like this:
ws.on("message", (msg) => {
const event = JSON.parse(msg)
if (event.event === "media") {
const pcm = Buffer.from(event.media.payload, "base64")
processAudio(pcm)
}
if (event.event === "dtmf") {
handleDTMF(event.dtmf)
}
if (event.event === "stop") {
cleanup()
}
})
This is the actual contract developers build against.
No SIP parsing.
No RTP handling.
No codec negotiation.
Step 4: Your Bot Responds
To speak back to the caller, your bot sends audio in the same format:
- 16-bit PCM
- 8 kHz
- mono
- base64-encoded
- chunked correctly
ws.send(JSON.stringify({
event: "media",
stream_sid,
media: {
payload: base64Pcm,
chunk,
timestamp
}
}))
Exotel injects this audio back into the live call.
That’s the full loop.
What Exotel Handles (So You Don’t Have To)
This is where the invisible layer earns its keep.
Today, Exotel reliably handles:
- SIP signaling
- Carrier interop
- RTP reception
- Jitter buffering
- μ-law ↔ PCM conversion
- WebSocket lifecycle
- Event ordering
- Chunk timing constraints
- Call disconnect detection
- Recording (when enabled)
- Scaling to thousands of concurrent calls
Your bot only focuses on:
- STT
- LLM logic
- TTS
- Business rules
Audio Reality (No Marketing Version)
Telephony audio is constrained:
- 8 kHz
- mono
- compressed
- noisy
- packet-loss prone
Exotel does not magically transform PSTN audio into studio-quality sound today.
What it does provide is:
- Stable, predictable PCM frames
- Correct timing and chunking
- Low jitter delivery
- Consistent behavior under load
This is what allows AI pipelines to behave deterministically in production.
Barge-In: What Works Today
When a user interrupts a bot mid-sentence, Exotel supports:
- Voice activity detection
- clear events to stop queued audio
- Bidirectional control via Voicebot Applet
- Bot-initiated interruption handling
This enables basic, production-safe barge-in.
It is not magic.
It is reliable.
Contact Centers: Where StreamKit Fits
Many enterprises already run:
- Genesys
- Avaya
- Ameyo
- Asterisk
- Custom SBCs
These systems speak SIP, not WebSockets.
StreamKit bridges that gap:
SIP Contact Center → StreamKit → AgentStream → Your Bot
By adding a streaming header, enterprises can:
- Stream live calls to AI
- Run agent assist
- Capture real-time transcription
- Build hybrid bot-agent workflows
Without replacing their contact center.
Enterprises looking to upgrade to a modern cloud solution can explore Exotel’s cloud contact center for seamless integration.
Outbound AI Calls (The Most Common Use Case)
The most common Voice AI deployment today is outbound.
The flow is simple:
Your Backend → Connect API → Voicebot Applet → AgentStream → Bot
The moment the user answers, streaming begins.
This powers:
- Lead qualification
- Collections
- Renewal reminders
- NPS surveys
- Customer verification
- Appointment scheduling
At scale.
Enhance your automation with Exotel’s robust auto dialer to streamline high-volume outbound calling.
Debugging & Observability (What Exists Today)
Today, developers have access to:
- Call SIDs and leg identifiers
- Webhook lifecycle events
- Streaming events (start, media, stop)
- Disconnect reasons
- Recording URLs (if enabled)
This allows teams to build internal dashboards and trace failures end-to-end.
Reference Implementations
Exotel provides a working reference code:
- AgentStream Core
https://github.com/exotel/Agent-Stream - Minimal Echo Bot
https://github.com/exotel/Agent-Stream-echobot - Streaming Utilities
https://github.com/exotel/voice-streaming
These demonstrate:
- event handling
- base64 PCM handling
- bidirectional streaming
- correct chunk behaviour
Why This Layer Exists
Exotel did not start with AI.
We started with calls:
- IVRs
- contact centers
- cloud telephony
- billions of real phone calls
When AI entered voice, the pattern was clear:
Great intelligence
Fragile infrastructure
Voice AI fails not because models are weak but because telephony is unforgiving.
So we built the layer that absorbs reality.
What’s Next
To keep this blog honest, here’s what is not fully available today, but actively being worked on:
Coming Next
- Advanced noise suppression
- Improved echo cancellation
- Richer stream-level observability
- More granular audio health metrics
- Tighter AI-aware streaming controls
These will be covered in a future blog.
Final Thought
When Voice AI works smoothly on a real phone call, it’s not luck.
It’s infrastructure.
Exotel’s Voice for AI stack exists so developers can build real AI conversations on real phone calls, without fighting SIP, RTP, codecs, or telco edge cases.
You focus on intelligence.
We handle the invisible layer.
That’s the job.




