The Invisible Layer: How Exotel Powers Voice AI in the Real World

Summarize Blog With:

When someone speaks to an AI on a phone call and it feels smooth, fast, and natural, the credit usually goes to the AI:
“The bot understood me.”
“It replied instantly.”
“It sounded human.”

But that experience is not created by the LLM alone. It is created by an invisible infrastructure layer that most developers never see — the layer that takes messy, unpredictable telephony and turns it into something an AI can actually work with.

This blog is about that layer.

Not prompts.
Not models.
Not voices.

But the telephony + streaming system that makes Voice AI work outside demos, on real phone calls, at real scale.

What Happens Before Your AI Hears “Hello”

When a user dials a number and says “Hello”, your AI does not hear that word immediately.
Before the audio reaches your bot, Exotel has already done a large amount of work — in milliseconds.
Let’s slow it down.

Step 1: The Call Enters Telephony Reality

Caller → Mobile Network → Carrier → SIP → Exotel

A SIP INVITE reaches Exotel’s edge
Call metadata (From, To, region) is parsed
Routing logic is applied
A media path is prepared
Codec negotiation happens (typically G.711 μ-law)
Ringing and answer events are handled
RTP audio starts flowing

At this point:

• There is no AI
• There is no streaming
• This is pure telephony — stateful and timing-sensitive

Step 2: Turning RTP Audio Into Streams

Once the call is answered, Exotel activates the streaming layer.
This is where AgentStream comes in.
AgentStream is Exotel’s unified streaming layer, built by combining:

the Stream Applet (unidirectional streaming)
the Voicebot Applet (bidirectional streaming)

Under the hood:
RTP → Media Engine → PCM → WebSocket

RTP packets are received (20 ms frames)
Jitter is absorbed
μ-law audio is decoded to PCM
Audio is framed for streaming
A WebSocket connection is opened to your endpoint
A connected → start event sequence is sent

Still no AI logic — just making audio consumable.

Step 3: Audio Reaches Your Bot

What your bot receives today is not raw buffers, and not callbacks.
It receives structured WebSocket JSON events:

connected
start
media
dtmf
mark
clear
stop

A realistic handling loop looks like this:

ws.on("message", (msg) => {
  const event = JSON.parse(msg)
  if (event.event === "media") {
    const pcm = Buffer.from(event.media.payload, "base64")
    processAudio(pcm)
  }
  if (event.event === "dtmf") {
    handleDTMF(event.dtmf)
  }
  if (event.event === "stop") {
    cleanup()
  }
})

This is the actual contract developers build against.
No SIP parsing.
No RTP handling.
No codec negotiation.

Step 4: Your Bot Responds

To speak back to the caller, your bot sends audio in the same format:

16-bit PCM
8 kHz
mono
base64-encoded
chunked correctly

ws.send(JSON.stringify({
  event: "media",
  stream_sid,
  media: {
    payload: base64Pcm,
    chunk,
    timestamp
  }
}))

Exotel injects this audio back into the live call.
That’s the full loop.

What Exotel Handles (So You Don’t Have To)

This is where the invisible layer earns its keep.
Today, Exotel reliably handles:

SIP signaling
Carrier interop
RTP reception
Jitter buffering
μ-law ↔ PCM conversion
WebSocket lifecycle
Event ordering
Chunk timing constraints
Call disconnect detection
Recording (when enabled)
Scaling to thousands of concurrent calls

Your bot only focuses on:

STT
LLM logic
TTS
Business rules

Audio Reality (No Marketing Version)

Telephony audio is constrained:

8 kHz
mono
compressed
noisy
packet-loss prone

Exotel does not magically transform PSTN audio into studio-quality sound today.
What it does provide is:

Stable, predictable PCM frames
Correct timing and chunking
Low jitter delivery
Consistent behavior under load

This is what allows AI pipelines to behave deterministically in production.

Barge-In: What Works Today

When a user interrupts a bot mid-sentence, Exotel supports:

Voice activity detection
clear events to stop queued audio
Bidirectional control via Voicebot Applet
Bot-initiated interruption handling

This enables basic, production-safe barge-in.
It is not magic.
It is reliable.

Contact Centers: Where StreamKit Fits

Many enterprises already run:

Genesys
Avaya
Ameyo
Asterisk
Custom SBCs

These systems speak SIP, not WebSockets.
StreamKit bridges that gap:

SIP Contact Center → StreamKit → AgentStream → Your Bot

By adding a streaming header, enterprises can:

Stream live calls to AI
Run agent assist
Capture real-time transcription
Build hybrid bot-agent workflows

Without replacing their contact center.
Enterprises looking to upgrade to a modern cloud solution can explore Exotel’s cloud contact center for seamless integration.

Outbound AI Calls (The Most Common Use Case)

The most common Voice AI deployment today is outbound.
The flow is simple:
Your Backend → Connect API → Voicebot Applet → AgentStream → Bot

The moment the user answers, streaming begins.
This powers:

Lead qualification
Collections
Renewal reminders
NPS surveys
Customer verification
Appointment scheduling

At scale.
Enhance your automation with Exotel’s robust auto dialer to streamline high-volume outbound calling.

Debugging & Observability (What Exists Today)

Today, developers have access to:

Call SIDs and leg identifiers
Webhook lifecycle events
Streaming events (start, media, stop)
Disconnect reasons
Recording URLs (if enabled)

This allows teams to build internal dashboards and trace failures end-to-end.

Reference Implementations

Exotel provides a working reference code:

AgentStream Core
https://github.com/exotel/Agent-Stream
Minimal Echo Bot
https://github.com/exotel/Agent-Stream-echobot
Streaming Utilities
https://github.com/exotel/voice-streaming

These demonstrate:

event handling
base64 PCM handling
bidirectional streaming
correct chunk behaviour

Why This Layer Exists

Exotel did not start with AI.
We started with calls:

IVRs
contact centers
cloud telephony
billions of real phone calls

When AI entered voice, the pattern was clear:
Great intelligence
Fragile infrastructure
Voice AI fails not because models are weak but because telephony is unforgiving.
So we built the layer that absorbs reality.

What’s Next

To keep this blog honest, here’s what is not fully available today, but actively being worked on:
Coming Next

Advanced noise suppression
Improved echo cancellation
Richer stream-level observability
More granular audio health metrics
Tighter AI-aware streaming controls

These will be covered in a future blog.

Final Thought

When Voice AI works smoothly on a real phone call, it’s not luck.
It’s infrastructure.
Exotel’s Voice for AI stack exists so developers can build real AI conversations on real phone calls, without fighting SIP, RTP, codecs, or telco edge cases.

You focus on intelligence.
We handle the invisible layer.
That’s the job.

Saurabh Sharma

Saurabh Sharma is a Product Manager at Exotel, driving the development of voice and AI products including CPaaS APIs, Voice Streaming (AgentStream), Virtual SIP (vSIP), Digital Voice SDKs, Conversational AI (ExoBots), and the LeadX platform. With over a decade of experience, he specializes in building developer-first, enterprise-ready communication infrastructure that blends telephony and AI to deliver automation, scalability, and compliance. Passionate about simplifying complexity, Saurabh focuses on API-first platforms, AI-powered engagement, and product strategy that enable enterprises across BFSI, Automobile, Logistics, and EdTech to scale faster and deliver exceptional customer experiences.

The Invisible Layer: How Exotel Powers Voice AI in the Real World

Table of Contents

Transform CX with AI Solutions

What Happens Before Your AI Hears “Hello”