A technical deep-dive into WebSockets, VSIP, stereo streaming, and active-active failover—for the engineering teams building India’s next generation of Voice AI.
In Part 1 of this series, we made the business case: infrastructure reliability is what closes enterprise deals for Voice AI companies in India. Sarvam, Skit, and SquadStack don’t just sell their LLMs—they sell the promise that the conversation will never drop.
This post is for the engineers who have to deliver on that promise.
If you’re building real-time voice AI—whether it’s a multilingual support agent, an outbound collections bot, or an AI receptionist—you already know that your LLM is only one piece of a much larger latency puzzle. The full chain looks something like this:
The Voice AI Latency Chain
Caller speaks → Network transport → ASR (Speech-to-Text) → LLM inference → TTS (Text-to-Speech) → Network transport → Caller hears response
Every millisecond you burn on network transport and audio processing is a millisecond stolen from your LLM’s thinking time. And in a live voice conversation, the total round-trip budget is brutally tight: anything over 800ms starts feeling unnatural. Cross 1.5 seconds, and the caller talks over the bot. Hit 2 seconds, and they hang up.
This is the architecture story of how Exotel’s CPaaS stack is engineered to give your AI engine the maximum possible latency budget—by compressing everything else in the chain to near-zero.
1. The Low-Latency Streaming Pipeline: WebSockets & VSIP
Why Traditional SIP Falls Short for AI
SIP (Session Initiation Protocol) has been the backbone of VoIP for decades, and it works well for human-to-human conversations. But when you’re piping audio into an AI inference engine, SIP introduces unnecessary overhead:
- Multiple intermediary hops: Traditional SIP architectures route audio through media gateways, transcoders, and session border controllers—each adding latency.
- Codec negotiation overhead: SIP’s offer/answer model for codec negotiation adds setup time that’s irrelevant when your ASR engine has fixed input requirements.
- Store-and-forward patterns: Many SIP implementations buffer audio packets before forwarding, introducing jitter that’s acceptable for humans but disastrous for real-time AI processing.
For Vipatra’s engineering team, this was the first bottleneck they hit when building their voice AI platform. Their target was sub-300ms round-trip for natural conversation flow, and the traditional SIP path was consuming nearly half that budget before audio even reached their ASR engine.
Exotel’s Approach: WebSocket + VSIP Streaming
Exotel’s streaming architecture bypasses the traditional SIP media path entirely for AI workloads. Instead, it leverages two complementary protocols:
- WebSockets provide a persistent, full-duplex connection between Exotel’s media servers and your AI engine. Unlike HTTP request-response patterns, WebSockets maintain an open channel—audio chunks flow the instant they arrive at Exotel’s switches, with no handshake overhead per packet.
- VSIP (Voice Stream Integration Protocol) is Exotel’s purpose-built protocol layer that sits on top of WebSockets, optimized specifically for voice AI workloads.
VSIP handles:
- Raw audio chunk delivery: Audio is streamed in small chunks (typically 20–40ms frames) directly to your engine the moment it hits Exotel’s infrastructure. Your LLM can begin “thinking” before the caller finishes their sentence.
- Packet prioritization: VSIP pipes are optimized specifically for voice traffic, with QoS (Quality of Service) policies that ensure minimal jitter even during network congestion.
- Barge-in detection: When a caller interrupts the bot mid-sentence (a critical capability for natural conversation), VSIP signals your engine to stop TTS playback and switch to listening mode instantly—no waiting for the current audio buffer to drain.
Latency Comparison: Traditional SIP vs. Exotel WebSocket/VSIP
| Metric | Traditional SIP Path | Exotel WebSocket/VSIP |
|---|---|---|
| Audio delivery to AI engine | 80–150ms (via media gateway) | 10–25ms (direct stream) |
| Intermediary hops | 3–5 (SBC, gateway, transcoder) | 1 (Exotel switch → your engine) |
| Barge-in response time | 300–500ms | < 50ms |
| Codec flexibility | Negotiated per-session | Fixed, pre-configured |
| Connection model | New session per call | Persistent WebSocket pool |
The net result: Exotel’s streaming pipeline typically reclaims 70–120ms of latency budget that would otherwise be consumed by network transport—milliseconds that your LLM can use for generating better, more contextual responses.
2. Audio Fidelity: The Mono vs. Stereo Edge
The Mono Problem
Most telecom infrastructure delivers audio as a single mono channel—the caller’s voice and the bot’s voice mixed together into one audio stream. For human listeners, this is fine. For ASR engines, it’s a nightmare.
When your Speech-to-Text engine receives a mono stream, it has to solve two problems simultaneously:
- Speaker diarization: Figuring out which audio segments belong to the caller vs. the bot—a computationally expensive process that introduces its own latency and error rate.
- Echo suppression: The bot’s own TTS output bleeds into the caller’s audio. Software-based echo cancellation helps, but it’s never perfect—especially with Indian telecom infrastructure where echo characteristics vary by carrier and region.
The result: your ASR engine wastes compute cycles on speaker separation, introduces transcription errors from imperfect echo cancellation, and adds 50–100ms of processing latency—before it even begins actual speech recognition.
Exotel’s Stereo Streaming Architecture
Exotel provides dual-channel stereo audio where the caller and the bot occupy separate, isolated tracks:
- Channel 1 (Left): Caller’s voice only—clean, isolated, with no bot audio contamination.
- Channel 2 (Right): Bot’s TTS output only—useful for logging, quality monitoring, and conversation analytics.
This architectural separation delivers echo cancellation at the infrastructure level, not the software level. Your ASR engine receives a clean, single-speaker audio stream and can skip diarization entirely—focusing 100% of its compute on accurate transcription.
Real-World Impact: Fundamento’s Experience
Fundamento’s engineering team was processing complex financial conversations—loan terms, EMI calculations, account numbers—where transcription accuracy is non-negotiable. A misheard number can mean a wrong account lookup, a failed payment, or a compliance violation.
After switching from mono to Exotel’s stereo streaming, Fundamento observed measurable improvements in their ASR pipeline: reduced word error rate on numeric sequences, elimination of “ghost transcriptions” caused by echo bleed, and a meaningful reduction in ASR processing time per utterance—freeing up compute headroom for their LLM inference layer.
When to Use Stereo vs. Mono
Stereo isn’t always necessary. Here’s a practical decision framework:
| Use Case | Recommended | Why |
|---|---|---|
| Real-time AI conversations (support, collections, sales) | Stereo | ASR accuracy and latency are critical |
| Call recording for post-call analytics | Stereo | Separate channels enable precise speaker attribution |
| Simple IVR (press 1 for X, press 2 for Y) | Mono | No real-time ASR; DTMF detection is channel-agnostic |
| Outbound notification calls (one-way TTS) | Mono | No caller speech to transcribe |
3. High Availability by Design: Active-Active Architecture
Active-Active vs. Active-Passive: Why the Distinction Matters
Most CPaaS platforms implement High Availability as an Active-Passive configuration: one primary system handles all traffic, and a standby system takes over if the primary fails. The problem for voice AI is the failover gap—the time between the primary going down and the standby becoming fully active. In a typical Active-Passive setup, this gap ranges from 5–30 seconds.
Thirty seconds of dead air on a live voice call isn’t a “brief interruption.” It’s a dropped call. The customer hangs up, the collection attempt fails, and the enterprise client’s confidence in your platform takes a hit.
Exotel implements Active-Active architecture:
- Both paths are live simultaneously. Traffic is load-balanced across multiple active nodes. There is no “standby” waiting to be activated—every node is serving calls at all times.
- Stateful session management. Call state is replicated across active nodes. If one node fails, in-flight calls are seamlessly continued on another node without the caller experiencing any interruption.
- Zero-gap failover. Because there’s no cold-start delay for a standby system, failover is measured in milliseconds—fast enough that a live voice conversation continues without a perceptible glitch.
Multi-Operator, Multi-Region Redundancy
Active-Active architecture solves for node-level failures. But what about carrier-level or region-level outages? Exotel’s redundancy extends across multiple dimensions:
- Multi-operator: Call traffic is distributed across multiple telecom operators. If Operator A experiences congestion in a specific telecom circle, traffic automatically shifts to Operator B. The routing logic evaluates real-time carrier health metrics—ASR (Answer Seizure Ratio), call setup time, packet loss—and makes routing decisions per-call.
- Multi-region: Infrastructure spans multiple data center regions across India. A catastrophic failure in one region (power grid failure, natural disaster, major network event) triggers automatic rerouting through geographically distant DCs.
- Multi-path: For each call, multiple potential paths exist simultaneously. The platform selects the optimal path based on current conditions and can switch mid-call if conditions change—transparently to both the caller and the AI engine.
For Sarvam’s national-scale deployments, this architecture means their voice AI platform delivers consistent performance across every Indian telecom circle—regardless of which carrier is having a bad day in which region.
4. High-Throughput Traffic Management: CPS Control at the VN Level
The GPU Bottleneck Problem
Here’s a scenario that every Voice AI platform engineer will recognize: your outbound dialer is configured to push 200 CPS. Your GPU cluster can handle real-time inference for 150 concurrent sessions. At 200 CPS with an average call duration of 45 seconds, you’ll hit 150 concurrent sessions in under a minute—and every call beyond that either queues (adding latency) or fails (dropping the call).
The root cause isn’t insufficient GPU capacity. It’s insufficient traffic control between the telephony layer and the inference layer.
Exotel’s VN-Level CPS Controls
Exotel provides granular traffic management at the Virtual Number (VN) level, giving you precise control over call throughput:
- Per-VN CPS rate limiting: Set maximum CPS rates for each virtual number independently. If your inbound support line can handle 50 CPS but your outbound collections campaign can push 500 CPS, each gets its own rate configuration.
- Dynamic CPS adjustment: Adjust CPS limits in real-time based on your infrastructure’s current capacity. Running a GPU upgrade over the weekend? Throttle CPS temporarily. Scaling up for a campaign launch? Increase limits without redeploying.
- Overflow management: When incoming call volume exceeds CPS limits, Exotel’s platform can queue, redirect to overflow numbers, or return appropriate SIP responses—configurable per use case.
- CPS burst handling: Support for controlled burst capacity above steady-state CPS limits, with configurable burst windows and cooldown periods.
Practical Example: Matching CPS to GPU Capacity
Scenario: SquadStack’s outbound collections platform, targeting 5,000 concurrent calls during peak hours.
GPU capacity: Cluster supports real-time inference for 5,200 concurrent sessions (with 4% headroom).
Average call duration: 55 seconds.
Required CPS: ~91 CPS sustained (5,000 / 55) to maintain 5,000 concurrent calls at steady state.
Exotel CPS config: VN-level CPS set to 95 with burst capacity to 120 for campaign ramp-up periods.
Safety valve: If GPU utilization exceeds 90%, the platform signals Exotel to temporarily throttle CPS to 70—preventing inference degradation without dropping in-flight calls.
5. The ULVNO Advantage: Deep Telecom Grid Integration
What ULVNO Status Means Technically
Exotel’s Unified License Virtual Network Operator (ULVNO) status provides direct interconnection with Indian telecom infrastructure. In practice, this means:
- Fewer network hops: Calls route directly between Exotel’s infrastructure and the PSTN, without passing through third-party intermediaries. Fewer hops = lower latency + fewer points of failure.
- Any-series DID provisioning: Direct access to the full range of DID number series from Indian telecom operators. This isn’t just a business convenience—different number series have measurably different pickup rates in different regions.
- Carrier-grade signaling: Direct SS7/SIGTRAN integration with telecom switches provides call setup times and reliability metrics that wrapper-based platforms cannot match.
- Regulatory alignment: ULVNO compliance ensures that all call routing, recording, and number management adheres to TRAI and DoT regulations by default—eliminating the compliance overhead that AI companies would otherwise need to manage themselves.
15 Years of Infrastructure Hardening
Exotel’s telecom grid integration isn’t a recent addition. It’s the product of 15 years of continuous engineering, involving direct relationships with every major Indian telecom operator, battle-tested routing logic refined across billions of calls, deep understanding of regional telecom characteristics (latency profiles, congestion patterns, and failure modes vary significantly across Indian telecom circles), and a proprietary routing engine that incorporates real-time carrier health data for per-call path optimization.
For engineering teams at companies like Vipatra and Fundamento, this means one critical thing: when your code sends a trigger to initiate or receive a call, the connection happens reliably, consistently, and with the lowest possible latency. You don’t need to build telecom expertise in-house—Exotel’s 15 years of infrastructure work becomes an extension of your own stack.
6. Putting It All Together: The Full Architecture
Here’s how all five layers work together in a typical Voice AI deployment on Exotel:
- Caller dials in (or receives an outbound call) through Exotel’s ULVNO-connected PSTN infrastructure. Call setup happens with carrier-grade speed and reliability.
- Audio streams via WebSocket/VSIP directly to your AI engine. Raw audio chunks arrive within 10–25ms of being spoken, with no intermediary gateway delays.
- Stereo channels deliver the caller’s voice on a clean, isolated track. Your ASR engine processes a single-speaker stream—no diarization, no echo cancellation overhead.
- Your LLM generates a response with the maximum possible latency budget, because the transport layer has consumed only a fraction of the total round-trip time.
- TTS audio streams back through the same WebSocket/VSIP channel, with barge-in detection ready to interrupt playback the moment the caller speaks again.
- Throughout the call, Active-Active HA monitors carrier health, VN-level CPS controls protect your GPU cluster, and multi-operator redundancy ensures that a telco hiccup anywhere in the chain doesn’t reach the caller’s ear.
This is what we mean when we say Exotel is the “nervous system” connecting the brain (your LLM) to the world (the Indian caller). Every layer is engineered to maximize two things: the time your AI has to think, and the reliability with which its response reaches the caller.
BRING YOUR STACK DIAGRAM. WE’LL MAP THE INTEGRATION.
Exotel’s engineering team works directly with Voice AI companies to architect low-latency, high-availability pipelines tailored to your specific inference stack and scale requirements.
→ Request a Technical Architecture Review with Exotel’s Engineering Team
THIS IS PART 2 OF A 3-PART SERIES
- Part 1: Why 50% of India’s Voice AI Runs on One Infrastructure Partner
- Up Next → Part 3: From 100 to 10,000 Concurrent Calls: The Operational Playbook




