Summarize Blog With:

Why latency matters

Human conversation is turn‑based: we expect responses within a few hundred milliseconds. Performance benchmarks from contact‑centre studies show that latencies above 800 ms cause callers to feel the conversation has stalled and often result in the caller hanging up, while ideal response time is in the 300–500 ms range[1]. When voice agents take longer than one second to respond, contact centres report that customers hang up 40% more frequently[2].

  • Response time must be under ~800 ms to maintain conversational flow[3].
  • 500 ms or less is preferred to avoid listener anxiety[4].
  • Each component of the pipeline (telephony, speech‑to‑text, language model, text‑to‑speech and network) adds to the total delay[5], so optimisation must be holistic.

Exotel’s infrastructure can help meet these targets. The Calls/connect API on Exotel’s telephony platform is relatively quick, with an average latency of ~700 ms and a 95th percentile latency of ~1.1 s[6]. More importantly, Exotel’s AgentStream real‑time streaming API delivers sub‑20 ms media latency, far below the ~150 ms industry average[7]. On the media plane, Exotel’s voice‑streaming platform guarantees <250 ms end‑to‑end media latency[8]. Combining these low‑latency primitives with an optimized AI pipeline makes sub‑800 ms response times achievable.

Understanding the latency budget

A conversational voice agent typically involves several stages. Rough per‑component latencies in 2026 are:

ComponentTypical latencyNotes
Telephony transport<300 msUsing Exotel AgentStream’s <300 ms media latency[8]; network jitter and PSTN conversion add a few tens of milliseconds. Minimizing network hops and colocating compute near telephony PoPs further reduces delay.
Speech‑to‑text (ASR)150 ms TTFT (US), 250–350 ms (global)Deepgram in the US and similar providers[9]. Use streaming ASR; send audio in small chunks; run multiple STT models in parallel if needed.
Language model inference200–350 msFor fast models (e.g., GPT‑4 Turbo or Gemini Flash)[10]. The LLM should begin generating tokens as soon as enough context is available; keep prompts short to improve time to first token.
Text‑to‑speech (TTS)75–200 msElevenLabs’ Flash v2.5 produces audio in 75 ms[11]; OpenAI’s TTS averages about 200 ms[12]. Choose low‑latency TTS for highly interactive use cases.
Network & application overhead50–150 msFor WebSocket or WebRTC connections. Use Exotel’s WebSocket streaming to keep this overhead minimal; avoid multiple external API hops.

These numbers suggest that an optimized pipeline can deliver an average response under 700 ms (telephony <300 ms + STT 150 ms + LLM 200 ms + TTS 75 ms + overhead 50 ms). Hitting this budget requires careful engineering.

Best practices for low‑latency voice agents on Exotel

1. Use Exotel AgentStream for media transport

Exotel’s AgentStream provides a telco‑grade WebSocket interface that streams audio between your bot and the caller. Unlike REST‑based APIs that wait for call setup, AgentStream opens a full‑duplex channel where the call audio is delivered in <20 ms frames[7]. This eliminates the time wasted in sequential API calls and allows your application to react to incoming speech immediately. Traditional Connect API responses can have P95 latencies above 1 s[6], so use AgentStream for latency‑sensitive use cases.

Tip: Exotel recommends sending audio frames back in ~100 ms chunks and keeping them below 100 KB[13]. This ensures smooth playback and reduces buffering delay.

2. Minimise speech‑to‑text latency

  • Choose a fast ASR provider. Deepgram offers ~150 ms TTFT in the US and ~250–350 ms globally[9]. Gladia or local providers may offer similar performance. Avoid models like Whisper that have TTFT >500 ms[14].
  • Stream audio continuously. Do not wait for complete sentences; send 100–200 ms slices to the ASR. Configure Exotel’s Voicebot applet with ?sample-rate=8000|16000 to optimise audio bandwidth[13].
  • Run ASR and LLM in parallel. Many developers pipeline STT → LLM → TTS sequentially, which adds 200–400 ms. Exotel’s voice streaming supports starting language‑model inference as soon as the first transcripts arrive. Use concurrency to minimise idle time.

3. Optimise language‑model inference

  • Use smaller, faster models. GPT‑4 Turbo and Gemini Flash produce first tokens in roughly 200–350 ms[10]. For simple use cases, even smaller models like GPT‑3.5 or open‑source Llama 3 can cut latency further.
  • Keep prompts concise. Long prompts slow generation; provide only necessary instructions and context. Use structured system messages and avoid large RAG documents.
  • Pre‑compute static replies. For menu prompts or greetings, pre‑generate audio and play it directly via Exotel’s Passthru/Play Audio applets. This avoids LLM and TTS entirely.

4. Use ultra‑low‑latency TTS

  • Select a high‑speed TTS model. ElevenLabs’ Flash v2.5 synthesises speech in 75 ms[11]. OpenAI’s TTS around 200 ms[12] is still acceptable for most applications, but should be paired with a faster STT/LLM.
  • Stream audio as soon as available. Many TTS APIs return audio in streaming chunks. Start sending audio back to the caller as soon as the first bytes arrive; do not wait for the entire sentence.
  • Use appropriate sample rates. Exotel AgentStream supports 8, 16 and 24 kHz. Lower sample rates reduce payload size and speed up transmission; higher rates improve quality but add to latency[15].

5. Design for interruption and barge‑in

Users often speak over the voice agent. Your system should handle barge‑ins to maintain conversational flow. Exotel’s Voicebot applet signals DTMF events and stream lifecycle events such as start, media, and stop[13]. Implement logic that cancels pending TTS when new speech arrives and resets the pipeline quickly.

6. Host your AI close to Exotel PoPs

Network latency can quickly erode the gains from fast models. Host your bot servers in the same geographical region as Exotel’s Points‑of‑Presence (PoPs), India (Mumbai, Bangalore), Southeast Asia, or GCC, for best performance. Telnyx data shows that each network hop adds 20–50 ms[16]; colocated infrastructure eliminates these hops.

7. Instrument and measure end‑to‑end latency

Latency optimisation is an ongoing process. Capture and analyse metrics such as:

  • Time to first audio: difference between final speech input and first audio output; should be <500 ms.
  • Speech‑to‑text TTFT: measure per‑chunk transcription latency to detect slow segments.
  • LLM response time: track first token generation to avoid prompt‑construction bottlenecks.
  • TTS synthesis time: measure per‑sentence and per‑chunk latency.
  • Network and telephony jitter: monitor jitter, packet loss and round‑trip time.

Exotel’s AgentStream events include timestamps (stream_sid, connected, start)[13], which you can log to compute these metrics. Use monitoring tools like Prometheus and Grafana or voice‑AI‑specific observability platforms. When latency spikes occur, shorten responses, reduce TTS quality (8 kHz), or switch to simpler models[15].

Example architecture achieving <800 ms latency

Below is a reference architecture that combines Exotel AgentStream with state‑of‑the‑art AI components. The pipeline is designed to overlap steps and minimise waiting:

  1. Inbound call arrives via Exotel AgentStream. The Voicebot applet initiates a WebSocket connection to your bot server with event metadata and audio frames[13].
  2. Bot server forwards audio frames to the STT engine (e.g., Deepgram). As soon as the first transcript chunks are available (~150 ms[9]), it begins constructing an LLM prompt.
  3. In parallel, the bot server sends the partial transcript to a language model (e.g., GPT‑4 Turbo). The LLM starts generating tokens within ~200–350 ms[10].
  4. When a full sentence or response is ready, the bot server streams text to a TTS engine like ElevenLabs Flash. TTS produces the first 200 ms of audio in ~75 ms[11].
  5. As soon as the first audio bytes arrive, the bot server sends them back over the Exotel WebSocket. The caller hears the agent response within ~700 ms (300 ms telephony + 150 ms STT + 200 ms LLM + 75 ms TTS + network overhead), meeting the sub‑800 ms requirement.
  6. If the caller interrupts, the bot server detects new audio frames and cancels the ongoing TTS stream, ensuring natural barge‑in.

This architecture uses concurrency and streaming to overlap pipeline steps. Exotel’s low‑latency media transport makes it possible to achieve these targets; with slower telephony APIs, the budget would be consumed before the AI even starts.

Conclusion

Meeting the <800 ms latency benchmark is essential for conversational voice agents; delays longer than this lead to unnatural pauses, caller frustration, and increased hang‑ups[1]. By leveraging Exotel’s AgentStream, choosing fast STT/LLM/TTS models, streaming audio in small chunks, colocating compute near telephony infrastructure and instrumenting your pipeline, you can achieve sub‑800 ms end‑to‑end response times. Continuous monitoring and optimisation will keep your voice agent responsive and human‑like as traffic scales and models evolve.

 

[1][2][3][4][5][16] Voice AI agents compared on latency: performance benchmark

https://telnyx.com/resources/voice-ai-agents-compared-latency

[6] Exotel Docs

https://docs.exotel.com/voice-apis/why-does-the-response-time-of-connect-api-vary

[7] Exotel introduces real-time voice streaming API for AI developers

https://www.techcircle.in/2025/08/12/exotel-introduces-real-time-voice-streaming-api-for-ai-developers

[8] Voice Bots | Exotel

https://exotel.com/voice-streaming-for-developers/

[9][14] Speech-to-Text in Conversational AI | Comparing Deepgram, Gladia, Whisper and More

https://zoice.ai/blog/speech-to-text-in-conversational-ai/

[10] How do you optimize latency for Conversational AI?

https://elevenlabs.io/blog/how-do-you-optimize-latency-for-conversational-ai

[11][12] ElevenLabs vs OpenAI TTS: Which One”s Right for You? – Vapi AI Blog

https://vapi.ai/blog/elevenlabs-vs-openai

[13][15] Build a Real-Time Speech-to-Speech AI Voice Assistant on Exotel AgentStream (Bidirectional) with OpenAI Realtime & Python | Exotel

https://exotel.com/blog/build-a-real-time-speech-to-speech-ai-voice-assistant-on-exotel-agentstream-bidirectional-with-openai-realtime-python/

Shiva is Head of Digital Marketing & Developer Network at Exotel, a growing community of builders working with voice, messaging, and AI-powered communication APIs. He has spent 13+ years helping B2B SaaS companies grow through data-driven marketing, and today he's equally focused on helping developers discover, adopt, and get more out of Exotel's platform. He writes about developer ecosystems, voice AI trends, and what it takes to build great CX infrastructure.