Voice AI Agents: Building Conversational Interfaces with Real-Time Speech
Marcus Rivera
Full-stack developer and agent builder. Covers coding assistants and dev tools.
Voice is the most natural interface humans have. We speak at roughly 150 words per minute versus typing at 40, and we do it hands-free. For AI agents, voice unlocks entirely new interaction surfaces —...
Building Voice-Based AI Agents: A Practical Engineering Guide
Voice is the most natural interface humans have. We speak at roughly 150 words per minute versus typing at 40, and we do it hands-free. For AI agents, voice unlocks entirely new interaction surfaces — phone lines, smart speakers, car dashboards, wearables — that text-based chat can't reach.
But building a reliable voice agent is substantially harder than building a text chatbot. You're chaining together multiple ML systems in real time, each with its own failure modes, and the latency budget is brutally tight. Users expect sub-second responses in conversation. Every millisecond of delay between a user finishing their sentence and the agent beginning to speak feels wrong.
This guide covers the full stack: speech-to-text, text-to-speech, real-time orchestration, and integration with telephony and smart devices. I'll focus on tools and approaches that actually work in production.
The Voice Agent Pipeline
Every voice agent follows the same fundamental pipeline:
Audio In → VAD → STT → LLM → TTS → Audio Out
- Voice Activity Detection (VAD): Detect when the user starts and stops speaking
- Speech-to-Text (STT): Convert audio to a text transcript
- Language Model: Process the transcript and generate a response
- Text-to-Speech (TTS): Convert the response text back to audio
Each stage adds latency. The total pipeline latency — from the moment the user stops talking to the moment they hear the first audio of the response — is what determines whether your agent feels responsive or sluggish. A good target is under 800ms. Anything over 2 seconds and users start talking over your agent or hanging up.
Speech-to-Text: Getting the Words Right
STT is the first and arguably most critical component. If your agent mishears the user, everything downstream fails.
The Major Options
| Provider | Streaming | Latency | Accuracy | Cost (per min) | Best For |
|---|---|---|---|---|---|
| Deepgram Nova-2 | ✅ | ~200ms | Excellent | $0.0043 | Production voice agents |
| OpenAI Whisper (API) | ❌ (batch) | ~1-3s | Excellent | $0.006 | Offline processing |
| Whisper (local) | Via faster-whisper | ~300ms | Excellent | Free (compute) | Self-hosted |
| Google Cloud STT v2 | ✅ | ~300ms | Good | $0.016 | GCP shops |
| AssemblyAI | ✅ | ~300ms | Good | $0.015 | Transcription + features |
| Azure Speech | ✅ | ~300ms | Good | $0.016 | Enterprise/Microsoft |
Deepgram Nova-2 is the current go-to for real-time voice agents. It's fast, accurate, and purpose-built for streaming. Their WebSocket API returns interim results as the user speaks, which lets you start processing before the sentence is complete.
OpenAI's Whisper produces excellent transcripts but the API doesn't support streaming — you have to send complete audio chunks, which adds latency. The open-source model can be run locally with faster-whisper for streaming, but you're managing GPU infrastructure.
Streaming STT with Deepgram
Here's a practical example of streaming audio to Deepgram and getting real-time transcripts:
import asyncio
import websockets
import json
async def stream_to_deepgram(audio_stream):
"""Stream audio to Deepgram and yield transcript segments."""
DEEPGRAM_API_KEY = "your-key-here"
url = (
"wss://api.deepgram.com/v1/listen?"
"model=nova-2&"
"language=en-US&"
"encoding=linear16&"
"sample_rate=16000&"
"channels=1&"
"interim_results=true&"
"endpointing=300&" # 300ms silence = end of utterance
"utterance_end_ms=1000&" # Max utterance duration hint
"vad_events=true" # Get VAD events for turn detection
)
headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}
async with websockets.connect(url, extra_headers=headers) as ws:
async def send_audio():
async for chunk in audio_stream:
await ws.send(chunk)
await ws.send(json.dumps({"type": "CloseStream"}))
async def receive_transcripts():
async for message in ws:
data = json.loads(message)
if data.get("type") == "Results":
transcript = data["channel"]["alternatives"][0]
if transcript["transcript"]:
is_final = data["is_final"]
yield {
"text": transcript["transcript"],
"confidence": transcript["confidence"],
"is_final": is_final,
"speech_final": data.get("speech_final", False)
}
elif data.get("type") == "UtteranceEnd":
# User has stopped speaking — time to generate response
yield {"type": "utterance_end"}
# Run sender and receiver concurrently
send_task = asyncio.create_task(send_audio())
async for transcript in receive_transcripts():
yield transcript
await send_task
Key STT Considerations
Endpointing is the parameter that matters most for voice agents. It controls how long the system waits after the user stops talking before considering the utterance complete. Too short (100ms) and you'll cut off users who pause mid-sentence. Too long (2000ms) and your agent feels sluggish. 300-500ms is the sweet spot for conversational agents.
Interim results let you start processing before the user finishes speaking. Some architectures use speculative generation — they start the LLM call with partial transcripts and refine as more text arrives. This is complex but can cut perceived latency significantly.
Hotwords and context improve accuracy for domain-specific terms. Both Deepgram and Google let you supply a vocabulary list or context phrases. If your agent handles restaurant reservations, boosting terms like "reservation," "party size," and specific cuisine names helps enormously.
# Deepgram: boost domain-specific terms
url += "&keywords=reservation&keywords=party_size&keywords=cancellation"
Text-to-Speech: Sounding Natural
TTS quality has improved dramatically. Modern neural TTS models produce speech that's nearly indistinguishable from human recordings — but not all use cases need that fidelity.
The TTS Landscape
| Provider | Streaming | Quality | Voices | Cost (per 1M chars) | Latency |
|---|---|---|---|---|---|
| ElevenLabs | ✅ | Best | Clone any voice | $30-330 | ~300ms |
| OpenAI TTS | ❌ (batch) | Excellent | 6 fixed voices | $15 | ~500ms |
| Google Cloud TTS | ✅ (via SSML) | Good | 400+ | $4-16 | ~200ms |
| Azure Neural TTS | ✅ | Good | 400+ | $16 | ~200ms |
| Cartesia | ✅ | Excellent | Many | $15 | ~150ms |
| PlayHT | ✅ | Good | Clone capable | $25-45 | ~300ms |
ElevenLabs set the quality bar and remains the best for voice cloning and emotional range. Their streaming API returns audio chunks as they're generated, so you can start playback while the rest of the sentence is still being synthesized.
Cartesia is a strong newer entrant focused specifically on real-time applications. Their Sonic model achieves sub-150ms time-to-first-audio, which is critical for voice agents.
OpenAI's TTS produces excellent audio but doesn't stream — you get the full audio file back, adding latency. Fine for short responses but problematic for longer agent replies.
Streaming TTS Implementation
Here's how to stream TTS from ElevenLabs and play it back in real time:
import asyncio
import websockets
import json
import base64
class StreamingTTS:
def __init__(self, api_key, voice_id="21m00Tcm4TlvDq8ikWAM"):
self.api_key = api_key
self.voice_id = voice_id
self.audio_queue = asyncio.Queue()
async def synthesize_stream(self, text_stream):
"""Convert a stream of text tokens to audio chunks."""
url = (
f"wss://api.elevenlabs.io/v1/text-to-speech/"
f"{self.voice_id}/stream-input?"
"model_id=eleven_turbo_v2_5&"
"output_format=ulaw_8000" # Telephony-compatible format
)
async with websockets.connect(url) as ws:
# Send BOS (beginning of stream) with voice settings
await ws.send(json.dumps({
"text": " ",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.8,
"use_speaker_boost": False
},
"xi_api_key": self.api_key,
}))
# Stream text tokens as they arrive from the LLM
async def send_text():
buffer = ""
async for token in text_stream:
buffer += token
# Send complete sentences for natural prosody
if any(p in buffer for p in ".!?"):
await ws.send(json.dumps({
"text": buffer,
"flush": True
}))
buffer = ""
# Send remaining text
if buffer:
await ws.send(json.dumps({"text": buffer, "flush": True}))
# Signal end of input
await ws.send(json.dumps({"text": ""}))
async def receive_audio():
while True:
try:
msg = await asyncio.wait_for(ws.recv(), timeout=5.0)
data = json.loads(msg)
if data.get("audio"):
audio_chunk = base64.b64decode(data["audio"])
await self.audio_queue.put(audio_chunk)
if data.get("isFinal"):
break
except asyncio.TimeoutError:
break
await self.audio_queue.put(None) # Sentinel
await asyncio.gather(send_text(), receive_audio())
Chunking Strategy for Natural Speech
The biggest TTS mistake I see is sending the entire LLM response as one block. Neural TTS models apply prosody (rhythm, emphasis, intonation) based on sentence and paragraph structure. Sending one massive paragraph produces flat, monotone output.
Instead, chunk by sentence. Feed complete sentences to the TTS model so it can apply appropriate prosody. The code above demonstrates this — it buffers until it hits punctuation, then flushes.
For even better results, use SSML (Speech Synthesis Markup Language) to control pacing:
<speak>
Your reservation is confirmed for <say-as interpret-as="date">2024-12-15</say-as>
at <say-as interpret-as="time">7:30 PM</say-as>.
<break time="300ms"/>
Is there anything else I can help with?
</speak>
Real-Time Conversation Orchestration
This is where the complexity lives. You need to coordinate STT, LLM, and TTS in a tight loop while managing turn-taking, interruptions, and barge-in.
Latency Budget Breakdown
For a responsive voice agent, here's a realistic latency budget:
User stops speaking
→ Endpointing detection: 300ms
→ STT final result: 100ms
→ LLM first token: 200-400ms
→ TTS first audio chunk: 150-300ms
→ Network + playback buffer: 50ms
─────────────────────────────────
Total: 800-1150ms
That's tight but achievable. The key optimization is streaming everything — don't wait for the complete LLM response before starting TTS.
The Pipecat Framework
Pipecat is an open-source framework specifically designed for real-time voice AI agents. It handles the plumbing of connecting STT, LLM, and TTS services with proper streaming and interruption support.
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer
async def create_voice_agent(room_url, token):
"""Create a real-time voice agent using Pipecat."""
# Transport layer (WebRTC via Daily)
transport = DailyTransport(
room_url,
token,
"Voice Agent",
DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
),
)
# STT: Deepgram with streaming
stt = DeepgramSTTService(
api_key=os.getenv("DEEPGRAM_API_KEY"),
model="nova-2",
language="en-US",
)
# LLM: OpenAI GPT-4
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o",
system_prompt="""You are a helpful voice assistant. Keep responses
concise and conversational. Avoid lists and formatting since this
is a voice conversation. Aim for 1-3 sentences per response.""",
)
# TTS: ElevenLabs
tts = ElevenLabsTTSService(
api_key=os.getenv("ELEVENLABS_API_KEY"),
voice_id="21m00Tcm4TlvDq8ikWAM",
)
# Pipeline: Audio In → STT → LLM → TTS → Audio Out
pipeline = Pipeline([
transport.input(),
stt,
llm,
tts,
transport.output(),
])
task = PipelineTask(pipeline)
runner = PipelineRunner()
await runner.run(task)
Pipecat handles interruption automatically — if the user starts speaking while the agent is talking, it stops TTS playback, processes the new input, and responds to the interruption. This barge-in capability is essential for natural conversation.
Building It From Scratch
If you need more control or can't use Pipecat, here's the core orchestration logic:
import asyncio
class VoiceAgent:
def __init__(self, stt, llm, tts):
self.stt = stt
self.llm = llm
self.tts = tts
self.conversation_history = []
self.is_speaking = False
self.current_tts_task = None
async def handle_utterance(self, transcript: str):
"""Process a complete user utterance."""
# Cancel any ongoing TTS (barge-in)
if self.current_tts_task and not self.current_tts_task.done():
self.current_tts_task.cancel()
self.tts.stop_playback()
self.conversation_history.append({
"role": "user",
"content": transcript
})
# Start LLM generation (streaming)
llm_stream = self.llm.stream_chat(self.conversation_history)
# Collect full response for history while streaming to TTS
full_response = ""
async def text_collector():
nonlocal full_response
async for token in llm_stream:
full_response += token
yield token
# Stream LLM output directly to TTS
self.current_tts_task = asyncio.create_task(
self.tts.synthesize_and_play(text_collector())
)
try:
await self.current_tts_task
except asyncio.CancelledError:
pass # Interrupted by barge-in
self.conversation_history.append({
"role": "assistant",
"content": full_response
})
async def run(self, audio_input_stream):
"""Main event loop."""
async for event in self.stt.stream(audio_input_stream):
if event.get("speech_final"):
await self.handle_utterance(event["text"])
elif event.get("type") == "utterance_end":
# Handle silence — maybe prompt the user
pass
Handling Interruptions Gracefully
Interruption handling is what separates good voice agents from bad ones. Here are the patterns that work:
Hard stop: Immediately stop TTS and process the new input. Use this for clear interruptions where the user starts a new sentence.
Soft stop: Continue generating audio but don't play it. If the user's interruption turns out to be a brief interjection ("uh-huh," "right"), resume playback. This prevents the agent from stopping every time the user makes a verbal acknowledgment.
Collapsible context: When interrupted mid-response, add the partial response to the conversation history with a note that it was interrupted. The LLM can then naturally continue or adjust based on the interruption.
async def handle_interruption(self, new_transcript):
partial_response = self.tts.get_generated_text_so_far()
# Save the partial response
self.conversation_history.append({
"role": "assistant",
"content": partial_response + " [interrupted]"
})
# Process the new user input
await self.handle_utterance(new_transcript)
Integration with Phone Systems
Phone integration is where voice agents deliver the most business value — customer service, appointment scheduling, lead qualification. But telephony has its own quirks.
The Telephony Stack
Phone Network (PSTN)
↓
SIP Trunk (Twilio / Telnyx / Vonage)
↓
Media Server (your app or TwiML)
↓
Your Voice Agent
Phone audio comes as µ-law encoded 8kHz mono — much lower quality than web audio. This affects STT accuracy. You'll want to either:
- Use STT models that handle telephony audio well (Deepgram's
phonecallmodel) - Upsample to 16kHz before processing
Twilio Integration
Twilio is the most common telephony platform. Here's how to connect a voice agent to a phone number:
from fastapi import FastAPI, WebSocket
from twilio.rest import Client
import base64
import json
app = FastAPI()
# Handle incoming Twilio calls
@app.post("/voice/incoming")
async def handle_incoming_call():
"""Return TwiML to connect the call to our WebSocket media stream."""
from fastapi.responses import Response
twiml = """<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://your-server.com/media-stream">
<Parameter name="caller" value="{{From}}" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
@app.websocket("/media-stream")
async def handle_media_stream(ws: WebSocket):
"""Handle Twilio's bi-directional media stream."""
await ws.accept()
stream_sid = None
agent = VoiceAgent(stt=..., llm=..., tts=...)
audio_buffer = bytearray()
async for message in ws.iter_text():
data = json.loads(message)
if data["event"] == "start":
stream_sid = data["start"]["streamSid"]
# Initialize agent with caller context
caller = data["start"].get("customParameters", {}).get("caller")
await agent.initialize(caller_id=caller)
elif data["event"] == "media":
# Decode µ-law audio from Twilio
audio_chunk = base64.b64decode(data["media"]["payload"])
await agent.feed_audio(audio_chunk)
elif data["event"] == "stop":
await agent.cleanup()
break
async def send_audio_to_caller(audio_chunks):
"""Send synthesized audio back to the caller via Twilio."""
async for chunk in audio_chunks:
# Twilio expects µ-law 8kHz
payload = base64.b64encode(chunk).decode()
await ws.send(json.dumps({
"event": "media",
"streamSid": stream_sid,
"media": {"payload": payload}
}))
# Start the agent's audio output loop
asyncio.create_task(send_audio_to_caller(agent.audio_output()))
Vapi and Bland.ai: Managed Voice Agent Platforms
If you don't want to manage the telephony plumbing yourself, Vapi and Bland.ai are purpose-built platforms for phone-based AI agents. They handle the SIP integration, audio streaming, and provide a clean API:
import requests
# Create a voice agent with Vapi
response = requests.post(
"https://api.vapi.ai/call/phone",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"assistant": {
"model": {
"provider": "openai",
"model": "gpt-4o",
"systemMessage": "You are a helpful customer service agent..."
},
"voice": {
"provider": "11labs",
"voiceId": "21m00Tcm4TlvDq8ikWAM"
},
"firstMessage": "Hello, how can I help you today?",
"interruptionThreshold": 0.5
},
"phoneNumberId": "your-vapi-phone-number-id",
"customer": {
"number": "+1234567890"
}
}
)
These platforms are excellent for getting started quickly and for use cases where you don't need deep control over the audio pipeline. The trade-off is less control over latency optimization and higher per-minute costs ($0.05-0.15/min).
Integration with Smart Devices
Alexa Skills
Alexa's voice interaction model is different from phone/web agents. It's turn-based with explicit wake words — the user says "Alexa, ask [your skill] to..." and then speaks. There's no continuous conversation by default (though you can enable it with shouldEndSession: false).
# Alexa Skill Handler using ask-sdk
from ask_sdk_core.skill_builder import SkillBuilder
from ask_sdk_core.dispatch_components import AbstractRequestHandler
class LaunchRequestHandler(AbstractRequestHandler):
def can_handle(self, handler_input):
return handler_input.request_envelope.request.object_type == "LaunchRequest"
def handle(self, handler_input):
# Launch your voice agent for this session
session_attr = handler_input.attributes_manager.session_attributes
session_attr["conversation_history"] = []
return (
handler_input.response_builder
.speak("Hi! I'm your AI assistant. How can I help?")
.ask("What would you like to know?") # Keeps session open
.response
)
class IntentRequestHandler(AbstractRequestHandler):
def can_handle(self, handler_input):
return handler_input.request_envelope.request.object_type == "IntentRequest"
def handle(self, handler_input):
# Alexa has already done STT — you get the text directly
user_text = handler_input.request_envelope.request.intent.slots["query"].value
session_attr = handler_input.attributes_manager.session_attributes
history = session_attr.get("conversation_history", [])
history.append({"role": "user", "content": user_text})
# Call your LLM
response_text = call_llm(history)
history.append({"role": "assistant", "content": response_text})
return (
handler_input.response_builder
.speak(response_text)
.ask("Anything else?") # Keep session open for follow-up
.response
)
The key limitation with Alexa: you don't control STT or TTS. Amazon handles both, which means you can't use custom voices or tune STT for your domain. The advantage is that it's free to the end user and works on every Echo device.
Google Assistant / Google Home
Google's Actions Builder provides a similar turn-based model. The newer approach uses conversational webhooks where Google sends the recognized text to your endpoint and expects a response in their specified format.
Home Assistant + Local LLM
For privacy-focused or offline smart home agents, Home Assistant's Assist pipeline runs everything locally:
# configuration.yaml
assist_pipeline:
- name: "Living Room Agent"
stt_engine: whisper
tts_engine: piper
conversation_agent: homeassistant # or custom Ollama endpoint
Piper is a fast local TTS engine that runs on a Raspberry Pi. Combined with a local Whisper model for STT and a small LLM via Ollama, you get a fully offline voice agent. Quality won't match cloud services, but latency can actually be better (no network round trips) and there are zero privacy concerns.
Production Considerations
Cost Modeling
Voice agents get expensive fast. Here's a realistic cost breakdown for a 3-minute phone call:
| Component | Usage | Unit Cost | Cost |
|---|---|---|---|
| Twilio telephony | 3 min | $0.014/min | $0.042 |
| Deepgram STT | 3 min | $0.0043/min | $0.013 |
| GPT-4o (est. 1000 tokens) | ~1000 tokens | $0.005/1K | $0.005 |
| ElevenLabs TTS | ~500 chars | $0.30/1K chars | $0.15 |
| Total per call | $0.21 |
TTS is often the dominant cost. For cost-sensitive applications, consider:
- Google Cloud TTS ($0.004/1K chars — 75x cheaper than ElevenLabs)
- Azure Neural TTS (similar pricing to Google)
- Piper (free, local, good-enough quality for many use cases)
Error Handling
Every component in the pipeline can fail. Plan for it:
class ResilientVoiceAgent:
async def process_utterance(self, transcript: str):
try:
response = await self.llm.generate(transcript)
except (RateLimitError, Timeout):
response = "I'm having trouble thinking right now. Could you repeat that?"
try:
audio = await self.tts.synthesize(response)
except TTSError:
# Fall back to a different TTS provider
audio = await self.fallback_tts.synthesize(response)
return audio
Monitoring
Track these metrics in production:
- End-to-end latency: Time from user silence to first audio byte. Alert if p95 exceeds 1.5s.
- STT word error rate (WER): Sample calls and manually verify transcripts. Target < 10%.
- Interruption rate: How often users barge in. High rates (>30%) suggest the agent is too verbose.
- Call completion rate: What percentage of calls reach a natural conclusion vs. the user hanging up.
- Hallucination detection: Monitor for the LLM generating phone numbers, addresses, or other factual claims that could be wrong.
The Architecture That Actually Works
After building several production voice agents, here's the architecture I'd recommend:
┌─────────────────────────────────────────────────┐
│ Client Layer │
│ Phone (Twilio) │ Web (WebRTC) │ Device (Alexa) │
└──────────────────────┬──────────────────────────┘
│ WebSocket / SIP
┌──────────────────────▼──────────────────────────┐
│ Media Router (LiveKit / Daily) │
│ Handles WebRTC, audio mixing, recording │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ Voice Agent Orchestrator │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌──────────────┐ │
│ │ VAD │→│ STT │→│ LLM │→│ TTS │ │
│ └─────┘ └─────┘ └─────┘ └──────────────┘ │
│ Pipecat or custom pipeline │
└──────────────────────┬──────────────────────────┘
│
┌──────────────────────▼──────────────────────────┐
│ Application Layer │
│ Conversation state, tool calls, knowledge base │
└─────────────────────────────────────────────────┘
Use LiveKit or Daily for the media layer — they handle WebRTC complexity, provide reliable audio transport, and work across web and telephony. Use Pipecat for orchestration unless you have very specific requirements that demand a custom pipeline.
Start with the best components for each layer, measure latency and quality, then optimize the bottlenecks. In my experience, the LLM is usually the latency bottleneck — using GPT-4o-mini or Claude Haiku instead of GPT-4 can cut response latency in half with minimal quality loss for most conversational use cases.
Voice agents are hard to get right, but when they work, they feel like magic. The technology is mature enough now that a small team can build a production-quality agent in weeks, not months. Start simple, measure everything, and iterate on the experience.