Building Voice-Based AI Agents: A Practical Engineering Guide

Voice is the most natural interface humans have. We speak at roughly 150 words per minute versus typing at 40, and we do it hands-free. For AI agents, voice unlocks entirely new interaction surfaces — phone lines, smart speakers, car dashboards, wearables — that text-based chat can't reach.

But building a reliable voice agent is substantially harder than building a text chatbot. You're chaining together multiple ML systems in real time, each with its own failure modes, and the latency budget is brutally tight. Users expect sub-second responses in conversation. Every millisecond of delay between a user finishing their sentence and the agent beginning to speak feels wrong.

This guide covers the full stack: speech-to-text, text-to-speech, real-time orchestration, and integration with telephony and smart devices. I'll focus on tools and approaches that actually work in production.

The Voice Agent Pipeline

Every voice agent follows the same fundamental pipeline:

Audio In → VAD → STT → LLM → TTS → Audio Out

Voice Activity Detection (VAD): Detect when the user starts and stops speaking
Speech-to-Text (STT): Convert audio to a text transcript
Language Model: Process the transcript and generate a response
Text-to-Speech (TTS): Convert the response text back to audio

Each stage adds latency. The total pipeline latency — from the moment the user stops talking to the moment they hear the first audio of the response — is what determines whether your agent feels responsive or sluggish. A good target is under 800ms. Anything over 2 seconds and users start talking over your agent or hanging up.

Speech-to-Text: Getting the Words Right

STT is the first and arguably most critical component. If your agent mishears the user, everything downstream fails.

The Major Options

Provider	Streaming	Latency	Accuracy	Cost (per min)	Best For
Deepgram Nova-2	✅	~200ms	Excellent	$0.0043	Production voice agents
OpenAI Whisper (API)	❌ (batch)	~1-3s	Excellent	$0.006	Offline processing
Whisper (local)	Via faster-whisper	~300ms	Excellent	Free (compute)	Self-hosted
Google Cloud STT v2	✅	~300ms	Good	$0.016	GCP shops
AssemblyAI	✅	~300ms	Good	$0.015	Transcription + features
Azure Speech	✅	~300ms	Good	$0.016	Enterprise/Microsoft

Deepgram Nova-2 is the current go-to for real-time voice agents. It's fast, accurate, and purpose-built for streaming. Their WebSocket API returns interim results as the user speaks, which lets you start processing before the sentence is complete.

OpenAI's Whisper produces excellent transcripts but the API doesn't support streaming — you have to send complete audio chunks, which adds latency. The open-source model can be run locally with faster-whisper for streaming, but you're managing GPU infrastructure.

Streaming STT with Deepgram

Here's a practical example of streaming audio to Deepgram and getting real-time transcripts:

import asyncio
import websockets
import json

async def stream_to_deepgram(audio_stream):
    """Stream audio to Deepgram and yield transcript segments."""
    
    DEEPGRAM_API_KEY = "your-key-here"
    url = (
        "wss://api.deepgram.com/v1/listen?"
        "model=nova-2&"
        "language=en-US&"
        "encoding=linear16&"
        "sample_rate=16000&"
        "channels=1&"
        "interim_results=true&"
        "endpointing=300&"          # 300ms silence = end of utterance
        "utterance_end_ms=1000&"    # Max utterance duration hint
        "vad_events=true"           # Get VAD events for turn detection
    )
    
    headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}
    
    async with websockets.connect(url, extra_headers=headers) as ws:
        async def send_audio():
            async for chunk in audio_stream:
                await ws.send(chunk)
            await ws.send(json.dumps({"type": "CloseStream"}))
        
        async def receive_transcripts():
            async for message in ws:
                data = json.loads(message)
                
                if data.get("type") == "Results":
                    transcript = data["channel"]["alternatives"][0]
                    if transcript["transcript"]:
                        is_final = data["is_final"]
                        yield {
                            "text": transcript["transcript"],
                            "confidence": transcript["confidence"],
                            "is_final": is_final,
                            "speech_final": data.get("speech_final", False)
                        }
                
                elif data.get("type") == "UtteranceEnd":
                    # User has stopped speaking — time to generate response
                    yield {"type": "utterance_end"}
        
        # Run sender and receiver concurrently
        send_task = asyncio.create_task(send_audio())
        async for transcript in receive_transcripts():
            yield transcript
        await send_task

Key STT Considerations

Endpointing is the parameter that matters most for voice agents. It controls how long the system waits after the user stops talking before considering the utterance complete. Too short (100ms) and you'll cut off users who pause mid-sentence. Too long (2000ms) and your agent feels sluggish. 300-500ms is the sweet spot for conversational agents.

Interim results let you start processing before the user finishes speaking. Some architectures use speculative generation — they start the LLM call with partial transcripts and refine as more text arrives. This is complex but can cut perceived latency significantly.

Hotwords and context improve accuracy for domain-specific terms. Both Deepgram and Google let you supply a vocabulary list or context phrases. If your agent handles restaurant reservations, boosting terms like "reservation," "party size," and specific cuisine names helps enormously.

# Deepgram: boost domain-specific terms
url += "&keywords=reservation&keywords=party_size&keywords=cancellation"

Text-to-Speech: Sounding Natural

TTS quality has improved dramatically. Modern neural TTS models produce speech that's nearly indistinguishable from human recordings — but not all use cases need that fidelity.

The TTS Landscape

Provider	Streaming	Quality	Voices	Cost (per 1M chars)	Latency
ElevenLabs	✅	Best	Clone any voice	$30-330	~300ms
OpenAI TTS	❌ (batch)	Excellent	6 fixed voices	$15	~500ms
Google Cloud TTS	✅ (via SSML)	Good	400+	$4-16	~200ms
Azure Neural TTS	✅	Good	400+	$16	~200ms
Cartesia	✅	Excellent	Many	$15	~150ms
PlayHT	✅	Good	Clone capable	$25-45	~300ms

ElevenLabs set the quality bar and remains the best for voice cloning and emotional range. Their streaming API returns audio chunks as they're generated, so you can start playback while the rest of the sentence is still being synthesized.

Cartesia is a strong newer entrant focused specifically on real-time applications. Their Sonic model achieves sub-150ms time-to-first-audio, which is critical for voice agents.

OpenAI's TTS produces excellent audio but doesn't stream — you get the full audio file back, adding latency. Fine for short responses but problematic for longer agent replies.

Streaming TTS Implementation

Here's how to stream TTS from ElevenLabs and play it back in real time:

import asyncio
import websockets
import json
import base64

class StreamingTTS:
    def __init__(self, api_key, voice_id="21m00Tcm4TlvDq8ikWAM"):
        self.api_key = api_key
        self.voice_id = voice_id
        self.audio_queue = asyncio.Queue()
    
    async def synthesize_stream(self, text_stream):
        """Convert a stream of text tokens to audio chunks."""
        
        url = (
            f"wss://api.elevenlabs.io/v1/text-to-speech/"
            f"{self.voice_id}/stream-input?"
            "model_id=eleven_turbo_v2_5&"
            "output_format=ulaw_8000"  # Telephony-compatible format
        )
        
        async with websockets.connect(url) as ws:
            # Send BOS (beginning of stream) with voice settings
            await ws.send(json.dumps({
                "text": " ",
                "voice_settings": {
                    "stability": 0.5,
                    "similarity_boost": 0.8,
                    "use_speaker_boost": False
                },
                "xi_api_key": self.api_key,
            }))
            
            # Stream text tokens as they arrive from the LLM
            async def send_text():
                buffer = ""
                async for token in text_stream:
                    buffer += token
                    # Send complete sentences for natural prosody
                    if any(p in buffer for p in ".!?"):
                        await ws.send(json.dumps({
                            "text": buffer,
                            "flush": True
                        }))
                        buffer = ""
                
                # Send remaining text
                if buffer:
                    await ws.send(json.dumps({"text": buffer, "flush": True}))
                
                # Signal end of input
                await ws.send(json.dumps({"text": ""}))
            
            async def receive_audio():
                while True:
                    try:
                        msg = await asyncio.wait_for(ws.recv(), timeout=5.0)
                        data = json.loads(msg)
                        if data.get("audio"):
                            audio_chunk = base64.b64decode(data["audio"])
                            await self.audio_queue.put(audio_chunk)
                        if data.get("isFinal"):
                            break
                    except asyncio.TimeoutError:
                        break
                
                await self.audio_queue.put(None)  # Sentinel
            
            await asyncio.gather(send_text(), receive_audio())

Chunking Strategy for Natural Speech

The biggest TTS mistake I see is sending the entire LLM response as one block. Neural TTS models apply prosody (rhythm, emphasis, intonation) based on sentence and paragraph structure. Sending one massive paragraph produces flat, monotone output.

Instead, chunk by sentence. Feed complete sentences to the TTS model so it can apply appropriate prosody. The code above demonstrates this — it buffers until it hits punctuation, then flushes.

For even better results, use SSML (Speech Synthesis Markup Language) to control pacing:

<speak>
  Your reservation is confirmed for <say-as interpret-as="date">2024-12-15</say-as>
  at <say-as interpret-as="time">7:30 PM</say-as>.
  <break time="300ms"/>
  Is there anything else I can help with?
</speak>

Real-Time Conversation Orchestration

This is where the complexity lives. You need to coordinate STT, LLM, and TTS in a tight loop while managing turn-taking, interruptions, and barge-in.

Latency Budget Breakdown

For a responsive voice agent, here's a realistic latency budget:

User stops speaking
  → Endpointing detection: 300ms
  → STT final result: 100ms
  → LLM first token: 200-400ms
  → TTS first audio chunk: 150-300ms
  → Network + playback buffer: 50ms
  ─────────────────────────────────
  Total: 800-1150ms

That's tight but achievable. The key optimization is streaming everything — don't wait for the complete LLM response before starting TTS.

The Pipecat Framework

Pipecat is an open-source framework specifically designed for real-time voice AI agents. It handles the plumbing of connecting STT, LLM, and TTS services with proper streaming and interruption support.

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer

async def create_voice_agent(room_url, token):
    """Create a real-time voice agent using Pipecat."""
    
    # Transport layer (WebRTC via Daily)
    transport = DailyTransport(
        room_url,
        token,
        "Voice Agent",
        DailyParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )
    
    # STT: Deepgram with streaming
    stt = DeepgramSTTService(
        api_key=os.getenv("DEEPGRAM_API_KEY"),
        model="nova-2",
        language="en-US",
    )
    
    # LLM: OpenAI GPT-4
    llm = OpenAILLMService(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="gpt-4o",
        system_prompt="""You are a helpful voice assistant. Keep responses 
        concise and conversational. Avoid lists and formatting since this 
        is a voice conversation. Aim for 1-3 sentences per response.""",
    )
    
    # TTS: ElevenLabs
    tts = ElevenLabsTTSService(
        api_key=os.getenv("ELEVENLABS_API_KEY"),
        voice_id="21m00Tcm4TlvDq8ikWAM",
    )
    
    # Pipeline: Audio In → STT → LLM → TTS → Audio Out
    pipeline = Pipeline([
        transport.input(),
        stt,
        llm,
        tts,
        transport.output(),
    ])
    
    task = PipelineTask(pipeline)
    runner = PipelineRunner()
    await runner.run(task)

Pipecat handles interruption automatically — if the user starts speaking while the agent is talking, it stops TTS playback, processes the new input, and responds to the interruption. This barge-in capability is essential for natural conversation.

Building It From Scratch

If you need more control or can't use Pipecat, here's the core orchestration logic:

import asyncio

class VoiceAgent:
    def __init__(self, stt, llm, tts):
        self.stt = stt
        self.llm = llm
        self.tts = tts
        self.conversation_history = []
        self.is_speaking = False
        self.current_tts_task = None
    
    async def handle_utterance(self, transcript: str):
        """Process a complete user utterance."""
        
        # Cancel any ongoing TTS (barge-in)
        if self.current_tts_task and not self.current_tts_task.done():
            self.current_tts_task.cancel()
            self.tts.stop_playback()
        
        self.conversation_history.append({
            "role": "user",
            "content": transcript
        })
        
        # Start LLM generation (streaming)
        llm_stream = self.llm.stream_chat(self.conversation_history)
        
        # Collect full response for history while streaming to TTS
        full_response = ""
        
        async def text_collector():
            nonlocal full_response
            async for token in llm_stream:
                full_response += token
                yield token
        
        # Stream LLM output directly to TTS
        self.current_tts_task = asyncio.create_task(
            self.tts.synthesize_and_play(text_collector())
        )
        
        try:
            await self.current_tts_task
        except asyncio.CancelledError:
            pass  # Interrupted by barge-in
        
        self.conversation_history.append({
            "role": "assistant",
            "content": full_response
        })
    
    async def run(self, audio_input_stream):
        """Main event loop."""
        async for event in self.stt.stream(audio_input_stream):
            if event.get("speech_final"):
                await self.handle_utterance(event["text"])
            elif event.get("type") == "utterance_end":
                # Handle silence — maybe prompt the user
                pass

Handling Interruptions Gracefully

Interruption handling is what separates good voice agents from bad ones. Here are the patterns that work:

Hard stop: Immediately stop TTS and process the new input. Use this for clear interruptions where the user starts a new sentence.
Soft stop: Continue generating audio but don't play it. If the user's interruption turns out to be a brief interjection ("uh-huh," "right"), resume playback. This prevents the agent from stopping every time the user makes a verbal acknowledgment.
Collapsible context: When interrupted mid-response, add the partial response to the conversation history with a note that it was interrupted. The LLM can then naturally continue or adjust based on the interruption.

async def handle_interruption(self, new_transcript):
    partial_response = self.tts.get_generated_text_so_far()
    
    # Save the partial response
    self.conversation_history.append({
        "role": "assistant",
        "content": partial_response + " [interrupted]"
    })
    
    # Process the new user input
    await self.handle_utterance(new_transcript)

Integration with Phone Systems

Phone integration is where voice agents deliver the most business value — customer service, appointment scheduling, lead qualification. But telephony has its own quirks.

The Telephony Stack

Phone Network (PSTN)
    ↓
SIP Trunk (Twilio / Telnyx / Vonage)
    ↓
Media Server (your app or TwiML)
    ↓
Your Voice Agent

Phone audio comes as µ-law encoded 8kHz mono — much lower quality than web audio. This affects STT accuracy. You'll want to either:

Use STT models that handle telephony audio well (Deepgram's phonecall model)
Upsample to 16kHz before processing

Twilio Integration

Twilio is the most common telephony platform. Here's how to connect a voice agent to a phone number:

from fastapi import FastAPI, WebSocket
from twilio.rest import Client
import base64
import json

app = FastAPI()

# Handle incoming Twilio calls
@app.post("/voice/incoming")
async def handle_incoming_call():
    """Return TwiML to connect the call to our WebSocket media stream."""
    from fastapi.responses import Response
    
    twiml = """<?xml version="1.0" encoding="UTF-8"?>
    <Response>
        <Connect>
            <Stream url="wss://your-server.com/media-stream">
                <Parameter name="caller" value="{{From}}" />
            </Stream>
        </Connect>
    </Response>"""
    
    return Response(content=twiml, media_type="application/xml")


@app.websocket("/media-stream")
async def handle_media_stream(ws: WebSocket):
    """Handle Twilio's bi-directional media stream."""
    await ws.accept()
    
    stream_sid = None
    agent = VoiceAgent(stt=..., llm=..., tts=...)
    audio_buffer = bytearray()
    
    async for message in ws.iter_text():
        data = json.loads(message)
        
        if data["event"] == "start":
            stream_sid = data["start"]["streamSid"]
            # Initialize agent with caller context
            caller = data["start"].get("customParameters", {}).get("caller")
            await agent.initialize(caller_id=caller)
        
        elif data["event"] == "media":
            # Decode µ-law audio from Twilio
            audio_chunk = base64.b64decode(data["media"]["payload"])
            await agent.feed_audio(audio_chunk)
        
        elif data["event"] == "stop":
            await agent.cleanup()
            break
    
    async def send_audio_to_caller(audio_chunks):
        """Send synthesized audio back to the caller via Twilio."""
        async for chunk in audio_chunks:
            # Twilio expects µ-law 8kHz
            payload = base64.b64encode(chunk).decode()
            await ws.send(json.dumps({
                "event": "media",
                "streamSid": stream_sid,
                "media": {"payload": payload}
            }))
    
    # Start the agent's audio output loop
    asyncio.create_task(send_audio_to_caller(agent.audio_output()))

Vapi and Bland.ai: Managed Voice Agent Platforms

If you don't want to manage the telephony plumbing yourself, Vapi and Bland.ai are purpose-built platforms for phone-based AI agents. They handle the SIP integration, audio streaming, and provide a clean API:

import requests

# Create a voice agent with Vapi
response = requests.post(
    "https://api.vapi.ai/call/phone",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "assistant": {
            "model": {
                "provider": "openai",
                "model": "gpt-4o",
                "systemMessage": "You are a helpful customer service agent..."
            },
            "voice": {
                "provider": "11labs",
                "voiceId": "21m00Tcm4TlvDq8ikWAM"
            },
            "firstMessage": "Hello, how can I help you today?",
            "interruptionThreshold": 0.5
        },
        "phoneNumberId": "your-vapi-phone-number-id",
        "customer": {
            "number": "+1234567890"
        }
    }
)

These platforms are excellent for getting started quickly and for use cases where you don't need deep control over the audio pipeline. The trade-off is less control over latency optimization and higher per-minute costs ($0.05-0.15/min).

Integration with Smart Devices

Alexa Skills

Alexa's voice interaction model is different from phone/web agents. It's turn-based with explicit wake words — the user says "Alexa, ask [your skill] to..." and then speaks. There's no continuous conversation by default (though you can enable it with shouldEndSession: false).

# Alexa Skill Handler using ask-sdk
from ask_sdk_core.skill_builder import SkillBuilder
from ask_sdk_core.dispatch_components import AbstractRequestHandler

class LaunchRequestHandler(AbstractRequestHandler):
    def can_handle(self, handler_input):
        return handler_input.request_envelope.request.object_type == "LaunchRequest"
    
    def handle(self, handler_input):
        # Launch your voice agent for this session
        session_attr = handler_input.attributes_manager.session_attributes
        session_attr["conversation_history"] = []
        
        return (
            handler_input.response_builder
            .speak("Hi! I'm your AI assistant. How can I help?")
            .ask("What would you like to know?")  # Keeps session open
            .response
        )

class IntentRequestHandler(AbstractRequestHandler):
    def can_handle(self, handler_input):
        return handler_input.request_envelope.request.object_type == "IntentRequest"
    
    def handle(self, handler_input):
        # Alexa has already done STT — you get the text directly
        user_text = handler_input.request_envelope.request.intent.slots["query"].value
        
        session_attr = handler_input.attributes_manager.session_attributes
        history = session_attr.get("conversation_history", [])
        history.append({"role": "user", "content": user_text})
        
        # Call your LLM
        response_text = call_llm(history)
        history.append({"role": "assistant", "content": response_text})
        
        return (
            handler_input.response_builder
            .speak(response_text)
            .ask("Anything else?")  # Keep session open for follow-up
            .response
        )

The key limitation with Alexa: you don't control STT or TTS. Amazon handles both, which means you can't use custom voices or tune STT for your domain. The advantage is that it's free to the end user and works on every Echo device.

Google Assistant / Google Home

Google's Actions Builder provides a similar turn-based model. The newer approach uses conversational webhooks where Google sends the recognized text to your endpoint and expects a response in their specified format.

Home Assistant + Local LLM

For privacy-focused or offline smart home agents, Home Assistant's Assist pipeline runs everything locally:

# configuration.yaml
assist_pipeline:
  - name: "Living Room Agent"
    stt_engine: whisper
    tts_engine: piper
    conversation_agent: homeassistant  # or custom Ollama endpoint

Piper is a fast local TTS engine that runs on a Raspberry Pi. Combined with a local Whisper model for STT and a small LLM via Ollama, you get a fully offline voice agent. Quality won't match cloud services, but latency can actually be better (no network round trips) and there are zero privacy concerns.

Production Considerations

Cost Modeling

Voice agents get expensive fast. Here's a realistic cost breakdown for a 3-minute phone call:

Component	Usage	Unit Cost	Cost
Twilio telephony	3 min	$0.014/min	$0.042
Deepgram STT	3 min	$0.0043/min	$0.013
GPT-4o (est. 1000 tokens)	~1000 tokens	$0.005/1K	$0.005
ElevenLabs TTS	~500 chars	$0.30/1K chars	$0.15
Total per call			$0.21

TTS is often the dominant cost. For cost-sensitive applications, consider:

Google Cloud TTS ($0.004/1K chars — 75x cheaper than ElevenLabs)
Azure Neural TTS (similar pricing to Google)
Piper (free, local, good-enough quality for many use cases)

Error Handling

Every component in the pipeline can fail. Plan for it:

class ResilientVoiceAgent:
    async def process_utterance(self, transcript: str):
        try:
            response = await self.llm.generate(transcript)
        except (RateLimitError, Timeout):
            response = "I'm having trouble thinking right now. Could you repeat that?"
        
        try:
            audio = await self.tts.synthesize(response)
        except TTSError:
            # Fall back to a different TTS provider
            audio = await self.fallback_tts.synthesize(response)
        
        return audio

Monitoring

Track these metrics in production:

End-to-end latency: Time from user silence to first audio byte. Alert if p95 exceeds 1.5s.
STT word error rate (WER): Sample calls and manually verify transcripts. Target < 10%.
Interruption rate: How often users barge in. High rates (>30%) suggest the agent is too verbose.
Call completion rate: What percentage of calls reach a natural conclusion vs. the user hanging up.
Hallucination detection: Monitor for the LLM generating phone numbers, addresses, or other factual claims that could be wrong.

The Architecture That Actually Works

After building several production voice agents, here's the architecture I'd recommend:

┌─────────────────────────────────────────────────┐
│                  Client Layer                     │
│  Phone (Twilio) │ Web (WebRTC) │ Device (Alexa)  │
└──────────────────────┬──────────────────────────┘
                       │ WebSocket / SIP
┌──────────────────────▼──────────────────────────┐
│              Media Router (LiveKit / Daily)       │
│     Handles WebRTC, audio mixing, recording      │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│            Voice Agent Orchestrator               │
│  ┌─────┐  ┌─────┐  ┌─────┐  ┌──────────────┐  │
│  │ VAD │→│ STT │→│ LLM │→│     TTS        │  │
│  └─────┘  └─────┘  └─────┘  └──────────────┘  │
│  Pipecat or custom pipeline                      │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│              Application Layer                    │
│  Conversation state, tool calls, knowledge base  │
└─────────────────────────────────────────────────┘

Use LiveKit or Daily for the media layer — they handle WebRTC complexity, provide reliable audio transport, and work across web and telephony. Use Pipecat for orchestration unless you have very specific requirements that demand a custom pipeline.

Start with the best components for each layer, measure latency and quality, then optimize the bottlenecks. In my experience, the LLM is usually the latency bottleneck — using GPT-4o-mini or Claude Haiku instead of GPT-4 can cut response latency in half with minimal quality loss for most conversational use cases.

Voice agents are hard to get right, but when they work, they feel like magic. The technology is mature enough now that a small team can build a production-quality agent in weeks, not months. Start simple, measure everything, and iterate on the experience.

Voice AI Agents: Building Conversational Interfaces with Real-Time Speech