Back to Home
Conversational Agents

Building a Customer Support Agent That Resolves 70% of Tickets

Emma Liu

Tech journalist covering the AI agent ecosystem and startups.

March 13, 202617 min read

**Most customer support agents fail not because the LLM is bad, but because the architecture around it is naive.** A chatbot that can only answer questions from a static FAQ page will hallucinate, fru...

Building a High-Performing Customer Support Agent: A Production-Grade Tutorial

Most customer support agents fail not because the LLM is bad, but because the architecture around it is naive. A chatbot that can only answer questions from a static FAQ page will hallucinate, frustrate users, and generate tickets that are harder to resolve than if the customer had never contacted support at all.

This tutorial walks through building a support agent that actually works in production — one that knows when it doesn't know, integrates with your existing systems, escalates intelligently, and gets better over time. We'll use real tools, real code, and real tradeoffs.


The Architecture That Actually Works

Before touching any code, let's establish the system design. A production support agent isn't a single prompt — it's a pipeline.

┌─────────────────────────────────────────────────────────────────┐
│                        Customer Message                         │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│                    Intent Classification                         │
│            (deterministic routing, not LLM guessing)             │
└──────────┬───────────────┬──────────────────┬───────────────────┘
           ▼               ▼                  ▼
   ┌──────────────┐ ┌─────────────┐  ┌──────────────────┐
   │  Knowledge   │ │  Account /  │  │  Escalation      │
   │  Retrieval   │ │  Order Tool │  │  to Human        │
   │  (RAG)       │ │  Calls      │  │                  │
   └──────┬───────┘ └──────┬──────┘  └──────────────────┘
          ▼                ▼
   ┌─────────────────────────────┐
   │     Response Generation     │
   │   (constrained, grounded)   │
   └──────────────┬──────────────┘
                  ▼
   ┌─────────────────────────────┐
   │   Quality Gate / Guardrails │
   └──────────────┬──────────────┘
                  ▼
           Response to User

Each layer is independently testable and replaceable. That's the key insight — don't build a monolith.


Part 1: Knowledge Base Setup

Choosing Your RAG Stack

The knowledge base is where most of your accuracy comes from. Here's what I've seen work and what doesn't.

What works:

  • Chunking by semantic section (not arbitrary token counts)
  • Metadata filtering (product version, customer tier, region)
  • Hybrid search (vector + keyword) for technical content

What doesn't work:

  • Dumping your entire docs site into one vector store
  • Using a single embedding model for all content types
  • Ignoring document freshness — outdated answers are worse than no answer

Let's build this properly with a concrete example. We'll use LangChain for orchestration, Qdrant for vector storage (it supports hybrid search natively), and Cohere's embed-v3 for embeddings (it handles multilingual content well).

Document Processing Pipeline

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
import hashlib

def build_knowledge_base(docs_path: str, collection_name: str):
    """Build a knowledge base with proper chunking and metadata."""
    
    # Load documents - use appropriate loaders per file type
    loader = DirectoryLoader(
        docs_path,
        glob="**/*.md",
        show_progress=True,
        use_multithreading=True,
    )
    documents = loader.load()
    
    # Semantic-aware chunking
    # Critical: chunk by headers for docs, not by arbitrary token count
    splitter = RecursiveCharacterTextSplitter.from_language(
        language="markdown",
        chunk_size=800,
        chunk_overlap=100,
        separators=["\n## ", "\n### ", "\n\n", "\n", " "],
    )
    chunks = splitter.split_documents(documents)
    
    # Enrich metadata - this is what enables filtering later
    for chunk in chunks:
        # Generate a stable ID for deduplication
        chunk.metadata["content_hash"] = hashlib.md5(
            chunk.page_content.encode()
        ).hexdigest()
        
        # Extract product area from the file path
        # e.g., docs/billing/refunds.md → "billing"
        path_parts = chunk.metadata.get("source", "").split("/")
        chunk.metadata["product_area"] = (
            path_parts[-2] if len(path_parts) > 1 else "general"
        )
        
        # Tag document type
        chunk.metadata["doc_type"] = classify_doc_type(chunk.page_content)
    
    # Set up Qdrant with hybrid search
    client = QdrantClient(host="localhost", port=6333)
    
    # Create collection with both dense and sparse vectors
    client.create_collection(
        collection_name=collection_name,
        vectors_config={
            "dense": models.VectorParams(
                size=1024,  # Cohere embed-v3 dimension
                distance=models.Distance.COSINE,
            )
        },
        sparse_vectors_config={
            "sparse": models.SparseVectorParams(
                index=models.SparseIndexParams(on_disk=False)
            )
        },
    )
    
    return QdrantVectorStore.from_documents(
        documents=chunks,
        embedding=CohereEmbeddings(model="embed-english-v3.0"),
        collection_name=collection_name,
        retrieval_mode="hybrid",  # This is the key
        vector_name="dense",
        sparse_vector_name="sparse",
    )

Why Hybrid Search Matters

Pure vector search struggles with exact terminology. When a customer asks about "error code 4217," vector similarity will return semantically similar content — maybe error handling docs in general. But the customer needs the specific error code page.

Hybrid search combines dense vectors (semantic understanding) with sparse vectors (BM25-style keyword matching). In benchmarks on support content, I've seen hybrid search improve retrieval accuracy by 15-25% over pure vector search, especially for:

  • Error codes and specific identifiers
  • Product names and SKUs
  • Policy references (e.g., "30-day return policy")
# Retrieval with metadata filtering and reranking
from langchain_cohere import CohereRerank

def retrieve_context(
    query: str,
    vector_store: QdrantVectorStore,
    product_area: str = None,
    top_k: int = 10,
    rerank_top_n: int = 3,
) -> list[str]:
    """Retrieve and rerank knowledge base context."""
    
    # Build filter if product area is known
    search_kwargs = {"k": top_k}
    if product_area:
        search_kwargs["filter"] = models.Filter(
            must=[
                models.FieldCondition(
                    key="metadata.product_area",
                    match=models.MatchValue(value=product_area),
                )
            ]
        )
    
    # Retrieve candidates
    retriever = vector_store.as_retriever(search_kwargs=search_kwargs)
    candidates = retriever.invoke(query)
    
    # Rerank with Cohere — this is where the magic happens
    # Reranking is cheap and dramatically improves precision
    reranker = CohereRerank(model="rerank-english-v3.0", top_n=rerank_top_n)
    reranked = reranker.rerank(
        query=query,
        documents=[c.page_content for c in candidates],
        top_n=rerank_top_n,
    )
    
    return [candidates[r["index"]].page_content for r in reranked]

The reranking step is non-negotiable. Retrieval is recall-oriented (find everything that might be relevant). Reranking is precision-oriented (rank the most relevant first). Skipping it means your LLM gets noisy context and produces worse answers.


Part 2: Tool Integration

A support agent that can only answer questions from documentation is useful but limited. The real value comes from integrating with your actual systems — checking order status, issuing refunds, updating accounts.

Tool Design Principles

  1. Tools should be atomic and predictable. One tool = one action. Don't build check_order_and_update_shipping — build get_order and update_shipping_address separately.
  2. Tools should return structured data, not prose. The LLM will generate the prose.
  3. Tools should fail gracefully. Every tool needs error handling that the agent can reason about.

Building the Tool Layer

from langchain_core.tools import tool
from pydantic import BaseModel, Field
from typing import Optional
import httpx

# Define strict input schemas - this prevents the LLM from
# hallucinating parameter names
class OrderLookupInput(BaseModel):
    order_id: str = Field(description="The order ID, e.g., ORD-12345")
    email: Optional[str] = Field(
        default=None,
        description="Customer email for verification"
    )

@tool("lookup_order", args_schema=OrderLookupInput)
async def lookup_order(order_id: str, email: str = None) -> dict:
    """Look up order details including status, tracking, and items.
    Use this when a customer asks about their order, shipping status,
    or wants to make changes to an existing order."""
    
    async with httpx.AsyncClient() as client:
        try:
            response = await client.get(
                f"https://api.internal.example.com/orders/{order_id}",
                headers={"Authorization": f"Bearer {get_service_token()}"},
                timeout=5.0,
            )
            
            if response.status_code == 404:
                return {
                    "error": "order_not_found",
                    "message": f"No order found with ID {order_id}. "
                               "Please verify the order ID.",
                }
            
            response.raise_for_status()
            order = response.json()
            
            # Verify email if provided (prevents social engineering)
            if email and order["customer_email"].lower() != email.lower():
                return {
                    "error": "verification_failed",
                    "message": "The email provided doesn't match this order.",
                }
            
            # Return structured data, not a formatted string
            return {
                "order_id": order["id"],
                "status": order["status"],
                "items": order["items"],
                "shipping": {
                    "carrier": order["shipping"]["carrier"],
                    "tracking_number": order["shipping"]["tracking"],
                    "estimated_delivery": order["shipping"]["eta"],
                },
                "total": order["total"],
                "can_cancel": order["status"] in ["pending", "processing"],
                "can_return": order["status"] == "delivered",
            }
            
        except httpx.TimeoutException:
            return {
                "error": "service_unavailable",
                "message": "Order system is temporarily unavailable.",
            }


class RefundInput(BaseModel):
    order_id: str = Field(description="The order ID to refund")
    reason: str = Field(description="Reason for the refund")
    amount: Optional[float] = Field(
        default=None,
        description="Partial refund amount. None means full refund."
    )
    line_items: Optional[list[str]] = Field(
        default=None,
        description="Specific line item IDs to refund. None means all."
    )

@tool("process_refund", args_schema=RefundInput)
async def process_refund(
    order_id: str,
    reason: str,
    amount: float = None,
    line_items: list[str] = None,
) -> dict:
    """Process a refund for an order. Can handle full or partial refunds.
    IMPORTANT: Only use this tool AFTER confirming the refund details
    with the customer. Never process a refund without explicit confirmation."""
    
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.internal.example.com/refunds",
            json={
                "order_id": order_id,
                "reason": reason,
                "amount": amount,
                "line_items": line_items,
            },
            headers={"Authorization": f"Bearer {get_service_token()}"},
            timeout=10.0,
        )
        
        if response.status_code == 422:
            return {
                "error": "refund_not_allowed",
                "message": response.json()["detail"],
            }
        
        response.raise_for_status()
        refund = response.json()
        
        return {
            "refund_id": refund["id"],
            "amount": refund["amount"],
            "status": refund["status"],
            "estimated_processing_days": refund["processing_days"],
        }

The System Prompt That Ties It Together

Tools alone aren't enough. The system prompt needs to establish clear behavioral boundaries:

SUPPORT_AGENT_SYSTEM_PROMPT = """You are a customer support agent for ExampleCo.

## Core Rules
1. NEVER fabricate information. If you don't know, say so.
2. NEVER process refunds or account changes without explicit customer confirmation.
3. ALWAYS verify customer identity before accessing account details.
4. If a customer is frustrated, acknowledge their frustration before solving the problem.
5. Keep responses concise. Customers want answers, not essays.

## Knowledge
You have access to our knowledge base covering products, policies, and procedures.
Only answer based on retrieved context. If the context doesn't contain the answer,
say: "I don't have that information in my system. Let me connect you with someone
who can help."

## Tools
You have access to tools for looking up orders, processing refunds, and managing
accounts. Always explain what you're doing before using a tool that modifies data.

## Escalation
Escalate to a human agent when:
- The customer explicitly requests a human
- The issue involves legal matters, fraud, or account security
- You've failed to resolve the issue after 2 attempts
- The customer has been waiting more than 5 minutes in this conversation
- The issue requires access to systems you don't have tools for

When escalating, summarize the issue and what you've already tried.
"""

Part 3: Escalation Logic

This is where most support agents fail. Bad escalation means either (a) the bot keeps the customer trapped in a loop when it can't help, or (b) it gives up and escalates too early, defeating the purpose of automation.

Multi-Signal Escalation Detection

Don't rely on a single signal. Combine multiple indicators:

from dataclasses import dataclass, field
from enum import Enum
from langchain_openai import ChatOpenAI

class EscalationSignal(Enum):
    EXPLICIT_REQUEST = "explicit_request"       # "let me talk to a human"
    NEGATIVE_SENTIMENT = "negative_sentiment"    # frustration detected
    REPETITION_LOOP = "repetition_loop"          # same issue, multiple attempts
    OUT_OF_SCOPE = "out_of_scope"               # no tools or knowledge available
    HIGH_STAKES = "high_stakes"                  # fraud, legal, safety
    TIMEOUT = "timeout"                          # conversation too long

@dataclass
class EscalationState:
    attempts: int = 0
    sentiment_scores: list[float] = field(default_factory=list)
    tools_used: list[str] = field(default_factory=list)
    knowledge_retrieved: bool = False
    explicit_request: bool = False
    high_stakes_detected: bool = False
    
    def should_escalate(self) -> tuple[bool, str | None]:
        """Determine if we should escalate and why."""
        
        # Hard rules — always escalate
        if self.explicit_request:
            return True, EscalationSignal.EXPLICIT_REQUEST.value
        
        if self.high_stakes_detected:
            return True, EscalationSignal.HIGH_STAKES.value
        
        # Soft rules — escalate based on patterns
        if self.attempts >= 3:
            return True, EscalationSignal.REPETITION_LOOP.value
        
        if (
            len(self.sentiment_scores) >= 2
            and all(s < 0.3 for s in self.sentiment_scores[-2:])
        ):
            return True, EscalationSignal.NEGATIVE_SENTIMENT.value
        
        if not self.knowledge_retrieved and not self.tools_used:
            return True, EscalationSignal.OUT_OF_SCOPE.value
        
        return False, None


async def detect_escalation_signals(
    message: str,
    conversation_history: list[dict],
    state: EscalationState,
) -> EscalationState:
    """Update escalation state based on the latest message."""
    
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    
    # Check for explicit escalation request
    # Use a small, fast model for classification tasks
    classification = await llm.ainvoke([
        {"role": "system", "content": """Classify this customer message. 
         Respond with ONLY one of:
         - ESCALATE: customer explicitly wants a human agent
         - CONTINUE: customer is asking a question or describing an issue
         - HIGH_STAKES: involves fraud, legal action, safety concern, or data breach
         
         Examples of ESCALATE: "speak to a manager", "I want a real person", 
         "transfer me to an agent"
         """},
        {"role": "user", "content": message},
    ])
    
    result = classification.content.strip().upper()
    if result == "ESCALATE":
        state.explicit_request = True
    elif result == "HIGH_STAKES":
        state.high_stakes_detected = True
    
    # Sentiment analysis — lightweight, runs on every message
    sentiment = await llm.ainvoke([
        {"role": "system", "content": """Rate the sentiment of this message 
         from 0.0 (extremely negative/frustrated) to 1.0 (very positive/happy).
         Respond with ONLY a number."""},
        {"role": "user", "content": message},
    ])
    
    try:
        state.sentiment_scores.append(float(sentiment.content.strip()))
    except ValueError:
        pass  # Don't crash on malformed LLM output
    
    return state

The Escalation Handoff

When you escalate, don't just dump the customer into a queue. Provide context:

async def escalate_to_human(
    conversation_id: str,
    state: EscalationState,
    reason: str,
    conversation_history: list[dict],
) -> dict:
    """Escalate to a human agent with full context."""
    
    # Generate a summary for the human agent
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    
    summary = await llm.ainvoke([
        {"role": "system", "content": """Summarize this support conversation 
         for a human agent taking over. Include:
         1. The customer's issue in one sentence
         2. What the bot has already tried
         3. Key details (order numbers, account info, etc.)
         4. The customer's emotional state
         Keep it under 150 words."""},
        {"role": "user", "content": str(conversation_history)},
    ])
    
    # Create handoff ticket in your helpdesk
    async with httpx.AsyncClient() as client:
        ticket = await client.post(
            "https://api.internal.example.com/tickets",
            json={
                "conversation_id": conversation_id,
                "priority": "high" if state.high_stakes_detected else "normal",
                "escalation_reason": reason,
                "bot_summary": summary.content,
                "full_history": conversation_history,
                "customer_sentiment_trend": state.sentiment_scores,
                "attempted_resolutions": state.attempts,
            },
            headers={"Authorization": f"Bearer {get_service_token()}"},
        )
    
    return {
        "message": "I'm connecting you with a support specialist who can help "
                   "with this. They'll have the full context of our conversation, "
                   "so you won't need to repeat yourself.",
        "ticket_id": ticket.json()["id"],
        "estimated_wait": ticket.json()["estimated_wait_minutes"],
    }

Part 4: Quality Metrics and Monitoring

You can't improve what you don't measure. Here are the metrics that actually matter for support agents.

The Metrics That Matter

Metric What It Measures Target How to Measure
Resolution Rate % of conversations resolved without human escalation 60-75% Track escalation outcomes
Accuracy Rate % of answers that are factually correct >95% LLM-as-judge + human sampling
Customer Satisfaction (CSAT) Post-interaction survey score >4.0/5.0 Automated survey after resolution
Escalation Quality Did escalation include proper context? >90% Audit escalation handoffs
Hallucination Rate % of responses containing fabricated info <2% RAG faithfulness checks
Average Handle Time Time from first message to resolution <3 min Timestamp tracking
Tool Success Rate % of tool calls that return valid data >98% Error logging

Implementing a Quality Gate

Every response should pass through a quality check before reaching the customer:

from pydantic import BaseModel

class QualityCheckResult(BaseModel):
    passed: bool
    faithfulness_score: float      # 0-1: how grounded in retrieved context
    relevance_score: float         # 0-1: how relevant to the question
    safety_score: float            # 0-1: no harmful content
    issues: list[str]              # specific problems found

async def quality_gate(
    question: str,
    response: str,
    retrieved_context: list[str],
    llm: ChatOpenAI,
) -> QualityCheckResult:
    """Check response quality before sending to customer."""
    
    context_text = "\n---\n".join(retrieved_context)
    
    evaluation = await llm.ainvoke([
        {"role": "system", "content": """You are a quality assurance evaluator 
         for customer support responses. Evaluate the response against these criteria:
         
         1. FAITHFULNESS (0-1): Is every claim in the response supported by the 
            provided context? Score 0 if any claim is fabricated.
         2. RELEVANCE (0-1): Does the response directly address the customer's 
            question? Score 0 if it's off-topic.
         3. SAFETY (0-1): Is the response safe? Score 0 if it contains harmful 
            advice, exposes internal systems, or shares other customers' data.
         
         Respond in this exact JSON format:
         {
           "faithfulness": 0.0-1.0,
           "relevance": 0.0-1.0,
           "safety": 0.0-1.0,
           "issues": ["list of specific problems"]
         }"""},
        {"role": "user", "content": f"""Customer question: {question}
         
         Retrieved context:
         {context_text}
         
         Agent response:
         {response}"""},
    ])
    
    result = json.loads(evaluation.content)
    
    return QualityCheckResult(
        passed=(
            result["faithfulness"] >= 0.8
            and result["relevance"] >= 0.7
            and result["safety"] >= 0.95
        ),
        faithfulness_score=result["faithfulness"],
        relevance_score=result["relevance"],
        safety_score=result["safety"],
        issues=result["issues"],
    )

The Feedback Loop

Metrics are useless without a feedback loop. Here's a practical system:

async def handle_with_quality_loop(
    message: str,
    conversation_id: str,
    state: ConversationState,
) -> str:
    """Full support pipeline with quality checking."""
    
    # 1. Check escalation signals
    state.escalation = await detect_escalation_signals(
        message, state.history, state.escalation
    )
    should_esc, reason = state.escalation.should_escalate()
    if should_esc:
        return await escalate_to_human(
            conversation_id, state.escalation, reason, state.history
        )
    
    # 2. Retrieve context
    context = retrieve_context(
        message, state.vector_store, product_area=state.detected_product_area
    )
    if context:
        state.escalation.knowledge_retrieved = True
    
    # 3. Generate response
    response = await generate_response(message, context, state)
    
    # 4. Quality gate
    quality = await quality_gate(message, response, context, state.llm)
    
    if not quality.passed:
        # Log the failure for analysis
        await log_quality_failure(
            conversation_id=conversation_id,
            question=message,
            response=response,
            quality_result=quality,
        )
        
        # If faithfulness failed, we might be hallucinating
        if quality.faithfulness_score < 0.5:
            return ("I want to make sure I give you accurate information. "
                    "Let me connect you with a specialist who can help.")
        
        # If relevance failed, try rephrasing the query
        if quality.relevance_score < 0.5:
            state.escalation.attempts += 1
            return ("I may have misunderstood your question. "
                    "Could you tell me more about what you're looking for?")
    
    # 5. Log success for monitoring
    await log_interaction(
        conversation_id=conversation_id,
        message=message,
        response=response,
        context_used=context,
        quality=quality,
    )
    
    return response

Dashboard Metrics to Track

Set up a real-time dashboard (we use Grafana, but anything works):

┌─────────────────────────────────────────────────────────┐
│  Support Agent Dashboard                                │
├──────────────┬──────────────┬───────────────────────────┤
│  Resolution  │  CSAT Score  │  Active Conversations     │
│    68.3%     │    4.2/5.0   │         142               │
│  ▲ +2.1%     │  ▲ +0.1      │                           │
├──────────────┴──────────────┴───────────────────────────┤
│  Escalation Reasons (last 24h)                          │
│  ████████████████░░░░  Explicit request    42%          │
│  ██████████░░░░░░░░░░  Repetition loop     28%          │
│  █████░░░░░░░░░░░░░░░  Out of scope        15%          │
│  ████░░░░░░░░░░░░░░░░  Negative sentiment  10%          │
│  ██░░░░░░░░░░░░░░░░░░  High stakes          5%          │
├─────────────────────────────────────────────────────────┤
│  Hallucination Rate: 1.2%  │  Tool Success Rate: 98.7% │
│  Avg Handle Time: 2m 34s   │  Knowledge Coverage: 84%  │
└─────────────────────────────────────────────────────────┘

Putting It All Together

Here's the main orchestration loop:

from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

def create_support_agent(vector_store, tools):
    """Create the full support agent."""
    
    llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0.1,  # Low but not zero — some variation feels natural
        streaming=True,    # Important for UX — show responses as they generate
    )
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", SUPPORT_AGENT_SYSTEM_PROMPT),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ])
    
    agent = create_openai_tools_agent(llm, tools, prompt)
    
    return AgentExecutor(
        agent=agent,
        tools=tools,
        max_iterations=5,       # Prevent infinite tool loops
        handle_parsing_errors=True,
        return_intermediate_steps=True,  # For debugging and monitoring
    )

What to Expect

With this architecture, here's realistic performance after tuning:

  • Week 1-2: 40-50% resolution rate. Lots of edge cases. Focus on knowledge base gaps.
  • Week 3-4: 55-65% resolution rate. Tool integrations stabilize. Escalation logic improves.
  • Month 2+: 65-75% resolution rate. Feedback loop catches remaining issues.

The 75% ceiling is real. Beyond that, you're dealing with complex, multi-step issues that genuinely need human judgment. Don't chase 90% automation — it'll make the experience worse for everyone.


Honest Limitations

I'd be doing you a disservice if I didn't mention what's hard:

  1. Latency. RAG retrieval + quality gate + tool calls can add 3-8 seconds per response. Streaming helps perception, but the pipeline is inherently slower than a simple chatbot. Budget for this.

  2. Cost. A quality-checked, tool-using agent costs roughly $0.02-0.08 per conversation with GPT-4o. At scale (100k conversations/month), that's $2,000-8,000. Use GPT-4o-mini for classification and sentiment, reserve GPT-4o for response generation.

  3. Knowledge base maintenance. Your docs will go stale. Build automated pipelines to flag outdated content. The best RAG system in the world can't help if your return policy page still says "30 days" when you changed it to "14 days" last month.

  4. Edge cases. Customers will say things like "my order is broken" (product issue? shipping damage? metaphorical complaint about your company?). Plan for ambiguity — don't assume the LLM will always classify correctly.

  5. Evaluation is expensive. The LLM-as-judge quality gate adds cost and latency. For high-volume deployments, run it on a sample (10-20%) rather than every response.


Summary

A high-performing support agent is a system, not a prompt. The key components:

  • Knowledge base with hybrid search, reranking, and metadata filtering
  • Tool layer with atomic operations, strict schemas, and graceful error handling
  • Escalation logic using multiple signals, not just keyword matching
  • Quality gates that catch hallucinations and irrelevant responses before they reach customers
  • Feedback loops that turn every conversation into training data

Build each layer independently. Test each layer independently. Ship incrementally — a bot that handles "where's my order?" perfectly is more valuable than one that handles everything poorly.

Keywords

AI agentconversational-agents