Building a Research Agent with LangGraph: From Question to Report

What We're Actually Building

A research agent that takes a natural language question, decomposes it into sub-queries, searches the web and academic databases in parallel, synthesizes findings, and produces a structured report with inline citations. Not a toy demo — something that handles multi-hop research with proper source tracking.

The agent uses a directed graph with conditional routing, parallel execution branches, and a shared state that accumulates sources throughout the process. Here's the architecture:

                    ┌─────────────┐
                    │  Decompose  │
                    │   Query     │
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
              ┌─────┴───┐   ┌────┴─────┐
              │  Web     │   │  Paper   │
              │  Search  │   │  Search  │
              └─────┬───┘   └────┬─────┘
                    │            │
                    └──────┬─────┘
                    ┌──────┴──────┐
                    │  Synthesize │
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │  Generate   │
                    │  Report     │
                    └─────────────┘

Dependencies and Setup

pip install langgraph langchain-openai langchain-community tavily-python arxiv

You'll need API keys for:

OpenAI (or any LangChain-compatible LLM) — GPT-4o works best for synthesis tasks
Tavily — purpose-built search API for AI agents (free tier: 1000 queries/month)
ArXiv — free, no key needed, but rate-limited

import os
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["TAVILY_API_KEY"] = "tvly-..."

Defining the Agent State

The state is the backbone of any LangGraph agent. Every node reads from it and writes to it. Getting this right determines whether your agent works or turns into spaghetti.

import operator
from typing import Annotated, TypedDict, Literal
from langchain_core.pydantic_v1 import BaseModel, Field


class SubQuery(BaseModel):
    """A decomposed sub-question with its rationale."""
    question: str = Field(description="The specific sub-question")
    rationale: str = Field(description="Why this sub-question matters")
    search_type: Literal["web", "paper", "both"] = Field(
        description="Which search strategy to use"
    )


class Source(BaseModel):
    """A tracked source with content and metadata."""
    url: str = ""
    title: str = ""
    content: str = ""
    source_type: Literal["web", "paper"] = "web"
    relevance_score: float = 0.0
    citation_key: str = ""  # e.g., "Smith2024" or "[1]"


class Finding(BaseModel):
    """A synthesized finding from one or more sources."""
    claim: str
    evidence: list[str]
    source_keys: list[str]
    confidence: Literal["high", "medium", "low"]


class ResearchState(TypedDict):
    # Input
    research_question: str

    # Decomposition
    sub_queries: list[SubQuery]

    # Accumulated sources — using Annotated with operator.add
    # so multiple nodes can append without overwriting each other
    web_sources: Annotated[list[Source], operator.add]
    paper_sources: Annotated[list[Source], operator.add]

    # Synthesis
    findings: list[Finding]
    contradictions: list[str]
    gaps: list[str]

    # Output
    report: str
    bibliography: list[Source]

    # Control flow
    iteration: int
    max_iterations: int

The Annotated[list[Source], operator.add] pattern is critical. It tells LangGraph that when multiple nodes write to web_sources concurrently, their outputs should be appended rather than one overwriting the other. This is what makes parallel search branches work.

Node 1: Query Decomposition

The first node breaks a research question into targeted sub-queries. This matters more than people think — a question like "What are the impacts of large language models on software engineering?" has at least four distinct angles that need separate searches.

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOpenAI(model="gpt-4o", temperature=0)
structured_llm = llm.with_structured_output(SubQuery)


def decompose_query(state: ResearchState) -> dict:
    """Break the research question into sub-queries for targeted search."""
    question = state["research_question"]

    decomposition_prompt = f"""You are a research planning expert. Given the following research question,
decompose it into 3-6 specific sub-questions that, when answered together, 
will provide a comprehensive response.

Research Question: {question}

For each sub-question, specify:
1. The precise question to search for
2. Why it's necessary for answering the main question  
3. Whether it's best answered by web search (recent news, practical info, 
   industry data), academic papers (peer-reviewed research, methodologies, 
   theoretical frameworks), or both

Focus on sub-questions that are:
- Specific enough to get useful search results
- Different enough to cover distinct aspects
- Ordered from foundational to advanced"""

    response = structured_llm.invoke([
        SystemMessage(content="You decompose research questions into searchable sub-queries."),
        HumanMessage(content=decomposition_prompt)
    ])

    # We expect a list, but structured output gives one object
    # So we use a list wrapper prompt instead
    class SubQueryList(BaseModel):
        queries: list[SubQuery]

    list_structured = llm.with_structured_output(SubQueryList)
    result = list_structured.invoke([
        SystemMessage(content="You decompose research questions into searchable sub-queries."),
        HumanMessage(content=decomposition_prompt)
    ])

    return {
        "sub_queries": result.queries,
        "iteration": state.get("iteration", 0) + 1
    }

Honest caveat: LLM-based query decomposition is probabilistic. Run the same question twice and you'll get slightly different sub-queries. For production use, add a validation step that checks whether the sub-queries collectively cover the original question. I've seen cases where the LLM fixates on one aspect and ignores others — especially with broad questions.

Node 2: Web Search

from tavily import TavilyClient

tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])


def web_search(state: ResearchState) -> dict:
    """Search the web for each sub-query marked for web search."""
    sources = []
    citation_counter = len(state.get("paper_sources", []))

    for sq in state["sub_queries"]:
        if sq.search_type not in ("web", "both"):
            continue

        try:
            # Tavily returns structured results with content snippets
            results = tavily.search(
                query=sq.question,
                search_depth="advanced",  # Gets more content per result
                max_results=5,
                include_raw_content=False,
            )

            for i, result in enumerate(results.get("results", [])):
                citation_counter += 1
                source = Source(
                    url=result.get("url", ""),
                    title=result.get("title", ""),
                    content=result.get("content", ""),
                    source_type="web",
                    relevance_score=result.get("score", 0.0),
                    citation_key=f"[{citation_counter}]"
                )
                sources.append(source)

        except Exception as e:
            # Don't let one failed search kill the whole agent
            print(f"Search failed for '{sq.question}': {e}")
            continue

    return {"web_sources": sources}

Why Tavily over SerpAPI or Google? Tavily returns extracted content snippets, not just links and descriptions. For an agent, you need the actual text to synthesize — not a list of URLs that would require a second scraping step. That said, Tavily's content extraction isn't perfect. For production, you'd want to add a content extraction step using something like trafilatura or readability-lxml on the raw pages.

Node 3: Academic Paper Search

import arxiv


def paper_search(state: ResearchState) -> dict:
    """Search arXiv for academic papers related to sub-queries."""
    sources = []
    citation_counter = len(state.get("web_sources", []))

    for sq in state["sub_queries"]:
        if sq.search_type not in ("paper", "both"):
            continue

        try:
            client = arxiv.Client()
            search = arxiv.Search(
                query=sq.question,
                max_results=5,
                sort_by=arxiv.SortCriterion.Relevance,
                sort_order=arxiv.SortOrder.Descending,
            )

            for i, result in enumerate(client.results(search)):
                citation_counter += 1

                # Extract the abstract as content
                # For full paper analysis, you'd download the PDF
                # and extract text — but that's a separate pipeline
                content = f"""Title: {result.title}
Authors: {', '.join(a.name for a in result.authors)}
Published: {result.published.strftime('%Y-%m-%d')}
Abstract: {result.summary}
Categories: {', '.join(result.categories)}"""

                source = Source(
                    url=result.entry_id,
                    title=result.title,
                    content=content,
                    source_type="paper",
                    relevance_score=0.8,  # arXiv doesn't provide relevance scores
                    citation_key=f"[{citation_counter}]"
                )
                sources.append(source)

        except Exception as e:
            print(f"Paper search failed for '{sq.question}': {e}")
            continue

    return {"paper_sources": sources}

Limitation worth noting: ArXiv covers computer science, physics, mathematics, and related fields well, but it's sparse for social sciences, medicine, or business. For a general-purpose research agent, you'd want to integrate Semantic Scholar's API (which has 200M+ papers across disciplines) or PubMed for biomedical research. The Semantic Scholar API is free and has better relevance ranking:

# Alternative: Semantic Scholar (pip install semanticscholar)
from semanticscholar import SemanticScholar

sch = SemanticScholar()
results = sch.search_paper(
    query=sq.question,
    limit=5,
    fields=["title", "abstract", "url", "year", "authors", "citationCount"]
)

Node 4: Synthesis

This is where the agent earns its keep. Raw search results are useless — the synthesis node needs to extract claims, identify supporting evidence, detect contradictions, and assess confidence.

class SynthesisOutput(BaseModel):
    findings: list[Finding]
    contradictions: list[str] = Field(
        description="Claims where sources disagree"
    )
    gaps: list[str] = Field(
        description="Aspects of the question not covered by available sources"
    )


def synthesize(state: ResearchState) -> dict:
    """Analyze all sources and extract structured findings."""
    # Format all sources for the prompt
    all_sources = state["web_sources"] + state["paper_sources"]

    sources_text = ""
    for s in all_sources:
        sources_text += f"\n--- {s.citation_key} ({s.source_type}) ---\n"
        sources_text += f"Title: {s.title}\n"
        sources_text += f"URL: {s.url}\n"
        sources_text += f"Content: {s.content[:2000]}\n"  # Truncate for context window

    synthesis_prompt = f"""You are a research analyst. Given the following research question 
and collected sources, produce a structured synthesis.

Research Question: {state['research_question']}

Sources:
{sources_text}

Instructions:
1. Extract 5-10 key findings that address the research question
2. For each finding, cite the specific sources that support it using their citation keys
3. Assess confidence based on:
   - Number of corroborating sources
   - Source quality (peer-reviewed > industry report > blog post)
   - Recency of information
4. Identify any contradictions between sources
5. Note gaps — aspects of the question that the sources don't adequately address

Be rigorous. If sources disagree, present both sides. Don't manufacture 
consensus that doesn't exist."""

    synthesis_llm = llm.with_structured_output(SynthesisOutput)
    result = synthesis_llm.invoke([
        SystemMessage(content="You synthesize research findings with rigorous citation practices."),
        HumanMessage(content=synthesis_prompt)
    ])

    return {
        "findings": result.findings,
        "contradictions": result.contradictions,
        "gaps": result.gaps,
    }

The [:2000] truncation on content is a pragmatic compromise. With 20-30 sources, you're looking at 40-60K characters of source text. Even with GPT-4o's 128K context window, you'll hit diminishing returns and increased hallucination with that much input. For a production system, you'd want a two-pass approach: first filter for relevance, then do deep synthesis on the top sources.

Node 5: Report Generation

def generate_report(state: ResearchState) -> dict:
    """Generate a structured research report with citations."""
    all_sources = state["web_sources"] + state["paper_sources"]
    source_lookup = {s.citation_key: s for s in all_sources}

    # Build bibliography
    bibliography = []
    for source in all_sources:
        # Only include sources actually cited in findings
        cited_keys = set()
        for finding in state["findings"]:
            cited_keys.update(finding.source_keys)
        if source.citation_key in cited_keys:
            bibliography.append(source)

    findings_text = ""
    for i, f in enumerate(state["findings"], 1):
        citations = ", ".join(f.source_keys)
        findings_text += f"\n{i}. {f.claim}\n"
        findings_text += f"   Evidence: {'; '.join(f.evidence)}\n"
        findings_text += f"   Sources: {citations}\n"
        findings_text += f"   Confidence: {f.confidence}\n"

    report_prompt = f"""You are writing a research report. Using the synthesized findings below,
produce a comprehensive, well-structured report.

Research Question: {state['research_question']}

Findings:
{findings_text}

Contradictions Found: {state['contradictions']}
Gaps in Research: {state['gaps']}

Report Structure:
1. **Executive Summary** (2-3 sentences answering the research question)
2. **Introduction** (context and scope)
3. **Key Findings** (organized thematically, not just a list)
4. **Analysis & Discussion** (connections between findings, implications)
5. **Contradictions & Limitations** (intellectual honesty)
6. **Conclusion** (direct answer with nuance)
7. **Bibliography** (formatted reference list)

Rules:
- Use inline citations like [1], [2] throughout the text
- Every factual claim MUST have a citation
- When findings conflict, present the disagreement explicitly
- Write for a technical audience but explain jargon on first use
- Keep the report between 1500-3000 words"""

    report = llm.invoke([
        SystemMessage(content="You write rigorous research reports with proper citations."),
        HumanMessage(content=report_prompt)
    ])

    # Format bibliography
    bib_text = "\n\n## Bibliography\n\n"
    for source in bibliography:
        if source.source_type == "paper":
            bib_text += f"{source.citation_key} {source.title}. Available at: {source.url}\n\n"
        else:
            bib_text += f"{source.citation_key} \"{source.title}.\" Available at: {source.url}\n\n"

    return {
        "report": report.content + bib_text,
        "bibliography": bibliography,
    }

Assembling the Graph

Now we wire everything together. The key decision: web search and paper search run in parallel (fan-out), then synthesis waits for both to complete (fan-in).

from langgraph.graph import StateGraph, START, END

# Initialize the graph
graph = StateGraph(ResearchState)

# Add nodes
graph.add_node("decompose", decompose_query)
graph.add_node("web_search", web_search)
graph.add_node("paper_search", paper_search)
graph.add_node("synthesize", synthesize)
graph.add_node("generate_report", generate_report)

# Define edges
graph.add_edge(START, "decompose")

# Fan-out: after decomposition, search both sources in parallel
graph.add_edge("decompose", "web_search")
graph.add_edge("decompose", "paper_search")

# Fan-in: both search branches feed into synthesis
# LangGraph handles the merge automatically thanks to
# the Annotated[list, operator.add] reducer on state
graph.add_edge("web_search", "synthesize")
graph.add_edge("paper_search", "synthesize")

# Final steps
graph.add_edge("synthesize", "generate_report")
graph.add_edge("generate_report", END)

# Compile
research_agent = graph.compile()

How the fan-in works: When web_search and paper_search both complete, LangGraph merges their state updates. Because web_sources and paper_sources use the operator.add reducer, the lists get concatenated. The synthesize node then sees both sets of sources.

Running the Agent

# Optional: visualize the graph
from IPython.display import Image, display
display(Image(research_agent.get_graph().draw_mermaid_png()))

# Run it
result = research_agent.invoke({
    "research_question": "What are the current approaches to reducing "
                         "hallucinations in large language models, and "
                         "how effective are they?",
    "iteration": 0,
    "max_iterations": 1,
})

print(result["report"])

For streaming (useful for long-running research):

# Stream node outputs as they complete
for event in research_agent.stream({
    "research_question": "How do mixture-of-experts architectures "
                         "compare to dense transformers for code generation?",
    "iteration": 0,
    "max_iterations": 1,
}):
    for node_name, output in event.items():
        print(f"=== {node_name} completed ===")
        if "web_sources" in output:
            print(f"  Found {len(output['web_sources'])} web sources")
        if "paper_sources" in output:
            print(f"  Found {len(output['paper_sources'])} paper sources")
        if "findings" in output:
            print(f"  Extracted {len(output['findings'])} findings")

Adding Conditional Routing: Iterative Research

The basic graph does a single pass. Real research is iterative — you discover gaps, then search to fill them. Here's how to add a loop:

def should_continue_research(state: ResearchState) -> Literal["decompose", "generate_report"]:
    """Decide whether to do another search iteration or generate the report."""
    if state["iteration"] >= state["max_iterations"]:
        return "generate_report"

    # If synthesis found significant gaps, do another round
    if len(state.get("gaps", [])) >= 2:
        return "decompose"

    return "generate_report"


def refine_queries(state: ResearchState) -> dict:
    """Generate new sub-queries targeting identified gaps."""
    gap_prompt = f"""The previous research round found these gaps:
{chr(10).join('- ' + g for g in state['gaps'])}

Original question: {state['research_question']}

Generate 2-3 specific search queries to fill these gaps.
Only use 'both' search_type for queries that need both web and paper sources."""

    class SubQueryList(BaseModel):
        queries: list[SubQuery]

    list_structured = llm.with_structured_output(SubQueryList)
    result = list_structured.invoke([
        SystemMessage(content="Generate targeted queries to fill research gaps."),
        HumanMessage(content=gap_prompt)
    ])

    return {
        "sub_queries": result.queries,
        "iteration": state["iteration"] + 1,
    }


# Rebuild graph with loop
graph = StateGraph(ResearchState)

graph.add_node("decompose", decompose_query)
graph.add_node("refine_queries", refine_queries)
graph.add_node("web_search", web_search)
graph.add_node("paper_search", paper_search)
graph.add_node("synthesize", synthesize)
graph.add_node("generate_report", generate_report)

graph.add_edge(START, "decompose")
graph.add_edge("decompose", "web_search")
graph.add_edge("decompose", "paper_search")
graph.add_edge("web_search", "synthesize")
graph.add_edge("paper_search", "synthesize")

# Conditional: loop back or finish
graph.add_conditional_edges(
    "synthesize",
    should_continue_research,
    {
        "decompose": "refine_queries",  # Refine, then search again
        "generate_report": "generate_report",
    }
)
graph.add_edge("refine_queries", "web_search")
graph.add_edge("refine_queries", "paper_search")
graph.add_edge("generate_report", END)

research_agent = graph.compile()

Set max_iterations to 2-3 for most use cases. Beyond that, you're burning tokens on diminishing returns.

Adding Checkpointing for Resilience

Research agents hit API rate limits, timeouts, and network errors. LangGraph's checkpointing lets you resume from the last successful node:

from langgraph.checkpoint.memory import MemorySaver

# For development — in production, use SQLite or Postgres checkpointer
checkpointer = MemorySaver()

research_agent = graph.compile(checkpointer=checkpointer)

# Run with a thread ID for checkpointing
config = {"configurable": {"thread_id": "research-session-1"}}

result = research_agent.invoke({
    "research_question": "What is the current state of neuromorphic computing?",
    "iteration": 0,
    "max_iterations": 2,
}, config=config)

# If it fails partway through, you can resume:
# research_agent.invoke(None, config=config)  # Picks up from last checkpoint

For production, swap MemorySaver for SqliteScheckpointer or the Postgres checkpointer:

pip install langgraph-checkpoint-sqlite

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("research_checkpoints.db")

Practical Tips and Gotchas

Token budget management. A single research run with 30 sources and iterative search can easily consume 50-100K tokens. Track usage:

# Add to your LLM initialization
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    max_tokens=4096,  # Cap output length per node
)

Citation verification. The LLM will occasionally cite sources that don't actually support the claim. For production, add a verification node that checks citations against source content:

def verify_citations(state: ResearchState) -> dict:
    """Spot-check that cited sources actually contain the claimed evidence."""
    all_sources = {s.citation_key: s for s in state["web_sources"] + state["paper_sources"]}
    verified_findings = []

    for finding in state["findings"]:
        verified_keys = []
        for key in finding.source_keys:
            source = all_sources.get(key)
            if source and _claim_in_source(finding.claim, source.content):
                verified_keys.append(key)

        if verified_keys:  # Keep finding only if at least one source checks out
            finding.source_keys = verified_keys
            verified_findings.append(finding)

    return {"findings": verified_findings}

Rate limiting. ArXiv rate-limits at roughly 1 request per 3 seconds. Add delays:

import time

def paper_search(state: ResearchState) -> dict:
    sources = []
    for sq in state["sub_queries"]:
        # ... search logic ...
        time.sleep(3)  # Respect arXiv rate limits
    return {"paper_sources": sources}

What This Agent Doesn't Do Well

I want to be direct about limitations:

No PDF parsing. The arXiv integration only reads abstracts. Full paper analysis requires a PDF extraction pipeline (PyMuPDF, pdfplumber) and a chunking strategy. This is a significant gap for academic research.
No web page scraping. Tavily gives snippets, not full pages. For deep analysis, integrate a scraping tool or use Tavily's include_raw_content=True (which is slower and costs more).
Citation quality is mediocre. The LLM will sometimes cite sources tangentially. The verification node helps but doesn't eliminate the problem.
No fact-checking across sources. The synthesis node identifies contradictions in the prompt, but it's still trusting the LLM's reading comprehension. It won't catch subtle misrepresentations.
Context window pressure. With 30+ sources, you're pushing the limits of what fits in a single synthesis prompt. A map-reduce pattern (synthesize per-topic, then combine) scales better.

For a production research agent, you'd want to address each of these. But as a starting framework, this gives you a working pipeline with proper state management, parallel execution, and iterative refinement — the hard architectural problems that LangGraph is actually designed to solve.

Building a Research Agent with LangGraph: From Question to Report

Building a Research Agent with LangGraph: From Question to Report

What We're Actually Building

Dependencies and Setup

Defining the Agent State

Node 1: Query Decomposition

Node 2: Web Search

Node 3: Academic Paper Search

Node 4: Synthesis

Node 5: Report Generation

Assembling the Graph

Running the Agent

Adding Conditional Routing: Iterative Research

Adding Checkpointing for Resilience

Practical Tips and Gotchas

What This Agent Doesn't Do Well

Keywords