Home

Building a Knowledge Graph with LangChain and CrewAI

Ni

Nina Kowalski

May 31, 20268 min read

# Building a Knowledge Graph with LangChain and CrewAI ## Overview and Audience This guide shows how to combine LangChain’s LLM‑centric tooling with CrewAI’s multi‑agent orchestration to construct a...

Building a Knowledge Graph with LangChain and CrewAI

Overview and Audience

This guide shows how to combine LangChain’s LLM‑centric tooling with CrewAI’s multi‑agent orchestration to construct a knowledge graph from unstructured text. The target audience includes data engineers, researchers, and developers who need to extract structured entities and relations from documents, then store them in a graph database for downstream tasks such as question answering, recommendation, or semantic search. Familiarity with Python, basic NLP concepts, and either a local Neo4j instance or an in‑memory graph library (e.g., NetworkX) is assumed.

Core Features and Capabilities

  • Document ingestion – LangChain provides loaders for PDFs, HTML, markdown, and plain text, plus splitters that respect semantic boundaries.
  • Entity and relation extraction – Using LLMs via LangChain’s LLMChain or StructuredOutputParser, you can prompt the model to emit JSON‑style triples (subject, predicate, object).
  • Agent‑based validation – CrewAI lets you define specialized agents (e.g., Extractor, Validator, Merger) that communicate through shared memory and iterate on results.
  • Graph construction – Extracted triples are fed into NetworkX or Neo4j, enabling visualisation, querying with Cypher, or exporting to RDF.
  • Iterative refinement – Agents can re‑run extraction on low‑confidence spans, request human‑in‑the‑loop feedback, or merge duplicate entities via similarity scoring.

These capabilities let you move from raw corpora to a queryable knowledge graph without writing custom parsers for each domain.

Architecture and How It Works

The pipeline consists of three layers:

  1. Ingestion Layer – LangChain’s DocumentLoader and TextSplitter turn source files into Document objects with metadata.
  2. Extraction Layer – A CrewAI crew runs three agents in sequence:
    • Extractor: Sends each chunk to an LLM with a prompt that returns a list of triples.
    • Validator: Checks each triple against a knowledge base (e.g., Wikidata via SPARQL) or runs a consistency check (no self‑loops, valid types).
    • Merger: Uses fuzzy string matching (e.g., rapidfuzz) to unify entities and aggregates duplicate predicates.
  3. Storage Layer – Validated triples are inserted into a graph. For quick prototyping, NetworkX is sufficient; for production, Neo4j offers ACID transactions and Cypher queries.

Data flow is asynchronous: the Extractor agent publishes results to a shared CrewAI memory store; the Validator reads from that store, updates confidence scores, and writes back; the Merger finally writes to the graph. This loop can be configured to run for a fixed number of iterations or until convergence.

Example Prompt for Extraction

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import OpenAI

triple_prompt = PromptTemplate(
    input_variables=["chunk"],
    template="""
You are an information extraction assistant. Given the text below, extract all factual statements as triples in JSON format.
Each triple must have keys "subject", "predicate", "object". Return a JSON list.

Text:
{chunk}

Triples:
"""
)

llm = OpenAI(temperature=0)
extractor_chain = LLMChain(llm=llm, prompt=triple_prompt)

The chain’s output is parsed with json.loads and passed to the Validator agent.

Real-World Use Cases

  • Scientific literature mining – Extract gene‑disease interactions from PubMed abstracts to build a biomedical knowledge graph for hypothesis generation.
  • Enterprise knowledge base – Convert internal wikis and PDF manuals into a graph that powers an internal search engine, allowing employees to ask natural‑language questions and receive sourced answers.
  • Media monitoring – Process news articles to capture events (who did what, where, when) and feed a temporal graph for trend analysis.
  • E‑commerce product graph – Pull product attributes from vendor catalogs and reviews to enable faceted search and recommendation.

In each case, the combination of LangChain’s flexible document handling and CrewAI’s agent collaboration reduces the need for hand‑crafted rules while keeping the extraction process auditable.

Strengths and Limitations

Strengths

  • Modularity: Swap LLMs (OpenAI, Anthropic, local Hugging Face models) without changing the agent logic.
  • Observability: CrewAI’s built‑in logging and memory inspection let you trace why a triple was accepted or rejected.
  • Scalability: The extraction layer can be parallelised across CPU cores or GPUs by batching chunks.
  • Graph‑first output: The triples are ready for any graph store, avoiding an intermediate relational schema.

Limitations

  • LLM hallucination: Even with validation, false triples can slip through; domain‑specific gazetteers or rule‑based filters improve precision.
  • Cost: Running large LLMs over millions of chunks can be expensive; consider using smaller models for early filtering.
  • Agent overhead: CrewAI adds latency due to inter‑agent communication; for simple pipelines a single LangChain chain may be faster.
  • Graph size: In‑memory NetworkX graphs become unwieldy beyond a few million edges; Neo4j or a purpose‑built triple store is required for larger scales.

Getting Started Guide

Prerequisites

  • Python 3.10+
  • Access to an LLM API (OpenAI key shown; replace with your provider)
  • Neo4j Desktop or Docker container (optional, for persistent storage)

Installation

pip install langchain crewai openai neo4j networkx rapidfuzz

Step‑by‑Step Example

  1. Load and split a sample text
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("sample.txt")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
print(f"{len(chunks)} chunks ready for extraction")
  1. Define the Extractor, Validator, and Merger agents
from crewai import Agent, Task, Crew
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import json

# Extractor
llm = OpenAI(temperature=0)
extract_prompt = PromptTemplate(
    input_variables=["chunk"],
    template="Extract triples as JSON list from: {chunk}"
)
extract_chain = LLMChain(llm=llm, prompt=extract_prompt)

def extract_triples(chunk: str):
    raw = extract_chain.run(chunk=chunk)
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        return []

# Validator (simple duplicate check)
known_entities = set()

def validate_triple(triple):
    subj, pred, obj = triple["subject"], triple["predicate"], triple["object"]
    if subj == obj:
        return False
    # optional: check against a gazetteer
    return True

# Merger (fuzzy merge)
from rapidfuzz import fuzz

def merge_entities(triples, threshold=85):
    merged = []
    entity_map = {}
    for t in triples:
        subj = t["subject"]
        obj = t["object"]
        # find existing similar entity
        def find_or_add(name):
            for known, canon in entity_map.items():
                if fuzz.ratio(name, known) >= threshold:
                    return canon
            entity_map[name] = name
            return name
        t["subject"] = find_or_add(subj)
        t["object"] = find_or_add(obj)
        merged.append(t)
    return merged

# Wrap as CrewAI agents
extractor_agent = Agent(
    role="Extractor",
    goal="Produce candidate triples from text chunks",
    backstory="Uses an LLM to pull subject‑predicate‑object statements.",
    verbose=True,
    allow_delegation=False,
)

validator_agent = Agent(
    role="Validator",
    goal="Filter out invalid or duplicate triples",
    backstory="Applies simple rules and gazetteer checks.",
    verbose=True,
    allow_delegation=False,
)

merger_agent = Agent(
    role="Merger",
    goal="Unify similar entities and consolidate predicates",
    backstory="Employs fuzzy matching to reduce noise.",
    verbose=True,
    allow_delegation=False,
)

# Tasks
extract_task = Task(
    description="Run extractor on each chunk",
    agent=extractor_agent,
    function=lambda chunk: extract_triples(chunk),
)

validate_task = Task(
    description="Validate extracted triples",
    agent=validator_agent,
    function=lambda triples: [t for t in triples if validate_triple(t)],
)

merge_task = Task(
    description="Merge similar entities",
    agent=merger_agent,
    function=lambda triples: merge_entities(triples),
)

crew = Crew(
    agents=[extractor_agent, validator_agent, merger_agent],
    tasks=[extract_task, validate_task, merge_task],
    verbose=True,
)

# Process all chunks
all_triples = []
for chunk in chunks:
    result = crew.kickoff(inputs={"chunk": chunk.page_content})
    # crew returns a list of outputs per task; we take the final merged list
    all_triples.extend(result[-1] if isinstance(result, list) else [])

print(f"Collected {len(all_triples)} triples after merging")
  1. Store in Neo4j
from neo4j import GraphDatabase

def insert_triples(triples, uri="bolt://localhost:7687", user="neo4j", password="test"):
    driver = GraphDatabase.driver(uri, auth=(user, password))
    with driver.session() as session:
        for t in triples:
            session.run(
                """
                MERGE (s:Entity {name: $subj})
                MERGE (o:Entity {name: $obj})
                MERGE (s)-[r:REL {type: $pred}]->(o)
                SET r.created = timestamp()
                """,
                subj=t["subject"], obj=t["object"], pred=t["predicate"]
            )
    driver.close()

insert_triples(all_triples)
  1. Query the graph
from neo4j import GraphDatabase

def query_related(entity, uri="bolt://localhost:7687", user="neo4j", password="test"):
    driver = GraphDatabase.driver(uri, auth=(user, password))
    with driver.session() as session:
        res = session.run(
            """
            MATCH (e:Entity {name: $ent})-[r:REL]->(n:Entity)
            RETURN r.type AS predicate, n.name AS object
            UNION
            MATCH (n:Entity)-[r:REL]->(e:Entity {name: $ent})
            RETURN r.type AS predicate, n.name AS object
            """,
            ent=entity
        )
        return [{"predicate": r["predicate"], "object": r["object"]} for r in res]
    driver.close()

print(query_related("COVID-19"))

Run the script (python build_kg.py) to see triples appear in Neo4j Browser. Adjust the prompts, validation rules, or merger threshold to fit your domain.

Comparison with Alternatives

Feature LangChain + CrewAI LlamaIndex + AutoGen Hugging Face smolagents Pure Neo4j + spaCy
LLM flexibility High (any LangChain‑compatible LLM) Medium (LlamaIndex wrappers) High (HF pipeline) Low (rule‑based)
Multi‑agent orchestration Built‑in (CrewAI) Requires custom AutoGen setup Minimal (single‑agent) None
Document loading Rich loader ecosystem Good (LlamaIndex readers) Basic Manual
Graph output Direct triple emission Requires post‑processing Same as LangChain Native Cypher
Production scaling Needs external graph store Same Same Strong (Neo4j native)
Learning curve Moderate (two frameworks) Moderate Low (HF) High (Cypher + NLP)

LangChain + CrewAI shines when you need a declarative, LLM‑driven extraction pipeline that can be inspected and iterated via agents. For pure graph construction without LLMs, a spaCy‑Neo4j combo is faster but less adaptable to new relation types. LlamaIndex + AutoGen offers similar agent capabilities but ties you to LlamaIndex’s data connectors.

Final Thoughts

Combining LangChain’s document‑centric tooling with CrewAI’s agent collaboration gives you a pragmatic way to turn raw text into a structured knowledge graph. The approach is transparent: you can inspect each agent’s output, tweak prompts, and replace components without rewriting the whole system. While LLM costs and hallucination remain concerns, the modular design lets you add validation layers, human‑in‑the‑loop checks, or hybrid rule‑based filters as needed. For teams that already use LangChain for LLM applications, adding CrewAI is a low‑friction step toward more complex, multi‑step knowledge‑extraction workflows.


Feel free to adapt the example code to your own data sources, LLM provider, or graph database. The patterns shown here scale from a few hundred documents to corpora of millions when paired with batching and asynchronous execution.

Keywords

LangChainCrewAIknowledge graphentity extractionmulti-agent systemsNeo4jLLM pipeline

Keep reading

More related articles from DriftSeas.