3. Retrieval-Augmented Generation RAG
Li Wei
3. Retrieval‑Augmented Generation (RAG)
Overview of RAG
Generative AIs such as ChatGPT and GLM have shown impressive performance in tasks like text generation and text‑to‑image generation. However, they have inherent limitations: they can hallucinate, their outputs lack interpretability, they struggle with domain‑specific knowledge, and they are unaware of the latest information. To overcome these constraints and boost model capabilities, there are two main approaches:
- Fine‑tuning – updating the model itself.
- Enabling interaction with the external world – letting the model acquire knowledge in various forms and ways.
Fine‑tuning can indeed help a model truly “learn” private knowledge, but it brings several problems:
- Because generative models rely on internal knowledge (weights), they cannot completely eliminate hallucinations. In high‑stakes scenarios that demand strict accuracy, this is unacceptable, as users cannot easily tell from the surface whether the model is fabricating information.
- In real‑world settings, data is generated continuously, and concepts evolve rapidly (e.g., policy interpretations, indicator adjustments). Fine‑tuning is not a trivial task; from data preparation and compute resources to training time and effectiveness, constantly re‑training on fresh data is unrealistic, and even monthly updates are considered ideal.
A promising alternative is RAG (Retrieval‑Augmented Generation), which lets generative models interact with external knowledge sources. RAG works like a search engine: it finds the most relevant knowledge or dialogue history for a user query, combines it with the original question to create a rich prompt, and guides the model to generate an accurate answer. Essentially, it leverages in‑context learning.
In the LLM space, building a minimum viable product is relatively easy, but achieving production‑grade performance and reliability—especially for a high‑performance RAG system—is challenging. RAG is already widely used in enterprise private‑knowledge Q&A, and many popular “chat‑to‑PDF”, “chat‑to‑doc”, etc., applications are built on RAG.
Core RAG Terminology
(the original list was omitted in the source)
Basic RAG Workflow
As shown below, RAG consists of five fundamental steps: preparing knowledge documents, embedding model, vector database, query retrieval, and answer generation. Each stage is described in detail.
Preparing Knowledge Documents
The first step in building an efficient RAG system is to gather the knowledge documents. In practice, sources can be diverse: Word files, TXT, CSV tables, Excel spreadsheets, PDFs, images, videos, and more.
Thus, the initial phase requires specialized document‑parsing tools (e.g., PDF text extractors) or multimodal techniques such as OCR to convert these heterogeneous resources into pure text that a large language model can understand. For PDFs, use a PDF text‑extraction tool; for images and videos, apply OCR to automatically recognize and convert any embedded text.
Embedding Model
In a RAG system, the retrieval stage is not about keyword matching but about comparing vector similarity. When a user asks a question, the query is encoded into a vector; simultaneously, the knowledge base has already been split into many “chunks,” each encoded as a vector. Retrieval is simply determining which knowledge chunk is closest to the query in semantic space.
“Closeness” can be interpreted in two ways:
- Semantic direction – are the vectors pointing in the same direction?
- Geometric distance – how far apart are the points in the vector space?
Because of these two perspectives, RAG systems often employ both cosine similarity and Euclidean distance as similarity measures.
- Cosine similarity focuses on the angle between vectors; even if their magnitudes differ, a small angle indicates strong semantic relatedness.
- Euclidean distance treats vectors as points in a high‑dimensional space and measures the straight‑line distance; a smaller distance means the vectors are numerically closer.
Example: Word2Vec
Word2Vec, developed by Google, illustrates one way to build an embedding model. It offers two training strategies: Continuous Bag‑of‑Words (CBOW) and Skip‑gram. Below we focus on CBOW.
- One‑Hot Encoding – each context word (e.g., “the”, “cat”, “on”, “the”, “mat”) is turned into a high‑dimensional sparse vector with a single 1 at the index of the word.
- Word‑Vector Generation – each one‑hot vector is multiplied by a weight matrix W of size V × N (V = vocabulary size, N = embedding dimension). The resulting dense vectors are summed and averaged to produce a context vector e (1 × N).
- Output Layer – a second matrix W′ linearly transforms e, followed by a Softmax that yields a probability distribution over the vocabulary, indicating the likelihood of each word being the target.
- Parameter Update – the difference between the predicted distribution and the true target word drives gradient descent updates to W and W′, minimizing loss.
More advanced models such as BERT and the GPT series use deeper neural architectures to capture richer semantic relationships. Embeddings are therefore a cornerstone of NLP, enabling efficient representation of complex word relationships and dramatically improving retrieval accuracy.
Vector Database
A vector database’s primary role is to store and retrieve vector‑formatted data efficiently. These vectors can represent text, images, audio, or any modality after feature extraction.
When using RAG, the raw data is first transformed into vectors by an embedding model. All resulting vectors are then persisted in a purpose‑built vector store, which offers several advantages:
- Fast similarity search – optimized algorithms quickly locate the nearest vectors even in massive collections.
- Scalability – designed for high‑dimensional data, they can handle millions or billions of records.
- Flexibility & extensibility – new data sources or indexing strategies can be added without sacrificing performance.
Query Retrieval
After the preparation steps, we can process user queries. The query is embedded into a vector, and the system searches the vector database for semantically similar knowledge chunks or past dialogue turns, returning the most relevant results.
Answer Generation
Finally, the original user question and the retrieved information are combined into a prompt template, fed to a large language model, and the model’s output is returned as the answer.
RAG Evaluation Metrics
Evaluating a Retrieval‑Augmented Generation system requires assessing both the retrieval and the generation stages. If you prefer a high‑level view, the core metrics below give a quick sense of performance.
Retrieval‑Stage Core Metrics
Precision – proportion of retrieved documents that are relevant.
Precision = # relevant retrieved / # retrieved
Example: 7 relevant out of 10 retrieved → Precision = 0.7. High precision means the system returns accurate results.Recall – proportion of all relevant documents that were retrieved.
Recall = # relevant retrieved / # all relevant
Example: 12 of 20 relevant documents retrieved → Recall = 0.6. High recall indicates good coverage.F1‑Score – harmonic mean of precision and recall, balancing the two.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Useful when both accuracy and completeness matter.NDCG (Normalized Discounted Cumulative Gain) – evaluates ranking quality, giving higher weight to items near the top of the list.
NDCG@k = DCG@k / IDCG@k
Critical for assessing Top‑K results.
Generation‑Stage Core Metrics
- BLEU – n‑gram overlap between generated text and reference answers; fast to compute but ignores semantics.
- ROUGE – n‑gram recall‑oriented metric; variants include ROUGE‑N, ROUGE‑L (longest common subsequence), and ROUGE‑W (weighted LCS). Commonly used for summarization and QA.
- BERTScore – computes similarity using BERT embeddings, capturing semantics at higher cost.
- Factual Accuracy – checks whether the generated content aligns with retrieved documents, external knowledge bases, or human fact‑checkers. This is a key quality indicator for RAG.
- Relevance – measures how well the answer addresses the user query, using semantic similarity, keyword matching, or human evaluation.
Optimizing RAG
The basic workflow described earlier is just a starting point; each component offers substantial room for improvement. Below are 12 concrete optimization strategies (implemented in the LangChain and LlamaIndex frameworks; refer to their official docs for details).
Data Cleaning
A high‑performance RAG system depends on accurate, clean source knowledge. Steps include:
- Basic text cleaning – normalize formatting, strip special characters, remove duplicates.
- Entity resolution – unify terminology (e.g., map “LLM”, “large language model”, “big model” to a single canonical term).
- Document segmentation – ensure topics are grouped logically; if a human can’t quickly identify the right document, the retrieval system won’t either.
- Data augmentation – add synonyms, paraphrases, or translations to enrich the corpus.
- User‑feedback loop – continuously update the database with real‑world feedback and flag misinformation.
- Time‑sensitive handling – retire or refresh stale documents for rapidly changing subjects.
Chunking
RAG requires splitting documents into chunks before embedding. The goal is to keep semantic coherence while minimizing noise. Overly large chunks dilute relevance; overly small chunks lose context.
Chunking Strategies
- Fixed‑size chunks – simple character or token limits with a small overlap to avoid cutting sentences. Low cost but may split mid‑sentence.
- Content‑aware chunks – split on punctuation, paragraphs, or use NLP sentence tokenizers (NLTK, spaCy). Better semantic integrity but can produce uneven sizes.
- Recursive chunking – a common engineering approach: start with a coarse split (e.g., paragraph delimiter
\n\n); if a piece is still too big, split on line breaks\n; continue with spaces or periods until the size fits. This yields finer granularity where needed while keeping larger blocks for sparse sections. - Multi‑scale chunking – generate several chunk sizes for the same document and store all in the vector DB, preserving hierarchical links. Retrieval can then choose the appropriate granularity, at the cost of higher storage and indexing overhead.
- Special‑structure chunking – for Markdown, LaTeX, or source code, use dedicated splitters that respect the original structure (LangChain provides built‑in splitters for these formats).
Choosing Chunk Size
There is no universal “right” size; it balances semantic completeness, retrieval precision, and context utilization.
- Respect the model’s context window – a chunk should occupy no more than ~10‑20 % of the window, leaving room for the user query and multiple retrieved chunks.
- Prioritize semantic units over “filling the window.”
- Avoid extremes – too small → fragmented info; too large → diluted similarity scores.
- Adjust granularity by document type – technical docs can have larger chunks (concept or section), code snippets need finer splits.
- Account for Top‑K retrieval – if you plan to feed, say, 5 chunks, compute:
(context window – query tokens) ÷ Top‑K ≈ max chunk size. - Overlap – typically 10‑30 % of the chunk size; too little leads to boundary loss, too much inflates storage/computation.
Empirically validate chunk size with:
- Recall/Hit‑rate comparisons across sizes.
- Human evaluation for “answer completeness” vs. “off‑topic”.
- A/B tests on a fixed dataset.
Embedding Model Selection
Different embedding models yield different performance. Classic Word2Vec produces static word vectors, which do not change with context—problematic for polysemy (e.g., “光盘” meaning “optical disc” vs. “record”). Modern contextual models (BERT, Sentence‑Transformers, OpenAI embeddings) generate dynamic embeddings that adapt to surrounding words, handling ambiguity much better. Choose a model that balances:
- Domain relevance – fine‑tuned or domain‑specific embeddings often outperform generic ones.
- Latency & cost – larger models are slower and pricier.
- Dimensionality – higher dimensions may improve accuracy but increase storage/computation.
(The remainder of the original content was truncated.)
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.