Implementing RAG Systems from A to Z: Practical Guide from Embedding to Retriever

Why RAG?

RAG System Information Retrieval Architecture

The biggest limitation of LLMs is that they don’t know information beyond their training data. They can’t access internal documents or company data, of course. Fine-tuning is expensive and requires retraining whenever updates are needed.

Retrieval-Augmented Generation (RAG) offers a different solution. Instead of changing model weights, it fetches relevant documents from external knowledge repositories in real-time and appends them to prompts. The LLM then generates responses based on this extended context. Although it sounds simple, implementing it involves several nuanced considerations.

Structural Overview

The main components of a RAG system are four:

Knowledge Base: This stores and preprocesses documents for reference. Original documents are chunked into manageable sizes, embedded into vectors using an embedding model, and stored in a vector database.

Embedding Model: This transforms text into numerical vectors such that semantically similar texts have vectors close in the embedding space. This is crucial for effective retrieval.

Retriever: Converts user queries into vectors, then searches the repository to find the top-k most similar chunks.

Generator: Combines the retrieved chunks with the original question to create a prompt, then feeds it into an LLM to generate the final answer.

Common Practical Challenges

Choosing the Embedding Model

Starting with a general-purpose model like text-embedding-ada-002 is a reasonable approach. However, if working with Korean documents or specialized domains such as finance or legal texts, using multilingual or domain-specific models can significantly improve performance.

Model swapping costs should also be considered. Changing the embedding model requires re-embedding the entire dataset.

Chunking Strategy

Chunk size greatly impacts system performance. Too small, and the context gets truncated, making it hard for the LLM to understand; too large, and search precision drops, plus the context window fills up faster.

A common starting point is around 512 tokens with approximately 100 tokens of overlap. Cutting along paragraph or section boundaries often outperforms fixed-size chunks.

Recursive splitting can enhance results: first dividing large sections, embedding and searching these, then further splitting hit chunks for detailed retrieval within those.

Re-ranking (Reranking)

Searching based solely on vector similarity can sometimes return irrelevant documents at the top. A solution involves a two-stage process: first retrieving top N candidates (e.g., 50–100), then reranking these with a cross-encoder model for the final top-k.

Cross-encoders evaluate the relevance of the query-document pair directly, providing higher accuracy than bi-encoders. While slower, they are used after initial filtering for more precise ranking.

Hybrid Search

Combining vector search with keyword-based methods like BM25 can be effective. Vector search captures semantic similarity, but keyword search is better for exact matches of proper nouns or code identifiers.

The common approach for merging results is Reciprocal Rank Fusion (RRF), which adjusts weights according to domain specifics.

Common Bottlenecks

Incorrect search results lead to wrong answers. No matter how advanced the LLM, if the retriever pulls irrelevant documents, the output suffers. Prioritizing search quality evaluation is crucial.

Multi-hop queries are limited in basic RAG setups. Questions linking multiple documents, such as “differences between last year’s and this year’s policies,” require more complex frameworks like LangGraph and agent architectures.

Maintaining an up-to-date knowledge base is often underestimated. Environments with frequent document updates need well-designed pipelines for continuous updates.

Frequently Asked Questions

Is fine-tuning embedding models necessary? Not always. If working in specialized domains with many technical terms or in languages where default models underperform, domain-specific fine-tuning can help. Often, starting with a general-purpose model and improving as needed is advisable.

Is a vector DB indispensable? For prototypes, simple solutions like FAISS with file storage are sufficient. For production with millions of vectors, dedicated solutions like Qdrant, Weaviate, or Pinecone are recommended.

How to verify that generated answers reflect actual search results? Implement a separate validation step to assess the groundedness between answers and source documents, often by using an LLM as a judge at the pipeline’s end.