Retrieval-Augmented Generation (RAG) has become the standard pattern for building LLM applications that need accurate, up-to-date information. Instead of relying solely on the model’s training data, RAG retrieves relevant context before generating responses.
Here are production-ready RAG architecture patterns.
Why RAG
The Problem
llm_limitations:
knowledge_cutoff:
- Training data has end date
- Can't know recent information
- Stale facts and figures
hallucination:
- Generates plausible but false info
- Confident about incorrect statements
- No way to verify internally
no_private_data:
- Only knows public training data
- Can't access your documents
- Can't use proprietary information
RAG Solution
rag_approach:
retrieval:
- Find relevant documents
- Based on query similarity
- From your data sources
augmentation:
- Add retrieved context to prompt
- Ground the response in real data
- Provide source attribution
generation:
- LLM generates using context
- Answers based on your data
- Can cite sources
Basic RAG Architecture
Components
┌─────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ Query ──► Embed ──► Retrieve ──► Augment ──► Generate │
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ ┌────▼────┐ ┌───▼───┐ ┌───▼───┐ ┌───▼────┐ │
│ │ │Embedding│ │Vector │ │Prompt │ │ LLM │ │
│ │ │ Model │ │ DB │ │Builder│ │ │ │
│ │ └─────────┘ └───────┘ └───────┘ └────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Basic Implementation
class BasicRAG:
def __init__(self, embedding_model, vector_db, llm):
self.embedding_model = embedding_model
self.vector_db = vector_db
self.llm = llm
def query(self, question, k=5):
# 1. Embed the query
query_embedding = self.embedding_model.encode(question)
# 2. Retrieve relevant documents
results = self.vector_db.query(query_embedding, top_k=k)
# 3. Build context from retrieved documents
context = "\n\n".join([r.text for r in results])
# 4. Augment prompt with context
prompt = f"""Answer the question based on the following context.
If the answer isn't in the context, say "I don't have information about that."
Context:
{context}
Question: {question}
Answer:"""
# 5. Generate response
response = self.llm.generate(prompt)
return {
"answer": response,
"sources": [r.metadata for r in results]
}
Advanced Patterns
Query Transformation
class QueryTransformer:
"""Transform queries for better retrieval."""
def __init__(self, llm):
self.llm = llm
def expand_query(self, query):
"""Generate multiple search queries."""
prompt = f"""Generate 3 different search queries to find information for this question.
Return only the queries, one per line.
Question: {query}
Search queries:"""
response = self.llm.generate(prompt)
queries = [q.strip() for q in response.split('\n') if q.strip()]
return queries
def hypothetical_answer(self, query):
"""Generate hypothetical answer for better embedding (HyDE)."""
prompt = f"""Write a short paragraph that would be a perfect answer to this question.
Don't worry about accuracy, just match the expected writing style and content.
Question: {query}
Hypothetical answer:"""
return self.llm.generate(prompt)
class AdvancedRAG:
def query(self, question, k=5):
# Method 1: Query expansion
expanded_queries = self.query_transformer.expand_query(question)
all_results = []
for q in expanded_queries:
embedding = self.embedding_model.encode(q)
results = self.vector_db.query(embedding, top_k=k)
all_results.extend(results)
# Deduplicate and rank
unique_results = self.deduplicate_and_rank(all_results)
# Method 2: HyDE
hypothetical = self.query_transformer.hypothetical_answer(question)
hyde_embedding = self.embedding_model.encode(hypothetical)
hyde_results = self.vector_db.query(hyde_embedding, top_k=k)
# Combine results
combined = self.merge_results(unique_results, hyde_results)
return self.generate_with_context(question, combined[:k])
Reranking
from sentence_transformers import CrossEncoder
class RerankedRAG:
def __init__(self):
self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def query(self, question, initial_k=20, final_k=5):
# Initial retrieval (get more candidates)
query_embedding = self.embedding_model.encode(question)
candidates = self.vector_db.query(query_embedding, top_k=initial_k)
# Rerank with cross-encoder
pairs = [[question, c.text] for c in candidates]
scores = self.reranker.predict(pairs)
# Sort by reranking score
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
# Return top results
top_results = [r[0] for r in ranked[:final_k]]
return self.generate_with_context(question, top_results)
Contextual Compression
class ContextCompressor:
"""Extract only relevant parts from retrieved documents."""
def __init__(self, llm):
self.llm = llm
def compress(self, question, document):
prompt = f"""Extract only the sentences from this document that are relevant to answering the question.
If nothing is relevant, respond with "NOT_RELEVANT".
Question: {question}
Document:
{document}
Relevant sentences:"""
response = self.llm.generate(prompt)
if "NOT_RELEVANT" in response:
return None
return response
class CompressedRAG:
def query(self, question, k=10):
# Retrieve more documents
results = self.retrieve(question, k=k)
# Compress each document
compressed = []
for doc in results:
relevant = self.compressor.compress(question, doc.text)
if relevant:
compressed.append({
"text": relevant,
"source": doc.metadata
})
# Use compressed context (less tokens)
return self.generate_with_context(question, compressed[:5])
Multi-Index RAG
class MultiIndexRAG:
"""Query multiple specialized indexes."""
def __init__(self):
self.indexes = {
"documentation": DocumentationIndex(),
"code": CodeIndex(),
"support_tickets": SupportIndex(),
}
self.router = QueryRouter()
def query(self, question, k=5):
# Route query to appropriate indexes
relevant_indexes = self.router.route(question)
all_results = []
for index_name in relevant_indexes:
index = self.indexes[index_name]
results = index.query(question, k=k)
for r in results:
r.source_index = index_name
all_results.extend(results)
# Rank across all indexes
ranked = self.cross_index_ranking(all_results)
return self.generate_with_context(question, ranked[:k])
class QueryRouter:
def route(self, question):
"""Determine which indexes to query."""
# Simple keyword-based routing
if "code" in question.lower() or "function" in question.lower():
return ["code", "documentation"]
elif "error" in question.lower() or "bug" in question.lower():
return ["support_tickets", "documentation"]
else:
return ["documentation"]
Evaluation
RAG Metrics
rag_evaluation:
retrieval_metrics:
recall:
what: Did we retrieve the relevant documents?
measure: Relevant retrieved / Total relevant
precision:
what: Were retrieved documents relevant?
measure: Relevant retrieved / Total retrieved
mrr:
what: Rank of first relevant document
measure: 1 / rank of first relevant
generation_metrics:
faithfulness:
what: Is the answer supported by context?
measure: Claims in context / Total claims
relevance:
what: Does answer address the question?
measure: Human evaluation or LLM judge
citation_accuracy:
what: Are sources correctly cited?
measure: Verifiable citations / Total citations
Evaluation Implementation
class RAGEvaluator:
def evaluate(self, test_cases):
results = []
for case in test_cases:
# Run RAG
response = self.rag.query(case.question)
# Evaluate retrieval
retrieved_ids = [r.id for r in response.sources]
retrieval_recall = len(
set(retrieved_ids) & set(case.relevant_doc_ids)
) / len(case.relevant_doc_ids)
# Evaluate faithfulness (using LLM)
faithfulness = self.evaluate_faithfulness(
response.answer,
[r.text for r in response.sources]
)
# Evaluate relevance
relevance = self.evaluate_relevance(
case.question,
response.answer
)
results.append({
"question": case.question,
"retrieval_recall": retrieval_recall,
"faithfulness": faithfulness,
"relevance": relevance
})
return results
Key Takeaways
- RAG grounds LLMs in real data, reducing hallucination
- Basic RAG: embed → retrieve → augment → generate
- Query transformation improves retrieval (expansion, HyDE)
- Reranking with cross-encoders improves relevance
- Contextual compression reduces token usage
- Multi-index RAG handles diverse data sources
- Evaluate both retrieval and generation quality
- Chunk size and overlap significantly affect results
- Monitor and iterate based on real queries
RAG is not one pattern—it’s a family of techniques. Choose based on your use case.