RAG Architecture Patterns for Production

April 17, 2023

Retrieval-Augmented Generation (RAG) has become the standard pattern for building LLM applications that need accurate, up-to-date information. Instead of relying solely on the model’s training data, RAG retrieves relevant context before generating responses.

Here are production-ready RAG architecture patterns.

Why RAG

The Problem

llm_limitations:
  knowledge_cutoff:
    - Training data has end date
    - Can't know recent information
    - Stale facts and figures

  hallucination:
    - Generates plausible but false info
    - Confident about incorrect statements
    - No way to verify internally

  no_private_data:
    - Only knows public training data
    - Can't access your documents
    - Can't use proprietary information

RAG Solution

rag_approach:
  retrieval:
    - Find relevant documents
    - Based on query similarity
    - From your data sources

  augmentation:
    - Add retrieved context to prompt
    - Ground the response in real data
    - Provide source attribution

  generation:
    - LLM generates using context
    - Answers based on your data
    - Can cite sources

Basic RAG Architecture

Components

┌─────────────────────────────────────────────────────────────┐
│                        RAG Pipeline                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Query ──► Embed ──► Retrieve ──► Augment ──► Generate      │
│    │         │          │           │            │           │
│    │         │          │           │            │           │
│    │    ┌────▼────┐ ┌───▼───┐  ┌───▼───┐   ┌───▼────┐     │
│    │    │Embedding│ │Vector │  │Prompt │   │  LLM   │     │
│    │    │ Model   │ │  DB   │  │Builder│   │        │     │
│    │    └─────────┘ └───────┘  └───────┘   └────────┘     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Basic Implementation

class BasicRAG:
    def __init__(self, embedding_model, vector_db, llm):
        self.embedding_model = embedding_model
        self.vector_db = vector_db
        self.llm = llm

    def query(self, question, k=5):
        # 1. Embed the query
        query_embedding = self.embedding_model.encode(question)

        # 2. Retrieve relevant documents
        results = self.vector_db.query(query_embedding, top_k=k)

        # 3. Build context from retrieved documents
        context = "\n\n".join([r.text for r in results])

        # 4. Augment prompt with context
        prompt = f"""Answer the question based on the following context.
If the answer isn't in the context, say "I don't have information about that."

Context:
{context}

Question: {question}

Answer:"""

        # 5. Generate response
        response = self.llm.generate(prompt)

        return {
            "answer": response,
            "sources": [r.metadata for r in results]
        }

Advanced Patterns

Query Transformation

class QueryTransformer:
    """Transform queries for better retrieval."""

    def __init__(self, llm):
        self.llm = llm

    def expand_query(self, query):
        """Generate multiple search queries."""
        prompt = f"""Generate 3 different search queries to find information for this question.
Return only the queries, one per line.

Question: {query}

Search queries:"""

        response = self.llm.generate(prompt)
        queries = [q.strip() for q in response.split('\n') if q.strip()]
        return queries

    def hypothetical_answer(self, query):
        """Generate hypothetical answer for better embedding (HyDE)."""
        prompt = f"""Write a short paragraph that would be a perfect answer to this question.
Don't worry about accuracy, just match the expected writing style and content.

Question: {query}

Hypothetical answer:"""

        return self.llm.generate(prompt)

class AdvancedRAG:
    def query(self, question, k=5):
        # Method 1: Query expansion
        expanded_queries = self.query_transformer.expand_query(question)
        all_results = []
        for q in expanded_queries:
            embedding = self.embedding_model.encode(q)
            results = self.vector_db.query(embedding, top_k=k)
            all_results.extend(results)

        # Deduplicate and rank
        unique_results = self.deduplicate_and_rank(all_results)

        # Method 2: HyDE
        hypothetical = self.query_transformer.hypothetical_answer(question)
        hyde_embedding = self.embedding_model.encode(hypothetical)
        hyde_results = self.vector_db.query(hyde_embedding, top_k=k)

        # Combine results
        combined = self.merge_results(unique_results, hyde_results)

        return self.generate_with_context(question, combined[:k])

Reranking

from sentence_transformers import CrossEncoder

class RerankedRAG:
    def __init__(self):
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def query(self, question, initial_k=20, final_k=5):
        # Initial retrieval (get more candidates)
        query_embedding = self.embedding_model.encode(question)
        candidates = self.vector_db.query(query_embedding, top_k=initial_k)

        # Rerank with cross-encoder
        pairs = [[question, c.text] for c in candidates]
        scores = self.reranker.predict(pairs)

        # Sort by reranking score
        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )

        # Return top results
        top_results = [r[0] for r in ranked[:final_k]]
        return self.generate_with_context(question, top_results)

Contextual Compression

class ContextCompressor:
    """Extract only relevant parts from retrieved documents."""

    def __init__(self, llm):
        self.llm = llm

    def compress(self, question, document):
        prompt = f"""Extract only the sentences from this document that are relevant to answering the question.
If nothing is relevant, respond with "NOT_RELEVANT".

Question: {question}

Document:
{document}

Relevant sentences:"""

        response = self.llm.generate(prompt)
        if "NOT_RELEVANT" in response:
            return None
        return response

class CompressedRAG:
    def query(self, question, k=10):
        # Retrieve more documents
        results = self.retrieve(question, k=k)

        # Compress each document
        compressed = []
        for doc in results:
            relevant = self.compressor.compress(question, doc.text)
            if relevant:
                compressed.append({
                    "text": relevant,
                    "source": doc.metadata
                })

        # Use compressed context (less tokens)
        return self.generate_with_context(question, compressed[:5])

Multi-Index RAG

class MultiIndexRAG:
    """Query multiple specialized indexes."""

    def __init__(self):
        self.indexes = {
            "documentation": DocumentationIndex(),
            "code": CodeIndex(),
            "support_tickets": SupportIndex(),
        }
        self.router = QueryRouter()

    def query(self, question, k=5):
        # Route query to appropriate indexes
        relevant_indexes = self.router.route(question)

        all_results = []
        for index_name in relevant_indexes:
            index = self.indexes[index_name]
            results = index.query(question, k=k)
            for r in results:
                r.source_index = index_name
            all_results.extend(results)

        # Rank across all indexes
        ranked = self.cross_index_ranking(all_results)

        return self.generate_with_context(question, ranked[:k])

class QueryRouter:
    def route(self, question):
        """Determine which indexes to query."""
        # Simple keyword-based routing
        if "code" in question.lower() or "function" in question.lower():
            return ["code", "documentation"]
        elif "error" in question.lower() or "bug" in question.lower():
            return ["support_tickets", "documentation"]
        else:
            return ["documentation"]

Evaluation

RAG Metrics

rag_evaluation:
  retrieval_metrics:
    recall:
      what: Did we retrieve the relevant documents?
      measure: Relevant retrieved / Total relevant

    precision:
      what: Were retrieved documents relevant?
      measure: Relevant retrieved / Total retrieved

    mrr:
      what: Rank of first relevant document
      measure: 1 / rank of first relevant

  generation_metrics:
    faithfulness:
      what: Is the answer supported by context?
      measure: Claims in context / Total claims

    relevance:
      what: Does answer address the question?
      measure: Human evaluation or LLM judge

    citation_accuracy:
      what: Are sources correctly cited?
      measure: Verifiable citations / Total citations

Evaluation Implementation

class RAGEvaluator:
    def evaluate(self, test_cases):
        results = []

        for case in test_cases:
            # Run RAG
            response = self.rag.query(case.question)

            # Evaluate retrieval
            retrieved_ids = [r.id for r in response.sources]
            retrieval_recall = len(
                set(retrieved_ids) & set(case.relevant_doc_ids)
            ) / len(case.relevant_doc_ids)

            # Evaluate faithfulness (using LLM)
            faithfulness = self.evaluate_faithfulness(
                response.answer,
                [r.text for r in response.sources]
            )

            # Evaluate relevance
            relevance = self.evaluate_relevance(
                case.question,
                response.answer
            )

            results.append({
                "question": case.question,
                "retrieval_recall": retrieval_recall,
                "faithfulness": faithfulness,
                "relevance": relevance
            })

        return results

Key Takeaways

RAG is not one pattern—it’s a family of techniques. Choose based on your use case.