LLM Integration Patterns for Applications

January 23, 2023

Large Language Models (LLMs) have capabilities that weren’t possible before. But integrating them into applications isn’t as simple as calling an API. The non-deterministic nature, latency, cost, and quality concerns require specific patterns.

Here are integration patterns that work in production.

Understanding LLM Characteristics

Unique Properties

llm_characteristics:
  non_deterministic:
    - Same input can produce different outputs
    - Temperature controls variability
    - Affects testing and validation

  high_latency:
    - Seconds, not milliseconds
    - Streaming helps perception
    - Impacts UX design

  token_based_cost:
    - Pay per token (input + output)
    - Longer context = higher cost
    - Optimization matters

  context_limited:
    - Fixed context window
    - Must fit prompt + response
    - Requires summarization strategies

  hallucination_prone:
    - Confidently generates false information
    - Requires validation
    - Ground with real data

Pattern: Prompt Templates

Structured Prompting

from string import Template

SUMMARIZE_TEMPLATE = Template("""
You are a technical writer creating concise summaries.

Summarize the following document in exactly ${num_sentences} sentences.
Focus on: ${focus_areas}

Document:
${document}

Summary:
""")

def summarize(document, num_sentences=3, focus_areas="key findings and conclusions"):
    prompt = SUMMARIZE_TEMPLATE.substitute(
        document=document,
        num_sentences=num_sentences,
        focus_areas=focus_areas
    )
    return llm.generate(prompt)

Few-Shot Prompting

FEW_SHOT_TEMPLATE = """
Classify the following support ticket into a category.

Categories: billing, technical, account, other

Examples:
Ticket: "I can't log into my account"
Category: account

Ticket: "My invoice shows the wrong amount"
Category: billing

Ticket: "The API returns 500 errors"
Category: technical

Ticket: "${ticket_text}"
Category:"""

def classify_ticket(ticket_text):
    prompt = Template(FEW_SHOT_TEMPLATE).substitute(ticket_text=ticket_text)
    response = llm.generate(prompt, temperature=0)
    return response.strip().lower()

Pattern: Chain of Thought

Breaking Down Complex Tasks

ANALYSIS_CHAIN = """
Analyze this code change for potential issues.

Step 1: Understand what the code does
Step 2: Identify potential bugs
Step 3: Check for security issues
Step 4: Assess performance implications
Step 5: Provide recommendations

Code change:
${code_diff}

Let's analyze step by step:

Step 1 - What the code does:
"""

def analyze_code_change(code_diff):
    prompt = Template(ANALYSIS_CHAIN).substitute(code_diff=code_diff)
    # Using chain of thought improves reasoning quality
    return llm.generate(prompt, temperature=0.3)

Pattern: Retrieval Augmented Generation (RAG)

Grounding with Real Data

from sentence_transformers import SentenceTransformer
import faiss

class RAGSystem:
    def __init__(self, documents):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.documents = documents
        self._build_index()

    def _build_index(self):
        embeddings = self.encoder.encode(self.documents)
        self.index = faiss.IndexFlatL2(embeddings.shape[1])
        self.index.add(embeddings)

    def retrieve(self, query, k=3):
        query_embedding = self.encoder.encode([query])
        distances, indices = self.index.search(query_embedding, k)
        return [self.documents[i] for i in indices[0]]

    def generate(self, query):
        # Retrieve relevant documents
        context_docs = self.retrieve(query)
        context = "\n\n".join(context_docs)

        prompt = f"""Answer the question based on the following context.
If the answer isn't in the context, say "I don't have information about that."

Context:
{context}

Question: {query}

Answer:"""

        return llm.generate(prompt, temperature=0)

RAG Architecture

rag_components:
  document_processing:
    - Chunking (overlap for context)
    - Embedding generation
    - Metadata extraction
    - Index storage

  retrieval:
    - Query embedding
    - Similarity search
    - Re-ranking (optional)
    - Filtering (metadata)

  generation:
    - Context injection
    - Prompt construction
    - LLM call
    - Citation extraction

Pattern: Function Calling

Structured Outputs

import json

FUNCTION_SCHEMA = {
    "name": "create_calendar_event",
    "description": "Create a calendar event from natural language",
    "parameters": {
        "type": "object",
        "properties": {
            "title": {"type": "string", "description": "Event title"},
            "start_time": {"type": "string", "description": "ISO datetime"},
            "duration_minutes": {"type": "integer", "description": "Duration"},
            "attendees": {"type": "array", "items": {"type": "string"}}
        },
        "required": ["title", "start_time"]
    }
}

def parse_calendar_request(user_input):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": user_input}],
        functions=[FUNCTION_SCHEMA],
        function_call={"name": "create_calendar_event"}
    )

    function_call = response.choices[0].message.function_call
    return json.loads(function_call.arguments)

# Usage
event = parse_calendar_request("Schedule a team meeting next Tuesday at 2pm for 1 hour")
# Returns: {"title": "Team meeting", "start_time": "2023-01-31T14:00:00", "duration_minutes": 60}

Pattern: Agents

Tool-Using LLMs

class Agent:
    def __init__(self, tools):
        self.tools = {tool.name: tool for tool in tools}
        self.tool_descriptions = self._format_tools()

    def _format_tools(self):
        return "\n".join([
            f"- {name}: {tool.description}"
            for name, tool in self.tools.items()
        ])

    def run(self, task, max_iterations=5):
        context = []

        for _ in range(max_iterations):
            prompt = self._build_prompt(task, context)
            response = llm.generate(prompt)

            action = self._parse_action(response)

            if action["type"] == "finish":
                return action["result"]

            if action["type"] == "tool":
                tool_result = self.tools[action["tool"]].execute(action["input"])
                context.append({"action": action, "result": tool_result})

        return "Max iterations reached"

    def _build_prompt(self, task, context):
        return f"""You are an assistant with access to tools.

Available tools:
{self.tool_descriptions}

Task: {task}

Previous actions and results:
{self._format_context(context)}

Decide the next action. Format:
THOUGHT: <reasoning>
ACTION: <tool_name> or FINISH
INPUT: <tool input> or <final answer>
"""

Pattern: Caching Strategies

Multi-Level Caching

class LLMCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.local_cache = {}  # In-memory for hot paths

    def _semantic_key(self, prompt):
        """Generate embedding-based key for semantic caching."""
        embedding = encoder.encode(prompt)
        # Find similar cached prompts
        similar = self.redis.search_similar(embedding, threshold=0.95)
        return similar[0] if similar else None

    def get(self, prompt, exact=True):
        # Try exact match first (fast)
        exact_key = hashlib.sha256(prompt.encode()).hexdigest()
        if exact_key in self.local_cache:
            return self.local_cache[exact_key]

        cached = self.redis.get(f"llm:exact:{exact_key}")
        if cached:
            return cached

        # Try semantic match (slower but catches paraphrases)
        if not exact:
            semantic_match = self._semantic_key(prompt)
            if semantic_match:
                return semantic_match

        return None

    def set(self, prompt, response, ttl=3600):
        exact_key = hashlib.sha256(prompt.encode()).hexdigest()
        self.local_cache[exact_key] = response
        self.redis.setex(f"llm:exact:{exact_key}", ttl, response)

Pattern: Fallback Chains

Graceful Degradation

class LLMFallbackChain:
    def __init__(self):
        self.models = [
            ("gpt-4", self._call_gpt4),
            ("gpt-3.5-turbo", self._call_gpt35),
            ("local-model", self._call_local),
        ]

    def generate(self, prompt, **kwargs):
        last_error = None

        for model_name, model_fn in self.models:
            try:
                response = model_fn(prompt, **kwargs)
                return {"model": model_name, "content": response}
            except RateLimitError:
                continue  # Try next model
            except APIError as e:
                last_error = e
                continue
            except Exception as e:
                last_error = e
                break

        # All models failed
        return {"error": str(last_error), "fallback": self._static_fallback(prompt)}

    def _static_fallback(self, prompt):
        """Return a safe fallback response."""
        return "I'm unable to process this request right now. Please try again later."

Key Takeaways

LLM integration is a new discipline. These patterns are emerging best practices.