Large Language Models (LLMs) have capabilities that weren’t possible before. But integrating them into applications isn’t as simple as calling an API. The non-deterministic nature, latency, cost, and quality concerns require specific patterns.
Here are integration patterns that work in production.
Understanding LLM Characteristics
Unique Properties
llm_characteristics:
non_deterministic:
- Same input can produce different outputs
- Temperature controls variability
- Affects testing and validation
high_latency:
- Seconds, not milliseconds
- Streaming helps perception
- Impacts UX design
token_based_cost:
- Pay per token (input + output)
- Longer context = higher cost
- Optimization matters
context_limited:
- Fixed context window
- Must fit prompt + response
- Requires summarization strategies
hallucination_prone:
- Confidently generates false information
- Requires validation
- Ground with real data
Pattern: Prompt Templates
Structured Prompting
from string import Template
SUMMARIZE_TEMPLATE = Template("""
You are a technical writer creating concise summaries.
Summarize the following document in exactly ${num_sentences} sentences.
Focus on: ${focus_areas}
Document:
${document}
Summary:
""")
def summarize(document, num_sentences=3, focus_areas="key findings and conclusions"):
prompt = SUMMARIZE_TEMPLATE.substitute(
document=document,
num_sentences=num_sentences,
focus_areas=focus_areas
)
return llm.generate(prompt)
Few-Shot Prompting
FEW_SHOT_TEMPLATE = """
Classify the following support ticket into a category.
Categories: billing, technical, account, other
Examples:
Ticket: "I can't log into my account"
Category: account
Ticket: "My invoice shows the wrong amount"
Category: billing
Ticket: "The API returns 500 errors"
Category: technical
Ticket: "${ticket_text}"
Category:"""
def classify_ticket(ticket_text):
prompt = Template(FEW_SHOT_TEMPLATE).substitute(ticket_text=ticket_text)
response = llm.generate(prompt, temperature=0)
return response.strip().lower()
Pattern: Chain of Thought
Breaking Down Complex Tasks
ANALYSIS_CHAIN = """
Analyze this code change for potential issues.
Step 1: Understand what the code does
Step 2: Identify potential bugs
Step 3: Check for security issues
Step 4: Assess performance implications
Step 5: Provide recommendations
Code change:
${code_diff}
Let's analyze step by step:
Step 1 - What the code does:
"""
def analyze_code_change(code_diff):
prompt = Template(ANALYSIS_CHAIN).substitute(code_diff=code_diff)
# Using chain of thought improves reasoning quality
return llm.generate(prompt, temperature=0.3)
Pattern: Retrieval Augmented Generation (RAG)
Grounding with Real Data
from sentence_transformers import SentenceTransformer
import faiss
class RAGSystem:
def __init__(self, documents):
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.documents = documents
self._build_index()
def _build_index(self):
embeddings = self.encoder.encode(self.documents)
self.index = faiss.IndexFlatL2(embeddings.shape[1])
self.index.add(embeddings)
def retrieve(self, query, k=3):
query_embedding = self.encoder.encode([query])
distances, indices = self.index.search(query_embedding, k)
return [self.documents[i] for i in indices[0]]
def generate(self, query):
# Retrieve relevant documents
context_docs = self.retrieve(query)
context = "\n\n".join(context_docs)
prompt = f"""Answer the question based on the following context.
If the answer isn't in the context, say "I don't have information about that."
Context:
{context}
Question: {query}
Answer:"""
return llm.generate(prompt, temperature=0)
RAG Architecture
rag_components:
document_processing:
- Chunking (overlap for context)
- Embedding generation
- Metadata extraction
- Index storage
retrieval:
- Query embedding
- Similarity search
- Re-ranking (optional)
- Filtering (metadata)
generation:
- Context injection
- Prompt construction
- LLM call
- Citation extraction
Pattern: Function Calling
Structured Outputs
import json
FUNCTION_SCHEMA = {
"name": "create_calendar_event",
"description": "Create a calendar event from natural language",
"parameters": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Event title"},
"start_time": {"type": "string", "description": "ISO datetime"},
"duration_minutes": {"type": "integer", "description": "Duration"},
"attendees": {"type": "array", "items": {"type": "string"}}
},
"required": ["title", "start_time"]
}
}
def parse_calendar_request(user_input):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": user_input}],
functions=[FUNCTION_SCHEMA],
function_call={"name": "create_calendar_event"}
)
function_call = response.choices[0].message.function_call
return json.loads(function_call.arguments)
# Usage
event = parse_calendar_request("Schedule a team meeting next Tuesday at 2pm for 1 hour")
# Returns: {"title": "Team meeting", "start_time": "2023-01-31T14:00:00", "duration_minutes": 60}
Pattern: Agents
Tool-Using LLMs
class Agent:
def __init__(self, tools):
self.tools = {tool.name: tool for tool in tools}
self.tool_descriptions = self._format_tools()
def _format_tools(self):
return "\n".join([
f"- {name}: {tool.description}"
for name, tool in self.tools.items()
])
def run(self, task, max_iterations=5):
context = []
for _ in range(max_iterations):
prompt = self._build_prompt(task, context)
response = llm.generate(prompt)
action = self._parse_action(response)
if action["type"] == "finish":
return action["result"]
if action["type"] == "tool":
tool_result = self.tools[action["tool"]].execute(action["input"])
context.append({"action": action, "result": tool_result})
return "Max iterations reached"
def _build_prompt(self, task, context):
return f"""You are an assistant with access to tools.
Available tools:
{self.tool_descriptions}
Task: {task}
Previous actions and results:
{self._format_context(context)}
Decide the next action. Format:
THOUGHT: <reasoning>
ACTION: <tool_name> or FINISH
INPUT: <tool input> or <final answer>
"""
Pattern: Caching Strategies
Multi-Level Caching
class LLMCache:
def __init__(self, redis_client):
self.redis = redis_client
self.local_cache = {} # In-memory for hot paths
def _semantic_key(self, prompt):
"""Generate embedding-based key for semantic caching."""
embedding = encoder.encode(prompt)
# Find similar cached prompts
similar = self.redis.search_similar(embedding, threshold=0.95)
return similar[0] if similar else None
def get(self, prompt, exact=True):
# Try exact match first (fast)
exact_key = hashlib.sha256(prompt.encode()).hexdigest()
if exact_key in self.local_cache:
return self.local_cache[exact_key]
cached = self.redis.get(f"llm:exact:{exact_key}")
if cached:
return cached
# Try semantic match (slower but catches paraphrases)
if not exact:
semantic_match = self._semantic_key(prompt)
if semantic_match:
return semantic_match
return None
def set(self, prompt, response, ttl=3600):
exact_key = hashlib.sha256(prompt.encode()).hexdigest()
self.local_cache[exact_key] = response
self.redis.setex(f"llm:exact:{exact_key}", ttl, response)
Pattern: Fallback Chains
Graceful Degradation
class LLMFallbackChain:
def __init__(self):
self.models = [
("gpt-4", self._call_gpt4),
("gpt-3.5-turbo", self._call_gpt35),
("local-model", self._call_local),
]
def generate(self, prompt, **kwargs):
last_error = None
for model_name, model_fn in self.models:
try:
response = model_fn(prompt, **kwargs)
return {"model": model_name, "content": response}
except RateLimitError:
continue # Try next model
except APIError as e:
last_error = e
continue
except Exception as e:
last_error = e
break
# All models failed
return {"error": str(last_error), "fallback": self._static_fallback(prompt)}
def _static_fallback(self, prompt):
"""Return a safe fallback response."""
return "I'm unable to process this request right now. Please try again later."
Key Takeaways
- LLMs are non-deterministic—design for variability
- Use prompt templates for consistency and maintainability
- Chain of thought improves reasoning for complex tasks
- RAG grounds LLMs in real data, reducing hallucinations
- Function calling extracts structured data reliably
- Agents combine LLMs with tools for complex workflows
- Multi-level caching reduces costs and latency
- Fallback chains ensure availability
- Test with diverse inputs, not just happy paths
LLM integration is a new discipline. These patterns are emerging best practices.