LLM API calls are expensive and slow. Many applications make redundant calls with similar or identical prompts. Effective caching can reduce costs by 50-80% and improve latency significantly.
Here’s how to implement caching for LLM applications.
Why Cache LLM Responses
The Economics
caching_economics:
scenario: "Customer support chatbot"
daily_queries: 10000
average_cost: "$0.02 per query"
daily_cost: "$200"
with_caching:
cache_hit_rate: "60%"
cached_queries: 6000
api_queries: 4000
daily_cost: "$80"
savings: "60%"
Cache Types
cache_types:
exact_match:
description: "Identical prompts return cached response"
hit_rate: "10-30%"
complexity: "Low"
semantic_cache:
description: "Similar prompts return cached response"
hit_rate: "40-70%"
complexity: "Medium"
prefix_cache:
description: "Cache common prompt prefixes"
hit_rate: "Depends on structure"
complexity: "Low"
Implementation Patterns
Exact Match Cache
import hashlib
from redis import Redis
class ExactMatchCache:
def __init__(self, redis: Redis, ttl: int = 3600):
self.redis = redis
self.ttl = ttl
def _hash_request(self, messages: list, model: str) -> str:
content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
async def get(self, messages: list, model: str) -> Optional[str]:
key = self._hash_request(messages, model)
cached = await self.redis.get(f"llm:exact:{key}")
if cached:
return json.loads(cached)
return None
async def set(self, messages: list, model: str, response: str):
key = self._hash_request(messages, model)
await self.redis.setex(
f"llm:exact:{key}",
self.ttl,
json.dumps(response)
)
Semantic Cache
class SemanticCache:
def __init__(
self,
embedding_model,
vector_store,
similarity_threshold: float = 0.95
):
self.embedder = embedding_model
self.store = vector_store
self.threshold = similarity_threshold
async def get(self, messages: list) -> Optional[CacheResult]:
query_text = self._messages_to_text(messages)
embedding = await self.embedder.embed(query_text)
results = await self.store.search(
embedding=embedding,
top_k=1,
filter={"type": "llm_cache"}
)
if results and results[0].score >= self.threshold:
return CacheResult(
response=results[0].metadata["response"],
similarity=results[0].score
)
return None
async def set(self, messages: list, response: str, ttl: int = 3600):
query_text = self._messages_to_text(messages)
embedding = await self.embedder.embed(query_text)
await self.store.insert(
embedding=embedding,
metadata={
"type": "llm_cache",
"messages": messages,
"response": response,
"expires_at": datetime.utcnow() + timedelta(seconds=ttl)
}
)
Anthropic Prompt Caching
# Anthropic's native prompt caching
import anthropic
client = anthropic.Anthropic()
# Mark static content for caching
response = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant with extensive knowledge...",
"cache_control": {"type": "ephemeral"} # Cache this
}
],
messages=[{"role": "user", "content": user_query}]
)
# Subsequent calls with same system prompt use cached prefix
# Saves up to 90% on input tokens for cached portion
Cache Invalidation
invalidation_strategies:
time_based:
approach: "TTL on cached entries"
use_when: "Data freshness has time bounds"
event_based:
approach: "Invalidate on data changes"
use_when: "Cache depends on mutable data"
version_based:
approach: "Include version in cache key"
use_when: "Prompt or model changes"
Key Takeaways
- Caching reduces LLM costs 50-80% in typical applications
- Start with exact match caching (simple, effective)
- Add semantic caching for higher hit rates
- Use provider features like Anthropic prompt caching
- Cache invalidation strategy matters
- Monitor cache hit rates and optimize
- Consider response freshness requirements
- Test that cached responses remain appropriate
Caching is low-hanging fruit. Implement it early.