Prompt Caching Strategies for LLM Applications

March 25, 2024

LLM API calls are expensive and slow. Many applications make redundant calls with similar or identical prompts. Effective caching can reduce costs by 50-80% and improve latency significantly.

Here’s how to implement caching for LLM applications.

Why Cache LLM Responses

The Economics

caching_economics:
  scenario: "Customer support chatbot"
  daily_queries: 10000
  average_cost: "$0.02 per query"
  daily_cost: "$200"

  with_caching:
    cache_hit_rate: "60%"
    cached_queries: 6000
    api_queries: 4000
    daily_cost: "$80"
    savings: "60%"

Cache Types

cache_types:
  exact_match:
    description: "Identical prompts return cached response"
    hit_rate: "10-30%"
    complexity: "Low"

  semantic_cache:
    description: "Similar prompts return cached response"
    hit_rate: "40-70%"
    complexity: "Medium"

  prefix_cache:
    description: "Cache common prompt prefixes"
    hit_rate: "Depends on structure"
    complexity: "Low"

Implementation Patterns

Exact Match Cache

import hashlib
from redis import Redis

class ExactMatchCache:
    def __init__(self, redis: Redis, ttl: int = 3600):
        self.redis = redis
        self.ttl = ttl

    def _hash_request(self, messages: list, model: str) -> str:
        content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()

    async def get(self, messages: list, model: str) -> Optional[str]:
        key = self._hash_request(messages, model)
        cached = await self.redis.get(f"llm:exact:{key}")
        if cached:
            return json.loads(cached)
        return None

    async def set(self, messages: list, model: str, response: str):
        key = self._hash_request(messages, model)
        await self.redis.setex(
            f"llm:exact:{key}",
            self.ttl,
            json.dumps(response)
        )

Semantic Cache

class SemanticCache:
    def __init__(
        self,
        embedding_model,
        vector_store,
        similarity_threshold: float = 0.95
    ):
        self.embedder = embedding_model
        self.store = vector_store
        self.threshold = similarity_threshold

    async def get(self, messages: list) -> Optional[CacheResult]:
        query_text = self._messages_to_text(messages)
        embedding = await self.embedder.embed(query_text)

        results = await self.store.search(
            embedding=embedding,
            top_k=1,
            filter={"type": "llm_cache"}
        )

        if results and results[0].score >= self.threshold:
            return CacheResult(
                response=results[0].metadata["response"],
                similarity=results[0].score
            )
        return None

    async def set(self, messages: list, response: str, ttl: int = 3600):
        query_text = self._messages_to_text(messages)
        embedding = await self.embedder.embed(query_text)

        await self.store.insert(
            embedding=embedding,
            metadata={
                "type": "llm_cache",
                "messages": messages,
                "response": response,
                "expires_at": datetime.utcnow() + timedelta(seconds=ttl)
            }
        )

Anthropic Prompt Caching

# Anthropic's native prompt caching
import anthropic

client = anthropic.Anthropic()

# Mark static content for caching
response = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with extensive knowledge...",
            "cache_control": {"type": "ephemeral"}  # Cache this
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

# Subsequent calls with same system prompt use cached prefix
# Saves up to 90% on input tokens for cached portion

Cache Invalidation

invalidation_strategies:
  time_based:
    approach: "TTL on cached entries"
    use_when: "Data freshness has time bounds"

  event_based:
    approach: "Invalidate on data changes"
    use_when: "Cache depends on mutable data"

  version_based:
    approach: "Include version in cache key"
    use_when: "Prompt or model changes"

Key Takeaways

Caching is low-hanging fruit. Implement it early.