Small Models, Big Impact

The AI conversation often focuses on the biggest models, but small models are having a moment. GPT-4o mini, Claude 3 Haiku, Llama 3 8B, and Phi-3 deliver impressive capabilities at a fraction of the cost and latency. For many production use cases, small is the right choice.

Here’s when and how to use small models effectively.

The Small Model Landscape

Current Options

small_models_2024:
  cloud_apis:
    gpt4o_mini:
      provider: OpenAI
      cost: "$0.15/$0.60 per 1M tokens"
      speed: "Very fast"
      quality: "Surprisingly good"

    claude_3_haiku:
      provider: Anthropic
      cost: "$0.25/$1.25 per 1M tokens"
      speed: "Fastest Claude"
      quality: "Good for most tasks"

  open_source:
    llama_3_8b:
      size: "8B parameters"
      self_hosted: "4-8GB VRAM"
      quality: "Strong for size"

    phi_3:
      size: "3.8B parameters"
      self_hosted: "2-4GB VRAM"
      quality: "Impressive reasoning"

    mistral_7b:
      size: "7B parameters"
      self_hosted: "4-6GB VRAM"
      quality: "Great efficiency"

Cost Comparison

cost_analysis:
  example: "1 million requests, 1K tokens each"

  gpt4_turbo:
    cost: "$40,000"
    latency: "Slower"

  gpt4o_mini:
    cost: "$750"
    latency: "Fast"
    savings: "98%"

  self_hosted_llama:
    cost: "~$500 (compute)"
    latency: "Fast"
    note: "Plus infrastructure overhead"

When to Use Small Models

Good Fit

small_model_use_cases:
  classification:
    examples: [sentiment, intent, category]
    why: "Well-defined output space"

  extraction:
    examples: [entities, structured data, key facts]
    why: "Focused task, clear format"

  simple_generation:
    examples: [summaries, short responses, formatting]
    why: "Limited creativity needed"

  preprocessing:
    examples: [routing, filtering, validation]
    why: "Speed matters, errors recoverable"

  high_volume:
    examples: [batch processing, real-time, edge]
    why: "Cost and latency constraints"

Poor Fit

avoid_small_models_for:
  complex_reasoning:
    - Multi-step logical problems
    - Novel problem solving
    - Expert-level analysis

  nuanced_generation:
    - Long-form creative writing
    - Highly technical content
    - Subtle tone requirements

  safety_critical:
    - Medical advice
    - Legal analysis
    - Financial recommendations

Optimization Techniques

Prompt Engineering for Small Models

# Small models need more explicit prompts

# Bad: Too vague for small model
bad_prompt = "Categorize this customer message."

# Good: Explicit instructions and format
good_prompt = """Categorize this customer message into exactly ONE category.

Categories:
- billing: Payment, charges, invoices
- technical: Bugs, errors, how-to questions
- account: Login, profile, settings
- other: Anything else

Message: {message}

Respond with only the category name, nothing else.
Category:"""

# Even better: Few-shot examples
few_shot_prompt = """Categorize customer messages.

Message: "I can't log into my account"
Category: account

Message: "Why was I charged twice?"
Category: billing

Message: "The app crashes when I upload photos"
Category: technical

Message: "{message}"
Category:"""

Task Decomposition

class SmallModelPipeline:
    """Break complex tasks into small-model-friendly steps."""

    def __init__(self, small_model, large_model):
        self.small = small_model  # GPT-4o mini, Haiku
        self.large = large_model  # GPT-4o, Sonnet

    async def process_document(self, document: str) -> Analysis:
        # Step 1: Small model extracts key facts
        facts = await self.small.generate(
            prompt=f"Extract key facts from this document:\n{document}\n\nFacts:"
        )

        # Step 2: Small model identifies topics
        topics = await self.small.generate(
            prompt=f"List main topics in this document:\n{document}\n\nTopics:"
        )

        # Step 3: Small model generates summary
        summary = await self.small.generate(
            prompt=f"Summarize in 2-3 sentences:\n{document}\n\nSummary:"
        )

        # Step 4: Large model for complex analysis (if needed)
        if self._needs_deep_analysis(document):
            analysis = await self.large.generate(
                prompt=f"""Analyze this document:

Facts: {facts}
Topics: {topics}
Summary: {summary}

Document: {document}

Provide detailed analysis:"""
            )
        else:
            analysis = summary

        return Analysis(
            facts=facts,
            topics=topics,
            summary=summary,
            detailed=analysis
        )

Routing Pattern

class ModelRouter:
    """Route requests to appropriate model size."""

    async def route(self, request: Request) -> Response:
        complexity = await self._assess_complexity(request)

        if complexity == "simple":
            return await self.small_model.generate(request)
        elif complexity == "medium":
            return await self.medium_model.generate(request)
        else:
            return await self.large_model.generate(request)

    async def _assess_complexity(self, request: Request) -> str:
        # Use small model to assess complexity
        assessment = await self.small_model.generate(
            prompt=f"""Rate this request's complexity: simple, medium, complex

Request: {request.content[:500]}

Complexity:"""
        )
        return assessment.strip().lower()

Self-Hosting Considerations

When to Self-Host

self_hosting_decision:
  consider_self_hosting:
    - Data privacy requirements
    - Predictable high volume
    - Latency-sensitive applications
    - Cost optimization at scale
    - Offline/air-gapped needs

  stick_with_apis:
    - Variable volume
    - Limited ML ops expertise
    - Rapid iteration needed
    - Don't want infrastructure burden

Deployment Options

self_hosting_options:
  ollama:
    ease: "Very easy"
    use_case: "Development, small scale"
    command: "ollama run llama3:8b"

  vllm:
    ease: "Moderate"
    use_case: "Production, high throughput"
    features: "Continuous batching, PagedAttention"

  tgi:
    ease: "Moderate"
    use_case: "Production, Hugging Face ecosystem"
    features: "Optimized inference, streaming"

Key Takeaways

Small models handle most production tasks well
98% cost savings possible vs. large models
Use explicit prompts with examples
Decompose complex tasks into small-model steps
Route based on task complexity
Self-host for privacy or extreme scale
Test quality carefully before switching
Small models are getting better fast
Default to small, escalate when needed
The right model is the smallest that works

Small models are the pragmatic choice. Use them.