The AI conversation often focuses on the biggest models, but small models are having a moment. GPT-4o mini, Claude 3 Haiku, Llama 3 8B, and Phi-3 deliver impressive capabilities at a fraction of the cost and latency. For many production use cases, small is the right choice.
Here’s when and how to use small models effectively.
The Small Model Landscape
Current Options
small_models_2024:
cloud_apis:
gpt4o_mini:
provider: OpenAI
cost: "$0.15/$0.60 per 1M tokens"
speed: "Very fast"
quality: "Surprisingly good"
claude_3_haiku:
provider: Anthropic
cost: "$0.25/$1.25 per 1M tokens"
speed: "Fastest Claude"
quality: "Good for most tasks"
open_source:
llama_3_8b:
size: "8B parameters"
self_hosted: "4-8GB VRAM"
quality: "Strong for size"
phi_3:
size: "3.8B parameters"
self_hosted: "2-4GB VRAM"
quality: "Impressive reasoning"
mistral_7b:
size: "7B parameters"
self_hosted: "4-6GB VRAM"
quality: "Great efficiency"
Cost Comparison
cost_analysis:
example: "1 million requests, 1K tokens each"
gpt4_turbo:
cost: "$40,000"
latency: "Slower"
gpt4o_mini:
cost: "$750"
latency: "Fast"
savings: "98%"
self_hosted_llama:
cost: "~$500 (compute)"
latency: "Fast"
note: "Plus infrastructure overhead"
When to Use Small Models
Good Fit
small_model_use_cases:
classification:
examples: [sentiment, intent, category]
why: "Well-defined output space"
extraction:
examples: [entities, structured data, key facts]
why: "Focused task, clear format"
simple_generation:
examples: [summaries, short responses, formatting]
why: "Limited creativity needed"
preprocessing:
examples: [routing, filtering, validation]
why: "Speed matters, errors recoverable"
high_volume:
examples: [batch processing, real-time, edge]
why: "Cost and latency constraints"
Poor Fit
avoid_small_models_for:
complex_reasoning:
- Multi-step logical problems
- Novel problem solving
- Expert-level analysis
nuanced_generation:
- Long-form creative writing
- Highly technical content
- Subtle tone requirements
safety_critical:
- Medical advice
- Legal analysis
- Financial recommendations
Optimization Techniques
Prompt Engineering for Small Models
# Small models need more explicit prompts
# Bad: Too vague for small model
bad_prompt = "Categorize this customer message."
# Good: Explicit instructions and format
good_prompt = """Categorize this customer message into exactly ONE category.
Categories:
- billing: Payment, charges, invoices
- technical: Bugs, errors, how-to questions
- account: Login, profile, settings
- other: Anything else
Message: {message}
Respond with only the category name, nothing else.
Category:"""
# Even better: Few-shot examples
few_shot_prompt = """Categorize customer messages.
Message: "I can't log into my account"
Category: account
Message: "Why was I charged twice?"
Category: billing
Message: "The app crashes when I upload photos"
Category: technical
Message: "{message}"
Category:"""
Task Decomposition
class SmallModelPipeline:
"""Break complex tasks into small-model-friendly steps."""
def __init__(self, small_model, large_model):
self.small = small_model # GPT-4o mini, Haiku
self.large = large_model # GPT-4o, Sonnet
async def process_document(self, document: str) -> Analysis:
# Step 1: Small model extracts key facts
facts = await self.small.generate(
prompt=f"Extract key facts from this document:\n{document}\n\nFacts:"
)
# Step 2: Small model identifies topics
topics = await self.small.generate(
prompt=f"List main topics in this document:\n{document}\n\nTopics:"
)
# Step 3: Small model generates summary
summary = await self.small.generate(
prompt=f"Summarize in 2-3 sentences:\n{document}\n\nSummary:"
)
# Step 4: Large model for complex analysis (if needed)
if self._needs_deep_analysis(document):
analysis = await self.large.generate(
prompt=f"""Analyze this document:
Facts: {facts}
Topics: {topics}
Summary: {summary}
Document: {document}
Provide detailed analysis:"""
)
else:
analysis = summary
return Analysis(
facts=facts,
topics=topics,
summary=summary,
detailed=analysis
)
Routing Pattern
class ModelRouter:
"""Route requests to appropriate model size."""
async def route(self, request: Request) -> Response:
complexity = await self._assess_complexity(request)
if complexity == "simple":
return await self.small_model.generate(request)
elif complexity == "medium":
return await self.medium_model.generate(request)
else:
return await self.large_model.generate(request)
async def _assess_complexity(self, request: Request) -> str:
# Use small model to assess complexity
assessment = await self.small_model.generate(
prompt=f"""Rate this request's complexity: simple, medium, complex
Request: {request.content[:500]}
Complexity:"""
)
return assessment.strip().lower()
Self-Hosting Considerations
When to Self-Host
self_hosting_decision:
consider_self_hosting:
- Data privacy requirements
- Predictable high volume
- Latency-sensitive applications
- Cost optimization at scale
- Offline/air-gapped needs
stick_with_apis:
- Variable volume
- Limited ML ops expertise
- Rapid iteration needed
- Don't want infrastructure burden
Deployment Options
self_hosting_options:
ollama:
ease: "Very easy"
use_case: "Development, small scale"
command: "ollama run llama3:8b"
vllm:
ease: "Moderate"
use_case: "Production, high throughput"
features: "Continuous batching, PagedAttention"
tgi:
ease: "Moderate"
use_case: "Production, Hugging Face ecosystem"
features: "Optimized inference, streaming"
Key Takeaways
- Small models handle most production tasks well
- 98% cost savings possible vs. large models
- Use explicit prompts with examples
- Decompose complex tasks into small-model steps
- Route based on task complexity
- Self-host for privacy or extreme scale
- Test quality carefully before switching
- Small models are getting better fast
- Default to small, escalate when needed
- The right model is the smallest that works
Small models are the pragmatic choice. Use them.