AI costs can range from cents to thousands of dollars for similar tasks. Understanding the real economics is essential for sustainable AI applications. These benchmarks come from production systems, not marketing materials.
Here are real AI cost numbers to inform your decisions.
Cost Landscape Overview
API Pricing (October 2024)
llm_api_pricing:
openai:
gpt4_turbo:
input: "$10/1M tokens"
output: "$30/1M tokens"
gpt4o:
input: "$5/1M tokens"
output: "$15/1M tokens"
gpt4o_mini:
input: "$0.15/1M tokens"
output: "$0.60/1M tokens"
anthropic:
claude_3_opus:
input: "$15/1M tokens"
output: "$75/1M tokens"
claude_35_sonnet:
input: "$3/1M tokens"
output: "$15/1M tokens"
claude_3_haiku:
input: "$0.25/1M tokens"
output: "$1.25/1M tokens"
embeddings:
openai_ada_002: "$0.10/1M tokens"
openai_3_small: "$0.02/1M tokens"
openai_3_large: "$0.13/1M tokens"
Real Workload Examples
workload_costs:
chatbot_conversation:
description: "10-turn conversation, ~500 tokens/turn"
total_tokens: "~5,000 input + 2,500 output"
gpt4_turbo:
cost: "$0.125"
per_1000_convos: "$125"
claude_35_sonnet:
cost: "$0.053"
per_1000_convos: "$53"
gpt4o_mini:
cost: "$0.002"
per_1000_convos: "$2"
document_summarization:
description: "Summarize 10-page document (~5K words)"
tokens: "~7,000 input, 500 output"
gpt4o:
cost: "$0.043"
per_1000_docs: "$43"
claude_35_sonnet:
cost: "$0.029"
per_1000_docs: "$29"
rag_query:
description: "RAG with 5 retrieved chunks"
tokens: "~3,000 context + 500 response"
gpt4o:
cost: "$0.023"
per_10000_queries: "$230"
gpt4o_mini:
cost: "$0.00075"
per_10000_queries: "$7.50"
Cost Optimization Strategies
Model Selection Impact
class CostOptimizedRouter:
"""Route queries to appropriate model based on complexity."""
async def route_query(self, query: str, context: str) -> str:
# Classify complexity
complexity = await self._classify_complexity(query)
if complexity == "simple":
# Use mini model - 97% cheaper
return await self.gpt4o_mini.generate(query, context)
elif complexity == "medium":
# Use Sonnet - 80% cheaper than Opus
return await self.claude_sonnet.generate(query, context)
else:
# Use best model only when needed
return await self.claude_opus.generate(query, context)
async def _classify_complexity(self, query: str) -> str:
# Use mini model for classification itself
result = await self.gpt4o_mini.generate(
prompt=f"Classify query complexity: simple/medium/complex\n\n{query}"
)
return result.strip().lower()
Caching Impact
caching_savings:
scenario: "Support chatbot, 10K queries/day"
without_caching:
unique_queries: 10000
cost_per_day: "$500"
with_semantic_cache:
cache_hit_rate: "60%"
unique_queries: 4000
cost_per_day: "$200"
savings: "60%"
implementation:
- Hash exact queries for identical matches
- Embed queries for semantic similarity
- Cache responses for 24 hours
- Invalidate on knowledge updates
Prompt Optimization
prompt_optimization:
verbose_prompt:
tokens: 500
example: "Full instructions repeated every call"
optimized_prompt:
tokens: 100
technique: "Move instructions to system prompt, reference by ID"
savings_at_scale:
queries_per_day: 100000
token_savings: 400 * 100000 = 40M tokens
cost_savings_gpt4o: "$200/day"
Batch Processing
# Individual API calls
async def process_individual(items):
results = []
for item in items:
result = await llm.generate(item) # Cold start each time
results.append(result)
return results
# Batch API calls
async def process_batch(items, batch_size=20):
"""Batch requests for efficiency."""
results = []
for i in range(0, len(items), batch_size):
batch = items[i:i+batch_size]
batch_results = await llm.generate_batch(batch)
results.extend(batch_results)
return results
# OpenAI batch API - 50% discount
async def process_with_batch_api(items):
"""Use OpenAI's batch API for async processing."""
batch_input = create_batch_file(items)
batch_job = await openai.batches.create(
input_file_id=batch_input.id,
endpoint="/v1/chat/completions"
)
# Results available within 24 hours at 50% cost
return await wait_for_batch(batch_job.id)
Self-Hosting Economics
Cloud vs Self-Hosted
self_hosting_analysis:
scenario: "1M queries/month, ~1K tokens average"
cloud_api:
model: "GPT-4o mini"
cost: "$750/month"
maintenance: "None"
self_hosted_gpu:
model: "Llama 3 8B"
hardware: "A10G GPU ($1.50/hr)"
monthly_compute: "$1,080/month"
maintenance: "Significant"
note: "More expensive unless very high volume"
break_even:
threshold: "~5M+ queries/month"
consideration: "Factor in ML ops overhead"
When Self-Hosting Makes Sense
self_hosting_considerations:
good_fit:
- Data privacy requirements
- Predictable very high volume
- Edge/offline requirements
- Specialized fine-tuned models
poor_fit:
- Variable volume
- Need best quality
- Limited ML ops resources
- Fast iteration needed
Cost Monitoring
class CostTracker:
"""Track AI costs in real-time."""
async def track_request(
self,
model: str,
input_tokens: int,
output_tokens: int,
metadata: dict
):
cost = self._calculate_cost(model, input_tokens, output_tokens)
await self.metrics.record(
"ai_cost",
value=cost,
tags={
"model": model,
"feature": metadata.get("feature"),
"user_tier": metadata.get("user_tier")
}
)
# Alert on anomalies
if await self._is_anomaly(cost, metadata):
await self.alert(
f"Unusual AI cost: ${cost:.4f} for {metadata}"
)
def _calculate_cost(
self,
model: str,
input_tokens: int,
output_tokens: int
) -> float:
prices = self.pricing[model]
return (
input_tokens * prices["input"] / 1_000_000 +
output_tokens * prices["output"] / 1_000_000
)
Key Takeaways
- Model selection has 10-100x cost impact
- GPT-4o mini and Haiku are viable for most tasks
- Route by complexity—use expensive models sparingly
- Caching can reduce costs 50-80%
- Prompt optimization compounds at scale
- Batch APIs offer 50% discounts for async work
- Self-hosting rarely saves money until very high volume
- Monitor costs by feature and user segment
- Set budgets and alerts before launch
- Quality vs cost tradeoff is task-dependent
Know your costs. Optimize deliberately.