When LLM outputs aren’t quite right, you have two main options: improve your prompts or fine-tune the model. Both can improve results, but they have different costs, capabilities, and use cases.
Here’s how to decide between fine-tuning and prompting.
Understanding the Options
Prompting
prompting:
what: Crafting instructions to get desired outputs
changes: Only the input to the model
model: Uses base model as-is
techniques:
- Zero-shot: Just describe the task
- Few-shot: Provide examples in prompt
- Chain-of-thought: Ask for step-by-step reasoning
- System prompts: Set context and behavior
Fine-Tuning
fine_tuning:
what: Training a model on specific examples
changes: The model's weights
model: Creates new model variant
types:
full_fine_tuning:
what: Update all model weights
cost: Expensive, requires GPU infrastructure
when: Significant behavior change needed
parameter_efficient:
what: Update small number of parameters (LoRA, QLoRA)
cost: Much cheaper, can run on smaller GPUs
when: Most fine-tuning use cases
instruction_tuning:
what: Fine-tune on instruction-response pairs
cost: Moderate
when: Improve instruction following
Decision Framework
Start with Prompting
try_prompting_first:
reasons:
- No training data needed
- Immediate iteration
- No infrastructure requirements
- Easy to experiment
prompting_techniques:
1_clear_instructions:
before: "Summarize this"
after: "Summarize this article in exactly 3 bullet points, each under 20 words"
2_few_shot_examples:
approach: Include 2-5 examples of desired output
benefit: Model learns pattern from examples
3_output_format:
approach: Specify exact format (JSON, markdown, etc.)
benefit: Consistent, parseable outputs
4_chain_of_thought:
approach: "Think step by step"
benefit: Better reasoning for complex tasks
Consider Fine-Tuning When
fine_tuning_indicators:
prompting_limits_reached:
- Tried multiple prompt variations
- Still inconsistent outputs
- Examples don't fit in context
specific_style_needed:
- Company voice/tone
- Domain-specific terminology
- Consistent formatting patterns
specialized_task:
- Niche domain knowledge
- Unusual output format
- Task not well-represented in training
efficiency_requirements:
- Shorter prompts = lower latency
- Reduce token costs
- Simpler production system
Comparison
Trade-offs
comparison:
prompting:
pros:
- No training needed
- Instant iteration
- Works with any model
- No infrastructure required
cons:
- Uses context for examples
- Higher latency (longer prompts)
- May not achieve consistency
- Limited by model's base behavior
fine_tuning:
pros:
- Shorter prompts needed
- Consistent behavior
- Domain adaptation
- Can outperform prompting
cons:
- Requires training data
- Takes time to train
- Needs infrastructure
- Model management overhead
Cost Comparison
cost_analysis:
prompting:
upfront: None
per_request: Higher (more tokens in prompt)
iteration: Fast, free
fine_tuning:
upfront:
training: $5-500+ depending on model/data
data_preparation: Hours to days
per_request: Lower (shorter prompts)
iteration: Slow, costs money each time
break_even:
calculation: |
If fine-tuning saves 500 tokens per request,
and you make 100K requests/month,
savings = 50M tokens/month
At $0.002/1K tokens = $100/month savings
Fine-tuning cost of $50 pays back in 2 weeks
Fine-Tuning in Practice
Data Preparation
# Fine-tuning data format for OpenAI
training_data = [
{
"messages": [
{"role": "system", "content": "You are a technical support agent."},
{"role": "user", "content": "My API returns 401 errors"},
{"role": "assistant", "content": "A 401 error indicates authentication failure. Please check:\n1. Your API key is valid\n2. The key has correct permissions\n3. The Authorization header is properly formatted"}
]
},
# More examples...
]
# Best practices for training data:
# - 50-100 examples minimum, 500+ for best results
# - Diverse examples covering edge cases
# - High-quality, reviewed by humans
# - Consistent format and style
Fine-Tuning Process
import openai
# Upload training file
training_file = openai.File.create(
file=open("training_data.jsonl", "rb"),
purpose='fine-tune'
)
# Create fine-tuning job
job = openai.FineTuningJob.create(
training_file=training_file.id,
model="gpt-3.5-turbo",
hyperparameters={
"n_epochs": 3
}
)
# Monitor progress
status = openai.FineTuningJob.retrieve(job.id)
print(status.status) # 'running', 'succeeded', etc.
# Use fine-tuned model
response = openai.ChatCompletion.create(
model=job.fine_tuned_model, # ft:gpt-3.5-turbo:my-org::abc123
messages=[{"role": "user", "content": "My API returns 403 errors"}]
)
Evaluation
def evaluate_model(model_id, test_cases):
results = []
for case in test_cases:
response = openai.ChatCompletion.create(
model=model_id,
messages=case.messages
)
# Compare to expected output
similarity = compute_similarity(
response.choices[0].message.content,
case.expected_output
)
results.append({
"input": case.messages[-1]["content"],
"expected": case.expected_output,
"actual": response.choices[0].message.content,
"similarity": similarity
})
return {
"average_similarity": sum(r["similarity"] for r in results) / len(results),
"results": results
}
# Compare base model vs fine-tuned
base_eval = evaluate_model("gpt-3.5-turbo", test_cases)
ft_eval = evaluate_model("ft:gpt-3.5-turbo:my-org::abc123", test_cases)
print(f"Base model: {base_eval['average_similarity']:.2f}")
print(f"Fine-tuned: {ft_eval['average_similarity']:.2f}")
Hybrid Approaches
RAG + Fine-Tuning
rag_plus_fine_tuning:
scenario: Customer support bot
approach:
rag: Retrieve relevant documentation
fine_tuning: Learn company voice and format
benefit:
- Accurate information (RAG)
- Consistent style (fine-tuning)
- Best of both worlds
Progressive Enhancement
progressive_approach:
step_1:
method: Zero-shot prompting
effort: Low
result: Baseline performance
step_2:
method: Few-shot prompting
effort: Medium
result: Improved with examples
step_3:
method: Optimized system prompt
effort: Medium
result: Better consistency
step_4:
method: Fine-tuning
effort: High
result: Best performance
decision:
- Stop when quality is acceptable
- Only fine-tune if prompting insufficient
Key Takeaways
- Always try prompting first—it’s faster and cheaper
- Fine-tune when prompting hits limits after thorough testing
- Fine-tuning excels at style, format, and domain adaptation
- Prompting excels at flexibility and rapid iteration
- Calculate cost trade-offs: training vs. per-request savings
- Quality training data is crucial for fine-tuning success
- Evaluate objectively: compare base vs. fine-tuned performance
- Consider hybrid approaches: RAG + fine-tuning
- Fine-tuning is not magic—it learns from your examples
- Start simple, add complexity only when needed
The best approach depends on your specific use case. Let data guide your decision.