Fine-tuning lets you customize model behavior with your own data. It’s powerful—but often unnecessary. Better prompting, RAG, or using a more capable model frequently achieves the same result with less effort.
Here’s when fine-tuning makes sense and how to approach it.
When to Fine-Tune
Good Reasons
fine_tuning_good_reasons:
style_consistency:
description: "Match specific writing style or tone"
example: "Brand voice across all outputs"
alternative: "Few-shot prompting often sufficient"
domain_terminology:
description: "Use specialized vocabulary correctly"
example: "Medical, legal, technical domains"
alternative: "RAG with domain documents"
format_reliability:
description: "Consistent output structure"
example: "Always output specific JSON schema"
alternative: "Structured output modes"
efficiency_at_scale:
description: "Reduce tokens for repeated patterns"
example: "High-volume, narrow use case"
when: "After exhausting prompting options"
Bad Reasons
fine_tuning_bad_reasons:
adding_knowledge:
problem: "Fine-tuning doesn't reliably add facts"
solution: "Use RAG instead"
one_off_task:
problem: "Effort not justified for single use"
solution: "Better prompting"
prompt_is_too_long:
problem: "Fine-tuning to avoid context"
solution: "Often prompting still needed anyway"
following_hype:
problem: "Fine-tuning because others do"
solution: "Evaluate if you actually need it"
The Decision Framework
fine_tuning_decision:
step_1:
question: "Have you tried the best base model?"
if_no: "Try GPT-4o, Claude 3.5, etc. first"
step_2:
question: "Have you optimized your prompt?"
if_no: "Iterate on prompts with examples"
step_3:
question: "Would RAG solve the problem?"
if_yes: "RAG is usually easier and more flexible"
step_4:
question: "Is this high-volume, narrow use case?"
if_yes: "Fine-tuning may be justified"
step_5:
question: "Do you have quality training data?"
if_no: "Fine-tuning quality depends on data quality"
Fine-Tuning Process
Data Preparation
class FineTuningDataset:
"""Prepare data for fine-tuning."""
def prepare_examples(
self,
raw_data: list[dict]
) -> list[TrainingExample]:
examples = []
for item in raw_data:
# Format as conversation
example = TrainingExample(
messages=[
{
"role": "system",
"content": self.system_prompt
},
{
"role": "user",
"content": item["input"]
},
{
"role": "assistant",
"content": item["ideal_output"]
}
]
)
# Validate
if self._validate_example(example):
examples.append(example)
return examples
def _validate_example(self, example: TrainingExample) -> bool:
# Check quality criteria
output = example.messages[-1]["content"]
return (
len(output) > 10 and
len(output) < 2000 and
self._is_high_quality(output)
)
Training Configuration
# OpenAI fine-tuning
from openai import OpenAI
client = OpenAI()
# Upload training file
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3
}
)
# Monitor progress
while True:
status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {status.status}")
if status.status in ["succeeded", "failed"]:
break
time.sleep(60)
Evaluation
class FineTuneEvaluator:
"""Compare fine-tuned vs base model."""
async def evaluate(
self,
test_set: list[TestCase],
fine_tuned_model: str,
base_model: str
) -> EvaluationResult:
results = {"fine_tuned": [], "base": []}
for case in test_set:
# Test fine-tuned
ft_response = await self.generate(
fine_tuned_model,
case.input
)
ft_score = await self.score(ft_response, case.expected)
results["fine_tuned"].append(ft_score)
# Test base
base_response = await self.generate(
base_model,
case.input
)
base_score = await self.score(base_response, case.expected)
results["base"].append(base_score)
return EvaluationResult(
fine_tuned_avg=sum(results["fine_tuned"]) / len(test_set),
base_avg=sum(results["base"]) / len(test_set),
improvement=(results["fine_tuned_avg"] - results["base_avg"])
)
Key Takeaways
- Fine-tuning is a last resort, not first option
- Try better models and prompting first
- RAG handles knowledge better than fine-tuning
- Fine-tuning excels at style and format
- Data quality determines output quality
- Always evaluate against base model
- Consider ongoing maintenance burden
- Start with small dataset, iterate
Fine-tune when necessary, not because you can.