Fine-Tuning LLMs: When and Why

Fine-tuning lets you customize model behavior with your own data. It’s powerful—but often unnecessary. Better prompting, RAG, or using a more capable model frequently achieves the same result with less effort.

Here’s when fine-tuning makes sense and how to approach it.

When to Fine-Tune

Good Reasons

fine_tuning_good_reasons:
  style_consistency:
    description: "Match specific writing style or tone"
    example: "Brand voice across all outputs"
    alternative: "Few-shot prompting often sufficient"

  domain_terminology:
    description: "Use specialized vocabulary correctly"
    example: "Medical, legal, technical domains"
    alternative: "RAG with domain documents"

  format_reliability:
    description: "Consistent output structure"
    example: "Always output specific JSON schema"
    alternative: "Structured output modes"

  efficiency_at_scale:
    description: "Reduce tokens for repeated patterns"
    example: "High-volume, narrow use case"
    when: "After exhausting prompting options"

Bad Reasons

fine_tuning_bad_reasons:
  adding_knowledge:
    problem: "Fine-tuning doesn't reliably add facts"
    solution: "Use RAG instead"

  one_off_task:
    problem: "Effort not justified for single use"
    solution: "Better prompting"

  prompt_is_too_long:
    problem: "Fine-tuning to avoid context"
    solution: "Often prompting still needed anyway"

  following_hype:
    problem: "Fine-tuning because others do"
    solution: "Evaluate if you actually need it"

The Decision Framework

fine_tuning_decision:
  step_1:
    question: "Have you tried the best base model?"
    if_no: "Try GPT-4o, Claude 3.5, etc. first"

  step_2:
    question: "Have you optimized your prompt?"
    if_no: "Iterate on prompts with examples"

  step_3:
    question: "Would RAG solve the problem?"
    if_yes: "RAG is usually easier and more flexible"

  step_4:
    question: "Is this high-volume, narrow use case?"
    if_yes: "Fine-tuning may be justified"

  step_5:
    question: "Do you have quality training data?"
    if_no: "Fine-tuning quality depends on data quality"

Fine-Tuning Process

Data Preparation

class FineTuningDataset:
    """Prepare data for fine-tuning."""

    def prepare_examples(
        self,
        raw_data: list[dict]
    ) -> list[TrainingExample]:
        examples = []

        for item in raw_data:
            # Format as conversation
            example = TrainingExample(
                messages=[
                    {
                        "role": "system",
                        "content": self.system_prompt
                    },
                    {
                        "role": "user",
                        "content": item["input"]
                    },
                    {
                        "role": "assistant",
                        "content": item["ideal_output"]
                    }
                ]
            )

            # Validate
            if self._validate_example(example):
                examples.append(example)

        return examples

    def _validate_example(self, example: TrainingExample) -> bool:
        # Check quality criteria
        output = example.messages[-1]["content"]
        return (
            len(output) > 10 and
            len(output) < 2000 and
            self._is_high_quality(output)
        )

Training Configuration

# OpenAI fine-tuning
from openai import OpenAI

client = OpenAI()

# Upload training file
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3
    }
)

# Monitor progress
while True:
    status = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {status.status}")
    if status.status in ["succeeded", "failed"]:
        break
    time.sleep(60)

Evaluation

class FineTuneEvaluator:
    """Compare fine-tuned vs base model."""

    async def evaluate(
        self,
        test_set: list[TestCase],
        fine_tuned_model: str,
        base_model: str
    ) -> EvaluationResult:
        results = {"fine_tuned": [], "base": []}

        for case in test_set:
            # Test fine-tuned
            ft_response = await self.generate(
                fine_tuned_model,
                case.input
            )
            ft_score = await self.score(ft_response, case.expected)
            results["fine_tuned"].append(ft_score)

            # Test base
            base_response = await self.generate(
                base_model,
                case.input
            )
            base_score = await self.score(base_response, case.expected)
            results["base"].append(base_score)

        return EvaluationResult(
            fine_tuned_avg=sum(results["fine_tuned"]) / len(test_set),
            base_avg=sum(results["base"]) / len(test_set),
            improvement=(results["fine_tuned_avg"] - results["base_avg"])
        )

Key Takeaways

Fine-tuning is a last resort, not first option
Try better models and prompting first
RAG handles knowledge better than fine-tuning
Fine-tuning excels at style and format
Data quality determines output quality
Always evaluate against base model
Consider ongoing maintenance burden
Start with small dataset, iterate

Fine-tune when necessary, not because you can.