LLM Testing Strategies That Work

August 19, 2024

LLM outputs are non-deterministic. Traditional unit tests don’t work. Yet you still need confidence that your AI application behaves correctly. LLM testing requires new strategies—probabilistic assertions, semantic evaluation, and regression detection.

Here are testing strategies that work for LLM applications.

Why LLM Testing Is Different

The Challenges

llm_testing_challenges:
  non_determinism:
    - Same input → different outputs
    - Temperature affects randomness
    - Model updates change behavior

  no_ground_truth:
    - Many valid answers
    - Quality is subjective
    - Context-dependent correctness

  evaluation_is_hard:
    - "Is this response good?" is complex
    - Human evaluation doesn't scale
    - Automated metrics imperfect

Testing Levels

llm_testing_pyramid:
  unit_tests:
    what: "Test individual components"
    examples:
      - Prompt template rendering
      - Input validation
      - Output parsing
    approach: "Traditional testing"

  integration_tests:
    what: "Test LLM interactions"
    examples:
      - API call handling
      - Error recovery
      - Timeout handling
    approach: "Mocked LLM responses"

  evaluation_tests:
    what: "Test output quality"
    examples:
      - Response accuracy
      - Format compliance
      - Safety checks
    approach: "LLM-as-judge, metrics"

  regression_tests:
    what: "Detect quality changes"
    examples:
      - Prompt changes
      - Model updates
      - System changes
    approach: "Golden datasets, comparisons"

Testing Strategies

Deterministic Component Testing

import pytest

# Test prompt templates (deterministic)
def test_prompt_template():
    template = PromptTemplate(
        template="Summarize this {doc_type}: {content}"
    )

    result = template.render(
        doc_type="article",
        content="Test content"
    )

    assert result == "Summarize this article: Test content"

# Test output parsing (deterministic)
def test_json_extraction():
    parser = JSONExtractor(schema=ContactSchema)

    raw_output = '{"name": "John", "email": "john@example.com"}'
    result = parser.parse(raw_output)

    assert result.name == "John"
    assert result.email == "john@example.com"

# Test validation (deterministic)
def test_input_validation():
    validator = InputValidator(max_length=1000)

    with pytest.raises(ValidationError):
        validator.validate("x" * 1001)

Mock-Based Integration Testing

from unittest.mock import AsyncMock

@pytest.fixture
def mock_llm():
    llm = AsyncMock()
    llm.generate.return_value = GenerationResult(
        content="Mocked response",
        tokens_used=50
    )
    return llm

async def test_chat_flow(mock_llm):
    chat = ChatService(llm=mock_llm)

    response = await chat.send_message("Hello")

    assert response.content == "Mocked response"
    mock_llm.generate.assert_called_once()

async def test_retry_on_timeout(mock_llm):
    mock_llm.generate.side_effect = [
        TimeoutError(),
        GenerationResult(content="Success")
    ]

    chat = ChatService(llm=mock_llm, max_retries=3)
    response = await chat.send_message("Hello")

    assert response.content == "Success"
    assert mock_llm.generate.call_count == 2

LLM-as-Judge Evaluation

class LLMEvaluator:
    """Use an LLM to evaluate another LLM's output."""

    def __init__(self, judge_llm):
        self.judge = judge_llm

    async def evaluate_response(
        self,
        question: str,
        response: str,
        criteria: list[str]
    ) -> EvaluationResult:
        eval_prompt = f"""Evaluate this response on the given criteria.

Question: {question}
Response: {response}

Criteria to evaluate:
{chr(10).join(f'- {c}' for c in criteria)}

For each criterion, rate 1-5 and explain briefly.
Return as JSON: {{"criterion": {{"score": N, "reason": "..."}}, ...}}
"""

        result = await self.judge.generate(eval_prompt)
        return EvaluationResult.parse(result)

# Usage in tests
async def test_response_quality():
    evaluator = LLMEvaluator(judge_llm=get_eval_model())

    response = await my_app.generate_response(
        "Explain kubernetes to a beginner"
    )

    evaluation = await evaluator.evaluate_response(
        question="Explain kubernetes to a beginner",
        response=response,
        criteria=[
            "Accuracy: Information is correct",
            "Clarity: Easy to understand for beginners",
            "Completeness: Covers key concepts"
        ]
    )

    assert evaluation.scores["Accuracy"] >= 4
    assert evaluation.scores["Clarity"] >= 4

Assertion-Based Testing

class SemanticAssertions:
    """Semantic assertions for LLM outputs."""

    def __init__(self, llm):
        self.llm = llm

    async def assert_contains_concept(
        self,
        text: str,
        concept: str
    ) -> bool:
        result = await self.llm.generate(
            f"""Does this text discuss the concept "{concept}"?
Text: {text}
Answer only: yes or no"""
        )
        return result.strip().lower() == "yes"

    async def assert_tone(
        self,
        text: str,
        expected_tone: str
    ) -> bool:
        result = await self.llm.generate(
            f"""Is the tone of this text "{expected_tone}"?
Text: {text}
Answer only: yes or no"""
        )
        return result.strip().lower() == "yes"

# Usage
async def test_customer_response_tone():
    assertions = SemanticAssertions(eval_llm)
    response = await support_bot.respond("I'm frustrated with your service")

    assert await assertions.assert_tone(response, "empathetic and helpful")
    assert await assertions.assert_contains_concept(response, "apology")

Regression Testing with Golden Datasets

class RegressionSuite:
    """Detect quality regressions across versions."""

    def __init__(self, golden_dataset_path: str):
        self.golden_data = self._load_dataset(golden_dataset_path)
        self.evaluator = LLMEvaluator(get_eval_model())

    async def run_regression_tests(
        self,
        model_under_test
    ) -> RegressionReport:
        results = []

        for case in self.golden_data:
            response = await model_under_test.generate(case.input)

            # Compare to golden response
            similarity = await self._compute_similarity(
                response,
                case.golden_response
            )

            # Evaluate quality
            quality = await self.evaluator.evaluate_response(
                case.input,
                response,
                case.criteria
            )

            results.append(TestResult(
                case_id=case.id,
                similarity_score=similarity,
                quality_scores=quality.scores,
                passed=similarity > 0.8 and quality.average >= 4
            ))

        return RegressionReport(results=results)

# Golden dataset format
golden_dataset = [
    {
        "id": "summarization_001",
        "input": "Summarize this article: ...",
        "golden_response": "The article discusses...",
        "criteria": ["Accuracy", "Conciseness", "Completeness"]
    }
]

Best Practices

llm_testing_best_practices:
  structure:
    - Separate deterministic from probabilistic tests
    - Run fast tests frequently, slow tests in CI
    - Maintain golden datasets

  evaluation:
    - Define clear quality criteria
    - Use multiple evaluation methods
    - Track metrics over time

  regression:
    - Test before model updates
    - Test after prompt changes
    - Version your prompts

  scale:
    - Automate what you can
    - Sample for expensive evaluations
    - Human review for edge cases

Key Takeaways

Test your LLMs. Your users will thank you.