LLM outputs are non-deterministic. Traditional unit tests don’t work. Yet you still need confidence that your AI application behaves correctly. LLM testing requires new strategies—probabilistic assertions, semantic evaluation, and regression detection.
Here are testing strategies that work for LLM applications.
Why LLM Testing Is Different
The Challenges
llm_testing_challenges:
non_determinism:
- Same input → different outputs
- Temperature affects randomness
- Model updates change behavior
no_ground_truth:
- Many valid answers
- Quality is subjective
- Context-dependent correctness
evaluation_is_hard:
- "Is this response good?" is complex
- Human evaluation doesn't scale
- Automated metrics imperfect
Testing Levels
llm_testing_pyramid:
unit_tests:
what: "Test individual components"
examples:
- Prompt template rendering
- Input validation
- Output parsing
approach: "Traditional testing"
integration_tests:
what: "Test LLM interactions"
examples:
- API call handling
- Error recovery
- Timeout handling
approach: "Mocked LLM responses"
evaluation_tests:
what: "Test output quality"
examples:
- Response accuracy
- Format compliance
- Safety checks
approach: "LLM-as-judge, metrics"
regression_tests:
what: "Detect quality changes"
examples:
- Prompt changes
- Model updates
- System changes
approach: "Golden datasets, comparisons"
Testing Strategies
Deterministic Component Testing
import pytest
# Test prompt templates (deterministic)
def test_prompt_template():
template = PromptTemplate(
template="Summarize this {doc_type}: {content}"
)
result = template.render(
doc_type="article",
content="Test content"
)
assert result == "Summarize this article: Test content"
# Test output parsing (deterministic)
def test_json_extraction():
parser = JSONExtractor(schema=ContactSchema)
raw_output = '{"name": "John", "email": "john@example.com"}'
result = parser.parse(raw_output)
assert result.name == "John"
assert result.email == "john@example.com"
# Test validation (deterministic)
def test_input_validation():
validator = InputValidator(max_length=1000)
with pytest.raises(ValidationError):
validator.validate("x" * 1001)
Mock-Based Integration Testing
from unittest.mock import AsyncMock
@pytest.fixture
def mock_llm():
llm = AsyncMock()
llm.generate.return_value = GenerationResult(
content="Mocked response",
tokens_used=50
)
return llm
async def test_chat_flow(mock_llm):
chat = ChatService(llm=mock_llm)
response = await chat.send_message("Hello")
assert response.content == "Mocked response"
mock_llm.generate.assert_called_once()
async def test_retry_on_timeout(mock_llm):
mock_llm.generate.side_effect = [
TimeoutError(),
GenerationResult(content="Success")
]
chat = ChatService(llm=mock_llm, max_retries=3)
response = await chat.send_message("Hello")
assert response.content == "Success"
assert mock_llm.generate.call_count == 2
LLM-as-Judge Evaluation
class LLMEvaluator:
"""Use an LLM to evaluate another LLM's output."""
def __init__(self, judge_llm):
self.judge = judge_llm
async def evaluate_response(
self,
question: str,
response: str,
criteria: list[str]
) -> EvaluationResult:
eval_prompt = f"""Evaluate this response on the given criteria.
Question: {question}
Response: {response}
Criteria to evaluate:
{chr(10).join(f'- {c}' for c in criteria)}
For each criterion, rate 1-5 and explain briefly.
Return as JSON: {{"criterion": {{"score": N, "reason": "..."}}, ...}}
"""
result = await self.judge.generate(eval_prompt)
return EvaluationResult.parse(result)
# Usage in tests
async def test_response_quality():
evaluator = LLMEvaluator(judge_llm=get_eval_model())
response = await my_app.generate_response(
"Explain kubernetes to a beginner"
)
evaluation = await evaluator.evaluate_response(
question="Explain kubernetes to a beginner",
response=response,
criteria=[
"Accuracy: Information is correct",
"Clarity: Easy to understand for beginners",
"Completeness: Covers key concepts"
]
)
assert evaluation.scores["Accuracy"] >= 4
assert evaluation.scores["Clarity"] >= 4
Assertion-Based Testing
class SemanticAssertions:
"""Semantic assertions for LLM outputs."""
def __init__(self, llm):
self.llm = llm
async def assert_contains_concept(
self,
text: str,
concept: str
) -> bool:
result = await self.llm.generate(
f"""Does this text discuss the concept "{concept}"?
Text: {text}
Answer only: yes or no"""
)
return result.strip().lower() == "yes"
async def assert_tone(
self,
text: str,
expected_tone: str
) -> bool:
result = await self.llm.generate(
f"""Is the tone of this text "{expected_tone}"?
Text: {text}
Answer only: yes or no"""
)
return result.strip().lower() == "yes"
# Usage
async def test_customer_response_tone():
assertions = SemanticAssertions(eval_llm)
response = await support_bot.respond("I'm frustrated with your service")
assert await assertions.assert_tone(response, "empathetic and helpful")
assert await assertions.assert_contains_concept(response, "apology")
Regression Testing with Golden Datasets
class RegressionSuite:
"""Detect quality regressions across versions."""
def __init__(self, golden_dataset_path: str):
self.golden_data = self._load_dataset(golden_dataset_path)
self.evaluator = LLMEvaluator(get_eval_model())
async def run_regression_tests(
self,
model_under_test
) -> RegressionReport:
results = []
for case in self.golden_data:
response = await model_under_test.generate(case.input)
# Compare to golden response
similarity = await self._compute_similarity(
response,
case.golden_response
)
# Evaluate quality
quality = await self.evaluator.evaluate_response(
case.input,
response,
case.criteria
)
results.append(TestResult(
case_id=case.id,
similarity_score=similarity,
quality_scores=quality.scores,
passed=similarity > 0.8 and quality.average >= 4
))
return RegressionReport(results=results)
# Golden dataset format
golden_dataset = [
{
"id": "summarization_001",
"input": "Summarize this article: ...",
"golden_response": "The article discusses...",
"criteria": ["Accuracy", "Conciseness", "Completeness"]
}
]
Best Practices
llm_testing_best_practices:
structure:
- Separate deterministic from probabilistic tests
- Run fast tests frequently, slow tests in CI
- Maintain golden datasets
evaluation:
- Define clear quality criteria
- Use multiple evaluation methods
- Track metrics over time
regression:
- Test before model updates
- Test after prompt changes
- Version your prompts
scale:
- Automate what you can
- Sample for expensive evaluations
- Human review for edge cases
Key Takeaways
- LLM testing requires different strategies than traditional software
- Test deterministic components with traditional unit tests
- Use mocks for integration testing without LLM calls
- LLM-as-judge enables scalable quality evaluation
- Semantic assertions check meaning, not exact text
- Golden datasets detect quality regressions
- Define clear evaluation criteria upfront
- Track metrics over time to catch drift
- Automate most testing, human review for ambiguous cases
- Testing is investment—it prevents production surprises
Test your LLMs. Your users will thank you.