Structured Output Patterns for LLMs

April 29, 2024

Getting LLMs to produce structured output—JSON, specific formats, validated data—is crucial for programmatic use. But LLMs are text generators, not data structure generators. Reliable structured output requires specific patterns.

Here are patterns for getting structured output from LLMs.

The Challenge

Why Structured Output Is Hard

structured_output_challenges:
  llms_generate_text:
    - Not designed for structured data
    - May add explanatory text
    - Format can vary

  edge_cases:
    - Unexpected input causes format breaks
    - Long outputs may truncate
    - Models may refuse and return text

  validation:
    - Output may be syntactically valid but semantically wrong
    - Missing required fields
    - Wrong data types

Native JSON Mode

Using API JSON Mode

import openai

# OpenAI JSON mode
response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "Extract information as JSON."},
        {"role": "user", "content": f"Extract name and email from: {text}"}
    ],
    response_format={"type": "json_object"}
)

data = json.loads(response.choices[0].message.content)
# Anthropic/Claude
import anthropic

response = anthropic.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"""Extract as JSON with fields: name, email

Text: {text}

JSON:"""
    }]
)

# Parse from response
content = response.content[0].text
data = json.loads(content)

Prompt Patterns

Schema in Prompt

EXTRACTION_PROMPT = """
Extract information from the text as JSON.

Output Schema:
{
  "name": "string (full name)",
  "email": "string (email address) or null",
  "company": "string (company name) or null",
  "role": "string (job title) or null"
}

Rules:
- Use null for missing information
- Normalize email to lowercase
- Return ONLY valid JSON, no additional text

Text:
{text}

JSON:
"""

def extract_contact(text: str) -> dict:
    prompt = EXTRACTION_PROMPT.format(text=text)
    response = llm.generate(prompt, response_format={"type": "json_object"})
    return json.loads(response)

Few-Shot with Examples

FEW_SHOT_PROMPT = """
Convert the text to structured data.

Example 1:
Input: "John Smith is a software engineer at Acme Corp. Reach him at john@acme.com"
Output: {"name": "John Smith", "role": "software engineer", "company": "Acme Corp", "email": "john@acme.com"}

Example 2:
Input: "Contact our sales team for pricing"
Output: {"name": null, "role": null, "company": null, "email": null}

Now convert:
Input: "{text}"
Output:"""

Validation Patterns

Schema Validation

from pydantic import BaseModel, EmailStr, validator
from typing import Optional

class ContactInfo(BaseModel):
    name: str
    email: Optional[EmailStr] = None
    company: Optional[str] = None
    role: Optional[str] = None

    @validator('name')
    def name_not_empty(cls, v):
        if not v or not v.strip():
            raise ValueError('Name cannot be empty')
        return v.strip()

def extract_with_validation(text: str) -> ContactInfo:
    response = llm.generate(extraction_prompt(text))
    data = json.loads(response)
    return ContactInfo(**data)  # Raises on invalid data

Retry with Feedback

def extract_with_retry(text: str, max_attempts: int = 3) -> dict:
    schema = ContactInfo.schema()

    for attempt in range(max_attempts):
        try:
            response = llm.generate(extraction_prompt(text, schema))
            data = json.loads(response)
            validated = ContactInfo(**data)
            return validated.dict()
        except json.JSONDecodeError as e:
            if attempt == max_attempts - 1:
                raise
            # Retry with feedback
            text = f"{text}\n\nPrevious response was not valid JSON: {e}"
        except ValidationError as e:
            if attempt == max_attempts - 1:
                raise
            # Retry with validation feedback
            text = f"{text}\n\nPrevious response had validation errors: {e}"

Function Calling

OpenAI Function Calling

tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_contact",
            "description": "Extract contact information from text",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {"type": "string", "description": "Full name"},
                    "email": {"type": "string", "description": "Email address"},
                    "company": {"type": "string", "description": "Company name"},
                    "role": {"type": "string", "description": "Job title"}
                },
                "required": ["name"]
            }
        }
    }
]

response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": f"Extract contact: {text}"}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "extract_contact"}}
)

# Structured output guaranteed
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)

Complex Structures

Nested Objects

class Address(BaseModel):
    street: Optional[str]
    city: Optional[str]
    state: Optional[str]
    country: Optional[str]
    postal_code: Optional[str]

class Organization(BaseModel):
    name: str
    industry: Optional[str]
    size: Optional[str]
    addresses: list[Address] = []

NESTED_PROMPT = """
Extract organization information as JSON.

Schema:
{
  "name": "string (required)",
  "industry": "string or null",
  "size": "string (small/medium/large/enterprise) or null",
  "addresses": [
    {
      "street": "string or null",
      "city": "string or null",
      "state": "string or null",
      "country": "string or null",
      "postal_code": "string or null"
    }
  ]
}
"""

Lists of Items

class Task(BaseModel):
    title: str
    priority: str
    due_date: Optional[str]

class TaskList(BaseModel):
    tasks: list[Task]

def extract_tasks(text: str) -> TaskList:
    prompt = f"""
Extract tasks from this text as JSON.

Output format:
{{"tasks": [{{"title": "string", "priority": "high/medium/low", "due_date": "YYYY-MM-DD or null"}}]}}

Text: {text}

JSON:"""

    response = llm.generate(prompt, response_format={"type": "json_object"})
    return TaskList(**json.loads(response))

Best Practices

structured_output_best_practices:
  prompting:
    - Include schema in prompt
    - Use few-shot examples
    - Specify "JSON only, no other text"
    - Use response_format when available

  validation:
    - Always validate output
    - Use Pydantic or similar
    - Handle validation errors gracefully

  reliability:
    - Implement retry with feedback
    - Have fallback behavior
    - Log failures for analysis

  performance:
    - Cache parsed results
    - Batch similar extractions
    - Use appropriate model (not always GPT-4)

Key Takeaways

Structured output is solvable. Use the right patterns.