Multi-Modal AI: Building Applications That See and Read

GPT-4V (Vision) adds image understanding to GPT-4’s capabilities. This isn’t just OCR—the model can reason about images, understand context, and combine visual and textual information. Multi-modal AI opens new application categories.

Here’s how to build applications that see and read.

What GPT-4V Can Do

gpt4v_capabilities:
  visual_understanding:
    - Object and scene recognition
    - Text extraction (OCR)
    - Chart and diagram interpretation
    - Spatial relationships

  reasoning:
    - Answer questions about images
    - Compare multiple images
    - Combine image and text context
    - Follow complex visual instructions

  limitations:
    - May struggle with small text
    - Can miss fine details
    - Inconsistent with specific measurements
    - No video understanding (yet)

Basic Usage

Sending Images to GPT-4V

import openai
import base64

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def analyze_image(image_path: str, prompt: str) -> str:
    client = openai.OpenAI()

    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"  # or "low" for faster/cheaper
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )

    return response.choices[0].message.content

# Usage
result = analyze_image(
    "architecture_diagram.png",
    "Describe the system architecture shown in this diagram."
)

Multiple Images

def compare_images(image_paths: list[str], prompt: str) -> str:
    client = openai.OpenAI()

    content = [{"type": "text", "text": prompt}]

    for path in image_paths:
        base64_image = encode_image(path)
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
        })

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{"role": "user", "content": content}],
        max_tokens=1000
    )

    return response.choices[0].message.content

# Compare two UI designs
result = compare_images(
    ["design_v1.png", "design_v2.png"],
    "Compare these two UI designs. What are the key differences?"
)

Application Patterns

Document Understanding

class DocumentAnalyzer:
    """Analyze documents with mixed text and visual content."""

    def __init__(self):
        self.client = openai.OpenAI()

    def analyze_document(self, image_path: str) -> dict:
        base64_image = encode_image(image_path)

        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": """Analyze this document and extract:
1. Document type (invoice, receipt, report, etc.)
2. Key fields and their values
3. Any tables present
4. Important dates and amounts

Return as JSON."""
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                        }
                    ]
                }
            ],
            max_tokens=2000
        )

        return json.loads(response.choices[0].message.content)

# Process an invoice
invoice_data = analyzer.analyze_document("invoice.pdf")
# Returns: {"type": "invoice", "vendor": "...", "total": 1234.56, ...}

Visual QA System

class VisualQA:
    """Answer questions about images using conversation context."""

    def __init__(self):
        self.client = openai.OpenAI()
        self.conversations = {}

    def ask(self, user_id: str, image_path: str, question: str) -> str:
        # Get or create conversation history
        if user_id not in self.conversations:
            self.conversations[user_id] = []

        messages = self.conversations[user_id].copy()

        # Add new question with image
        base64_image = encode_image(image_path)
        messages.append({
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                }
            ]
        })

        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=messages,
            max_tokens=500
        )

        answer = response.choices[0].message.content

        # Store conversation (text only for history)
        self.conversations[user_id].append({
            "role": "user",
            "content": question
        })
        self.conversations[user_id].append({
            "role": "assistant",
            "content": answer
        })

        return answer

UI Analysis

def analyze_ui(screenshot_path: str) -> dict:
    """Analyze a UI screenshot for issues and improvements."""
    client = openai.OpenAI()

    base64_image = encode_image(screenshot_path)

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """Analyze this UI screenshot:

1. Identify any usability issues
2. Check for accessibility concerns (contrast, text size)
3. Evaluate visual hierarchy
4. Suggest improvements

Be specific and actionable."""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                    }
                ]
            }
        ],
        max_tokens=1500
    )

    return response.choices[0].message.content

Cost and Performance

Optimization Strategies

cost_optimization:
  image_resolution:
    low_detail:
      tokens: ~85 tokens per image
      use_for: Simple images, thumbnails
    high_detail:
      tokens: ~85 + 170 * tiles
      use_for: Detailed analysis, small text

  strategies:
    - Resize images before sending
    - Use low detail when sufficient
    - Batch related questions
    - Cache results for repeated analysis

from PIL import Image

def optimize_image(image_path: str, max_size: int = 1024) -> str:
    """Resize image to reduce tokens while maintaining quality."""
    img = Image.open(image_path)

    # Resize if larger than max_size
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = tuple(int(dim * ratio) for dim in img.size)
        img = img.resize(new_size, Image.Resampling.LANCZOS)

    # Convert to JPEG for smaller size
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    return base64.b64encode(buffer.getvalue()).decode('utf-8')

Use Cases

High-Value Applications

multimodal_use_cases:
  document_processing:
    - Invoice and receipt extraction
    - Form understanding
    - Contract analysis
    - ID verification

  visual_inspection:
    - Quality control
    - Damage assessment
    - Inventory verification
    - Safety compliance

  accessibility:
    - Image descriptions for blind users
    - Alt text generation
    - Visual content summarization

  creative:
    - Design feedback
    - Brand consistency checking
    - UI/UX review
    - Asset organization

Key Takeaways

GPT-4V understands images with reasoning, not just recognition
Use base64 encoding or URLs to send images
Detail level affects cost significantly (low vs. high)
Combine text and images in single prompts for context
Maintain conversation history for multi-turn visual QA
Optimize image size to reduce costs
High-value use cases: document processing, visual inspection, accessibility
Limitations: small text, fine details, measurements
Multi-modal is a new capability category—explore creatively

Vision + language enables new categories of applications. Build them.