Multi-Modal AI: Building Applications That See and Read

December 11, 2023

GPT-4V (Vision) adds image understanding to GPT-4’s capabilities. This isn’t just OCR—the model can reason about images, understand context, and combine visual and textual information. Multi-modal AI opens new application categories.

Here’s how to build applications that see and read.

Multi-Modal Capabilities

What GPT-4V Can Do

gpt4v_capabilities:
  visual_understanding:
    - Object and scene recognition
    - Text extraction (OCR)
    - Chart and diagram interpretation
    - Spatial relationships

  reasoning:
    - Answer questions about images
    - Compare multiple images
    - Combine image and text context
    - Follow complex visual instructions

  limitations:
    - May struggle with small text
    - Can miss fine details
    - Inconsistent with specific measurements
    - No video understanding (yet)

Basic Usage

Sending Images to GPT-4V

import openai
import base64

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def analyze_image(image_path: str, prompt: str) -> str:
    client = openai.OpenAI()

    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"  # or "low" for faster/cheaper
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )

    return response.choices[0].message.content

# Usage
result = analyze_image(
    "architecture_diagram.png",
    "Describe the system architecture shown in this diagram."
)

Multiple Images

def compare_images(image_paths: list[str], prompt: str) -> str:
    client = openai.OpenAI()

    content = [{"type": "text", "text": prompt}]

    for path in image_paths:
        base64_image = encode_image(path)
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
        })

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{"role": "user", "content": content}],
        max_tokens=1000
    )

    return response.choices[0].message.content

# Compare two UI designs
result = compare_images(
    ["design_v1.png", "design_v2.png"],
    "Compare these two UI designs. What are the key differences?"
)

Application Patterns

Document Understanding

class DocumentAnalyzer:
    """Analyze documents with mixed text and visual content."""

    def __init__(self):
        self.client = openai.OpenAI()

    def analyze_document(self, image_path: str) -> dict:
        base64_image = encode_image(image_path)

        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": """Analyze this document and extract:
1. Document type (invoice, receipt, report, etc.)
2. Key fields and their values
3. Any tables present
4. Important dates and amounts

Return as JSON."""
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                        }
                    ]
                }
            ],
            max_tokens=2000
        )

        return json.loads(response.choices[0].message.content)

# Process an invoice
invoice_data = analyzer.analyze_document("invoice.pdf")
# Returns: {"type": "invoice", "vendor": "...", "total": 1234.56, ...}

Visual QA System

class VisualQA:
    """Answer questions about images using conversation context."""

    def __init__(self):
        self.client = openai.OpenAI()
        self.conversations = {}

    def ask(self, user_id: str, image_path: str, question: str) -> str:
        # Get or create conversation history
        if user_id not in self.conversations:
            self.conversations[user_id] = []

        messages = self.conversations[user_id].copy()

        # Add new question with image
        base64_image = encode_image(image_path)
        messages.append({
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                }
            ]
        })

        response = self.client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=messages,
            max_tokens=500
        )

        answer = response.choices[0].message.content

        # Store conversation (text only for history)
        self.conversations[user_id].append({
            "role": "user",
            "content": question
        })
        self.conversations[user_id].append({
            "role": "assistant",
            "content": answer
        })

        return answer

UI Analysis

def analyze_ui(screenshot_path: str) -> dict:
    """Analyze a UI screenshot for issues and improvements."""
    client = openai.OpenAI()

    base64_image = encode_image(screenshot_path)

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """Analyze this UI screenshot:

1. Identify any usability issues
2. Check for accessibility concerns (contrast, text size)
3. Evaluate visual hierarchy
4. Suggest improvements

Be specific and actionable."""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                    }
                ]
            }
        ],
        max_tokens=1500
    )

    return response.choices[0].message.content

Cost and Performance

Optimization Strategies

cost_optimization:
  image_resolution:
    low_detail:
      tokens: ~85 tokens per image
      use_for: Simple images, thumbnails
    high_detail:
      tokens: ~85 + 170 * tiles
      use_for: Detailed analysis, small text

  strategies:
    - Resize images before sending
    - Use low detail when sufficient
    - Batch related questions
    - Cache results for repeated analysis
from PIL import Image

def optimize_image(image_path: str, max_size: int = 1024) -> str:
    """Resize image to reduce tokens while maintaining quality."""
    img = Image.open(image_path)

    # Resize if larger than max_size
    if max(img.size) > max_size:
        ratio = max_size / max(img.size)
        new_size = tuple(int(dim * ratio) for dim in img.size)
        img = img.resize(new_size, Image.Resampling.LANCZOS)

    # Convert to JPEG for smaller size
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    return base64.b64encode(buffer.getvalue()).decode('utf-8')

Use Cases

High-Value Applications

multimodal_use_cases:
  document_processing:
    - Invoice and receipt extraction
    - Form understanding
    - Contract analysis
    - ID verification

  visual_inspection:
    - Quality control
    - Damage assessment
    - Inventory verification
    - Safety compliance

  accessibility:
    - Image descriptions for blind users
    - Alt text generation
    - Visual content summarization

  creative:
    - Design feedback
    - Brand consistency checking
    - UI/UX review
    - Asset organization

Key Takeaways

Vision + language enables new categories of applications. Build them.