GPT-4V (Vision) adds image understanding to GPT-4’s capabilities. This isn’t just OCR—the model can reason about images, understand context, and combine visual and textual information. Multi-modal AI opens new application categories.
Here’s how to build applications that see and read.
Multi-Modal Capabilities
What GPT-4V Can Do
gpt4v_capabilities:
visual_understanding:
- Object and scene recognition
- Text extraction (OCR)
- Chart and diagram interpretation
- Spatial relationships
reasoning:
- Answer questions about images
- Compare multiple images
- Combine image and text context
- Follow complex visual instructions
limitations:
- May struggle with small text
- Can miss fine details
- Inconsistent with specific measurements
- No video understanding (yet)
Basic Usage
Sending Images to GPT-4V
import openai
import base64
def encode_image(image_path: str) -> str:
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def analyze_image(image_path: str, prompt: str) -> str:
client = openai.OpenAI()
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high" # or "low" for faster/cheaper
}
}
]
}
],
max_tokens=1000
)
return response.choices[0].message.content
# Usage
result = analyze_image(
"architecture_diagram.png",
"Describe the system architecture shown in this diagram."
)
Multiple Images
def compare_images(image_paths: list[str], prompt: str) -> str:
client = openai.OpenAI()
content = [{"type": "text", "text": prompt}]
for path in image_paths:
base64_image = encode_image(path)
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
})
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[{"role": "user", "content": content}],
max_tokens=1000
)
return response.choices[0].message.content
# Compare two UI designs
result = compare_images(
["design_v1.png", "design_v2.png"],
"Compare these two UI designs. What are the key differences?"
)
Application Patterns
Document Understanding
class DocumentAnalyzer:
"""Analyze documents with mixed text and visual content."""
def __init__(self):
self.client = openai.OpenAI()
def analyze_document(self, image_path: str) -> dict:
base64_image = encode_image(image_path)
response = self.client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": """Analyze this document and extract:
1. Document type (invoice, receipt, report, etc.)
2. Key fields and their values
3. Any tables present
4. Important dates and amounts
Return as JSON."""
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
}
]
}
],
max_tokens=2000
)
return json.loads(response.choices[0].message.content)
# Process an invoice
invoice_data = analyzer.analyze_document("invoice.pdf")
# Returns: {"type": "invoice", "vendor": "...", "total": 1234.56, ...}
Visual QA System
class VisualQA:
"""Answer questions about images using conversation context."""
def __init__(self):
self.client = openai.OpenAI()
self.conversations = {}
def ask(self, user_id: str, image_path: str, question: str) -> str:
# Get or create conversation history
if user_id not in self.conversations:
self.conversations[user_id] = []
messages = self.conversations[user_id].copy()
# Add new question with image
base64_image = encode_image(image_path)
messages.append({
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
}
]
})
response = self.client.chat.completions.create(
model="gpt-4-vision-preview",
messages=messages,
max_tokens=500
)
answer = response.choices[0].message.content
# Store conversation (text only for history)
self.conversations[user_id].append({
"role": "user",
"content": question
})
self.conversations[user_id].append({
"role": "assistant",
"content": answer
})
return answer
UI Analysis
def analyze_ui(screenshot_path: str) -> dict:
"""Analyze a UI screenshot for issues and improvements."""
client = openai.OpenAI()
base64_image = encode_image(screenshot_path)
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": """Analyze this UI screenshot:
1. Identify any usability issues
2. Check for accessibility concerns (contrast, text size)
3. Evaluate visual hierarchy
4. Suggest improvements
Be specific and actionable."""
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
}
]
}
],
max_tokens=1500
)
return response.choices[0].message.content
Cost and Performance
Optimization Strategies
cost_optimization:
image_resolution:
low_detail:
tokens: ~85 tokens per image
use_for: Simple images, thumbnails
high_detail:
tokens: ~85 + 170 * tiles
use_for: Detailed analysis, small text
strategies:
- Resize images before sending
- Use low detail when sufficient
- Batch related questions
- Cache results for repeated analysis
from PIL import Image
def optimize_image(image_path: str, max_size: int = 1024) -> str:
"""Resize image to reduce tokens while maintaining quality."""
img = Image.open(image_path)
# Resize if larger than max_size
if max(img.size) > max_size:
ratio = max_size / max(img.size)
new_size = tuple(int(dim * ratio) for dim in img.size)
img = img.resize(new_size, Image.Resampling.LANCZOS)
# Convert to JPEG for smaller size
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return base64.b64encode(buffer.getvalue()).decode('utf-8')
Use Cases
High-Value Applications
multimodal_use_cases:
document_processing:
- Invoice and receipt extraction
- Form understanding
- Contract analysis
- ID verification
visual_inspection:
- Quality control
- Damage assessment
- Inventory verification
- Safety compliance
accessibility:
- Image descriptions for blind users
- Alt text generation
- Visual content summarization
creative:
- Design feedback
- Brand consistency checking
- UI/UX review
- Asset organization
Key Takeaways
- GPT-4V understands images with reasoning, not just recognition
- Use base64 encoding or URLs to send images
- Detail level affects cost significantly (low vs. high)
- Combine text and images in single prompts for context
- Maintain conversation history for multi-turn visual QA
- Optimize image size to reduce costs
- High-value use cases: document processing, visual inspection, accessibility
- Limitations: small text, fine details, measurements
- Multi-modal is a new capability category—explore creatively
Vision + language enables new categories of applications. Build them.