Running LLMs locally has become practical with tools like Ollama and open-source models like Llama 2 and Mistral. Local LLMs enable faster development iteration, complete privacy, and zero API costs. For many use cases, they’re now good enough.
Here’s a practical guide to local LLM development.
Why Local LLMs
Benefits
local_llm_benefits:
development_speed:
- No rate limits
- No API latency
- Fast iteration
- Offline development
privacy:
- Data never leaves machine
- No third-party logging
- Sensitive data safe
- Compliance friendly
cost:
- Zero marginal cost
- Unlimited testing
- No surprise bills
- Predictable expenses
control:
- Model choice
- No terms of service changes
- Reproducible results
- Custom modifications
Trade-offs
local_llm_tradeoffs:
capability:
- Smaller than GPT-4/Claude 3
- May not match frontier for complex tasks
- Context windows vary
infrastructure:
- Requires GPU (for reasonable speed)
- Uses system resources
- Initial setup complexity
maintenance:
- Manual updates
- No hosted improvements
- You handle operations
Getting Started with Ollama
Installation
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Start the service
ollama serve
Running Models
# Download and run Mistral
ollama run mistral
# Run Llama 2
ollama run llama2
# Run with specific size
ollama run llama2:13b
# Run Code Llama for coding
ollama run codellama
API Usage
import requests
def query_local_llm(prompt: str, model: str = "mistral") -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Usage
result = query_local_llm("Explain microservices in 3 sentences.")
print(result)
OpenAI-Compatible API
# Ollama provides OpenAI-compatible endpoint
import openai
client = openai.OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Not used but required
)
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "user", "content": "Explain RAG architecture."}
]
)
print(response.choices[0].message.content)
Model Selection
Popular Options
local_models_2024:
general_purpose:
mistral_7b:
strengths: Fast, good quality for size
use_case: General tasks, chat
requirements: 8GB RAM
llama2_13b:
strengths: Better reasoning than 7B
use_case: More complex tasks
requirements: 16GB RAM
mixtral_8x7b:
strengths: MoE architecture, strong quality
use_case: Complex tasks, coding
requirements: 32GB+ RAM
coding:
codellama:
strengths: Code-focused training
use_case: Code generation, review
variants: 7B, 13B, 34B
deepseek_coder:
strengths: Strong code performance
use_case: Code tasks
requirements: Varies by size
specialized:
phi_2:
strengths: Very small, surprisingly capable
use_case: Resource-constrained environments
requirements: 4GB RAM
Selection Criteria
model_selection_criteria:
by_use_case:
prototyping:
recommendation: Mistral 7B
reason: Fast, good enough quality
production_testing:
recommendation: Llama 2 13B or Mixtral
reason: Better quality approximation
code_generation:
recommendation: Code Llama or DeepSeek
reason: Code-specific training
by_hardware:
8gb_ram:
options: [Mistral 7B, Phi-2]
note: Quantized versions expand options
16gb_ram:
options: [Llama 2 13B, Code Llama 13B]
32gb_plus:
options: [Mixtral, larger models]
Development Workflow
Local-First Development
import os
class LLMClient:
"""Unified client that works with local and remote LLMs."""
def __init__(self):
self.use_local = os.getenv("USE_LOCAL_LLM", "true").lower() == "true"
if self.use_local:
self.client = openai.OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
self.model = os.getenv("LOCAL_MODEL", "mistral")
else:
self.client = openai.OpenAI()
self.model = os.getenv("REMOTE_MODEL", "gpt-4-turbo")
def generate(self, prompt: str, **kwargs) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
**kwargs
)
return response.choices[0].message.content
# Development: USE_LOCAL_LLM=true
# Production: USE_LOCAL_LLM=false
Testing with Local Models
# Fast iteration with local models
def test_prompt_variations():
client = LLMClient() # Uses local by default
prompts = [
"Summarize in 3 bullet points: {text}",
"Key takeaways from this text: {text}",
"TL;DR: {text}"
]
for prompt_template in prompts:
prompt = prompt_template.format(text=sample_text)
result = client.generate(prompt)
print(f"Template: {prompt_template[:30]}...")
print(f"Result: {result[:100]}...")
print("---")
# Run unlimited iterations with no cost
test_prompt_variations()
Production Considerations
When to Use Local vs. Remote
deployment_decision:
use_local_in_production:
- Privacy requirements
- Air-gapped environments
- Cost-sensitive applications
- Acceptable quality trade-off
- Predictable workloads
use_remote_in_production:
- Need frontier capabilities
- Variable workloads
- No GPU infrastructure
- Quality is critical
Hybrid Approach
class HybridLLMClient:
"""Route to local or remote based on task complexity."""
def __init__(self):
self.local_client = self._create_local_client()
self.remote_client = self._create_remote_client()
def generate(self, prompt: str, complexity: str = "auto") -> str:
if complexity == "simple" or self._assess_complexity(prompt) < 0.5:
# Use fast, free local model
return self.local_client.generate(prompt)
else:
# Use powerful remote model
return self.remote_client.generate(prompt)
def _assess_complexity(self, prompt: str) -> float:
# Simple heuristics or classifier
indicators = [
len(prompt) > 2000,
"reasoning" in prompt.lower(),
"complex" in prompt.lower(),
"analyze" in prompt.lower(),
]
return sum(indicators) / len(indicators)
Key Takeaways
- Local LLMs are practical in 2024 with Ollama and open models
- Benefits: speed, privacy, cost, control
- Mistral 7B is excellent for development iteration
- Use OpenAI-compatible APIs for easy switching
- Build abstraction layer for local/remote switching
- Test with local, deploy with appropriate model
- Hybrid approaches optimize cost and quality
- Hardware requirements vary by model size
- Quality is good enough for many use cases
Local LLMs accelerate development. Use them.