Local LLMs for Development: Practical Guide

Running LLMs locally has become practical with tools like Ollama and open-source models like Llama 2 and Mistral. Local LLMs enable faster development iteration, complete privacy, and zero API costs. For many use cases, they’re now good enough.

Here’s a practical guide to local LLM development.

Why Local LLMs

Benefits

local_llm_benefits:
  development_speed:
    - No rate limits
    - No API latency
    - Fast iteration
    - Offline development

  privacy:
    - Data never leaves machine
    - No third-party logging
    - Sensitive data safe
    - Compliance friendly

  cost:
    - Zero marginal cost
    - Unlimited testing
    - No surprise bills
    - Predictable expenses

  control:
    - Model choice
    - No terms of service changes
    - Reproducible results
    - Custom modifications

Trade-offs

local_llm_tradeoffs:
  capability:
    - Smaller than GPT-4/Claude 3
    - May not match frontier for complex tasks
    - Context windows vary

  infrastructure:
    - Requires GPU (for reasonable speed)
    - Uses system resources
    - Initial setup complexity

  maintenance:
    - Manual updates
    - No hosted improvements
    - You handle operations

Getting Started with Ollama

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Start the service
ollama serve

Running Models

# Download and run Mistral
ollama run mistral

# Run Llama 2
ollama run llama2

# Run with specific size
ollama run llama2:13b

# Run Code Llama for coding
ollama run codellama

API Usage

import requests

def query_local_llm(prompt: str, model: str = "mistral") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Usage
result = query_local_llm("Explain microservices in 3 sentences.")
print(result)

OpenAI-Compatible API

# Ollama provides OpenAI-compatible endpoint
import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Not used but required
)

response = client.chat.completions.create(
    model="mistral",
    messages=[
        {"role": "user", "content": "Explain RAG architecture."}
    ]
)
print(response.choices[0].message.content)

Model Selection

Popular Options

local_models_2024:
  general_purpose:
    mistral_7b:
      strengths: Fast, good quality for size
      use_case: General tasks, chat
      requirements: 8GB RAM

    llama2_13b:
      strengths: Better reasoning than 7B
      use_case: More complex tasks
      requirements: 16GB RAM

    mixtral_8x7b:
      strengths: MoE architecture, strong quality
      use_case: Complex tasks, coding
      requirements: 32GB+ RAM

  coding:
    codellama:
      strengths: Code-focused training
      use_case: Code generation, review
      variants: 7B, 13B, 34B

    deepseek_coder:
      strengths: Strong code performance
      use_case: Code tasks
      requirements: Varies by size

  specialized:
    phi_2:
      strengths: Very small, surprisingly capable
      use_case: Resource-constrained environments
      requirements: 4GB RAM

Selection Criteria

model_selection_criteria:
  by_use_case:
    prototyping:
      recommendation: Mistral 7B
      reason: Fast, good enough quality

    production_testing:
      recommendation: Llama 2 13B or Mixtral
      reason: Better quality approximation

    code_generation:
      recommendation: Code Llama or DeepSeek
      reason: Code-specific training

  by_hardware:
    8gb_ram:
      options: [Mistral 7B, Phi-2]
      note: Quantized versions expand options

    16gb_ram:
      options: [Llama 2 13B, Code Llama 13B]

    32gb_plus:
      options: [Mixtral, larger models]

Development Workflow

Local-First Development

import os

class LLMClient:
    """Unified client that works with local and remote LLMs."""

    def __init__(self):
        self.use_local = os.getenv("USE_LOCAL_LLM", "true").lower() == "true"

        if self.use_local:
            self.client = openai.OpenAI(
                base_url="http://localhost:11434/v1",
                api_key="ollama"
            )
            self.model = os.getenv("LOCAL_MODEL", "mistral")
        else:
            self.client = openai.OpenAI()
            self.model = os.getenv("REMOTE_MODEL", "gpt-4-turbo")

    def generate(self, prompt: str, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        return response.choices[0].message.content

# Development: USE_LOCAL_LLM=true
# Production: USE_LOCAL_LLM=false

Testing with Local Models

# Fast iteration with local models
def test_prompt_variations():
    client = LLMClient()  # Uses local by default

    prompts = [
        "Summarize in 3 bullet points: {text}",
        "Key takeaways from this text: {text}",
        "TL;DR: {text}"
    ]

    for prompt_template in prompts:
        prompt = prompt_template.format(text=sample_text)
        result = client.generate(prompt)
        print(f"Template: {prompt_template[:30]}...")
        print(f"Result: {result[:100]}...")
        print("---")

# Run unlimited iterations with no cost
test_prompt_variations()

Production Considerations

When to Use Local vs. Remote

deployment_decision:
  use_local_in_production:
    - Privacy requirements
    - Air-gapped environments
    - Cost-sensitive applications
    - Acceptable quality trade-off
    - Predictable workloads

  use_remote_in_production:
    - Need frontier capabilities
    - Variable workloads
    - No GPU infrastructure
    - Quality is critical

Hybrid Approach

class HybridLLMClient:
    """Route to local or remote based on task complexity."""

    def __init__(self):
        self.local_client = self._create_local_client()
        self.remote_client = self._create_remote_client()

    def generate(self, prompt: str, complexity: str = "auto") -> str:
        if complexity == "simple" or self._assess_complexity(prompt) < 0.5:
            # Use fast, free local model
            return self.local_client.generate(prompt)
        else:
            # Use powerful remote model
            return self.remote_client.generate(prompt)

    def _assess_complexity(self, prompt: str) -> float:
        # Simple heuristics or classifier
        indicators = [
            len(prompt) > 2000,
            "reasoning" in prompt.lower(),
            "complex" in prompt.lower(),
            "analyze" in prompt.lower(),
        ]
        return sum(indicators) / len(indicators)

Key Takeaways

Local LLMs are practical in 2024 with Ollama and open models
Benefits: speed, privacy, cost, control
Mistral 7B is excellent for development iteration
Use OpenAI-compatible APIs for easy switching
Build abstraction layer for local/remote switching
Test with local, deploy with appropriate model
Hybrid approaches optimize cost and quality
Hardware requirements vary by model size
Quality is good enough for many use cases

Local LLMs accelerate development. Use them.