Local LLMs for Development: Practical Guide

January 22, 2024

Running LLMs locally has become practical with tools like Ollama and open-source models like Llama 2 and Mistral. Local LLMs enable faster development iteration, complete privacy, and zero API costs. For many use cases, they’re now good enough.

Here’s a practical guide to local LLM development.

Why Local LLMs

Benefits

local_llm_benefits:
  development_speed:
    - No rate limits
    - No API latency
    - Fast iteration
    - Offline development

  privacy:
    - Data never leaves machine
    - No third-party logging
    - Sensitive data safe
    - Compliance friendly

  cost:
    - Zero marginal cost
    - Unlimited testing
    - No surprise bills
    - Predictable expenses

  control:
    - Model choice
    - No terms of service changes
    - Reproducible results
    - Custom modifications

Trade-offs

local_llm_tradeoffs:
  capability:
    - Smaller than GPT-4/Claude 3
    - May not match frontier for complex tasks
    - Context windows vary

  infrastructure:
    - Requires GPU (for reasonable speed)
    - Uses system resources
    - Initial setup complexity

  maintenance:
    - Manual updates
    - No hosted improvements
    - You handle operations

Getting Started with Ollama

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Start the service
ollama serve

Running Models

# Download and run Mistral
ollama run mistral

# Run Llama 2
ollama run llama2

# Run with specific size
ollama run llama2:13b

# Run Code Llama for coding
ollama run codellama

API Usage

import requests

def query_local_llm(prompt: str, model: str = "mistral") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Usage
result = query_local_llm("Explain microservices in 3 sentences.")
print(result)

OpenAI-Compatible API

# Ollama provides OpenAI-compatible endpoint
import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Not used but required
)

response = client.chat.completions.create(
    model="mistral",
    messages=[
        {"role": "user", "content": "Explain RAG architecture."}
    ]
)
print(response.choices[0].message.content)

Model Selection

local_models_2024:
  general_purpose:
    mistral_7b:
      strengths: Fast, good quality for size
      use_case: General tasks, chat
      requirements: 8GB RAM

    llama2_13b:
      strengths: Better reasoning than 7B
      use_case: More complex tasks
      requirements: 16GB RAM

    mixtral_8x7b:
      strengths: MoE architecture, strong quality
      use_case: Complex tasks, coding
      requirements: 32GB+ RAM

  coding:
    codellama:
      strengths: Code-focused training
      use_case: Code generation, review
      variants: 7B, 13B, 34B

    deepseek_coder:
      strengths: Strong code performance
      use_case: Code tasks
      requirements: Varies by size

  specialized:
    phi_2:
      strengths: Very small, surprisingly capable
      use_case: Resource-constrained environments
      requirements: 4GB RAM

Selection Criteria

model_selection_criteria:
  by_use_case:
    prototyping:
      recommendation: Mistral 7B
      reason: Fast, good enough quality

    production_testing:
      recommendation: Llama 2 13B or Mixtral
      reason: Better quality approximation

    code_generation:
      recommendation: Code Llama or DeepSeek
      reason: Code-specific training

  by_hardware:
    8gb_ram:
      options: [Mistral 7B, Phi-2]
      note: Quantized versions expand options

    16gb_ram:
      options: [Llama 2 13B, Code Llama 13B]

    32gb_plus:
      options: [Mixtral, larger models]

Development Workflow

Local-First Development

import os

class LLMClient:
    """Unified client that works with local and remote LLMs."""

    def __init__(self):
        self.use_local = os.getenv("USE_LOCAL_LLM", "true").lower() == "true"

        if self.use_local:
            self.client = openai.OpenAI(
                base_url="http://localhost:11434/v1",
                api_key="ollama"
            )
            self.model = os.getenv("LOCAL_MODEL", "mistral")
        else:
            self.client = openai.OpenAI()
            self.model = os.getenv("REMOTE_MODEL", "gpt-4-turbo")

    def generate(self, prompt: str, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            **kwargs
        )
        return response.choices[0].message.content

# Development: USE_LOCAL_LLM=true
# Production: USE_LOCAL_LLM=false

Testing with Local Models

# Fast iteration with local models
def test_prompt_variations():
    client = LLMClient()  # Uses local by default

    prompts = [
        "Summarize in 3 bullet points: {text}",
        "Key takeaways from this text: {text}",
        "TL;DR: {text}"
    ]

    for prompt_template in prompts:
        prompt = prompt_template.format(text=sample_text)
        result = client.generate(prompt)
        print(f"Template: {prompt_template[:30]}...")
        print(f"Result: {result[:100]}...")
        print("---")

# Run unlimited iterations with no cost
test_prompt_variations()

Production Considerations

When to Use Local vs. Remote

deployment_decision:
  use_local_in_production:
    - Privacy requirements
    - Air-gapped environments
    - Cost-sensitive applications
    - Acceptable quality trade-off
    - Predictable workloads

  use_remote_in_production:
    - Need frontier capabilities
    - Variable workloads
    - No GPU infrastructure
    - Quality is critical

Hybrid Approach

class HybridLLMClient:
    """Route to local or remote based on task complexity."""

    def __init__(self):
        self.local_client = self._create_local_client()
        self.remote_client = self._create_remote_client()

    def generate(self, prompt: str, complexity: str = "auto") -> str:
        if complexity == "simple" or self._assess_complexity(prompt) < 0.5:
            # Use fast, free local model
            return self.local_client.generate(prompt)
        else:
            # Use powerful remote model
            return self.remote_client.generate(prompt)

    def _assess_complexity(self, prompt: str) -> float:
        # Simple heuristics or classifier
        indicators = [
            len(prompt) > 2000,
            "reasoning" in prompt.lower(),
            "complex" in prompt.lower(),
            "analyze" in prompt.lower(),
        ]
        return sum(indicators) / len(indicators)

Key Takeaways

Local LLMs accelerate development. Use them.