Embeddings convert text, images, and other data into dense vector representations that capture semantic meaning. They’re the foundation of semantic search, RAG systems, and recommendation engines. Understanding embeddings helps you build better AI applications.
Here’s a deep dive into embedding models.
What Are Embeddings
The Concept
embeddings:
what: Dense vector representations of data
purpose: Capture semantic meaning in numerical form
key_property: Similar items have similar vectors
example:
text: "The cat sat on the mat"
embedding: [0.23, -0.45, 0.12, ..., 0.67] # 384-1536 dimensions
similar_text: "A feline rested on the rug" # Similar vector
Why They Work
embedding_properties:
semantic_similarity:
- "king" - "man" + "woman" ≈ "queen"
- Similar concepts cluster together
- Relationships preserved in vector space
dimensionality:
- Each dimension captures some aspect of meaning
- Combined dimensions represent complex semantics
- More dimensions = more nuance (usually)
learned_representations:
- Trained on large text corpora
- Learn patterns from context
- Transfer to new domains
Embedding Model Types
General Purpose
general_purpose_models:
openai_embeddings:
model: text-embedding-ada-002
dimensions: 1536
strengths: High quality, easy to use
weaknesses: API cost, vendor lock-in
cost: $0.0001/1K tokens
sentence_transformers:
model: all-MiniLM-L6-v2
dimensions: 384
strengths: Fast, free, local
weaknesses: Lower quality than larger models
cohere_embed:
model: embed-english-v3.0
dimensions: 1024
strengths: Multiple languages, good quality
weaknesses: API cost
voyage_ai:
model: voyage-large-2
dimensions: 1024
strengths: High quality retrieval
weaknesses: Newer, smaller ecosystem
Specialized Models
specialized_models:
code_embeddings:
models: [CodeBERT, StarEncoder, Voyage-code]
use_case: Code search, similarity
multilingual:
models: [multilingual-e5-large, paraphrase-multilingual]
use_case: Cross-lingual search
domain_specific:
examples:
- BioBERT (biomedical)
- SciBERT (scientific)
- FinBERT (financial)
Choosing a Model
Selection Criteria
selection_criteria:
quality:
measure: Performance on benchmarks (MTEB)
consideration: Higher quality = better retrieval
latency:
measure: Time to embed
consideration: Real-time vs batch processing
dimensions:
measure: Vector size
consideration: Storage cost, search speed
cost:
measure: API pricing or compute
consideration: Budget constraints
deployment:
options: [API, self-hosted]
consideration: Data privacy, latency requirements
Benchmarking
from mteb import MTEB
from sentence_transformers import SentenceTransformer
# Evaluate model on standard benchmarks
model = SentenceTransformer('all-MiniLM-L6-v2')
evaluation = MTEB(tasks=["ArguAna", "NFCorpus"])
results = evaluation.run(model)
# Compare multiple models
models_to_compare = [
'all-MiniLM-L6-v2',
'all-mpnet-base-v2',
'e5-large-v2'
]
for model_name in models_to_compare:
model = SentenceTransformer(model_name)
results = evaluation.run(model)
print(f"{model_name}: {results}")
Practical Comparison
import time
import numpy as np
from sentence_transformers import SentenceTransformer
def benchmark_model(model_name, texts):
model = SentenceTransformer(model_name)
# Measure embedding time
start = time.time()
embeddings = model.encode(texts)
elapsed = time.time() - start
return {
'model': model_name,
'dimensions': embeddings.shape[1],
'time_per_text': elapsed / len(texts),
'memory_mb': embeddings.nbytes / 1024 / 1024
}
texts = load_sample_texts(1000)
results = []
for model in ['all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'e5-large-v2']:
results.append(benchmark_model(model, texts))
# Results:
# all-MiniLM-L6-v2: 384 dims, 0.5ms/text
# all-mpnet-base-v2: 768 dims, 1.2ms/text
# e5-large-v2: 1024 dims, 2.5ms/text
Working with Embeddings
Basic Usage
from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Embed single text
text = "How do I deploy a Kubernetes cluster?"
embedding = model.encode(text)
# Embed batch
texts = [
"Kubernetes deployment tutorial",
"Docker container basics",
"Cloud infrastructure guide"
]
embeddings = model.encode(texts)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
query_embedding = model.encode("k8s setup guide")
similarities = cosine_similarity([query_embedding], embeddings)[0]
for text, score in zip(texts, similarities):
print(f"{score:.3f}: {text}")
Optimization Techniques
# Batch processing for efficiency
def embed_large_dataset(texts, model, batch_size=64):
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
batch_embeddings = model.encode(
batch,
show_progress_bar=False,
convert_to_numpy=True
)
embeddings.extend(batch_embeddings)
return np.array(embeddings)
# GPU acceleration
model = SentenceTransformer('all-MiniLM-L6-v2', device='cuda')
# Quantization for memory
from sentence_transformers.quantization import quantize_embeddings
embeddings = model.encode(texts)
quantized = quantize_embeddings(
embeddings,
precision='ubinary' # Reduces memory by ~32x
)
Caching Embeddings
import hashlib
import redis
import numpy as np
class EmbeddingCache:
def __init__(self, model, redis_client):
self.model = model
self.redis = redis_client
self.prefix = "emb:"
def _cache_key(self, text):
hash_val = hashlib.sha256(text.encode()).hexdigest()[:16]
return f"{self.prefix}{hash_val}"
def get_embedding(self, text):
key = self._cache_key(text)
# Try cache
cached = self.redis.get(key)
if cached:
return np.frombuffer(cached, dtype=np.float32)
# Compute and cache
embedding = self.model.encode(text)
self.redis.setex(
key,
86400, # 24 hour TTL
embedding.tobytes()
)
return embedding
def get_embeddings(self, texts):
# Check cache for all
keys = [self._cache_key(t) for t in texts]
cached = self.redis.mget(keys)
results = []
to_compute = []
to_compute_indices = []
for i, (text, cached_val) in enumerate(zip(texts, cached)):
if cached_val:
results.append(np.frombuffer(cached_val, dtype=np.float32))
else:
results.append(None)
to_compute.append(text)
to_compute_indices.append(i)
# Batch compute missing
if to_compute:
new_embeddings = self.model.encode(to_compute)
# Cache and fill results
pipe = self.redis.pipeline()
for idx, emb in zip(to_compute_indices, new_embeddings):
results[idx] = emb
key = self._cache_key(texts[idx])
pipe.setex(key, 86400, emb.tobytes())
pipe.execute()
return np.array(results)
Advanced Topics
Fine-Tuning Embeddings
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare training data (pairs with similarity scores)
train_examples = [
InputExample(texts=["kubernetes deployment", "k8s deploy"], label=0.9),
InputExample(texts=["kubernetes deployment", "python tutorial"], label=0.1),
# ... more examples
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Define loss
train_loss = losses.CosineSimilarityLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path='./fine-tuned-model'
)
Multi-Vector Representations
# ColBERT-style late interaction
class MultiVectorEmbedding:
def __init__(self, model):
self.model = model
def encode_query(self, query):
"""Encode query to multiple vectors (one per token)."""
# Get token-level embeddings
tokens = self.model.tokenize(query)
embeddings = self.model.encode_tokens(tokens)
return embeddings
def encode_document(self, document):
"""Encode document to multiple vectors."""
tokens = self.model.tokenize(document)
embeddings = self.model.encode_tokens(tokens)
return embeddings
def score(self, query_vectors, doc_vectors):
"""MaxSim scoring."""
scores = []
for q_vec in query_vectors:
max_sim = max(cosine_similarity([q_vec], doc_vectors)[0])
scores.append(max_sim)
return sum(scores)
Key Takeaways
- Embeddings capture semantic meaning in vector space
- Similar concepts have similar vectors
- Choose models based on quality, speed, cost, and dimensions
- Benchmark on your specific use case
- Cache embeddings to reduce compute costs
- Batch process for efficiency
- Consider fine-tuning for domain-specific tasks
- Smaller models often sufficient for many use cases
- GPU acceleration significantly improves throughput
- Monitor embedding quality in production
Embeddings are the foundation of modern AI applications. Understand them deeply.