FAISS Index Types and HNSW Algorithm Explained
FlatL2 - Euclidean Distance
For vectors: $$\mathbf{x} = (x_1, x_2, ..., x_d)$$ $$\mathbf{y} = (y_1, y_2, ..., y_d)$$
The squared Euclidean distance (L2 distance squared) is defined as:
$$D(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|^2 = \sum_{i=1}^{d} (x_i - y_i)^2$$
The Euclidean distance is:
$$D(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}| = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$
FlatL2 computes this distance directly. Higher similarity corresponds to smaller distance values.
Note: In practice, the square root operation is sometimes omitted (calculating only squared distance) because the ranking results remain consistent and computation is more efficient.
IndexFlatIP - Inner Product
IndexFlatIP uses inner product to measure vector similarity, typically representing an implementation of cosine similarity. When vector are normalized, inner product is equivalent to cosine similarity.
$$S(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{d} x_i \cdot y_i$$
$$cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| \cdot |\mathbf{y}|}$$
When $|\mathbf{x}| = |\mathbf{y}| = 1$:
$$cos(\theta) = \mathbf{x} \cdot \mathbf{y}$$
HNSW Algorithm: Principles and Implementation
Core Principles
HNSW (Hierarchical Navigable Small World) is an efficient Approximate Nearest Neighbor (ANN) algorithm designed for fast similarity search in large-scale high-dimensional datasets. It combines concepts from Skip Lists and Navigable Small World (NSW) graphs, using a hierarchical structure to accelerate search operations.
Key characteristics:
- Multiple Layers: HNSW extends NSW with a multi-layer structure similar to skip lists
- Higher Layers: Contain fewer nodes with sparse connections, enabling rapid navigation
- Bottom Layer (Layer 0): Contains all data points with dense connections for precise search
- Search Process: Traverses from top to bottom—coarse localization at higher layers, progressive refinement downward to reduce computational cost
Implementation
Index Construction
Random Layer Assignment:
import random
def assign_layer(max_layers, decay_factor=0.5):
layer = 0
while random.random() < decay_factor and layer < max_layers:
layer += 1
return layer
Incremental NSW Construction:
class NSWGraph:
def __init__(self, M):
self.graph = {}
self.M = M # number of connections
def insert(self, point, vectors):
neighbors = self.greedy_search(point, vectors, self.M)
self.graph[point] = neighbors
for neighbor in neighbors:
if neighbor not in self.graph:
self.graph[neighbor] = []
self.graph[neighbor].append(point)
Query Search
Layer-wise Traversal:
def search_hnsw(query, entry_point, layers, vectors, ef=10):
current = entry_point
# Search from top layer
for layer in range(len(layers) - 1, 0, -1):
candidates = layers[layer].greedy_search(
query, current, ef=ef, vectors=vectors
)
current = candidates[0]
# Exact search at bottom layer
results = layers[0].exact_search(query, vectors, k=10)
return results
Greedy Search:
def greedy_search(query, entry, vectors, ef=1):
visited = set()
candidates = [(entry, euclidean_distance(query, vectors[entry]))]
while candidates:
current, dist = heapq.heappop(candidates)
if current in visited:
continue
visited.add(current)
# Check neighbors
for neighbor in graph[current]:
if neighbor not in visited:
neighbor_dist = euclidean_distance(query, vectors[neighbor])
heapq.heappush(candidates, (neighbor, neighbor_dist))
return sorted(candidates, key=lambda x: x[1])[:ef]
Flat Index Characteristics
Exact Search:
- Computes precise distance between query vector and all indexed vectors (Euclidean distance, inner product, etc.)
- Guarantees globally optimal results (most similar)
- Ideal for scenarios requiring maximum accuracy
Trade-offs:
- Higher computational cost compared to approximate methods
- Does not scale well with massive datasets
- Best suited for small to medium-sized collections where precision is critical