Home > Tech > Content

FAISS Index Types and HNSW Algorithm Explained

Tech 2

FlatL2 - Euclidean Distance

For vectors: $$\mathbf{x} = (x_1, x_2, ..., x_d)$$ $$\mathbf{y} = (y_1, y_2, ..., y_d)$$

The squared Euclidean distance (L2 distance squared) is defined as:

$$D(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|^2 = \sum_{i=1}^{d} (x_i - y_i)^2$$

The Euclidean distance is:

$$D(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}| = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$

FlatL2 computes this distance directly. Higher similarity corresponds to smaller distance values.

Note: In practice, the square root operation is sometimes omitted (calculating only squared distance) because the ranking results remain consistent and computation is more efficient.

IndexFlatIP - Inner Product

IndexFlatIP uses inner product to measure vector similarity, typically representing an implementation of cosine similarity. When vector are normalized, inner product is equivalent to cosine similarity.

$$S(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{d} x_i \cdot y_i$$

$$cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| \cdot |\mathbf{y}|}$$

When $|\mathbf{x}| = |\mathbf{y}| = 1$:

$$cos(\theta) = \mathbf{x} \cdot \mathbf{y}$$

HNSW Algorithm: Principles and Implementation

Core Principles

HNSW (Hierarchical Navigable Small World) is an efficient Approximate Nearest Neighbor (ANN) algorithm designed for fast similarity search in large-scale high-dimensional datasets. It combines concepts from Skip Lists and Navigable Small World (NSW) graphs, using a hierarchical structure to accelerate search operations.

Key characteristics:

Multiple Layers: HNSW extends NSW with a multi-layer structure similar to skip lists
Higher Layers: Contain fewer nodes with sparse connections, enabling rapid navigation
Bottom Layer (Layer 0): Contains all data points with dense connections for precise search
Search Process: Traverses from top to bottom—coarse localization at higher layers, progressive refinement downward to reduce computational cost

Implementation

Index Construction

Random Layer Assignment:

import random

def assign_layer(max_layers, decay_factor=0.5):
    layer = 0
    while random.random() < decay_factor and layer < max_layers:
        layer += 1
    return layer

Incremental NSW Construction:

class NSWGraph:
    def __init__(self, M):
        self.graph = {}
        self.M = M  # number of connections
    
    def insert(self, point, vectors):
        neighbors = self.greedy_search(point, vectors, self.M)
        self.graph[point] = neighbors
        for neighbor in neighbors:
            if neighbor not in self.graph:
                self.graph[neighbor] = []
            self.graph[neighbor].append(point)

Query Search

Layer-wise Traversal:

def search_hnsw(query, entry_point, layers, vectors, ef=10):
    current = entry_point
    
    # Search from top layer
    for layer in range(len(layers) - 1, 0, -1):
        candidates = layers[layer].greedy_search(
            query, current, ef=ef, vectors=vectors
        )
        current = candidates[0]
    
    # Exact search at bottom layer
    results = layers[0].exact_search(query, vectors, k=10)
    return results

Greedy Search:

def greedy_search(query, entry, vectors, ef=1):
    visited = set()
    candidates = [(entry, euclidean_distance(query, vectors[entry]))]
    
    while candidates:
        current, dist = heapq.heappop(candidates)
        if current in visited:
            continue
        visited.add(current)
        
        # Check neighbors
        for neighbor in graph[current]:
            if neighbor not in visited:
                neighbor_dist = euclidean_distance(query, vectors[neighbor])
                heapq.heappush(candidates, (neighbor, neighbor_dist))
    
    return sorted(candidates, key=lambda x: x[1])[:ef]

Flat Index Characteristics

Exact Search:

Computes precise distance between query vector and all indexed vectors (Euclidean distance, inner product, etc.)
Guarantees globally optimal results (most similar)
Ideal for scenarios requiring maximum accuracy

Trade-offs:

Higher computational cost compared to approximate methods
Does not scale well with massive datasets
Best suited for small to medium-sized collections where precision is critical

Tags: FAISS

Back to List

Prev: Implementing Command-Line Interfaces in Flask with Flask-Script

Next: React Performance Optimization and JavaScript Iteration Protocols

Fading Coder

FAISS Index Types and HNSW Algorithm Explained

FlatL2 - Euclidean Distance

IndexFlatIP - Inner Product

HNSW Algorithm: Principles and Implementation

Core Principles

Implementation

Index Construction

Query Search

Flat Index Characteristics

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

FAISS Index Types and HNSW Algorithm Explained

FlatL2 - Euclidean Distance

IndexFlatIP - Inner Product

HNSW Algorithm: Principles and Implementation

Core Principles

Implementation

Index Construction

Query Search

Flat Index Characteristics

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment