Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

FAISS Index Types and HNSW Algorithm Explained

Tech 2

FlatL2 - Euclidean Distance

For vectors: $$\mathbf{x} = (x_1, x_2, ..., x_d)$$ $$\mathbf{y} = (y_1, y_2, ..., y_d)$$

The squared Euclidean distance (L2 distance squared) is defined as:

$$D(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}|^2 = \sum_{i=1}^{d} (x_i - y_i)^2$$

The Euclidean distance is:

$$D(\mathbf{x}, \mathbf{y}) = |\mathbf{x} - \mathbf{y}| = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$

FlatL2 computes this distance directly. Higher similarity corresponds to smaller distance values.

Note: In practice, the square root operation is sometimes omitted (calculating only squared distance) because the ranking results remain consistent and computation is more efficient.

IndexFlatIP - Inner Product

IndexFlatIP uses inner product to measure vector similarity, typically representing an implementation of cosine similarity. When vector are normalized, inner product is equivalent to cosine similarity.

$$S(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{d} x_i \cdot y_i$$

$$cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| \cdot |\mathbf{y}|}$$

When $|\mathbf{x}| = |\mathbf{y}| = 1$:

$$cos(\theta) = \mathbf{x} \cdot \mathbf{y}$$

HNSW Algorithm: Principles and Implementation

Core Principles

HNSW (Hierarchical Navigable Small World) is an efficient Approximate Nearest Neighbor (ANN) algorithm designed for fast similarity search in large-scale high-dimensional datasets. It combines concepts from Skip Lists and Navigable Small World (NSW) graphs, using a hierarchical structure to accelerate search operations.

Key characteristics:

  • Multiple Layers: HNSW extends NSW with a multi-layer structure similar to skip lists
  • Higher Layers: Contain fewer nodes with sparse connections, enabling rapid navigation
  • Bottom Layer (Layer 0): Contains all data points with dense connections for precise search
  • Search Process: Traverses from top to bottom—coarse localization at higher layers, progressive refinement downward to reduce computational cost

Implementation

Index Construction

Random Layer Assignment:

import random

def assign_layer(max_layers, decay_factor=0.5):
    layer = 0
    while random.random() < decay_factor and layer < max_layers:
        layer += 1
    return layer

Incremental NSW Construction:

class NSWGraph:
    def __init__(self, M):
        self.graph = {}
        self.M = M  # number of connections
    
    def insert(self, point, vectors):
        neighbors = self.greedy_search(point, vectors, self.M)
        self.graph[point] = neighbors
        for neighbor in neighbors:
            if neighbor not in self.graph:
                self.graph[neighbor] = []
            self.graph[neighbor].append(point)

Query Search

Layer-wise Traversal:

def search_hnsw(query, entry_point, layers, vectors, ef=10):
    current = entry_point
    
    # Search from top layer
    for layer in range(len(layers) - 1, 0, -1):
        candidates = layers[layer].greedy_search(
            query, current, ef=ef, vectors=vectors
        )
        current = candidates[0]
    
    # Exact search at bottom layer
    results = layers[0].exact_search(query, vectors, k=10)
    return results

Greedy Search:

def greedy_search(query, entry, vectors, ef=1):
    visited = set()
    candidates = [(entry, euclidean_distance(query, vectors[entry]))]
    
    while candidates:
        current, dist = heapq.heappop(candidates)
        if current in visited:
            continue
        visited.add(current)
        
        # Check neighbors
        for neighbor in graph[current]:
            if neighbor not in visited:
                neighbor_dist = euclidean_distance(query, vectors[neighbor])
                heapq.heappush(candidates, (neighbor, neighbor_dist))
    
    return sorted(candidates, key=lambda x: x[1])[:ef]

Flat Index Characteristics

Exact Search:

  • Computes precise distance between query vector and all indexed vectors (Euclidean distance, inner product, etc.)
  • Guarantees globally optimal results (most similar)
  • Ideal for scenarios requiring maximum accuracy

Trade-offs:

  • Higher computational cost compared to approximate methods
  • Does not scale well with massive datasets
  • Best suited for small to medium-sized collections where precision is critical
Tags: FAISS

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.