Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding K-Means Clustering: Algorithm and Implementation

Tech May 13 2

Introduction to Clustering

Unsupervised learning encompasses algorithms that work with unlabeled training data, attempting to uncover hidden patterns and structures within datasets. Unlike supervised learning, where explicit labels guide the learning process, unsupervised methods derive insights solely from the inherent characteristics of the data.

Clustering represents one of the most commmon unsupervised learning techniques. Its primary objectives include data reduction—compressing large datasets while preserving essential information—and organizing data points into meaningful groups for further analysis or application.

The K-Means algorithm stands as one of the foundational clustering approaches due to its simplicity and effectiveness.

How K-Means Works

Consider a two-dimensional dataset with scattered data points. The algorithm's goal is to partition these points into distinct clusters, where points with in the same cluster share similarities while points from different clusters are dissimilar.

The process begins by initializing K centroids, either randomly or using specific strategies. Each data point is then assigned to the nearest centroid based on a distance metric. After all points are assigned, new centroids are computed as the mean position of all points belonging to each cluster. This assignment and update cycle repeats iteratively until convergence criteria are met.

Using a two-dimensional example with sample coordinates:

     x                  y
1.658985        4.285136
-3.453687       3.424321
4.838138        -1.151539
-5.379713       -3.362104
0.972564        2.924086
-3.567919       1.531611
0.450614        -3.302219
-3.487105       -1.724432
2.668759        1.594842
-3.156485       3.191137
3.165506        -3.999838
-2.786837       -3.099354
4.208187        2.984927
-2.123337       2.943366
0.704199        -0.479481
-0.392370       -3.963704
2.831667        1.574018
-0.790153       3.343144
2.943496        -3.357075

When visualized, these points might appear to form approximately four natural groupings. The challenge lies in having the algorithm discover these groupings automatically rather than relying on visual inspection.

Determining the Optimal Number of Clusters

A critical question in K-Means clustering is selecting an appropriate value for K. Since we typically cannot visually inspect high-dimensional data, we need a systematic approach.

One common method involves the elbow technique. We execute K-Means for a range of K values (starting from K=2) and calculate the total within-cluster sum of squares (often called distortion or error) for each. As K increases, the error naturally decreases because more clusters allow for tighter groupings.

However, adding more clusters does not always provide proportional benefits. The goal is to find an elbow point where the rate of error reduction significantly flattens. This point represents a good balance between model complexity and clustering quality.

Distance Metrics

The choice of distance metric significantly impacts clustering results. Euclidean distance, the most common choice, measures the straight-line distance between two points in Euclidean space.

For two points A and B in n-dimensional space, the Euclidean distance formula is:

distance(A, B) = sqrt(sum((Ai - Bi)^2))

In two dimensions, this simplifies to the familiar Pythagoreean formula.

Implementation

The following Python implementation uses NumPy for numerical operations:

Data Loading

import numpy as np

def load_dataset(filename):
    '''
    Load dataset from file
    :param filename: path to the data file
    :return: numpy matrix containing the data
    '''
    data_points = []
    with open(filename, 'r') as file_handle:
        for line in file_handle:
            values = line.strip().split('\t')
            # Convert all values to float
            numeric_values = list(map(float, values))
            data_points.append(numeric_values)
    return np.matrix(data_points)

Random Centroid Initialization

def initialize_centroids(dataset, k):
    '''
    Generate K random centroids within the dataset boundaries
    Centroids are constrained to the min-max range of each dimension
    
    :param dataset: input data matrix
    :param k: number of centroids
    :return: matrix of K centroids
    '''
    num_samples, num_features = np.shape(dataset)
    centroids = np.mat(np.zeros((k, num_features)))
    
    for feature_idx in range(num_features):
        feature_col = dataset[:, feature_idx]
        min_val = float(min(feature_col))
        max_val = float(max(feature_col))
        range_val = max_val - min_val
        # Generate random values within [min, min+range]
        centroids[:, feature_idx] = min_val + range_val * np.random.rand(k, 1)
    
    return centroids

Distance Calculation

def euclidean_distance(point_a, point_b):
    '''
    Calculate Euclidean distance between two points
    :param point_a: first point
    :param point_b: second point
    :return: Euclidean distance
    '''
    return np.sqrt(np.sum(np.power(point_a - point_b, 2)))

Cost Function for Evaluating Different K Values

def calculate_distortion(dataset, k, distance_func=euclidean_distance, 
                        init_func=initialize_centroids, max_iterations=300):
    '''
    Compute the total distortion for a given K value
    This simplified K-Means helps evaluate different K values
    
    :param dataset: input data matrix
    :param k: number of clusters
    :param distance_func: distance measurement function
    :param init_func: centroid initialization function
    :param max_iterations: maximum number of iterations
    :return: total distortion value
    '''
    num_samples, _ = np.shape(dataset)
    
    # Store cluster assignment and corresponding distances
    assignments = np.mat(np.zeros((num_samples, 2)))
    
    # Initialize centroids
    centroids = init_func(dataset, k)
    
    iteration = 0
    while iteration < max_iterations:
        iteration += 1
        # Assign each point to nearest centroid
        for i in range(num_samples):
            min_distance = float('inf')
            closest_centroid = -1
            
            for j in range(k):
                current_distance = distance_func(centroids[j], dataset[i])
                if current_distance < min_distance:
                    min_distance = current_distance
                    closest_centroid = j
            
            assignments[i] = closest_centroid, min_distance ** 2
        
        # Update centroids
        for c in range(k):
            cluster_points = dataset[np.nonzero(assignments[:, 0].A == c)[0]]
            if len(cluster_points) > 0:
                centroids[c] = np.mean(cluster_points, axis=0)
    
    return float(np.mat(assignments[:, 1].sum(0))[0, 0])

Full K-Means Algorithm

def kmeans_clustering(dataset, k, distance_func=euclidean_distance,
                     init_func=initialize_centroids):
    '''
    Execute K-Means clustering algorithm
    Iteratively assigns points to clusters and updates centroids
    until cluster assignments stabilize
    
    :param dataset: input data matrix
    :param k: number of clusters
    :param distance_func: distance measurement function
    :param init_func: centroid initialization function
    :return: final centroids and cluster assignments
    '''
    num_samples, _ = np.shape(dataset)
    
    # Track cluster assignments and distances
    assignments = np.mat(np.zeros((num_samples, 2)))
    
    # Initialize centroids
    centroids = init_func(dataset, k)
    
    assignments_changed = True
    while assignments_changed:
        assignments_changed = False
        
        # Assign points to nearest centroid
        for i in range(num_samples):
            min_distance = float('inf')
            nearest_centroid = -1
            
            for j in range(k):
                dist = distance_func(centroids[j], dataset[i])
                if dist < min_distance:
                    min_distance = dist
                    nearest_centroid = j
            
            if assignments[i, 0] != nearest_centroid:
                assignments_changed = True
            assignments[i] = nearest_centroid, min_distance ** 2
        
        # Recalculate centroids
        for c in range(k):
            cluster_points = dataset[np.nonzero(assignments[:, 0].A == c)[0]]
            if len(cluster_points) > 0:
                centroids[c] = np.mean(cluster_points, axis=0)
    
    return centroids, assignments

Visualization of Different K Values

Testing various K values reveals how clustering quality changes. The red squares indicate cluster centroids, with points colored by their assigned cluster.

When K=2, the data splits into two large groups. Increasing to K=3, K=4, and beyond progressively divides the data into smaller, more refined clusters. The choice of optimal K should balance between capturing meaningful subgroups and avoiding over-segmentation.

The elbow method applied to the distortion values across different K values provides guidance in this selection process.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.