Understanding K-Means Clustering: Algorithm and Implementation
Introduction to Clustering
Unsupervised learning encompasses algorithms that work with unlabeled training data, attempting to uncover hidden patterns and structures within datasets. Unlike supervised learning, where explicit labels guide the learning process, unsupervised methods derive insights solely from the inherent characteristics of the data.
Clustering represents one of the most commmon unsupervised learning techniques. Its primary objectives include data reduction—compressing large datasets while preserving essential information—and organizing data points into meaningful groups for further analysis or application.
The K-Means algorithm stands as one of the foundational clustering approaches due to its simplicity and effectiveness.
How K-Means Works
Consider a two-dimensional dataset with scattered data points. The algorithm's goal is to partition these points into distinct clusters, where points with in the same cluster share similarities while points from different clusters are dissimilar.
The process begins by initializing K centroids, either randomly or using specific strategies. Each data point is then assigned to the nearest centroid based on a distance metric. After all points are assigned, new centroids are computed as the mean position of all points belonging to each cluster. This assignment and update cycle repeats iteratively until convergence criteria are met.
Using a two-dimensional example with sample coordinates:
x y
1.658985 4.285136
-3.453687 3.424321
4.838138 -1.151539
-5.379713 -3.362104
0.972564 2.924086
-3.567919 1.531611
0.450614 -3.302219
-3.487105 -1.724432
2.668759 1.594842
-3.156485 3.191137
3.165506 -3.999838
-2.786837 -3.099354
4.208187 2.984927
-2.123337 2.943366
0.704199 -0.479481
-0.392370 -3.963704
2.831667 1.574018
-0.790153 3.343144
2.943496 -3.357075
When visualized, these points might appear to form approximately four natural groupings. The challenge lies in having the algorithm discover these groupings automatically rather than relying on visual inspection.
Determining the Optimal Number of Clusters
A critical question in K-Means clustering is selecting an appropriate value for K. Since we typically cannot visually inspect high-dimensional data, we need a systematic approach.
One common method involves the elbow technique. We execute K-Means for a range of K values (starting from K=2) and calculate the total within-cluster sum of squares (often called distortion or error) for each. As K increases, the error naturally decreases because more clusters allow for tighter groupings.
However, adding more clusters does not always provide proportional benefits. The goal is to find an elbow point where the rate of error reduction significantly flattens. This point represents a good balance between model complexity and clustering quality.
Distance Metrics
The choice of distance metric significantly impacts clustering results. Euclidean distance, the most common choice, measures the straight-line distance between two points in Euclidean space.
For two points A and B in n-dimensional space, the Euclidean distance formula is:
distance(A, B) = sqrt(sum((Ai - Bi)^2))
In two dimensions, this simplifies to the familiar Pythagoreean formula.
Implementation
The following Python implementation uses NumPy for numerical operations:
Data Loading
import numpy as np
def load_dataset(filename):
'''
Load dataset from file
:param filename: path to the data file
:return: numpy matrix containing the data
'''
data_points = []
with open(filename, 'r') as file_handle:
for line in file_handle:
values = line.strip().split('\t')
# Convert all values to float
numeric_values = list(map(float, values))
data_points.append(numeric_values)
return np.matrix(data_points)
Random Centroid Initialization
def initialize_centroids(dataset, k):
'''
Generate K random centroids within the dataset boundaries
Centroids are constrained to the min-max range of each dimension
:param dataset: input data matrix
:param k: number of centroids
:return: matrix of K centroids
'''
num_samples, num_features = np.shape(dataset)
centroids = np.mat(np.zeros((k, num_features)))
for feature_idx in range(num_features):
feature_col = dataset[:, feature_idx]
min_val = float(min(feature_col))
max_val = float(max(feature_col))
range_val = max_val - min_val
# Generate random values within [min, min+range]
centroids[:, feature_idx] = min_val + range_val * np.random.rand(k, 1)
return centroids
Distance Calculation
def euclidean_distance(point_a, point_b):
'''
Calculate Euclidean distance between two points
:param point_a: first point
:param point_b: second point
:return: Euclidean distance
'''
return np.sqrt(np.sum(np.power(point_a - point_b, 2)))
Cost Function for Evaluating Different K Values
def calculate_distortion(dataset, k, distance_func=euclidean_distance,
init_func=initialize_centroids, max_iterations=300):
'''
Compute the total distortion for a given K value
This simplified K-Means helps evaluate different K values
:param dataset: input data matrix
:param k: number of clusters
:param distance_func: distance measurement function
:param init_func: centroid initialization function
:param max_iterations: maximum number of iterations
:return: total distortion value
'''
num_samples, _ = np.shape(dataset)
# Store cluster assignment and corresponding distances
assignments = np.mat(np.zeros((num_samples, 2)))
# Initialize centroids
centroids = init_func(dataset, k)
iteration = 0
while iteration < max_iterations:
iteration += 1
# Assign each point to nearest centroid
for i in range(num_samples):
min_distance = float('inf')
closest_centroid = -1
for j in range(k):
current_distance = distance_func(centroids[j], dataset[i])
if current_distance < min_distance:
min_distance = current_distance
closest_centroid = j
assignments[i] = closest_centroid, min_distance ** 2
# Update centroids
for c in range(k):
cluster_points = dataset[np.nonzero(assignments[:, 0].A == c)[0]]
if len(cluster_points) > 0:
centroids[c] = np.mean(cluster_points, axis=0)
return float(np.mat(assignments[:, 1].sum(0))[0, 0])
Full K-Means Algorithm
def kmeans_clustering(dataset, k, distance_func=euclidean_distance,
init_func=initialize_centroids):
'''
Execute K-Means clustering algorithm
Iteratively assigns points to clusters and updates centroids
until cluster assignments stabilize
:param dataset: input data matrix
:param k: number of clusters
:param distance_func: distance measurement function
:param init_func: centroid initialization function
:return: final centroids and cluster assignments
'''
num_samples, _ = np.shape(dataset)
# Track cluster assignments and distances
assignments = np.mat(np.zeros((num_samples, 2)))
# Initialize centroids
centroids = init_func(dataset, k)
assignments_changed = True
while assignments_changed:
assignments_changed = False
# Assign points to nearest centroid
for i in range(num_samples):
min_distance = float('inf')
nearest_centroid = -1
for j in range(k):
dist = distance_func(centroids[j], dataset[i])
if dist < min_distance:
min_distance = dist
nearest_centroid = j
if assignments[i, 0] != nearest_centroid:
assignments_changed = True
assignments[i] = nearest_centroid, min_distance ** 2
# Recalculate centroids
for c in range(k):
cluster_points = dataset[np.nonzero(assignments[:, 0].A == c)[0]]
if len(cluster_points) > 0:
centroids[c] = np.mean(cluster_points, axis=0)
return centroids, assignments
Visualization of Different K Values
Testing various K values reveals how clustering quality changes. The red squares indicate cluster centroids, with points colored by their assigned cluster.
When K=2, the data splits into two large groups. Increasing to K=3, K=4, and beyond progressively divides the data into smaller, more refined clusters. The choice of optimal K should balance between capturing meaningful subgroups and avoiding over-segmentation.
The elbow method applied to the distortion values across different K values provides guidance in this selection process.