Unsupervised Data Mining: K-Means Clustering and Feature Scaling
Supervised vs. Unsupervised Learning
Classification models operate under supervised learning, relying on labeled datasets to train predictive outcomes. Conversely, clustering algorithms function within unsupervised learning, extracting hidden structures from data without predefined labels.
Implementing K-Means Clustering
Generating synthetic data to visualize cluster distribution:
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
features, targets = make_blobs(n_samples=600, n_features=2, centers=4, random_state=42)
palette = ['#FF0000', '#FFC0CB', '#FFA500', '#808080']
fig, ax = plt.subplots()
for idx in range(4):
subset = features[targets == idx]
ax.scatter(subset[:, 0], subset[:, 1], c=palette[idx], s=10, label=f'Group {idx}')
ax.legend()
plt.show()
Fitting the K-Means model and extracting core attributes:
from sklearn.cluster import KMeans
k_value = 3
kmeans_model = KMeans(n_clusters=k_value, random_state=10, n_init='auto').fit(features)
# Retrieving cluster assignments
assignments = kmeans_model.labels_
# Alternative prediction method
predicted_groups = kmeans_model.predict(features)
# Extracting centroids
centers = kmeans_model.cluster_centers_
# Calculating Sum of Squared Errors (Inertia)
sse = kmeans_model.inertia_
Model Evaluation: Silhouette Score
While increasing the number of clusters naturally reduces the total Sum of Squared Errors, excessive fragmentation results in meaningless partitions. The goal is to maximize intra-cluster similarity and inter-cluster distinction. The silhouette coefficient quantifies this balance, enabling the identification of the optimal cluster count.
import pandas as pd
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
# Mean silhouette coefficient across all samples
avg_score = silhouette_score(features, assignments)
# Individual sample scores
individual_scores = silhouette_samples(features, assignments)
# Determining the optimal k by evaluating silhouette scores from 2 to 20
evaluation_scores = []
k_range = range(2, 21)
for k in k_range:
temp_model = KMeans(n_clusters=k, random_state=10, n_init='auto').fit(features)
evaluation_scores.append(silhouette_score(features, temp_model.labels_))
plt.plot(k_range, evaluation_scores)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
# Identifying the peak score
scores_df = pd.DataFrame(evaluation_scores, index=k_range)
optimal_k = scores_df.idxmax()[0]
plt.axvline(optimal_k, linestyle=':', color='black')
Row-Based Vector Normalization
Applying row-wise normalization to scale individual samples:
from sklearn.preprocessing import Normalizer
feature_subset = data_frame.iloc[:, 2:]
normalized_matrix = Normalizer().fit_transform(feature_subset)
Feature Scaling Techniques Comparison
- Standardization (StandardScaler): Processes data column-by-column, transforming feature values to follow a standard normal distribution (Z-score). This aligns feature dimensions and is primarily utilized in algorithms assuming normal distribution, such as linear and logistic regression.
- Min-Max Scaling (MinMaxScaler): Adjusts features based on minimum and maximum values, scaling the data to a specific range (typically [0, 1]). This accelerates model convergence and enhances precision, frequently serving as a preprocessing step for neural networks.
- Vector Normalization (Normalizer): Operates row-by-row, converting each sample vector into a unit vector. This ensures a uniform standard during dot product computations or kernel-based similarity measurements. It is highly prevalent in text classification, clustering tasks, and regularized regressions to mitigate overfitting.