Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Unsupervised Data Mining: K-Means Clustering and Feature Scaling

Tech May 10 2

Supervised vs. Unsupervised Learning

Classification models operate under supervised learning, relying on labeled datasets to train predictive outcomes. Conversely, clustering algorithms function within unsupervised learning, extracting hidden structures from data without predefined labels.

Implementing K-Means Clustering

Generating synthetic data to visualize cluster distribution:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

features, targets = make_blobs(n_samples=600, n_features=2, centers=4, random_state=42)
palette = ['#FF0000', '#FFC0CB', '#FFA500', '#808080']

fig, ax = plt.subplots()
for idx in range(4):
    subset = features[targets == idx]
    ax.scatter(subset[:, 0], subset[:, 1], c=palette[idx], s=10, label=f'Group {idx}')
ax.legend()
plt.show()

Fitting the K-Means model and extracting core attributes:

from sklearn.cluster import KMeans

k_value = 3
kmeans_model = KMeans(n_clusters=k_value, random_state=10, n_init='auto').fit(features)

# Retrieving cluster assignments
assignments = kmeans_model.labels_

# Alternative prediction method
predicted_groups = kmeans_model.predict(features)

# Extracting centroids
centers = kmeans_model.cluster_centers_

# Calculating Sum of Squared Errors (Inertia)
sse = kmeans_model.inertia_

Model Evaluation: Silhouette Score

While increasing the number of clusters naturally reduces the total Sum of Squared Errors, excessive fragmentation results in meaningless partitions. The goal is to maximize intra-cluster similarity and inter-cluster distinction. The silhouette coefficient quantifies this balance, enabling the identification of the optimal cluster count.

import pandas as pd
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt

# Mean silhouette coefficient across all samples
avg_score = silhouette_score(features, assignments)

# Individual sample scores
individual_scores = silhouette_samples(features, assignments)

# Determining the optimal k by evaluating silhouette scores from 2 to 20
evaluation_scores = []
k_range = range(2, 21)

for k in k_range:
    temp_model = KMeans(n_clusters=k, random_state=10, n_init='auto').fit(features)
    evaluation_scores.append(silhouette_score(features, temp_model.labels_))

plt.plot(k_range, evaluation_scores)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')

# Identifying the peak score
scores_df = pd.DataFrame(evaluation_scores, index=k_range)
optimal_k = scores_df.idxmax()[0]
plt.axvline(optimal_k, linestyle=':', color='black')

Row-Based Vector Normalization

Applying row-wise normalization to scale individual samples:

from sklearn.preprocessing import Normalizer

feature_subset = data_frame.iloc[:, 2:]
normalized_matrix = Normalizer().fit_transform(feature_subset)

Feature Scaling Techniques Comparison

  • Standardization (StandardScaler): Processes data column-by-column, transforming feature values to follow a standard normal distribution (Z-score). This aligns feature dimensions and is primarily utilized in algorithms assuming normal distribution, such as linear and logistic regression.
  • Min-Max Scaling (MinMaxScaler): Adjusts features based on minimum and maximum values, scaling the data to a specific range (typically [0, 1]). This accelerates model convergence and enhances precision, frequently serving as a preprocessing step for neural networks.
  • Vector Normalization (Normalizer): Operates row-by-row, converting each sample vector into a unit vector. This ensures a uniform standard during dot product computations or kernel-based similarity measurements. It is highly prevalent in text classification, clustering tasks, and regularized regressions to mitigate overfitting.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.