Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Implementing Machine Learning Classifiers using Scikit-Learn

Tech May 11 11

Visualizing Decision Boundaries

Generating a meshgrid over the feature space allows for the visualization of how a classifier partitions the data. The following function maps predictions across a dense grid and overlays the true data points.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

def plot_decision_boundaries(estimator, feat_df, target_series):
    grid_resolution = 500
    f1_min, f2_min = feat_df.min(axis=0)
    f1_max, f2_max = feat_df.max(axis=0)
    
    xx_lin = np.linspace(f1_min, f1_max, grid_resolution)
    yy_lin = np.linspace(f2_min, f2_max, grid_resolution)
    xx_grid, yy_grid = np.meshgrid(xx_lin, yy_lin)
    
    grid_points = np.c_[xx_grid.ravel(), yy_grid.ravel()]
    zz_pred = estimator.predict(grid_points)
    zz_grid = zz_pred.reshape(xx_grid.shape)
    
    light_cmap = mcolors.ListedColormap(['#77DD77', '#FF6961', '#AEC6CF'])
    dark_cmap = mcolors.ListedColormap(['green', 'red', 'blue'])
    
    plt.figure(figsize=(10, 6))
    plt.pcolormesh(xx_lin, yy_lin, zz_grid, cmap=light_cmap, shading='auto')
    
    plt.scatter(feat_df.iloc[:, 0], feat_df.iloc[:, 1], c=target_series, cmap=dark_cmap, edgecolor='black', s=40)
    plt.xlabel(feat_df.columns[0])
    plt.ylabel(feat_df.columns[1])
    
    unique_classes = np.unique(target_series)
    for cls in unique_classes:
        plt.scatter([], [], c=dark_cmap(cls), label=f'Class {cls}')
    plt.legend()
    plt.show()

Naive Bayes

Naive Bayes classifiers rely on Bayes' theorem with the assumption of conditional independence between every pair of features. They calculate the probability of each class given the feature values and select the class with the maximum posterior probability.

from sklearn.naive_bayes import GaussianNB

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)
nb_accuracy = nb_model.score(X_test, y_test)
print(f"Accuracy: {nb_accuracy:.2%}")
plot_decision_boundaries(nb_model, X_train_df, y_train_series)

Decision Trees

The CART algorithm recursively partitions the feature space into subsets that are as pure as possible regarding the target variable, using feature thresholds to split the data at each node.

from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_accuracy = dt_model.score(X_test, y_test)
print(f"Accuracy: {dt_accuracy:.2%}")
plot_decision_boundaries(dt_model, X_train_df, y_train_series)

Logistic Regression

Logistic Regression models the probability of an instance belonging to a specific class. During training, it finds the mapping parameters that maximize the likelihood of the observed data. For prediction, it computes the probabilities across all classes and assigns the instance to the class with the highest probability.

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver='saga', max_iter=1000)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = lr_model.score(X_test, y_test)
print(f"Accuracy: {lr_accuracy:.2%}")
plot_decision_boundaries(lr_model, X_train_df, y_train_series)

Support Vector Machines

In a perfectly separable binary classification scenario, an SVM identifies the hyperplane that maximizes the margin between the closest data points of different classes, known as support vectors.

from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)
svm_accuracy = svm_model.score(X_test, y_test)
print(f"Accuracy: {svm_accuracy:.2%}")
plot_decision_boundaries(svm_model, X_train_df, y_train_series)

Ensemble Methods: Random Forest

Random Forest applies bagging. It constructs multiple subsets of the training data via random sampling with replacement, trains a decision tree on each subset, and aggregates their predictions to form a robust classifier.

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = rf_model.score(X_test, y_test)
print(f"Accuracy: {rf_accuracy:.2%}")
plot_decision_boundaries(rf_model, X_train_df, y_train_series)

Ensemble Methods: AdaBoost

AdaBoost trains weak classifiers sequentially. After each round, it increases the weights of misclassified instances so subsequent classifiers focus more on difficult cases. The final prediction is a weighted vote of all weak classifiers.

from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier()
ada_model.fit(X_train, y_train)
ada_predictions = ada_model.predict(X_test)
ada_accuracy = ada_model.score(X_test, y_test)
print(f"Accuracy: {ada_accuracy:.2%}")
plot_decision_boundaries(ada_model, X_train_df, y_train_series)

Ensemble Methods: Gradient Boosting (GBDT)

Gradient Boosting builds an additive model in a forward stage-wise fashion. It trains subsequent decision trees on the residuals of the previous trees, ultimately summing their outputs for the final prediction.

from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = gb_model.score(X_test, y_test)
print(f"Accuracy: {gb_accuracy:.2%}")
plot_decision_boundaries(gb_model, X_train_df, y_train_series)

Advanced Boosting Frameworks

  • XGBoost: Improves upon traditional GBDT by utilizing a second-order Taylor expansion of the loss function, leading to faster convergence and better accuracy.
  • LightGBM: A highly efficient gradient boosting framework optimized for speed and lower memory usage through histogram-based decision tree learning.
  • Stacking: A model fusion technique where multiple base models are trained, and a second-level meta-learner is trained on their combined predictions to produce the final output.

Related Articles

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Comprehensive Guide to Hive SQL Syntax and Operations

This article provides a detailed walkthrough of Hive SQL, categorizing its features and syntax for practical use. Hive SQL is segmented into the following categories: DDL Statements: Operations on...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.