Home > Tech > Content

Implementing Machine Learning Classifiers using Scikit-Learn

Tech May 11 3

Visualizing Decision Boundaries

Generating a meshgrid over the feature space allows for the visualization of how a classifier partitions the data. The following function maps predictions across a dense grid and overlays the true data points.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

def plot_decision_boundaries(estimator, feat_df, target_series):
    grid_resolution = 500
    f1_min, f2_min = feat_df.min(axis=0)
    f1_max, f2_max = feat_df.max(axis=0)
    
    xx_lin = np.linspace(f1_min, f1_max, grid_resolution)
    yy_lin = np.linspace(f2_min, f2_max, grid_resolution)
    xx_grid, yy_grid = np.meshgrid(xx_lin, yy_lin)
    
    grid_points = np.c_[xx_grid.ravel(), yy_grid.ravel()]
    zz_pred = estimator.predict(grid_points)
    zz_grid = zz_pred.reshape(xx_grid.shape)
    
    light_cmap = mcolors.ListedColormap(['#77DD77', '#FF6961', '#AEC6CF'])
    dark_cmap = mcolors.ListedColormap(['green', 'red', 'blue'])
    
    plt.figure(figsize=(10, 6))
    plt.pcolormesh(xx_lin, yy_lin, zz_grid, cmap=light_cmap, shading='auto')
    
    plt.scatter(feat_df.iloc[:, 0], feat_df.iloc[:, 1], c=target_series, cmap=dark_cmap, edgecolor='black', s=40)
    plt.xlabel(feat_df.columns[0])
    plt.ylabel(feat_df.columns[1])
    
    unique_classes = np.unique(target_series)
    for cls in unique_classes:
        plt.scatter([], [], c=dark_cmap(cls), label=f'Class {cls}')
    plt.legend()
    plt.show()

Naive Bayes

Naive Bayes classifiers rely on Bayes' theorem with the assumption of conditional independence between every pair of features. They calculate the probability of each class given the feature values and select the class with the maximum posterior probability.

from sklearn.naive_bayes import GaussianNB

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)
nb_accuracy = nb_model.score(X_test, y_test)
print(f"Accuracy: {nb_accuracy:.2%}")
plot_decision_boundaries(nb_model, X_train_df, y_train_series)

Decision Trees

The CART algorithm recursively partitions the feature space into subsets that are as pure as possible regarding the target variable, using feature thresholds to split the data at each node.

from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_accuracy = dt_model.score(X_test, y_test)
print(f"Accuracy: {dt_accuracy:.2%}")
plot_decision_boundaries(dt_model, X_train_df, y_train_series)

Logistic Regression

Logistic Regression models the probability of an instance belonging to a specific class. During training, it finds the mapping parameters that maximize the likelihood of the observed data. For prediction, it computes the probabilities across all classes and assigns the instance to the class with the highest probability.

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver='saga', max_iter=1000)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = lr_model.score(X_test, y_test)
print(f"Accuracy: {lr_accuracy:.2%}")
plot_decision_boundaries(lr_model, X_train_df, y_train_series)

Support Vector Machines

In a perfectly separable binary classification scenario, an SVM identifies the hyperplane that maximizes the margin between the closest data points of different classes, known as support vectors.

from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)
svm_accuracy = svm_model.score(X_test, y_test)
print(f"Accuracy: {svm_accuracy:.2%}")
plot_decision_boundaries(svm_model, X_train_df, y_train_series)

Ensemble Methods: Random Forest

Random Forest applies bagging. It constructs multiple subsets of the training data via random sampling with replacement, trains a decision tree on each subset, and aggregates their predictions to form a robust classifier.

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = rf_model.score(X_test, y_test)
print(f"Accuracy: {rf_accuracy:.2%}")
plot_decision_boundaries(rf_model, X_train_df, y_train_series)

Ensemble Methods: AdaBoost

AdaBoost trains weak classifiers sequentially. After each round, it increases the weights of misclassified instances so subsequent classifiers focus more on difficult cases. The final prediction is a weighted vote of all weak classifiers.

from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier()
ada_model.fit(X_train, y_train)
ada_predictions = ada_model.predict(X_test)
ada_accuracy = ada_model.score(X_test, y_test)
print(f"Accuracy: {ada_accuracy:.2%}")
plot_decision_boundaries(ada_model, X_train_df, y_train_series)

Ensemble Methods: Gradient Boosting (GBDT)

Gradient Boosting builds an additive model in a forward stage-wise fashion. It trains subsequent decision trees on the residuals of the previous trees, ultimately summing their outputs for the final prediction.

from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = gb_model.score(X_test, y_test)
print(f"Accuracy: {gb_accuracy:.2%}")
plot_decision_boundaries(gb_model, X_train_df, y_train_series)

Advanced Boosting Frameworks

XGBoost: Improves upon traditional GBDT by utilizing a second-order Taylor expansion of the loss function, leading to faster convergence and better accuracy.
LightGBM: A highly efficient gradient boosting framework optimized for speed and lower memory usage through histogram-based decision tree learning.
Stacking: A model fusion technique where multiple base models are trained, and a second-level meta-learner is trained on their combined predictions to produce the final output.

Back to List

Prev: Configuring CORS in Spring Cloud Gateway

Next: Understanding Shallow and Deep Copying in Python

Fading Coder

Implementing Machine Learning Classifiers using Scikit-Learn

Visualizing Decision Boundaries

Naive Bayes

Decision Trees

Logistic Regression

Support Vector Machines

Ensemble Methods: Random Forest

Ensemble Methods: AdaBoost

Ensemble Methods: Gradient Boosting (GBDT)

Advanced Boosting Frameworks

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Implementing Machine Learning Classifiers using Scikit-Learn

Visualizing Decision Boundaries

Naive Bayes

Decision Trees

Logistic Regression

Support Vector Machines

Ensemble Methods: Random Forest

Ensemble Methods: AdaBoost

Ensemble Methods: Gradient Boosting (GBDT)

Advanced Boosting Frameworks

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment