Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Implementing Machine Learning Classifiers using Scikit-Learn

Tech May 11 3

Visualizing Decision Boundaries

Generating a meshgrid over the feature space allows for the visualization of how a classifier partitions the data. The following function maps predictions across a dense grid and overlays the true data points.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

def plot_decision_boundaries(estimator, feat_df, target_series):
    grid_resolution = 500
    f1_min, f2_min = feat_df.min(axis=0)
    f1_max, f2_max = feat_df.max(axis=0)
    
    xx_lin = np.linspace(f1_min, f1_max, grid_resolution)
    yy_lin = np.linspace(f2_min, f2_max, grid_resolution)
    xx_grid, yy_grid = np.meshgrid(xx_lin, yy_lin)
    
    grid_points = np.c_[xx_grid.ravel(), yy_grid.ravel()]
    zz_pred = estimator.predict(grid_points)
    zz_grid = zz_pred.reshape(xx_grid.shape)
    
    light_cmap = mcolors.ListedColormap(['#77DD77', '#FF6961', '#AEC6CF'])
    dark_cmap = mcolors.ListedColormap(['green', 'red', 'blue'])
    
    plt.figure(figsize=(10, 6))
    plt.pcolormesh(xx_lin, yy_lin, zz_grid, cmap=light_cmap, shading='auto')
    
    plt.scatter(feat_df.iloc[:, 0], feat_df.iloc[:, 1], c=target_series, cmap=dark_cmap, edgecolor='black', s=40)
    plt.xlabel(feat_df.columns[0])
    plt.ylabel(feat_df.columns[1])
    
    unique_classes = np.unique(target_series)
    for cls in unique_classes:
        plt.scatter([], [], c=dark_cmap(cls), label=f'Class {cls}')
    plt.legend()
    plt.show()

Naive Bayes

Naive Bayes classifiers rely on Bayes' theorem with the assumption of conditional independence between every pair of features. They calculate the probability of each class given the feature values and select the class with the maximum posterior probability.

from sklearn.naive_bayes import GaussianNB

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)
nb_accuracy = nb_model.score(X_test, y_test)
print(f"Accuracy: {nb_accuracy:.2%}")
plot_decision_boundaries(nb_model, X_train_df, y_train_series)

Decision Trees

The CART algorithm recursively partitions the feature space into subsets that are as pure as possible regarding the target variable, using feature thresholds to split the data at each node.

from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_accuracy = dt_model.score(X_test, y_test)
print(f"Accuracy: {dt_accuracy:.2%}")
plot_decision_boundaries(dt_model, X_train_df, y_train_series)

Logistic Regression

Logistic Regression models the probability of an instance belonging to a specific class. During training, it finds the mapping parameters that maximize the likelihood of the observed data. For prediction, it computes the probabilities across all classes and assigns the instance to the class with the highest probability.

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver='saga', max_iter=1000)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = lr_model.score(X_test, y_test)
print(f"Accuracy: {lr_accuracy:.2%}")
plot_decision_boundaries(lr_model, X_train_df, y_train_series)

Support Vector Machines

In a perfectly separable binary classification scenario, an SVM identifies the hyperplane that maximizes the margin between the closest data points of different classes, known as support vectors.

from sklearn.svm import SVC

svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)
svm_accuracy = svm_model.score(X_test, y_test)
print(f"Accuracy: {svm_accuracy:.2%}")
plot_decision_boundaries(svm_model, X_train_df, y_train_series)

Ensemble Methods: Random Forest

Random Forest applies bagging. It constructs multiple subsets of the training data via random sampling with replacement, trains a decision tree on each subset, and aggregates their predictions to form a robust classifier.

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = rf_model.score(X_test, y_test)
print(f"Accuracy: {rf_accuracy:.2%}")
plot_decision_boundaries(rf_model, X_train_df, y_train_series)

Ensemble Methods: AdaBoost

AdaBoost trains weak classifiers sequentially. After each round, it increases the weights of misclassified instances so subsequent classifiers focus more on difficult cases. The final prediction is a weighted vote of all weak classifiers.

from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier()
ada_model.fit(X_train, y_train)
ada_predictions = ada_model.predict(X_test)
ada_accuracy = ada_model.score(X_test, y_test)
print(f"Accuracy: {ada_accuracy:.2%}")
plot_decision_boundaries(ada_model, X_train_df, y_train_series)

Ensemble Methods: Gradient Boosting (GBDT)

Gradient Boosting builds an additive model in a forward stage-wise fashion. It trains subsequent decision trees on the residuals of the previous trees, ultimately summing their outputs for the final prediction.

from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = gb_model.score(X_test, y_test)
print(f"Accuracy: {gb_accuracy:.2%}")
plot_decision_boundaries(gb_model, X_train_df, y_train_series)

Advanced Boosting Frameworks

  • XGBoost: Improves upon traditional GBDT by utilizing a second-order Taylor expansion of the loss function, leading to faster convergence and better accuracy.
  • LightGBM: A highly efficient gradient boosting framework optimized for speed and lower memory usage through histogram-based decision tree learning.
  • Stacking: A model fusion technique where multiple base models are trained, and a second-level meta-learner is trained on their combined predictions to produce the final output.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.