Implementing Machine Learning Classifiers using Scikit-Learn
Visualizing Decision Boundaries
Generating a meshgrid over the feature space allows for the visualization of how a classifier partitions the data. The following function maps predictions across a dense grid and overlays the true data points.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def plot_decision_boundaries(estimator, feat_df, target_series):
grid_resolution = 500
f1_min, f2_min = feat_df.min(axis=0)
f1_max, f2_max = feat_df.max(axis=0)
xx_lin = np.linspace(f1_min, f1_max, grid_resolution)
yy_lin = np.linspace(f2_min, f2_max, grid_resolution)
xx_grid, yy_grid = np.meshgrid(xx_lin, yy_lin)
grid_points = np.c_[xx_grid.ravel(), yy_grid.ravel()]
zz_pred = estimator.predict(grid_points)
zz_grid = zz_pred.reshape(xx_grid.shape)
light_cmap = mcolors.ListedColormap(['#77DD77', '#FF6961', '#AEC6CF'])
dark_cmap = mcolors.ListedColormap(['green', 'red', 'blue'])
plt.figure(figsize=(10, 6))
plt.pcolormesh(xx_lin, yy_lin, zz_grid, cmap=light_cmap, shading='auto')
plt.scatter(feat_df.iloc[:, 0], feat_df.iloc[:, 1], c=target_series, cmap=dark_cmap, edgecolor='black', s=40)
plt.xlabel(feat_df.columns[0])
plt.ylabel(feat_df.columns[1])
unique_classes = np.unique(target_series)
for cls in unique_classes:
plt.scatter([], [], c=dark_cmap(cls), label=f'Class {cls}')
plt.legend()
plt.show()
Naive Bayes
Naive Bayes classifiers rely on Bayes' theorem with the assumption of conditional independence between every pair of features. They calculate the probability of each class given the feature values and select the class with the maximum posterior probability.
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)
nb_accuracy = nb_model.score(X_test, y_test)
print(f"Accuracy: {nb_accuracy:.2%}")
plot_decision_boundaries(nb_model, X_train_df, y_train_series)
Decision Trees
The CART algorithm recursively partitions the feature space into subsets that are as pure as possible regarding the target variable, using feature thresholds to split the data at each node.
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
dt_accuracy = dt_model.score(X_test, y_test)
print(f"Accuracy: {dt_accuracy:.2%}")
plot_decision_boundaries(dt_model, X_train_df, y_train_series)
Logistic Regression
Logistic Regression models the probability of an instance belonging to a specific class. During training, it finds the mapping parameters that maximize the likelihood of the observed data. For prediction, it computes the probabilities across all classes and assigns the instance to the class with the highest probability.
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver='saga', max_iter=1000)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = lr_model.score(X_test, y_test)
print(f"Accuracy: {lr_accuracy:.2%}")
plot_decision_boundaries(lr_model, X_train_df, y_train_series)
Support Vector Machines
In a perfectly separable binary classification scenario, an SVM identifies the hyperplane that maximizes the margin between the closest data points of different classes, known as support vectors.
from sklearn.svm import SVC
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)
svm_accuracy = svm_model.score(X_test, y_test)
print(f"Accuracy: {svm_accuracy:.2%}")
plot_decision_boundaries(svm_model, X_train_df, y_train_series)
Ensemble Methods: Random Forest
Random Forest applies bagging. It constructs multiple subsets of the training data via random sampling with replacement, trains a decision tree on each subset, and aggregates their predictions to form a robust classifier.
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_accuracy = rf_model.score(X_test, y_test)
print(f"Accuracy: {rf_accuracy:.2%}")
plot_decision_boundaries(rf_model, X_train_df, y_train_series)
Ensemble Methods: AdaBoost
AdaBoost trains weak classifiers sequentially. After each round, it increases the weights of misclassified instances so subsequent classifiers focus more on difficult cases. The final prediction is a weighted vote of all weak classifiers.
from sklearn.ensemble import AdaBoostClassifier
ada_model = AdaBoostClassifier()
ada_model.fit(X_train, y_train)
ada_predictions = ada_model.predict(X_test)
ada_accuracy = ada_model.score(X_test, y_test)
print(f"Accuracy: {ada_accuracy:.2%}")
plot_decision_boundaries(ada_model, X_train_df, y_train_series)
Ensemble Methods: Gradient Boosting (GBDT)
Gradient Boosting builds an additive model in a forward stage-wise fashion. It trains subsequent decision trees on the residuals of the previous trees, ultimately summing their outputs for the final prediction.
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
gb_accuracy = gb_model.score(X_test, y_test)
print(f"Accuracy: {gb_accuracy:.2%}")
plot_decision_boundaries(gb_model, X_train_df, y_train_series)
Advanced Boosting Frameworks
- XGBoost: Improves upon traditional GBDT by utilizing a second-order Taylor expansion of the loss function, leading to faster convergence and better accuracy.
- LightGBM: A highly efficient gradient boosting framework optimized for speed and lower memory usage through histogram-based decision tree learning.
- Stacking: A model fusion technique where multiple base models are trained, and a second-level meta-learner is trained on their combined predictions to produce the final output.