Scikit-learn Classification Algorithms: Implementation and Optimization
Supervised Learning Paradigms
Supervised learning algorithms map input features to known target variables. When the target consists of discrete categories, the task is defined as classification. When the target is continuous, it is a regression task. Unsupervised learning operates on data lacking target labels, aiming to discover inherent structures like clusters.
Dataset Management in Scikit-learn
Scikit-learn provides built-in utilities to acquire benchmark datasets via sklearn.datasets. Small datasets are fetched using load_* methods, while larger ones utilize fetch_*. Both return a Bunch object, essentially an extended dictionary containing arrays for data (features), target (labels), and metadata like feature_names.
Before training, datasets must be partitioned into distinct training and testing subsets. The train_test_split function handles this separation, ensuring the model is evaluated on unseen data.
from sklearn import datasets
from sklearn.model_selection import train_test_split
# Loading Iris dataset
iris_data = datasets.load_iris()
features_train, features_test, target_train, target_test = train_test_split(
iris_data.data, iris_data.target, test_size=0.2, random_state=42
)
Estimators and Transformers
Scikit-learn APIs follow a uniform paradigm. Transformers process and modify data (e.g., standardization via fit_transform), whereas Estimators implement machine learning algorithms. All estimators expose a fit method for training on data and a predict method for generating inferences.
K-Nearest Neighbors (KNN)
KNN classifies samples based on the majority class among its 'k' closest neighbors in the feature space, typically calculated using Euclidean distance. While intuitive, it demands significant computational resources during prediction and necessitates feature scaling to avoid distance distortion.
We demonstrate KNN on location check-in data, engineering temporal features and filtering out infrequent locations to reduce noise.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Load and filter spatial data
location_df = pd.read_csv('FBlocation/train.csv')
filtered_df = location_df.query("x < 1.0 & y < 1.0")
# Extract temporal features
timestamp_series = pd.to_datetime(filtered_df['time'], unit='s')
filtered_df['weekday'] = pd.DatetimeIndex(timestamp_series).weekday
processed_df = filtered_df.drop(['time'], axis=1)
# Filter out rare check-in places
place_frequency = processed_df.groupby('place_id').count()
frequent_places = place_frequency[place_frequency.row_id > 400].reset_index()
final_df = processed_df[processed_df['place_id'].isin(frequent_places.place_id)]
# Separate features and target
target_vals = final_df['place_id']
feature_vals = final_df.drop(['place_id'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(feature_vals, target_vals, test_size=0.25)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN classifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_scaled, y_train)
accuracy_score = knn_model.score(X_test_scaled, y_test)
Naive Bayes Classification
Naive Bayes relies on Bayes' theorem, assuming strict independence between features. It calculates the posterior probability of each class given the input features. The Multinomial variant is well-suited for discrete text data, frequently paired with TF-IDF vectorization. Laplace smoothing (controlled by the alpha parameter) prevents zero probabilities for unseen vocabulary.
from sklearn import datasets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
# Fetch news groups dataset
news_corpus = datasets.fetch_20newsgroups(subset='all')
X_train, X_test, y_train, y_test = train_test_split(
news_corpus.data, news_corpus.target, test_size=0.2
)
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Train Multinomial Naive Bayes
nb_classifier = MultinomialNB(alpha=1.0)
nb_classifier.fit(X_train_vec, y_train)
prediction_accuracy = nb_classifier.score(X_test_vec, y_test)
Classification Evaluation Metrics
Simple accuracy can be deceptive, particularly with imbalanced class distributions. A confusion matrix decomposes predictions into True Positives, True Negatives, False Positives, and False Negatives.
- Precision: The proportion of correctly predicted positive observations to the total predicted positives (measure of exactness).
- Recall: The proportion of correctly predicted positive observations to all actual positives (measure of completeness).
The classification_report utility provides a detailed breakdown of these metrics for every class.
from sklearn.metrics import classification_report
# Generate predictions and evaluate
y_predictions = knn_model.predict(X_test_scaled)
report_output = classification_report(y_test, y_predictions)
Hyperparameter Tuning with Grid Search
Model performance heavily relies on hyperparameters (e.g., the 'k' value in KNN). GridSearchCV systematically evaluates predefined parameter combinations using cross-validation. It divides the training set into 'cv' folds, iterates through parameter grids, and identifies the optimal configuration.
from sklearn.model_selection import GridSearchCV
# Define parameter grid for KNN
param_grid = {'n_neighbors': [3, 5, 7, 9]}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=3)
grid_search.fit(X_train_scaled, y_train)
optimal_k = grid_search.best_estimator_.n_neighbors
best_cv_score = grid_search.best_score_
Decision Trees
Decision trees partition the feature space into distinct regions based on feature thresholds, producing a flowchart-like model. Node splits are governed by metrics such as Information Gain (entropy) or Gini Impurity. They require minimal data preprocessing but are susceptible to overfitting if grown too deeply.
We apply a Decision Tree to the Titanic dataset, handling missing values and one-hot encoding categorical variables via dictionary vectorization.
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
# Load Titanic data
titanic_df = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")
features = titanic_df[['age', 'sex', 'pclass']]
target = titanic_df['survived']
# Impute missing ages
features['age'].fillna(features['age'].mean(), inplace=True)
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)
# Dictionary vectorization for categorical data
train_dicts = X_train.to_dict(orient='records')
test_dicts = X_test.to_dict(orient='records')
dict_vectorizer = DictVectorizer(sparse=False)
X_train_encoded = dict_vectorizer.fit_transform(train_dicts)
X_test_encoded = dict_vectorizer.transform(test_dicts)
# Train Decision Tree
tree_model = DecisionTreeClassifier(max_depth=5, criterion='gini')
tree_model.fit(X_train_encoded, y_train)
# Export tree visualization
export_graphviz(tree_model, out_file='tree.dot',
feature_names=dict_vectorizer.get_feature_names())
Random Forests
Random Forests overcome decision tree overfitting through ensemble learning. They construct multiple independent trees using bootstrapped sample subsets and random feature selections. The final classification aggregates individual tree predictions through a majority voting mechanism, substantially improving generalization and robustness.
from sklearn.ensemble import RandomForestClassifier
# Train Random Forest on the encoded Titanic features
forest_model = RandomForestClassifier(n_estimators=50, max_depth=5, bootstrap=True)
forest_model.fit(X_train_encoded, y_train)
forest_accuracy = forest_model.score(X_test_encoded, y_test)