Practical Machine Learning Workflows with Scikit-Learn
Environment Setup
Install the core library along with numerical computing dependencies:
pip install scikit-learn numpy
Data Acquisition and Inspection
Scikit-learn includes several curated datasets for rapid prototyping. The following example loads a multi-class classification dataset and inspects its dimensions:
from sklearn.datasets import load_wine
import numpy as np
wine_data = load_wine()
features = wine_data.data
labels = wine_data.target
print(f"Feature matrix dimensions: {features.shape}")
print(f"Target vector dimensions: {labels.shape}")
Feature Scaling and Partitioning
Numerical features often require normalization to ensure stable convergence during training. After scaling, the data is divided into training and validation subsets:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
normalizer = StandardScaler()
scaled_features = normalizer.fit_transform(features)
X_train, X_val, y_train, y_val = train_test_split(
scaled_features, labels, test_size=0.25, random_state=7, stratify=labels
)
Supervised Learning Implementations
Regression Workflow
For continuous target prediction, linear models provide a reliable baseline. The diabetes dataset serves as a standard regression benchmark:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
diabetes = load_diabetes()
reg_X_train, reg_X_test, reg_y_train, reg_y_test = train_test_split(
diabetes.data, diabetes.target, test_size=0.2, random_state=10
)
lin_model = LinearRegression()
lin_model.fit(reg_X_train, reg_y_train)
reg_predictions = lin_model.predict(reg_X_test)
mse_val = mean_squared_error(reg_y_test, reg_predictions)
print(f"Regression MSE: {mse_val:.2f}")
Classification Algorithms
Multiple classifiers can be instantiated and evaluated using a unified loop structure. This approach streamlines comparison across different algorithmic families:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
classifiers = {
"LogReg": LogisticRegression(max_iter=500, solver="lbfgs"),
"SVM": SVC(kernel="rbf", C=1.0),
"Tree": DecisionTreeClassifier(max_depth=4, random_state=7)
}
for name, model in classifiers.items():
model.fit(X_train, y_train)
preds = model.predict(X_val)
acc = accuracy_score(y_val, preds)
print(f"{name} Accuracy: {acc:.3f}")
Performence Measurement
Detailed metric reporting helps identify class-specific performance gaps. Regression tasks utilize variance-explained metrics:
from sklearn.metrics import classification_report, r2_score
# Classification breakdown
print(classification_report(y_val, preds, target_names=wine_data.target_names))
# Regression goodness-of-fit
r2_val = r2_score(reg_y_test, reg_predictions)
print(f"Coefficient of Determination (R²): {r2_val:.3f}")
Hyperparameter Optimization and Validation
Systematic parameter searching combined with k-fold cross-validation prevents overfitting and identifies optimal model configurations:
from sklearn.model_selection import GridSearchCV, cross_val_score
svm_param_space = {
"C": [0.01, 0.1, 1.0, 5.0],
"gamma": ["scale", "auto"],
"kernel": ["linear", "rbf"]
}
search_agent = GridSearchCV(
estimator=SVC(),
param_grid=svm_param_space,
cv=4,
n_jobs=-1,
refit=True
)
search_agent.fit(X_train, y_train)
print(f"Optimal configuration: {search_agent.best_params_}")
# K-Fold validation baseline
base_model = LogisticRegression(max_iter=500)
fold_scores = cross_val_score(base_model, scaled_features, labels, cv=5, scoring="accuracy")
print(f"Validation folds: {fold_scores}")
print(f"Average CV accuracy: {np.mean(fold_scores):.3f}")