Implementing Machine Learning with Random Forest Models in Python
Overview of Random Forest
Random forest is an ensemble learning method suited for both classification and regression tasks. It operates by constructing multiple decision trees during training and merging their outputs for more robust predictions.
Applicability
- Classification: Medical diagnosis, image categorization.
- Regression: Real estate price forecasting, stock trend estimation.
- Feature relevance ranking: Identifies influential variables via importance scores.
Core Mechanics
- Bootstrap aggregation: Generates varied subsets from the original dataset via random sampling with replacement.
- Tree construction: Trains a distinct decision tree on each subset; at every split, a random feature subset is evaluated rather than all features.
- Aggregation: Combines individual tree outcomes—majority vote for classification, mean value for regression.
Strengths
- Reduced overfitting: Averaging across many trees enhances generalization.
- Parallelizable: Tree building steps are independent and can run concurrently.
- Scalable: Handles high-dimensional input and large volumes efficiently.
- Implicit feature selection: Frequently selected features imply higher predictive power.
Limitations
- Low interpretability: Complex ensemble structure obscures single-tree reasoning paths.
- Sensitivity to noise: Performance may degrade with heavily noisy inputs or outliers.
Python Implementation Using scikit-learn
Basic Classsification Example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Load sample data
dataset = load_iris()
features = dataset.data
labels = dataset.target
# Partition into training and evaluation sets
train_feat, test_feat, train_lbl, test_lbl = train_test_split(
features, labels, test_size=0.3, random_state=7)
# Initialize classifier with 100 trees
forest_model = RandomForestClassifier(n_estimators=100, random_state=7)
# Fit model on training portion
forest_model.fit(train_feat, train_lbl)
# Generate predictions for test portion
pred_lbl = forest_model.predict(test_feat)
# Compute prediction accuracy
acc = accuracy_score(test_lbl, pred_lbl)
print(f"Model accuracy: {acc:.2f}")
# Display relative feature significance
sig_scores = forest_model.feature_importances_
print("Feature significance scores:", sig_scores)
Extended Workflow with Visualization
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
import numpy as np
# Obtain class probability estimates
prob_matrix = forest_model.predict_proba(test_feat)
# Build confusion matrix
conf_mat = confusion_matrix(test_lbl, pred_lbl)
# Prepare binary label format for ROC analysis
bin_labels = label_binarize(test_lbl, classes=[0, 1, 2])
num_classes = bin_labels.shape[1]
# Compute ROC metrics per class
false_pos = {}
true_pos = {}
roc_area = {}
for idx in range(num_classes):
false_pos[idx], true_pos[idx], _ = roc_curve(
bin_labels[:, idx], prob_matrix[:, idx])
roc_area[idx] = auc(false_pos[idx], true_pos[idx])
# Plot confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(
forest_model, test_feat, test_lbl, ax=ax, cmap='Blues')
ax.set_title('Confusion Matrix')
plt.show()
# Plot ROC curves
fig, ax = plt.subplots(figsize=(8, 6))
for idx in range(num_classes):
ax.plot(false_pos[idx], true_pos[idx],
label=f'Category {idx} (AUC = {roc_area[idx]:.2f})')
ax.plot([0, 1], [0, 1], '--', color='gray')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves')
ax.legend(loc='lower right')
plt.show()
# Visualize feature importance
group_names = dataset.feature_names
imp_vals = forest_model.feature_importances_
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(group_names, imp_vals, color='lightblue')
ax.set_xlabel('Attributes')
ax.set_ylabel('Significance')
ax.set_title('Attribute Significance Ranking')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Interpretasion of Results
- Accuracy metric: Reflects proportion of correct predictions on unseen data; values near 1.0 suggest strong fit but may indicate overfitting on small or imbalanced datasets.
- Significance scores: Quantify how often a feature contributes to splits; larger values mark stronger discriminative power. In the iris example, petal dimensions dominate classification decisions, whereas sepal measurements contribute less.
- Confusion matrix: Reveals per-class prediction errors; a perfect diagonal indicates flawless classification on the test set.
- ROC curves and AUC: Depict trade-offs between true positive and false positive rates across thresholds; AUC ≈ 1.0 signifies excellent separability per class.