Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Implementing Machine Learning with Random Forest Models in Python

Tech 1

Overview of Random Forest

Random forest is an ensemble learning method suited for both classification and regression tasks. It operates by constructing multiple decision trees during training and merging their outputs for more robust predictions.

Applicability

  • Classification: Medical diagnosis, image categorization.
  • Regression: Real estate price forecasting, stock trend estimation.
  • Feature relevance ranking: Identifies influential variables via importance scores.

Core Mechanics

  1. Bootstrap aggregation: Generates varied subsets from the original dataset via random sampling with replacement.
  2. Tree construction: Trains a distinct decision tree on each subset; at every split, a random feature subset is evaluated rather than all features.
  3. Aggregation: Combines individual tree outcomes—majority vote for classification, mean value for regression.

Strengths

  • Reduced overfitting: Averaging across many trees enhances generalization.
  • Parallelizable: Tree building steps are independent and can run concurrently.
  • Scalable: Handles high-dimensional input and large volumes efficiently.
  • Implicit feature selection: Frequently selected features imply higher predictive power.

Limitations

  • Low interpretability: Complex ensemble structure obscures single-tree reasoning paths.
  • Sensitivity to noise: Performance may degrade with heavily noisy inputs or outliers.

Python Implementation Using scikit-learn

Basic Classsification Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load sample data
dataset = load_iris()
features = dataset.data
labels = dataset.target

# Partition into training and evaluation sets
train_feat, test_feat, train_lbl, test_lbl = train_test_split(
    features, labels, test_size=0.3, random_state=7)

# Initialize classifier with 100 trees
forest_model = RandomForestClassifier(n_estimators=100, random_state=7)

# Fit model on training portion
forest_model.fit(train_feat, train_lbl)

# Generate predictions for test portion
pred_lbl = forest_model.predict(test_feat)

# Compute prediction accuracy
acc = accuracy_score(test_lbl, pred_lbl)
print(f"Model accuracy: {acc:.2f}")

# Display relative feature significance
sig_scores = forest_model.feature_importances_
print("Feature significance scores:", sig_scores)

Extended Workflow with Visualization

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
import numpy as np

# Obtain class probability estimates
prob_matrix = forest_model.predict_proba(test_feat)

# Build confusion matrix
conf_mat = confusion_matrix(test_lbl, pred_lbl)

# Prepare binary label format for ROC analysis
bin_labels = label_binarize(test_lbl, classes=[0, 1, 2])
num_classes = bin_labels.shape[1]

# Compute ROC metrics per class
false_pos = {}
true_pos = {}
roc_area = {}
for idx in range(num_classes):
    false_pos[idx], true_pos[idx], _ = roc_curve(
        bin_labels[:, idx], prob_matrix[:, idx])
    roc_area[idx] = auc(false_pos[idx], true_pos[idx])

# Plot confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(
    forest_model, test_feat, test_lbl, ax=ax, cmap='Blues')
ax.set_title('Confusion Matrix')
plt.show()

# Plot ROC curves
fig, ax = plt.subplots(figsize=(8, 6))
for idx in range(num_classes):
    ax.plot(false_pos[idx], true_pos[idx],
            label=f'Category {idx} (AUC = {roc_area[idx]:.2f})')
ax.plot([0, 1], [0, 1], '--', color='gray')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curves')
ax.legend(loc='lower right')
plt.show()

# Visualize feature importance
group_names = dataset.feature_names
imp_vals = forest_model.feature_importances_
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(group_names, imp_vals, color='lightblue')
ax.set_xlabel('Attributes')
ax.set_ylabel('Significance')
ax.set_title('Attribute Significance Ranking')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Interpretasion of Results

  • Accuracy metric: Reflects proportion of correct predictions on unseen data; values near 1.0 suggest strong fit but may indicate overfitting on small or imbalanced datasets.
  • Significance scores: Quantify how often a feature contributes to splits; larger values mark stronger discriminative power. In the iris example, petal dimensions dominate classification decisions, whereas sepal measurements contribute less.
  • Confusion matrix: Reveals per-class prediction errors; a perfect diagonal indicates flawless classification on the test set.
  • ROC curves and AUC: Depict trade-offs between true positive and false positive rates across thresholds; AUC ≈ 1.0 signifies excellent separability per class.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.