Home > Tech > Content

A Comprehensive Guide to Machine Learning Model Evaluation Metrics

Tech May 18 3

Accuracy

Accuracy is the most straightforward evaluation metric, calculated as the proportion of correct predictions out of the total number of predictions. However, it has notable limitations:

When class distribution is imbalanced, accuracy can be misleading. For instance, if 99% of samples belong to the positive class, a model that simply predicts all samples as positive achieves 99% accuracy.
It provides a general overview and may not reflect performance on specific classes of interest. In applications like search, we might care more about the precision of retrieved items or the recall of relevant items.

The error rate, the proportion of incorrectly classified samples, is the complement of accuracy.

from sklearn.metrics import accuracy_score

# Example data
predictions = [0, 0, 1, 1]
actuals = [1, 0, 1, 0]

# Calculate accuracy
accuracy = accuracy_score(actuals, predictions)
print(f"Accuracy: {accuracy:.2f}")  # Output: 0.50

Precision, Recall, and F1 Score

Precision (also known as positive predictive value) measures the proportion of true positives among all positive predictions. Recall (also known as sensitivity or true positive rate) measures the proportion of true positives among all actual positives.

In ranking problems, Precision@N and Recall@N are commonly used, focusing on the top N results.

Precision and recall are often in tension; improving one may reduce the other. This trade-off arises because data is typically noisy and not strictly binary.

These metrics are derived from the confusion matrix:

Actual	Predicted Positive	Predicted Negative
Positive	True Positive (TP)	False Negative (FN)
Negative	False Positive (FP)	True Negative (TN)

Precision (P) and Recall (R) are calculated as:

The F1 score provides a single metric that balances precision and recall, calculated as the harmonic mean of the two:

The F-beta score generalizes this, allowing different weights to precision and recall:

β > 1: Emphasizes recall
β < 1: Emphasizes precision
β = 1: Equal weight (F1 score)

For multi-class problems or multiple evaluations, macro and micro averaging can be used:

Macro: Calculate metrics for each class and average
Micro: Aggregate contributions of all classes to compute the average metric

from typing import List, Tuple
import matplotlib.pyplot as plt

def compute_confusion_matrix(
    predictions: List[int], 
    actuals: List[int]
) -> Tuple[int, int, int, int]:
    """Calculate TP, FP, TN, FN from predictions and actuals."""
    length = len(predictions)
    assert length == len(actuals)
    tp, fp, fn, tn = 0, 0, 0, 0
    for i in range(length):
        if predictions[i] == actuals[i] == 1:
            tp += 1
        elif predictions[i] == actuals[i] == 0:
            tn += 1
        elif predictions[i] == 1 and actuals[i] == 0:
            fp += 1
        elif predictions[i] == 0 and actuals[i] == 1:
            fn += 1
    return tp, fp, tn, fn

def calculate_precision(tp: int, fp: int) -> float:
    """Calculate precision."""
    return tp / (tp + fp) if (tp + fp) > 0 else 0.0

def calculate_recall(tp: int, fn: int) -> float:
    """Calculate recall."""
    return tp / (tp + fn) if (tp + fn) > 0 else 0.0

def generate_pr_curve(
    predicted_probs: List[float], 
    actuals: List[int]
) -> Tuple[List[float], List[float]]:
    """Generate precision-recall curve points."""
    precisions = [1.0]
    recalls = [0.0]
    sorted_indices = sorted(range(len(predicted_probs)), key=lambda i: predicted_probs[i], reverse=True)
    
    for i in range(1, len(sorted_indices) + 1):
        threshold = predicted_probs[sorted_indices[i-1]]
        binary_preds = [1 if prob >= threshold else 0 for prob in predicted_probs]
        tp, fp, tn, fn = compute_confusion_matrix(binary_preds, actuals)
        precision = calculate_precision(tp, fp)
        recall = calculate_recall(tp, fn)
        precisions.append(precision)
        recalls.append(recall)
    
    precisions.append(0.0)
    recalls.append(1.0)
    return precisions, recalls

# Example data
predicted_probs = [0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505,
                  0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.3, 0.1]
actuals = [1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0]

precisions, recalls = generate_pr_curve(predicted_probs, actuals)

# Plot PR curve
plt.figure(figsize=(12, 5))
plt.plot(recalls, precisions)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

Root Mean Square Error (RMSE)

RMSE is a common metric for regression models, measuring the square root of the average squared differences between predicted and actual values. It's sensitive to outliers, as large errors are squared and thus have a disproportionate impact.

When outliers are problematic:

Remove outliers if they are noise
Re-model if outliers are valid samples
Consider alternative metrics like MAPE (Mean Absolute Percentage Error), which normalizes errors and is less affected by outliers

ROC and AUC

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. TPR is the same as recall, while FPR is the proportion of negatives incorrectly identified as positives.

def calculate_fpr(fp: int, tn: int) -> float:
    """Calculate false positive rate."""
    return fp / (fp + tn) if (fp + tn) > 0 else 0.0

def calculate_tpr(tp: int, fn: int) -> float:
    """Calculate true positive rate."""
    return tp / (tp + fn) if (tp + fn) > 0 else 0.0

def generate_roc_curve(
    predicted_probs: List[float], 
    actuals: List[int]
) -> Tuple[List[float], List[float]]:
    """Generate ROC curve points."""
    fprs = [0.0]
    tprs = [0.0]
    sorted_indices = sorted(range(len(predicted_probs)), key=lambda i: predicted_probs[i], reverse=True)
    
    for i in range(1, len(sorted_indices) + 1):
        threshold = predicted_probs[sorted_indices[i-1]]
        binary_preds = [1 if prob >= threshold else 0 for prob in predicted_probs]
        tp, fp, tn, fn = compute_confusion_matrix(binary_preds, actuals)
        fpr = calculate_fpr(fp, tn)
        tpr = calculate_tpr(tp, fn)
        fprs.append(fpr)
        tprs.append(tpr)
    
    fprs.append(1.0)
    tprs.append(1.0)
    return fprs, tprs

# Generate and plot ROC curve
fprs, tprs = generate_roc_curve(predicted_probs, actuals)
plt.figure(figsize=(12, 5))
plt.plot(fprs, tprs)
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line for random model
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

The Area Under the ROC Curve (AUC) provides a single value summarizing the ROC curve. AUC ranges from 0.5 (random model) to 1.0 (perfect model). A higher AUC indicates better model performance in ranking positive samples above negative ones.

Unlike PR curves, ROC curves are less sensitive to class imbalance, making them suitable when positive and negative samples are unevenly distributed.

Kolmogorov-Smirnov (KS) Statistic

The KS statistic measures the maximum difference between two cumulative distribution functions. In model evaluation, it's used to assess the separation between positive and negative class distributions.

A high KS value indicates good model discrimination, meaning the model effectively separates positive and negative cases. However, extremely high values may indicate overfitting, especially in domains like credit risk where labels have inherent ambiguity.

from scipy import stats
import numpy as np

# Generate sample data
sample1 = stats.norm.rvs(size=200, loc=0., scale=1)
sample2 = stats.norm.rvs(size=300, loc=0.5, scale=1.5)

# Perform KS test
ks_result = stats.ks_2samp(sample1, sample2)
print(f"KS statistic: {ks_result.statistic:.3f}, p-value: {ks_result.pvalue:.3e}")

# Critical value at 0.05 significance level
critical_value = 1.358 * np.sqrt((200 + 300) / (200 * 300))
print(f"Critical value: {critical_value:.3f}")

# Plot distributions
plt.figure(figsize=(12, 5))
plt.hist(sample1, density=True, histtype='stepfilled', alpha=0.2, color='red', label='Sample 1')
plt.hist(sample2, density=True, histtype='stepfilled', alpha=0.2, color='blue', label='Sample 2')
plt.legend()
plt.title('Distribution Comparison')
plt.show()

In practice, KS values between 0.2 and 0.75 are often considered reasonable, with values outside this range potentially indicating underfitting or overfitting.

Scorecard Models

Scorecard models are linear regression models designed for interpretability and stability:

High feature coverage and stability
Clear interpretability of feature weights
Weights can be set based on expert knowledge or feature discrimination power (e.g., KS/IV values)

6.1 Non-linear Feature Handling

Two common approaches:

WOE (Weight of Evidence): Transforms categorical features based on the log ratio of good to bad event distributions within each category.
Binning: Converts continuous features into discrete categories to capture non-linear relationships.

6.2 Cross-feature Interactions

Interactions between features can be handled by:

Customer segmentation and separate modeling for each segment
Explicitly creating interaction terms

Tags: machine-learning

Back to List

Prev: Implementing the Producer-Consumer Pattern with Blocking Queues and Thread Synchronization

Next: Adapting Large Language Models: In-Context Learning, Fine-Tuning, and RLHF

Fading Coder

A Comprehensive Guide to Machine Learning Model Evaluation Metrics

Accuracy

Precision, Recall, and F1 Score

Root Mean Square Error (RMSE)

ROC and AUC

Kolmogorov-Smirnov (KS) Statistic

Scorecard Models

6.1 Non-linear Feature Handling

6.2 Cross-feature Interactions

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

A Comprehensive Guide to Machine Learning Model Evaluation Metrics

Accuracy

Precision, Recall, and F1 Score

Root Mean Square Error (RMSE)

ROC and AUC

Kolmogorov-Smirnov (KS) Statistic

Scorecard Models

6.1 Non-linear Feature Handling

6.2 Cross-feature Interactions

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment