A Comprehensive Guide to Machine Learning Model Evaluation Metrics
Accuracy
Accuracy is the most straightforward evaluation metric, calculated as the proportion of correct predictions out of the total number of predictions. However, it has notable limitations:
- When class distribution is imbalanced, accuracy can be misleading. For instance, if 99% of samples belong to the positive class, a model that simply predicts all samples as positive achieves 99% accuracy.
- It provides a general overview and may not reflect performance on specific classes of interest. In applications like search, we might care more about the precision of retrieved items or the recall of relevant items.
The error rate, the proportion of incorrectly classified samples, is the complement of accuracy.
from sklearn.metrics import accuracy_score
# Example data
predictions = [0, 0, 1, 1]
actuals = [1, 0, 1, 0]
# Calculate accuracy
accuracy = accuracy_score(actuals, predictions)
print(f"Accuracy: {accuracy:.2f}") # Output: 0.50
Precision, Recall, and F1 Score
Precision (also known as positive predictive value) measures the proportion of true positives among all positive predictions. Recall (also known as sensitivity or true positive rate) measures the proportion of true positives among all actual positives.
In ranking problems, Precision@N and Recall@N are commonly used, focusing on the top N results.
Precision and recall are often in tension; improving one may reduce the other. This trade-off arises because data is typically noisy and not strictly binary.
These metrics are derived from the confusion matrix:
| Actual | Predicted Positive | Predicted Negative |
|---|---|---|
| Positive | True Positive (TP) | False Negative (FN) |
| Negative | False Positive (FP) | True Negative (TN) |
Precision (P) and Recall (R) are calculated as:
The F1 score provides a single metric that balances precision and recall, calculated as the harmonic mean of the two:
The F-beta score generalizes this, allowing different weights to precision and recall:
- β > 1: Emphasizes recall
- β < 1: Emphasizes precision
- β = 1: Equal weight (F1 score)
For multi-class problems or multiple evaluations, macro and micro averaging can be used:
- Macro: Calculate metrics for each class and average
- Micro: Aggregate contributions of all classes to compute the average metric
from typing import List, Tuple
import matplotlib.pyplot as plt
def compute_confusion_matrix(
predictions: List[int],
actuals: List[int]
) -> Tuple[int, int, int, int]:
"""Calculate TP, FP, TN, FN from predictions and actuals."""
length = len(predictions)
assert length == len(actuals)
tp, fp, fn, tn = 0, 0, 0, 0
for i in range(length):
if predictions[i] == actuals[i] == 1:
tp += 1
elif predictions[i] == actuals[i] == 0:
tn += 1
elif predictions[i] == 1 and actuals[i] == 0:
fp += 1
elif predictions[i] == 0 and actuals[i] == 1:
fn += 1
return tp, fp, tn, fn
def calculate_precision(tp: int, fp: int) -> float:
"""Calculate precision."""
return tp / (tp + fp) if (tp + fp) > 0 else 0.0
def calculate_recall(tp: int, fn: int) -> float:
"""Calculate recall."""
return tp / (tp + fn) if (tp + fn) > 0 else 0.0
def generate_pr_curve(
predicted_probs: List[float],
actuals: List[int]
) -> Tuple[List[float], List[float]]:
"""Generate precision-recall curve points."""
precisions = [1.0]
recalls = [0.0]
sorted_indices = sorted(range(len(predicted_probs)), key=lambda i: predicted_probs[i], reverse=True)
for i in range(1, len(sorted_indices) + 1):
threshold = predicted_probs[sorted_indices[i-1]]
binary_preds = [1 if prob >= threshold else 0 for prob in predicted_probs]
tp, fp, tn, fn = compute_confusion_matrix(binary_preds, actuals)
precision = calculate_precision(tp, fp)
recall = calculate_recall(tp, fn)
precisions.append(precision)
recalls.append(recall)
precisions.append(0.0)
recalls.append(1.0)
return precisions, recalls
# Example data
predicted_probs = [0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505,
0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.3, 0.1]
actuals = [1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0]
precisions, recalls = generate_pr_curve(predicted_probs, actuals)
# Plot PR curve
plt.figure(figsize=(12, 5))
plt.plot(recalls, precisions)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
Root Mean Square Error (RMSE)
RMSE is a common metric for regression models, measuring the square root of the average squared differences between predicted and actual values. It's sensitive to outliers, as large errors are squared and thus have a disproportionate impact.
When outliers are problematic:
- Remove outliers if they are noise
- Re-model if outliers are valid samples
- Consider alternative metrics like MAPE (Mean Absolute Percentage Error), which normalizes errors and is less affected by outliers
ROC and AUC
The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. TPR is the same as recall, while FPR is the proportion of negatives incorrectly identified as positives.
def calculate_fpr(fp: int, tn: int) -> float:
"""Calculate false positive rate."""
return fp / (fp + tn) if (fp + tn) > 0 else 0.0
def calculate_tpr(tp: int, fn: int) -> float:
"""Calculate true positive rate."""
return tp / (tp + fn) if (tp + fn) > 0 else 0.0
def generate_roc_curve(
predicted_probs: List[float],
actuals: List[int]
) -> Tuple[List[float], List[float]]:
"""Generate ROC curve points."""
fprs = [0.0]
tprs = [0.0]
sorted_indices = sorted(range(len(predicted_probs)), key=lambda i: predicted_probs[i], reverse=True)
for i in range(1, len(sorted_indices) + 1):
threshold = predicted_probs[sorted_indices[i-1]]
binary_preds = [1 if prob >= threshold else 0 for prob in predicted_probs]
tp, fp, tn, fn = compute_confusion_matrix(binary_preds, actuals)
fpr = calculate_fpr(fp, tn)
tpr = calculate_tpr(tp, fn)
fprs.append(fpr)
tprs.append(tpr)
fprs.append(1.0)
tprs.append(1.0)
return fprs, tprs
# Generate and plot ROC curve
fprs, tprs = generate_roc_curve(predicted_probs, actuals)
plt.figure(figsize=(12, 5))
plt.plot(fprs, tprs)
plt.plot([0, 1], [0, 1], 'k--') # Diagonal line for random model
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
The Area Under the ROC Curve (AUC) provides a single value summarizing the ROC curve. AUC ranges from 0.5 (random model) to 1.0 (perfect model). A higher AUC indicates better model performance in ranking positive samples above negative ones.
Unlike PR curves, ROC curves are less sensitive to class imbalance, making them suitable when positive and negative samples are unevenly distributed.
Kolmogorov-Smirnov (KS) Statistic
The KS statistic measures the maximum difference between two cumulative distribution functions. In model evaluation, it's used to assess the separation between positive and negative class distributions.
A high KS value indicates good model discrimination, meaning the model effectively separates positive and negative cases. However, extremely high values may indicate overfitting, especially in domains like credit risk where labels have inherent ambiguity.
from scipy import stats
import numpy as np
# Generate sample data
sample1 = stats.norm.rvs(size=200, loc=0., scale=1)
sample2 = stats.norm.rvs(size=300, loc=0.5, scale=1.5)
# Perform KS test
ks_result = stats.ks_2samp(sample1, sample2)
print(f"KS statistic: {ks_result.statistic:.3f}, p-value: {ks_result.pvalue:.3e}")
# Critical value at 0.05 significance level
critical_value = 1.358 * np.sqrt((200 + 300) / (200 * 300))
print(f"Critical value: {critical_value:.3f}")
# Plot distributions
plt.figure(figsize=(12, 5))
plt.hist(sample1, density=True, histtype='stepfilled', alpha=0.2, color='red', label='Sample 1')
plt.hist(sample2, density=True, histtype='stepfilled', alpha=0.2, color='blue', label='Sample 2')
plt.legend()
plt.title('Distribution Comparison')
plt.show()
In practice, KS values between 0.2 and 0.75 are often considered reasonable, with values outside this range potentially indicating underfitting or overfitting.
Scorecard Models
Scorecard models are linear regression models designed for interpretability and stability:
- High feature coverage and stability
- Clear interpretability of feature weights
- Weights can be set based on expert knowledge or feature discrimination power (e.g., KS/IV values)
6.1 Non-linear Feature Handling
Two common approaches:
- WOE (Weight of Evidence): Transforms categorical features based on the log ratio of good to bad event distributions within each category.
- Binning: Converts continuous features into discrete categories to capture non-linear relationships.
6.2 Cross-feature Interactions
Interactions between features can be handled by:
- Customer segmentation and separate modeling for each segment
- Explicitly creating interaction terms