Interpretive Context

This section blends classical evaluation metrics with probabilistic theory, helping users understand both the foundations and the limitations of model performance metrics like AUC-ROC and AUC-PR.

Binary Classification Outputs

Let \(\hat{y} = f(x) \in [0, 1]\) be the probabilistic score assigned by the model for a sample \(x \in \mathbb{R}^d\), and \(y \in \{0, 1\}\) the ground-truth label.

At any threshold \(\tau\), we define the standard classification metrics:

\[\text{TPR}(\tau) = \frac{\text{TP}}{\text{TP} + \text{FN}}, \quad \text{FPR}(\tau) = \frac{\text{FP}}{\text{FP} + \text{TN}}\]
\[\text{Precision}(\tau) = \frac{\text{TP}}{\text{TP} + \text{FP}}, \quad \text{Recall}(\tau) = \frac{\text{TP}}{\text{TP} + \text{FN}}\]

These metrics form the foundation of ROC and Precision-Recall curves, which evaluate how performance shifts across different thresholds \(\tau \in [0, 1]\).

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots:

  • \(\text{TPR}(\tau)\) vs. \(\text{FPR}(\tau)\)

  • As threshold \(\tau\) varies from 1 to 0

The Area Under the ROC Curve (AUC-ROC) is defined as:

\[\text{AUC} = \int_0^1 \text{TPR}(F_0^{-1}(1 - u)) \, du\]

Where \(F_0\) is the CDF of scores from the negative class.

Probabilistic Interpretation

AUC can also be seen as the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample:

\[\text{AUC} = P(\hat{y}^+ > \hat{y}^-)\]

Proof (U-statistic representation):

\[\text{AUC} = \iint \mathbb{1}(\hat{y}_1 > \hat{y}_0) \, dF_1(\hat{y}_1) \, dF_0(\hat{y}_0)\]

Where \(F_1\) and \(F_0\) are the score distributions of the positive and negative classes.

ROC Operating Points

While AUC-ROC summarizes overall discriminative ability across all possible thresholds, practical deployment requires selecting a specific operating point (threshold \(\tau\)) that converts probabilistic scores into binary predictions. This selection fundamentally trades off sensitivity (true positive rate) against specificity (true negative rate), and the optimal choice depends on the relative costs of different error types.

The Operating Point Selection Problem

Given a classifier that produces continuous scores \(s(x) \in \mathbb{R}\) or probabilities \(\hat{p}(x) \in [0,1]\), we define the decision rule:

\[\begin{split}\hat{y} = \begin{cases} 1 & \text{if } s(x) \geq \tau \\ 0 & \text{if } s(x) < \tau \end{cases}\end{split}\]

Each threshold \(\tau\) induces a specific point on the ROC curve with coordinates \((\text{FPR}(\tau), \text{TPR}(\tau))\). The challenge is to select \(\tau\) without knowledge of deployment costs or class distributions.

The Fundamental Tradeoff: Lowering \(\tau\) increases sensitivity (captures more true positives) but decreases specificity (increases false positives). Raising \(\tau\) has the opposite effect.

Youden’s J Statistic

Definition: Youden’s index \(J\) identifies the threshold that maximizes the vertical distance from the ROC curve to the chance diagonal:

\[J(\tau) = \text{TPR}(\tau) - \text{FPR}(\tau) = \text{Sensitivity}(\tau) + \text{Specificity}(\tau) - 1\]

The optimal threshold is:

\[\tau^*_J = \arg\max_\tau J(\tau)\]

Geometric Interpretation: The chance diagonal represents random guessing (\(\text{TPR} = \text{FPR}\)). Youden’s J measures how far above this baseline the classifier performs. Maximum \(J\) occurs where the tangent to the ROC curve is parallel to the diagonal (slope = 1).

Statistical Properties:

  • \(J \in [0, 1]\) where \(J=0\) indicates no improvement over chance and \(J=1\) indicates perfect separation

  • \(J = \text{TPR} + \text{TNR} - 1\) equivalently maximizes the sum of sensitivity and specificity

  • Under equal class prevalence and equal costs, this is the Bayes-optimal threshold

Derivation:

For a given threshold \(\tau\), the confusion matrix yields:

\[J(\tau) = \frac{\text{TP}}{\text{TP} + \text{FN}} - \frac{\text{FP}}{\text{FP} + \text{TN}}\]

This can be rewritten in terms of the score distributions \(f_1(s)\) (positive class) and \(f_0(s)\) (negative class):

\[J(\tau) = \int_\tau^\infty f_1(s)\,ds - \int_\tau^\infty f_0(s)\,ds\]

The maximum occurs where the derivative equals zero:

\[\frac{dJ}{d\tau} = -f_1(\tau) + f_0(\tau) = 0 \quad \Rightarrow \quad f_1(\tau^*_J) = f_0(\tau^*_J)\]

Practical Interpretation: The optimal threshold is where the score densities of the positive and negative classes intersect, assuming equal prior probabilities and equal misclassification costs.

When to Use Youden’s J:

  • Equal importance of sensitivity and specificity

  • Balanced or unknown class prevalence

  • No differential misclassification costs

  • Screening tests where both false positives and false negatives are problematic

Limitations:

  • Assumes equal costs: \(C(\text{FP}) = C(\text{FN})\)

  • Ignores base rates (class prevalence)

  • May not be optimal for imbalanced datasets

  • Does not account for downstream decision costs

Closest to Top-Left

Definition: This method minimizes the Euclidean distance from the ROC point to the ideal classifier at coordinates (0, 1):

\[d(\tau) = \sqrt{(1 - \text{TPR}(\tau))^2 + \text{FPR}(\tau)^2}\]

The optimal threshold is:

\[\tau^*_{\text{TL}} = \arg\min_\tau d(\tau)\]

Geometric Interpretation: The point (0, 1) represents perfect classification: 100% sensitivity with 0% false positive rate. This criterion seeks the threshold that gets “as close as possible” to perfection in Euclidean space.

Alternative Formulation: Squaring the distance for computational convenience:

\[\tau^*_{\text{TL}} = \arg\min_\tau \left[(1 - \text{TPR}(\tau))^2 + \text{FPR}(\tau)^2\right]\]

Relationship to Youden’s J:

The closest-to-top-left criterion can be viewed as minimizing a weighted combination of false negative rate and false positive rate:

\[d^2(\tau) = \text{FNR}^2(\tau) + \text{FPR}^2(\tau)\]

This implicitly assumes squared error loss, whereas Youden’s J assumes linear loss.

Mathematical Properties:

  • Invariant to monotonic transformations of the score scale

  • Continuous and differentiable (when TPR and FPR are smooth)

  • Guaranteed to have a solution on compact ROC space

  • The optimal point lies on the convex hull of the ROC curve

When to Use Closest-to-Top-Left:

  • Desire to minimize “overall error” in both dimensions

  • Both error types equally costly, but penalized quadratically

  • Smooth, well-calibrated classifiers

  • Visual/geometric interpretation preferred

Comparison with Youden’s J:

For well-separated classes with smooth ROC curves, the two methods often yield similar thresholds. However, they can diverge when:

  • ROC curve has sharp corners or discontinuities

  • Class distributions have heavy tails

  • One error type dominates in frequency but not in cost

Cost-Sensitive Extensions

Both methods can be extended to account for asymmetric costs. Define:

  • \(C_{\text{FP}}\): cost of a false positive

  • \(C_{\text{FN}}\): cost of a false negative

  • \(\pi_0, \pi_1\): prior probabilities of negative and positive classes

The expected cost at threshold \(\tau\) is:

\[\mathbb{E}[\text{Cost}] = \pi_1 C_{\text{FN}} \cdot \text{FNR}(\tau) + \pi_0 C_{\text{FP}} \cdot \text{FPR}(\tau)\]

Cost-Weighted Youden’s J:

\[J_{\text{cost}}(\tau) = w_1 \cdot \text{TPR}(\tau) - w_0 \cdot \text{FPR}(\tau)\]

where \(w_1 = \pi_1 C_{\text{FN}}\) and \(w_0 = \pi_0 C_{\text{FP}}\).

Cost-Weighted Distance:

\[d_{\text{cost}}(\tau) = \sqrt{w_1^2(1-\text{TPR}(\tau))^2 + w_0^2 \text{FPR}(\tau)^2}\]

Practical Considerations

Important Caveats:

  1. Equal Cost Assumption: Both standard methods assume \(C_{\text{FP}} = C_{\text{FN}}\), which is rarely true in practice. Medical diagnostics, fraud detection, and legal applications all have strongly asymmetric costs.

  2. Prevalence Sensitivity: Neither method explicitly accounts for class imbalance. A threshold optimal on balanced validation data may perform poorly when deployed on imbalanced populations.

  3. Calibration Matters: These methods assume the classifier’s scores are well-calibrated. For poorly calibrated models, threshold selection may be unstable.

  4. Validation Set Dependence: The optimal threshold is computed on validation data and may not generalize if the test distribution differs (distribution shift, concept drift).

  5. Multi-Objective Constraints: Real applications often have multiple constraints (e.g., “achieve at least 90% sensitivity while maximizing specificity”). These require constrained optimization rather than simple threshold rules.

Recommended Workflow:

  1. Visualize the ROC curve and compute AUC

  2. Identify candidate thresholds using both Youden’s J and closest-to-top-left

  3. Evaluate both thresholds on a held-out test set

  4. If domain costs are known, compute expected cost for each threshold

  5. Consider sensitivity analysis: how does performance vary in a neighborhood around \(\tau^*\)?

  6. Document the chosen threshold and the rationale for deployment

Example Cost-Benefit Analysis:

In cancer screening:

  • \(C_{\text{FP}}\): Cost of unnecessary biopsy + patient anxiety ≈ $1,000

  • \(C_{\text{FN}}\): Cost of delayed treatment + mortality risk ≈ $100,000

  • \(\pi_1\): Cancer prevalence ≈ 0.01

Optimal threshold strongly favors sensitivity (low \(\tau\)) to minimize missed cancers, even at the cost of more false alarms.

Warning

Deploying a classifier with an arbitrary threshold (e.g., 0.5) without validation is statistically unjustified. The threshold should always be selected based on validation performance and domain requirements, not convention.

Note

For a comprehensive treatment of cost-sensitive learning and threshold optimization under uncertainty, see Elkan (2001) [*] and Hand (2009) [].

Precision-Recall Curve and AUC-PR

The Precision-Recall (PR) curve focuses solely on the positive class. It plots:

\[\text{Precision}(\tau) = \frac{\pi_1 \cdot \text{TPR}(\tau)}{P(\hat{y} \ge \tau)}, \quad \text{Recall}(\tau) = \text{TPR}(\tau)\]

Where \(\pi_1 = P(y = 1)\) is the class prevalence.

The AUC-PR is defined as:

\[\text{AUC-PR} = \int_0^1 \text{Precision}(r) \, dr\]

Unlike ROC, the PR curve is not invariant to class imbalance. The baseline for precision is simply the proportion of positive samples in the data: \(\pi_1\).

Average Precision

Average Precision (AP) provides an alternative summary of the PR curve:

\[\text{AP} = \sum_{k=1}^{n} P(k) \Delta R(k)\]

where \(P(k)\) is precision at threshold \(k\) and \(\Delta R(k)\) is the change in recall from threshold \(k-1\) to \(k.\)

Key Distinction: Unlike AUCPR which treats all parts of the curve equally, AP weights precision values by the change in recall, emphasizing performance at higher precision levels. This makes AP particularly suitable for tasks where precision at the top of the ranking is most critical (e.g., information retrieval, recommendation systems).

Thresholding and Predictions

To convert scores into hard predictions, we apply a threshold \(\tau\):

\[\begin{split}\hat{y}_\tau = \begin{cases} 1 & \text{if } \hat{y} \ge \tau \\ 0 & \text{otherwise} \end{cases}\end{split}\]
Threshold-based metrics include:

Accuracy: \(\frac{\text{TP} + \text{TN}}{\text{Total}}\)

F1 Score: \(2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)

Summary Table: Theory Meets Interpretation

Metric Mathematical Formulation Key Property Practical Caveat
AUC-ROC \( P(\hat{y}^+ > \hat{y}^-) \) Rank-based, threshold-free Can be misleading with class imbalance
AUC-PR \( \int_0^1 \text{Precision}(r)\,dr \) Focused on positives Sensitive to prevalence and score noise
Precision \( \frac{\text{TP}}{\text{TP} + \text{FP}} \) Measures correctness Not monotonic across thresholds
Recall \( \frac{\text{TP}}{\text{TP} + \text{FN}} \) Measures completeness Ignores false positives
F1 Score Harmonic mean of precision and recall Tradeoff-aware Requires threshold, hides base rates

Interpretive Caveats

  • AUC-ROC can be overly optimistic when the negative class dominates.

  • AUC-PR gives more meaningful insight for imbalanced datasets, but is more volatile and harder to interpret.

  • Neither AUC metric defines an optimal threshold — for deployment, threshold tuning must be contextualized.

  • Calibration affects PR metrics and thresholds, but not AUC-ROC.

  • Metric conflicts are common: one model may outperform in AUC but underperform in F1.

  • Fairness and subgroup analysis are essential: A model may perform well overall, yet exhibit bias in subgroup-specific metrics.

Threshold Selection logic

When computing confusion matrices, selecting the right classification threshold can significantly impact the output. The function show_confusion_matrix is documented in this section.

1. If the custom_threshold parameter is passed, it takes absolute precedence and is used directly.

2. If model_threshold is set and the model contains a threshold dictionary, the function will try to retrieve the threshold using the score parameter:

  • If score is passed (e.g., "f1"), then model.threshold[score] is used.

  • If score is not passed, the function will look up the first item in model.scoring (if available).

  • If neither a custom threshold nor a valid model threshold is available, the default value of 0.5 is used.

Calibration Trade-offs

Calibration curves are powerful diagnostic tools for assessing how well a model’s predicted probabilities reflect actual outcomes. However, their interpretation—and the methods used to derive them—come with important caveats that users should keep in mind.

Calibration Methodology

The examples shown in this library are based on models calibrated using Platt Scaling, a post-processing technique that fits a sigmoid function to the model’s prediction scores. Platt Scaling assumes a parametric form:

\[P(y = 1 \mid f(x)) = \frac{1}{1 + \exp(A f(x) + B)}\]

where \(A\) and \(B\) are scalar parameters learned using a separate calibration dataset. This approach is computationally efficient and works well for models such as SVMs and Logistic Regression, where prediction scores are linearly separable or approximately log-odds in nature.

However, Platt Scaling may underperform when the relationship between raw scores and true probabilities is non-monotonic or highly irregular.

Alternative Calibration Methods

An alternative to Platt Scaling is Isotonic Regression, a non-parametric method that fits a monotonically increasing function to the model’s prediction scores. It is particularly effective when the mapping between predicted probabilities and observed outcomes is complex or non-linear.

Mathematically, isotonic regression solves the following constrained optimization problem:

\[\min_{\hat{p}_1, \ldots, \hat{p}_N} \sum_{i=1}^{N} (y_i - \hat{p}_i)^2 \quad \text{subject to} \quad \hat{p}_1 \leq \hat{p}_2 \leq \cdots \leq \hat{p}_N\]

Here:

  • \(y_i \in \{0, 1\}\) are the true binary labels,

  • \(\hat{p}_i\) are the calibrated probabilities corresponding to the model’s scores,

  • and the constraint enforces monotonicity, preserving the order of the original prediction scores.

The solution is obtained using the Pool Adjacent Violators Algorithm (PAVA), an efficient method for enforcing monotonicity in a least-squares fit.

While Isotonic Regression is highly flexible and can model arbitrary step-like functions, this same flexibility increases the risk of overfitting, especially when the calibration dataset is small, imbalanced, or noisy. It may capture spurious fluctuations in the validation data rather than the true underlying relationship between scores and outcomes.

Warning

Overfitting with isotonic regression can lead to miscalibration in deployment, particularly if the validation set is not representative of the production environment.

Note

This library does not perform calibration internally. Instead, users are expected to calibrate models during training or preprocessing—e.g., using the model_tuner library [1] or any external tool. All calibration curve plots included here are illustrative and assume models have already been calibrated using Platt Scaling prior to visualization.

Dependence on Validation Data

All calibration techniques rely heavily on the quality of the validation data used to learn the mapping. If the validation set is not representative of the target population, the resulting calibration curve may be misleading. This concern is especially important when deploying models in real-world settings where data drift or population imbalance may occur.

Calibration Assessment

Perfect calibration requires:

\[P(Y=1 | \hat{p}=p) = p \quad \forall p \in [0,1]\]

In practice, calibration is assessed by binning predictions and comparing:

\[\text{Observed frequency in bin } b = \frac{\sum_{i \in b} y_i}{|b|}\]

to the mean predicted probability in that bin.

Brier Score Decomposition

The Brier score can be decomposed into:

\[\text{BS} = \text{Reliability} - \text{Resolution} + \text{Uncertainty}\]

where:

  • Reliability: How close predicted probabilities are to observed frequencies

  • Resolution: How well the model separates positive from negative cases

  • Uncertainty: Inherent unpredictability in the data (\(\bar{y}(1-\bar{y})\))

This decomposition reveals that a model can have a good Brier score through strong resolution even with poor calibration.

Interpreting the Brier Score

The Brier Score, often reported alongside calibration curves, provides a quantitative measure of probabilistic accuracy. It is defined as:

\[\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (\hat{p}_i - y_i)^2\]

where \(\hat{p}_i\) is the predicted probability and \(y_i\) is the actual class label. While a lower Brier Score generally indicates better performance, it conflates calibration (how close predicted probabilities are to actual outcomes) and refinement (how confidently predictions are made). Thus, the Brier Score should be interpreted in context and not relied upon in isolation.

Lift: Mathematical Definition

Lift at depth \(d\) is defined as:

\[\text{Lift}(d) = \frac{\text{TP}(d) / n(d)}{\text{TP}_{\text{total}} / N}\]

where:

  • \(\text{TP}(d)\) = number of true positives in top \(d\%\) of predictions

  • \(n(d)\) = number of observations in top \(d\%\)

  • \(\text{TP}_{\text{total}}\) = total true positives in dataset

  • \(N\) = total number of observations

A lift value of 2.0 at 10% depth (percentage of sample) means the model identifies twice as many positives in the top 10% compared to random selection.

Gain: Mathematical Definition

Cumulative gain at depth \(d\) is defined as:

\[\text{Gain}(d) = \frac{\text{TP}(d)}{\text{TP}_{\text{total}}} \times 100\%\]

where:

  • \(\text{TP}(d)\) = number of true positives in top \(d\%\) of predictions

  • \(\text{TP}_{\text{total}}\) = total true positives in dataset

The Gini coefficient, derived from the gain curve, is calculated as:

\[\text{Gini} = 2 \times \text{AUGC} - 1\]

where \(\text{AUGC}\) is the area under the gain curve. The Gini coefficient ranges from 0 (random model) to 1 (perfect model) and provides a single-number summary of model discrimination power.

Partial Dependence Foundations

Let \(\mathbf{X}\) represent the complete set of input features for a machine learning model, where \(\mathbf{X} = \{X_1, X_2, \dots, X_p\}\). Suppose we’re particularly interested in a subset of these features, denoted by \(\mathbf{X}_S\). The complementary set, \(\mathbf{X}_C\), contains all the features in \(\mathbf{X}\) that are not in \(\mathbf{X}_S\). Mathematically, this relationship is expressed as:

\[\mathbf{X}_C = \mathbf{X} \setminus \mathbf{X}_S\]

where \(\mathbf{X}_C\) is the set of features in \(\mathbf{X}\) after removing the features in \(\mathbf{X}_S\).

Partial Dependence Plots (PDPs) are used to illustrate the effect of the features in \(\mathbf{X}_S\) on the model’s predictions, while averaging out the influence of the features in \(\mathbf{X}_C\). This is mathematically defined as:

\[\begin{split}\begin{align*} \text{PD}_{\mathbf{X}_S}(x_S) &= \mathbb{E}_{\mathbf{X}_C} \left[ f(x_S, \mathbf{X}_C) \right] \\ &= \int f(x_S, x_C) \, p(x_C) \, dx_C \\ &= \int \left( \int f(x_S, x_C) \, p(x_C \mid x_S) \, dx_C \right) p(x_S) \, dx_S \end{align*}\end{split}\]

where:

  • \(\mathbb{E}_{\mathbf{X}_C} \left[ \cdot \right]\) indicates that we are taking the expected value over the possible values of the features in the set \(\mathbf{X}_C\).

  • \(p(x_C)\) represents the probability density function of the features in \(\mathbf{X}_C\).

This operation effectively summarizes the model’s output over all potential values of the complementary features, providing a clear view of how the features in \(\mathbf{X}_S\) alone impact the model’s predictions.

2D Partial Dependence Plots

Consider a trained machine learning model 2D Partial Dependence Plots\(f(\mathbf{X})\), where \(\mathbf{X} = (X_1, X_2, \dots, X_p)\) represents the vector of input features. The partial dependence of the predicted response \(\hat{y}\) on a single feature \(X_j\) is defined as:

\[\text{PD}(X_j) = \frac{1}{n} \sum_{i=1}^{n} f(X_j, \mathbf{X}_{C_i})\]

where:

  • \(X_j\) is the feature of interest.

  • \(\mathbf{X}_{C_i}\) represents the complement set of \(X_j\), meaning the remaining features in \(\mathbf{X}\) not included in \(X_j\) for the \(i\)-th instance.

  • \(n\) is the number of observations in the dataset.

For two features, \(X_j\) and \(X_k\), the partial dependence is given by:

\[\text{PD}(X_j, X_k) = \frac{1}{n} \sum_{i=1}^{n} f(X_j, X_k, \mathbf{X}_{C_i})\]

This results in a 2D surface plot (or contour plot) that shows how the predicted outcome changes as the values of \(X_j\) and \(X_k\) vary, while the effects of the other features are averaged out.

  • Single Feature PDP: When plotting \(\text{PD}(X_j)\), the result is a 2D line plot showing the marginal effect of feature \(X_j\) on the predicted outcome, averaged over all possible values of the other features.

  • Two Features PDP: When plotting \(\text{PD}(X_j, X_k)\), the result is a 3D surface plot (or a contour plot) that shows the combined marginal effect of \(X_j\) and \(X_k\) on the predicted outcome. The surface represents the expected value of the prediction as \(X_j\) and \(X_k\) vary, while all other features are averaged out.

3D Partial Dependence Plots

For a more comprehensive analysis, especially when exploring interactions between two features, 3D Partial Dependence Plots are invaluable. The partial dependence function for two features in a 3D context is:

\[\text{PD}(X_j, X_k) = \frac{1}{n} \sum_{i=1}^{n} f(X_j, X_k, \mathbf{X}_{C_i})\]

Here, the function \(f(X_j, X_k, \mathbf{X}_{C_i})\) is evaluated across a grid of values for \(X_j\) and \(X_k\). The resulting 3D surface plot represents how the model’s prediction changes over the joint range of these two features.

The 3D plot offers a more intuitive visualization of feature interactions compared to 2D contour plots, allowing for a better understanding of the combined effects of features on the model’s predictions. The surface plot is particularly useful when you need to capture complex relationships that might not be apparent in 2D.

  • Feature Interaction Visualization: The 3D PDP provides a comprehensive view of the interaction between two features. The resulting surface plot allows for the visualization of how the model’s output changes when the values of two features are varied simultaneously, making it easier to understand complex interactions.

  • Enhanced Interpretation: 3D PDPs offer enhanced interpretability in scenarios where feature interactions are not linear or where the effect of one feature depends on the value of another. The 3D visualization makes these dependencies more apparent.

Regression Outputs

Regression Performance Metrics

For a regression model with true values \(y_i\) and predictions \(\hat{y}_i\), where \(i = 1, \ldots, n\), the following metrics quantify prediction accuracy:

Mean Squared Error (MSE)

Measures the average squared difference between predictions and actual values:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

MSE penalizes large errors more heavily due to squaring. It is in squared units of the response variable.

Root Mean Squared Error (RMSE)

The square root of MSE, returning error to the original scale:

\[\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

RMSE is interpretable in the same units as \(y\) and is more sensitive to outliers than MAE.

Mean Absolute Error (MAE)

Measures the average absolute difference between predictions and actual values:

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]

MAE is more robust to outliers than RMSE and penalizes all errors proportionally. It is also in the same units as \(y\).

Mean Absolute Percentage Error (MAPE)

Expresses error as a percentage of the actual values:

\[\text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|\]

MAPE is scale-independent but undefined when \(y_i = 0\) and can be skewed by very small denominators.

Coefficient of Determination (R²)

Represents the proportion of variance in \(y\) explained by the model:

\[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}\]

where \(\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i\) is the mean of observed values.

Properties:

  • \(R^2 \in (-\infty, 1]\) in general; \(R^2 \in [0, 1]\) for models with an intercept

  • \(R^2 = 1\): perfect fit

  • \(R^2 = 0\): model performs no better than predicting the mean

  • \(R^2 < 0\): model performs worse than the mean baseline

Adjusted R² (R²_adj)

Adjusts \(R^2\) for the number of predictors, penalizing model complexity:

\[R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\]

where \(p\) is the number of predictors (excluding the intercept).

Properties:

  • \(R^2_{\text{adj}} \leq R^2\) always

  • Unlike \(R^2\), adjusted \(R^2\) can decrease when adding uninformative predictors

  • More appropriate than \(R^2\) for comparing models with different numbers of features

  • Can be negative if the model fits very poorly

Relationship Between Metrics:

\[\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\text{SS}_{\text{res}} / n}\]
\[R^2 = 1 - \frac{n \cdot \text{MSE}}{\text{SS}_{\text{tot}}}\]

Explained Variance

Measures the proportion of variance explained without squaring the correlation:

\[\text{Explained Var} = 1 - \frac{\text{Var}(y - \hat{y})}{\text{Var}(y)}\]

Unlike \(R^2\), explained variance doesn’t penalize systematic bias (constant offset).

Residual Diagnostics: Statistical Foundations

For a regression model \(y = f(\mathbf{X}) + \epsilon\), residuals are defined as:

\[e_i = y_i - \hat{y}_i\]

Standard OLS assumptions require:

  1. \(\mathbb{E}[e_i] = 0\) (zero mean)

  2. \(\text{Var}(e_i) = \sigma^2\) (homoscedasticity)

  3. \(e_i \sim \mathcal{N}(0, \sigma^2)\) (normality)

  4. \(\text{Cov}(e_i, e_j) = 0\) for \(i \neq j\) (independence)

Standardized Residuals

Residuals scaled by the estimated standard deviation:

\[r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_{ii}}}\]

where \(\hat{\sigma}^2 = \text{MSE}\) and \(h_{ii}\) is the leverage (see below). Standardized residuals with \(|r_i| > 2\) or \(|r_i| > 3\) may indicate outliers.

Heteroscedasticity Tests

Breusch-Pagan Test

Tests whether residual variance depends on predictors:

\[\text{LM} = n \cdot R^2_{\text{aux}}\]

where \(R^2_{\text{aux}}\) is from regressing \(e_i^2\) on \(\mathbf{X}\). Under \(H_0\) (homoscedasticity), \(\text{LM} \sim \chi^2_p\).

White Test

A more general test not assuming specific functional form:

\[\text{LM}_{\text{White}} = n \cdot R^2\]

from regressing \(e_i^2\) on all predictors, their squares, and cross-products.

Goldfeld-Quandt Test

Compares variance between subsamples (typically split at median):

\[F = \frac{s^2_{\text{high}}}{s^2_{\text{low}}} \sim F_{(n_h-k), (n_l-k)}\]

Spearman Rank Correlation

Tests monotonic relationship between \(|e_i|\) and \(\hat{y}_i\):

\[\rho_s = 1 - \frac{6\sum d_i^2}{n(n^2-1)}\]

where \(d_i\) is the difference in ranks.

Leverage and Influence

Leverage measures how unusual an observation’s predictor values are:

\[h_{ii} = \mathbf{x}_i^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_i\]

Observations with \(h_{ii} > 2p/n\) or \(h_{ii} > 3p/n\) are considered high leverage.

Properties:

  • \(0 \leq h_{ii} \leq 1\)

  • \(\sum_{i=1}^{n} h_{ii} = p + 1\) (where \(p\) is the number of predictors)

  • High leverage points have unusual \(X\) values but may or may not be influential

Cook’s Distance measures overall influence:

\[D_i = \frac{\sum_{j=1}^{n}(\hat{y}_j - \hat{y}_{j(i)})^2}{p \cdot \text{MSE}}\]

where \(\hat{y}_{j(i)}\) excludes observation \(i\). Common thresholds: \(D_i > 1\) or \(D_i > 4/n\).

Alternative formulation:

\[D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}\]

This shows Cook’s distance combines both residual magnitude (\(r_i^2\)) and leverage (\(h_{ii}\)).

Normality Diagnostics

Jarque-Bera Test

Tests whether residuals follow a normal distribution:

\[JB = \frac{n}{6}\left(S^2 + \frac{(K-3)^2}{4}\right)\]

where \(S\) is skewness and \(K\) is kurtosis of the residuals. Under \(H_0\) (normality), \(JB \sim \chi^2_2\).

Autocorrelation

Durbin-Watson Statistic

Tests for first-order autocorrelation in residuals:

\[DW = \frac{\sum_{i=2}^{n}(e_i - e_{i-1})^2}{\sum_{i=1}^{n} e_i^2}\]

Interpretation:

  • \(DW \approx 2\): no autocorrelation

  • \(DW < 2\): positive autocorrelation

  • \(DW > 2\): negative autocorrelation

  • Range: \(0 \leq DW \leq 4\)