Interpretive Context
This section blends classical evaluation metrics with probabilistic theory, helping users understand both the foundations and the limitations of model performance metrics like AUC-ROC and AUC-PR.
Binary Classification Outputs
Let \(\hat{y} = f(x) \in [0, 1]\) be the probabilistic score assigned by the model for a sample \(x \in \mathbb{R}^d\), and \(y \in \{0, 1\}\) the ground-truth label.
At any threshold \(\tau\), we define the standard classification metrics:
These metrics form the foundation of ROC and Precision-Recall curves, which evaluate how performance shifts across different thresholds \(\tau \in [0, 1]\).
ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve plots:
\(\text{TPR}(\tau)\) vs. \(\text{FPR}(\tau)\)
As threshold \(\tau\) varies from 1 to 0
The Area Under the ROC Curve (AUC-ROC) is defined as:
Where \(F_0\) is the CDF of scores from the negative class.
Probabilistic Interpretation
AUC can also be seen as the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample:
Proof (U-statistic representation):
Where \(F_1\) and \(F_0\) are the score distributions of the positive and negative classes.
ROC Operating Points
While AUC-ROC summarizes overall discriminative ability across all possible thresholds, practical deployment requires selecting a specific operating point (threshold \(\tau\)) that converts probabilistic scores into binary predictions. This selection fundamentally trades off sensitivity (true positive rate) against specificity (true negative rate), and the optimal choice depends on the relative costs of different error types.
The Operating Point Selection Problem
Given a classifier that produces continuous scores \(s(x) \in \mathbb{R}\) or probabilities \(\hat{p}(x) \in [0,1]\), we define the decision rule:
Each threshold \(\tau\) induces a specific point on the ROC curve with coordinates \((\text{FPR}(\tau), \text{TPR}(\tau))\). The challenge is to select \(\tau\) without knowledge of deployment costs or class distributions.
The Fundamental Tradeoff: Lowering \(\tau\) increases sensitivity (captures more true positives) but decreases specificity (increases false positives). Raising \(\tau\) has the opposite effect.
Youden’s J Statistic
Definition: Youden’s index \(J\) identifies the threshold that maximizes the vertical distance from the ROC curve to the chance diagonal:
The optimal threshold is:
Geometric Interpretation: The chance diagonal represents random guessing (\(\text{TPR} = \text{FPR}\)). Youden’s J measures how far above this baseline the classifier performs. Maximum \(J\) occurs where the tangent to the ROC curve is parallel to the diagonal (slope = 1).
Statistical Properties:
\(J \in [0, 1]\) where \(J=0\) indicates no improvement over chance and \(J=1\) indicates perfect separation
\(J = \text{TPR} + \text{TNR} - 1\) equivalently maximizes the sum of sensitivity and specificity
Under equal class prevalence and equal costs, this is the Bayes-optimal threshold
Derivation:
For a given threshold \(\tau\), the confusion matrix yields:
This can be rewritten in terms of the score distributions \(f_1(s)\) (positive class) and \(f_0(s)\) (negative class):
The maximum occurs where the derivative equals zero:
Practical Interpretation: The optimal threshold is where the score densities of the positive and negative classes intersect, assuming equal prior probabilities and equal misclassification costs.
When to Use Youden’s J:
Equal importance of sensitivity and specificity
Balanced or unknown class prevalence
No differential misclassification costs
Screening tests where both false positives and false negatives are problematic
Limitations:
Assumes equal costs: \(C(\text{FP}) = C(\text{FN})\)
Ignores base rates (class prevalence)
May not be optimal for imbalanced datasets
Does not account for downstream decision costs
Closest to Top-Left
Definition: This method minimizes the Euclidean distance from the ROC point to the ideal classifier at coordinates (0, 1):
The optimal threshold is:
Geometric Interpretation: The point (0, 1) represents perfect classification: 100% sensitivity with 0% false positive rate. This criterion seeks the threshold that gets “as close as possible” to perfection in Euclidean space.
Alternative Formulation: Squaring the distance for computational convenience:
Relationship to Youden’s J:
The closest-to-top-left criterion can be viewed as minimizing a weighted combination of false negative rate and false positive rate:
This implicitly assumes squared error loss, whereas Youden’s J assumes linear loss.
Mathematical Properties:
Invariant to monotonic transformations of the score scale
Continuous and differentiable (when TPR and FPR are smooth)
Guaranteed to have a solution on compact ROC space
The optimal point lies on the convex hull of the ROC curve
When to Use Closest-to-Top-Left:
Desire to minimize “overall error” in both dimensions
Both error types equally costly, but penalized quadratically
Smooth, well-calibrated classifiers
Visual/geometric interpretation preferred
Comparison with Youden’s J:
For well-separated classes with smooth ROC curves, the two methods often yield similar thresholds. However, they can diverge when:
ROC curve has sharp corners or discontinuities
Class distributions have heavy tails
One error type dominates in frequency but not in cost
Cost-Sensitive Extensions
Both methods can be extended to account for asymmetric costs. Define:
\(C_{\text{FP}}\): cost of a false positive
\(C_{\text{FN}}\): cost of a false negative
\(\pi_0, \pi_1\): prior probabilities of negative and positive classes
The expected cost at threshold \(\tau\) is:
Cost-Weighted Youden’s J:
where \(w_1 = \pi_1 C_{\text{FN}}\) and \(w_0 = \pi_0 C_{\text{FP}}\).
Cost-Weighted Distance:
Practical Considerations
Important Caveats:
Equal Cost Assumption: Both standard methods assume \(C_{\text{FP}} = C_{\text{FN}}\), which is rarely true in practice. Medical diagnostics, fraud detection, and legal applications all have strongly asymmetric costs.
Prevalence Sensitivity: Neither method explicitly accounts for class imbalance. A threshold optimal on balanced validation data may perform poorly when deployed on imbalanced populations.
Calibration Matters: These methods assume the classifier’s scores are well-calibrated. For poorly calibrated models, threshold selection may be unstable.
Validation Set Dependence: The optimal threshold is computed on validation data and may not generalize if the test distribution differs (distribution shift, concept drift).
Multi-Objective Constraints: Real applications often have multiple constraints (e.g., “achieve at least 90% sensitivity while maximizing specificity”). These require constrained optimization rather than simple threshold rules.
Recommended Workflow:
Visualize the ROC curve and compute AUC
Identify candidate thresholds using both Youden’s J and closest-to-top-left
Evaluate both thresholds on a held-out test set
If domain costs are known, compute expected cost for each threshold
Consider sensitivity analysis: how does performance vary in a neighborhood around \(\tau^*\)?
Document the chosen threshold and the rationale for deployment
Example Cost-Benefit Analysis:
In cancer screening:
\(C_{\text{FP}}\): Cost of unnecessary biopsy + patient anxiety ≈ $1,000
\(C_{\text{FN}}\): Cost of delayed treatment + mortality risk ≈ $100,000
\(\pi_1\): Cancer prevalence ≈ 0.01
Optimal threshold strongly favors sensitivity (low \(\tau\)) to minimize missed cancers, even at the cost of more false alarms.
Warning
Deploying a classifier with an arbitrary threshold (e.g., 0.5) without validation is statistically unjustified. The threshold should always be selected based on validation performance and domain requirements, not convention.
Precision-Recall Curve and AUC-PR
The Precision-Recall (PR) curve focuses solely on the positive class. It plots:
Where \(\pi_1 = P(y = 1)\) is the class prevalence.
The AUC-PR is defined as:
Unlike ROC, the PR curve is not invariant to class imbalance. The baseline for precision is simply the proportion of positive samples in the data: \(\pi_1\).
Average Precision
Average Precision (AP) provides an alternative summary of the PR curve:
where \(P(k)\) is precision at threshold \(k\) and \(\Delta R(k)\) is the change in recall from threshold \(k-1\) to \(k.\)
Key Distinction: Unlike AUCPR which treats all parts of the curve equally, AP weights precision values by the change in recall, emphasizing performance at higher precision levels. This makes AP particularly suitable for tasks where precision at the top of the ranking is most critical (e.g., information retrieval, recommendation systems).
Thresholding and Predictions
To convert scores into hard predictions, we apply a threshold \(\tau\):
Summary Table: Theory Meets Interpretation
| Metric | Mathematical Formulation | Key Property | Practical Caveat |
|---|---|---|---|
| AUC-ROC | \( P(\hat{y}^+ > \hat{y}^-) \) | Rank-based, threshold-free | Can be misleading with class imbalance |
| AUC-PR | \( \int_0^1 \text{Precision}(r)\,dr \) | Focused on positives | Sensitive to prevalence and score noise |
| Precision | \( \frac{\text{TP}}{\text{TP} + \text{FP}} \) | Measures correctness | Not monotonic across thresholds |
| Recall | \( \frac{\text{TP}}{\text{TP} + \text{FN}} \) | Measures completeness | Ignores false positives |
| F1 Score | Harmonic mean of precision and recall | Tradeoff-aware | Requires threshold, hides base rates |
Interpretive Caveats
AUC-ROC can be overly optimistic when the negative class dominates.
AUC-PR gives more meaningful insight for imbalanced datasets, but is more volatile and harder to interpret.
Neither AUC metric defines an optimal threshold — for deployment, threshold tuning must be contextualized.
Calibration affects PR metrics and thresholds, but not AUC-ROC.
Metric conflicts are common: one model may outperform in AUC but underperform in F1.
Fairness and subgroup analysis are essential: A model may perform well overall, yet exhibit bias in subgroup-specific metrics.
Threshold Selection logic
When computing confusion matrices, selecting the right classification threshold
can significantly impact the output. The function show_confusion_matrix is
documented in this section.
1. If the custom_threshold parameter is passed, it takes absolute precedence and is used directly.
2. If model_threshold is set and the model contains a threshold dictionary,
the function will try to retrieve the threshold using the score parameter:
If
scoreis passed (e.g.,"f1"), thenmodel.threshold[score]is used.If
scoreis not passed, the function will look up the first item inmodel.scoring(if available).If neither a custom threshold nor a valid model threshold is available, the default value of
0.5is used.
Calibration Trade-offs
Calibration curves are powerful diagnostic tools for assessing how well a model’s predicted probabilities reflect actual outcomes. However, their interpretation—and the methods used to derive them—come with important caveats that users should keep in mind.
Calibration Methodology
The examples shown in this library are based on models calibrated using Platt Scaling, a post-processing technique that fits a sigmoid function to the model’s prediction scores. Platt Scaling assumes a parametric form:
where \(A\) and \(B\) are scalar parameters learned using a separate calibration dataset. This approach is computationally efficient and works well for models such as SVMs and Logistic Regression, where prediction scores are linearly separable or approximately log-odds in nature.
However, Platt Scaling may underperform when the relationship between raw scores and true probabilities is non-monotonic or highly irregular.
Alternative Calibration Methods
An alternative to Platt Scaling is Isotonic Regression, a non-parametric method that fits a monotonically increasing function to the model’s prediction scores. It is particularly effective when the mapping between predicted probabilities and observed outcomes is complex or non-linear.
Mathematically, isotonic regression solves the following constrained optimization problem:
Here:
\(y_i \in \{0, 1\}\) are the true binary labels,
\(\hat{p}_i\) are the calibrated probabilities corresponding to the model’s scores,
and the constraint enforces monotonicity, preserving the order of the original prediction scores.
The solution is obtained using the Pool Adjacent Violators Algorithm (PAVA), an efficient method for enforcing monotonicity in a least-squares fit.
While Isotonic Regression is highly flexible and can model arbitrary step-like functions, this same flexibility increases the risk of overfitting, especially when the calibration dataset is small, imbalanced, or noisy. It may capture spurious fluctuations in the validation data rather than the true underlying relationship between scores and outcomes.
Warning
Overfitting with isotonic regression can lead to miscalibration in deployment, particularly if the validation set is not representative of the production environment.
Note
This library does not perform calibration internally. Instead, users are
expected to calibrate models during training or preprocessing—e.g., using the
model_tuner library [1] or any external tool. All calibration curve plots
included here are illustrative and assume models have already been calibrated
using Platt Scaling prior to visualization.
Dependence on Validation Data
All calibration techniques rely heavily on the quality of the validation data used to learn the mapping. If the validation set is not representative of the target population, the resulting calibration curve may be misleading. This concern is especially important when deploying models in real-world settings where data drift or population imbalance may occur.
Calibration Assessment
Perfect calibration requires:
In practice, calibration is assessed by binning predictions and comparing:
to the mean predicted probability in that bin.
Brier Score Decomposition
The Brier score can be decomposed into:
where:
Reliability: How close predicted probabilities are to observed frequencies
Resolution: How well the model separates positive from negative cases
Uncertainty: Inherent unpredictability in the data (\(\bar{y}(1-\bar{y})\))
This decomposition reveals that a model can have a good Brier score through strong resolution even with poor calibration.
Interpreting the Brier Score
The Brier Score, often reported alongside calibration curves, provides a quantitative measure of probabilistic accuracy. It is defined as:
where \(\hat{p}_i\) is the predicted probability and \(y_i\) is the actual class label. While a lower Brier Score generally indicates better performance, it conflates calibration (how close predicted probabilities are to actual outcomes) and refinement (how confidently predictions are made). Thus, the Brier Score should be interpreted in context and not relied upon in isolation.
Lift: Mathematical Definition
Lift at depth \(d\) is defined as:
where:
\(\text{TP}(d)\) = number of true positives in top \(d\%\) of predictions
\(n(d)\) = number of observations in top \(d\%\)
\(\text{TP}_{\text{total}}\) = total true positives in dataset
\(N\) = total number of observations
A lift value of 2.0 at 10% depth (percentage of sample) means the model identifies twice as many positives in the top 10% compared to random selection.
Gain: Mathematical Definition
Cumulative gain at depth \(d\) is defined as:
where:
\(\text{TP}(d)\) = number of true positives in top \(d\%\) of predictions
\(\text{TP}_{\text{total}}\) = total true positives in dataset
The Gini coefficient, derived from the gain curve, is calculated as:
where \(\text{AUGC}\) is the area under the gain curve. The Gini coefficient ranges from 0 (random model) to 1 (perfect model) and provides a single-number summary of model discrimination power.
Partial Dependence Foundations
Let \(\mathbf{X}\) represent the complete set of input features for a machine learning model, where \(\mathbf{X} = \{X_1, X_2, \dots, X_p\}\). Suppose we’re particularly interested in a subset of these features, denoted by \(\mathbf{X}_S\). The complementary set, \(\mathbf{X}_C\), contains all the features in \(\mathbf{X}\) that are not in \(\mathbf{X}_S\). Mathematically, this relationship is expressed as:
where \(\mathbf{X}_C\) is the set of features in \(\mathbf{X}\) after removing the features in \(\mathbf{X}_S\).
Partial Dependence Plots (PDPs) are used to illustrate the effect of the features in \(\mathbf{X}_S\) on the model’s predictions, while averaging out the influence of the features in \(\mathbf{X}_C\). This is mathematically defined as:
where:
\(\mathbb{E}_{\mathbf{X}_C} \left[ \cdot \right]\) indicates that we are taking the expected value over the possible values of the features in the set \(\mathbf{X}_C\).
\(p(x_C)\) represents the probability density function of the features in \(\mathbf{X}_C\).
This operation effectively summarizes the model’s output over all potential values of the complementary features, providing a clear view of how the features in \(\mathbf{X}_S\) alone impact the model’s predictions.
2D Partial Dependence Plots
Consider a trained machine learning model 2D Partial Dependence Plots\(f(\mathbf{X})\), where \(\mathbf{X} = (X_1, X_2, \dots, X_p)\) represents the vector of input features. The partial dependence of the predicted response \(\hat{y}\) on a single feature \(X_j\) is defined as:
where:
\(X_j\) is the feature of interest.
\(\mathbf{X}_{C_i}\) represents the complement set of \(X_j\), meaning the remaining features in \(\mathbf{X}\) not included in \(X_j\) for the \(i\)-th instance.
\(n\) is the number of observations in the dataset.
For two features, \(X_j\) and \(X_k\), the partial dependence is given by:
This results in a 2D surface plot (or contour plot) that shows how the predicted outcome changes as the values of \(X_j\) and \(X_k\) vary, while the effects of the other features are averaged out.
Single Feature PDP: When plotting \(\text{PD}(X_j)\), the result is a 2D line plot showing the marginal effect of feature \(X_j\) on the predicted outcome, averaged over all possible values of the other features.
Two Features PDP: When plotting \(\text{PD}(X_j, X_k)\), the result is a 3D surface plot (or a contour plot) that shows the combined marginal effect of \(X_j\) and \(X_k\) on the predicted outcome. The surface represents the expected value of the prediction as \(X_j\) and \(X_k\) vary, while all other features are averaged out.
3D Partial Dependence Plots
For a more comprehensive analysis, especially when exploring interactions between two features, 3D Partial Dependence Plots are invaluable. The partial dependence function for two features in a 3D context is:
Here, the function \(f(X_j, X_k, \mathbf{X}_{C_i})\) is evaluated across a grid of values for \(X_j\) and \(X_k\). The resulting 3D surface plot represents how the model’s prediction changes over the joint range of these two features.
The 3D plot offers a more intuitive visualization of feature interactions compared to 2D contour plots, allowing for a better understanding of the combined effects of features on the model’s predictions. The surface plot is particularly useful when you need to capture complex relationships that might not be apparent in 2D.
Feature Interaction Visualization: The 3D PDP provides a comprehensive view of the interaction between two features. The resulting surface plot allows for the visualization of how the model’s output changes when the values of two features are varied simultaneously, making it easier to understand complex interactions.
Enhanced Interpretation: 3D PDPs offer enhanced interpretability in scenarios where feature interactions are not linear or where the effect of one feature depends on the value of another. The 3D visualization makes these dependencies more apparent.
Regression Outputs
Regression Performance Metrics
For a regression model with true values \(y_i\) and predictions \(\hat{y}_i\), where \(i = 1, \ldots, n\), the following metrics quantify prediction accuracy:
Mean Squared Error (MSE)
Measures the average squared difference between predictions and actual values:
MSE penalizes large errors more heavily due to squaring. It is in squared units of the response variable.
Root Mean Squared Error (RMSE)
The square root of MSE, returning error to the original scale:
RMSE is interpretable in the same units as \(y\) and is more sensitive to outliers than MAE.
Mean Absolute Error (MAE)
Measures the average absolute difference between predictions and actual values:
MAE is more robust to outliers than RMSE and penalizes all errors proportionally. It is also in the same units as \(y\).
Mean Absolute Percentage Error (MAPE)
Expresses error as a percentage of the actual values:
MAPE is scale-independent but undefined when \(y_i = 0\) and can be skewed by very small denominators.
Coefficient of Determination (R²)
Represents the proportion of variance in \(y\) explained by the model:
where \(\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i\) is the mean of observed values.
Properties:
\(R^2 \in (-\infty, 1]\) in general; \(R^2 \in [0, 1]\) for models with an intercept
\(R^2 = 1\): perfect fit
\(R^2 = 0\): model performs no better than predicting the mean
\(R^2 < 0\): model performs worse than the mean baseline
Adjusted R² (R²_adj)
Adjusts \(R^2\) for the number of predictors, penalizing model complexity:
where \(p\) is the number of predictors (excluding the intercept).
Properties:
\(R^2_{\text{adj}} \leq R^2\) always
Unlike \(R^2\), adjusted \(R^2\) can decrease when adding uninformative predictors
More appropriate than \(R^2\) for comparing models with different numbers of features
Can be negative if the model fits very poorly
Relationship Between Metrics:
Explained Variance
Measures the proportion of variance explained without squaring the correlation:
Unlike \(R^2\), explained variance doesn’t penalize systematic bias (constant offset).
Residual Diagnostics: Statistical Foundations
For a regression model \(y = f(\mathbf{X}) + \epsilon\), residuals are defined as:
Standard OLS assumptions require:
\(\mathbb{E}[e_i] = 0\) (zero mean)
\(\text{Var}(e_i) = \sigma^2\) (homoscedasticity)
\(e_i \sim \mathcal{N}(0, \sigma^2)\) (normality)
\(\text{Cov}(e_i, e_j) = 0\) for \(i \neq j\) (independence)
Standardized Residuals
Residuals scaled by the estimated standard deviation:
where \(\hat{\sigma}^2 = \text{MSE}\) and \(h_{ii}\) is the leverage (see below). Standardized residuals with \(|r_i| > 2\) or \(|r_i| > 3\) may indicate outliers.
Heteroscedasticity Tests
Breusch-Pagan Test
Tests whether residual variance depends on predictors:
where \(R^2_{\text{aux}}\) is from regressing \(e_i^2\) on \(\mathbf{X}\). Under \(H_0\) (homoscedasticity), \(\text{LM} \sim \chi^2_p\).
White Test
A more general test not assuming specific functional form:
from regressing \(e_i^2\) on all predictors, their squares, and cross-products.
Goldfeld-Quandt Test
Compares variance between subsamples (typically split at median):
Spearman Rank Correlation
Tests monotonic relationship between \(|e_i|\) and \(\hat{y}_i\):
where \(d_i\) is the difference in ranks.
Leverage and Influence
Leverage measures how unusual an observation’s predictor values are:
Observations with \(h_{ii} > 2p/n\) or \(h_{ii} > 3p/n\) are considered high leverage.
Properties:
\(0 \leq h_{ii} \leq 1\)
\(\sum_{i=1}^{n} h_{ii} = p + 1\) (where \(p\) is the number of predictors)
High leverage points have unusual \(X\) values but may or may not be influential
Cook’s Distance measures overall influence:
where \(\hat{y}_{j(i)}\) excludes observation \(i\). Common thresholds: \(D_i > 1\) or \(D_i > 4/n\).
Alternative formulation:
This shows Cook’s distance combines both residual magnitude (\(r_i^2\)) and leverage (\(h_{ii}\)).
Normality Diagnostics
Jarque-Bera Test
Tests whether residuals follow a normal distribution:
where \(S\) is skewness and \(K\) is kurtosis of the residuals. Under \(H_0\) (normality), \(JB \sim \chi^2_2\).
Autocorrelation
Durbin-Watson Statistic
Tests for first-order autocorrelation in residuals:
Interpretation:
\(DW \approx 2\): no autocorrelation
\(DW < 2\): positive autocorrelation
\(DW > 2\): negative autocorrelation
Range: \(0 \leq DW \leq 4\)
Funnell, A., Shpaner, L., & Petousis, P. (2024). Model Tuner (Version 0.0.28b) [Software]. Zenodo. https://doi.org/10.5281/zenodo.12727322.
Elkan, C. (2001). The foundations of cost-sensitive learning. International Joint Conference on Artificial Intelligence, 973-978.
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103-123.