Interpretive Context
This section blends classical evaluation metrics with probabilistic theory, helping users understand both the foundations and the limitations of model performance metrics like AUC-ROC and AUC-PR.
Binary Classification Outputs
Let \(\hat{y} = f(x) \in [0, 1]\) be the probabilistic score assigned by the model for a sample \(x \in \mathbb{R}^d\), and \(y \in \{0, 1\}\) the ground-truth label.
At any threshold \(\tau\), we define the standard classification metrics:
These metrics form the foundation of ROC and Precision-Recall curves, which evaluate how performance shifts across different thresholds \(\tau \in [0, 1]\).
ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve plots:
\(\text{TPR}(\tau)\) vs. \(\text{FPR}(\tau)\)
As threshold \(\tau\) varies from 1 to 0
The Area Under the ROC Curve (AUC-ROC) is defined as:
Where \(F_0\) is the CDF of scores from the negative class.
Probabilistic Interpretation
AUC can also be seen as the probability that a randomly chosen positive sample is ranked higher than a randomly chosen negative sample:
Proof (U-statistic representation):
Where \(F_1\) and \(F_0\) are the score distributions of the positive and negative classes.
Precision-Recall Curve and AUC-PR
The Precision-Recall (PR) curve focuses solely on the positive class. It plots:
Where \(\pi_1 = P(y = 1)\) is the class prevalence.
The AUC-PR is defined as:
Unlike ROC, the PR curve is not invariant to class imbalance. The baseline for precision is simply the proportion of positive samples in the data: \(\pi_1\).
Thresholding and Predictions
To convert scores into hard predictions, we apply a threshold \(\tau\):
Summary Table: Theory Meets Interpretation
Metric | Mathematical Formulation | Key Property | Practical Caveat |
---|---|---|---|
AUC-ROC | \( P(\hat{y}^+ > \hat{y}^-) \) | Rank-based, threshold-free | Can be misleading with class imbalance |
AUC-PR | \( \int_0^1 \text{Precision}(r)\,dr \) | Focused on positives | Sensitive to prevalence and score noise |
Precision | \( \frac{\text{TP}}{\text{TP} + \text{FP}} \) | Measures correctness | Not monotonic across thresholds |
Recall | \( \frac{\text{TP}}{\text{TP} + \text{FN}} \) | Measures completeness | Ignores false positives |
F1 Score | Harmonic mean of precision and recall | Tradeoff-aware | Requires threshold, hides base rates |
Interpretive Caveats
AUC-ROC can be overly optimistic when the negative class dominates.
AUC-PR gives more meaningful insight for imbalanced datasets, but is more volatile and harder to interpret.
Neither AUC metric defines an optimal threshold — for deployment, threshold tuning must be contextualized.
Calibration affects PR metrics and thresholds, but not AUC-ROC.
Metric conflicts are common: one model may outperform in AUC but underperform in F1.
Fairness and subgroup analysis are essential: A model may perform well overall, yet exhibit bias in subgroup-specific metrics.
Threshold Selection logic
When computing confusion matrices, selecting the right classification threshold
can significantly impact the output. The function show_confusion_matrix
is
documented in this section.
1. If the custom_threshold parameter is passed, it takes absolute precedence and is used directly.
2. If model_threshold
is set and the model contains a threshold dictionary,
the function will try to retrieve the threshold using the score parameter:
If
score
is passed (e.g.,"f1"
), thenmodel.threshold[score]
is used.If
score
is not passed, the function will look up the first item inmodel.scoring
(if available).If neither a custom threshold nor a valid model threshold is available, the default value of
0.5
is used.
Calibration Trade-offs
Calibration curves are powerful diagnostic tools for assessing how well a model’s predicted probabilities reflect actual outcomes. However, their interpretation—and the methods used to derive them—come with important caveats that users should keep in mind.
Calibration Methodology
The examples shown in this library are based on models calibrated using Platt Scaling, a post-processing technique that fits a sigmoid function to the model’s prediction scores. Platt Scaling assumes a parametric form:
where \(A\) and \(B\) are scalar parameters learned using a separate calibration dataset. This approach is computationally efficient and works well for models such as SVMs and Logistic Regression, where prediction scores are linearly separable or approximately log-odds in nature.
However, Platt Scaling may underperform when the relationship between raw scores and true probabilities is non-monotonic or highly irregular.
Alternative Calibration Methods
An alternative to Platt Scaling is Isotonic Regression, a non-parametric method that fits a monotonically increasing function to the model’s prediction scores. It is particularly effective when the mapping between predicted probabilities and observed outcomes is complex or non-linear.
Mathematically, isotonic regression solves the following constrained optimization problem:
Here:
\(y_i \in \{0, 1\}\) are the true binary labels,
\(\hat{p}_i\) are the calibrated probabilities corresponding to the model’s scores,
and the constraint enforces monotonicity, preserving the order of the original prediction scores.
The solution is obtained using the Pool Adjacent Violators Algorithm (PAVA), an efficient method for enforcing monotonicity in a least-squares fit.
While Isotonic Regression is highly flexible and can model arbitrary step-like functions, this same flexibility increases the risk of overfitting, especially when the calibration dataset is small, imbalanced, or noisy. It may capture spurious fluctuations in the validation data rather than the true underlying relationship between scores and outcomes.
Warning
Overfitting with isotonic regression can lead to miscalibration in deployment, particularly if the validation set is not representative of the production environment.
Note
This library does not perform calibration internally. Instead, users are
expected to calibrate models during training or preprocessing—e.g., using the
model_tuner
library [1] or any external tool. All calibration curve plots
included here are illustrative and assume models have already been calibrated
using Platt Scaling prior to visualization.
Dependence on Validation Data
All calibration techniques rely heavily on the quality of the validation data used to learn the mapping. If the validation set is not representative of the target population, the resulting calibration curve may be misleading. This concern is especially important when deploying models in real-world settings where data drift or population imbalance may occur.
Interpreting the Brier Score
The Brier Score, often reported alongside calibration curves, provides a quantitative measure of probabilistic accuracy. It is defined as:
where \(\hat{p}_i\) is the predicted probability and \(y_i\) is the actual class label. While a lower Brier Score generally indicates better performance, it conflates calibration (how close predicted probabilities are to actual outcomes) and refinement (how confidently predictions are made). Thus, the Brier Score should be interpreted in context and not relied upon in isolation.
Partial Dependence Foundations
Let \(\mathbf{X}\) represent the complete set of input features for a machine learning model, where \(\mathbf{X} = \{X_1, X_2, \dots, X_p\}\). Suppose we’re particularly interested in a subset of these features, denoted by \(\mathbf{X}_S\). The complementary set, \(\mathbf{X}_C\), contains all the features in \(\mathbf{X}\) that are not in \(\mathbf{X}_S\). Mathematically, this relationship is expressed as:
where \(\mathbf{X}_C\) is the set of features in \(\mathbf{X}\) after removing the features in \(\mathbf{X}_S\).
Partial Dependence Plots (PDPs) are used to illustrate the effect of the features in \(\mathbf{X}_S\) on the model’s predictions, while averaging out the influence of the features in \(\mathbf{X}_C\). This is mathematically defined as:
where:
\(\mathbb{E}_{\mathbf{X}_C} \left[ \cdot \right]\) indicates that we are taking the expected value over the possible values of the features in the set \(\mathbf{X}_C\).
\(p(x_C)\) represents the probability density function of the features in \(\mathbf{X}_C\).
This operation effectively summarizes the model’s output over all potential values of the complementary features, providing a clear view of how the features in \(\mathbf{X}_S\) alone impact the model’s predictions.
2D Partial Dependence Plots
Consider a trained machine learning model 2D Partial Dependence Plots\(f(\mathbf{X})\), where \(\mathbf{X} = (X_1, X_2, \dots, X_p)\) represents the vector of input features. The partial dependence of the predicted response \(\hat{y}\) on a single feature \(X_j\) is defined as:
where:
\(X_j\) is the feature of interest.
\(\mathbf{X}_{C_i}\) represents the complement set of \(X_j\), meaning the remaining features in \(\mathbf{X}\) not included in \(X_j\) for the \(i\)-th instance.
\(n\) is the number of observations in the dataset.
For two features, \(X_j\) and \(X_k\), the partial dependence is given by:
This results in a 2D surface plot (or contour plot) that shows how the predicted outcome changes as the values of \(X_j\) and \(X_k\) vary, while the effects of the other features are averaged out.
Single Feature PDP: When plotting \(\text{PD}(X_j)\), the result is a 2D line plot showing the marginal effect of feature \(X_j\) on the predicted outcome, averaged over all possible values of the other features.
Two Features PDP: When plotting \(\text{PD}(X_j, X_k)\), the result is a 3D surface plot (or a contour plot) that shows the combined marginal effect of \(X_j\) and \(X_k\) on the predicted outcome. The surface represents the expected value of the prediction as \(X_j\) and \(X_k\) vary, while all other features are averaged out.
3D Partial Dependence Plots
For a more comprehensive analysis, especially when exploring interactions between two features, 3D Partial Dependence Plots are invaluable. The partial dependence function for two features in a 3D context is:
Here, the function \(f(X_j, X_k, \mathbf{X}_{C_i})\) is evaluated across a grid of values for \(X_j\) and \(X_k\). The resulting 3D surface plot represents how the model’s prediction changes over the joint range of these two features.
The 3D plot offers a more intuitive visualization of feature interactions compared to 2D contour plots, allowing for a better understanding of the combined effects of features on the model’s predictions. The surface plot is particularly useful when you need to capture complex relationships that might not be apparent in 2D.
Feature Interaction Visualization: The 3D PDP provides a comprehensive view of the interaction between two features. The resulting surface plot allows for the visualization of how the model’s output changes when the values of two features are varied simultaneously, making it easier to understand complex interactions.
Enhanced Interpretation: 3D PDPs offer enhanced interpretability in scenarios where feature interactions are not linear or where the effect of one feature depends on the value of another. The 3D visualization makes these dependencies more apparent.