Model Performance Summaries
Summarizes model performance metrics for classification and regression models.
- summarize_model_performance(model=None, X=None, y_prob=None, y_pred=None, y=None, model_type='classification', model_threshold=None, model_title=None, custom_threshold=None, score=None, return_df=False, overall_only=False, decimal_places=3, group_category=None, include_adjusted_r2=False)
- Parameters:
model (estimator, list, or
None) – Trained model or list of trained models. IfNone,y_probory_predmust be provided.X (array-like or
None) – Feature matrix used for evaluation. Required ifmodelis provided without precomputed predictions.y_prob (array-like, list, or
None) – Predicted probabilities for classification models. Can be provided instead ofmodelandX.y_pred (array-like, list, or
None) – Predicted labels (classification) or continuous predictions (regression). Can be provided instead ofmodelandX.y (array-like) – True target values.
model_type (str, optional) – Specifies whether the model is for classification or regression. Must be either
"classification"or"regression".model_threshold (float, dict, or
None, optional) – Classification decision thresholds. Can be a float or dict keyed by model name. Ignored ifcustom_thresholdis provided.custom_threshold (float or
None, optional) – Overrides all model thresholds with a fixed value. If set, excludes the “Model Threshold” row.model_title (str, list, or
None, optional) – Custom model names to display in output. Defaults to inferred names likeModel_1,Model_2, etc.score (str or
None, optional) – Optional custom scoring metric for threshold resolution.return_df (bool, optional) – If
True, returns results as apandas.DataFrameinstead of printing.overall_only (bool, optional) – For regression models, if
True, returns only overall metrics (without coefficients or feature importances).decimal_places (int, optional) – Number of decimal places for rounding metric values. Defaults to 3.
group_category (str, array-like, or
None, optional) – Optional grouping variable for classification metrics. Can be a column name inXor an array matching the length ofy.include_adjusted_r2 (bool, optional) – For regression models, if
True, computes and includes adjusted R-squared. Requires bothmodelandXto be provided.
- Returns:
If
return_df=True:Classification (no groups): metrics as rows, models as columns.
Classification (grouped): metrics as rows, groups as columns.
Regression: rows for metrics, coefficients, and/or feature importances.
If
return_df=False, prints a formatted performance summary usingmanual layout logic.
- Return type:
pandas.DataFrame or
None- Raises:
If
model_typeis invalid.If
overall_only=Trueis used for classification models.If neither (
modelandX) nor (y_probory_pred) are provided.
Important
You can supply either
modelwithXor precomputedy_prob/y_preddirectly.When using precomputed predictions, the function bypasses model inference.
Group-level metrics are only available for classification tasks using
group_category.If
custom_thresholdis specified, it overrides all model thresholds.
Notes
- Classification Models:
Computes precision, recall, specificity, F1-score, AUC ROC, Brier score, and average precision.
Supports per-group metric computation when
group_categoryis provided.Grouped outputs automatically use group names as table headers and maintain metric order (with “Model Threshold” appearing last).
Works with multiple models, custom thresholds, or precomputed probabilities.
- Regression Models:
Computes MAE, MAPE, MSE, RMSE, Explained Variance, R², and optionally Adj. R² (if
include_adjusted_r2=True).Extracts coefficients, intercepts, and feature importances (if available).
- Preserves original manual formatting block:
Maintains right-aligned column layout and visual separators for readability.
Ensures coefficients and intercepts are displayed consistently across models.
Provides clear model breaks and retains console formatting identical to previous releases.
overall_only=Truelimits output to a single summary row per model.
Output Behavior:
return_df=False: prints a fully formatted summary preserving manual alignment.return_df=True: returns a structured DataFrame suitable for further analysis or visualization.Metrics are rounded to the specified
decimal_placesfor clarity.
The summarize_model_performance function provides a structured evaluation of classification and regression models, generating key performance metrics. For classification models, it computes precision, recall, specificity, F1-score, and AUC ROC. For regression models, it extracts coefficients and evaluates error metrics like MSE, RMSE, and R². The function allows specifying custom thresholds, metric rounding, and formatted display options.
Below are two examples demonstrating how to evaluate multiple models using summarize_model_performance. The function calculates and presents metrics for classification and regression models.
Binary Classification Models
This section introduces binary classification using two widely used machine learning models: Logistic Regression and Random Forest Classifier.
These examples demonstrate how model_metrics prepares and trains models on a
synthetic dataset, setting the stage for evaluating their performance in subsequent sections.
Both models use a default classification threshold of 0.5, where predictions are
classified as positive (1) if the predicted probability exceeds 0.5, and negative (0)
otherwise.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate a synthetic dataset
X, y = make_classification(
n_samples=1000,
n_features=10,
random_state=42,
)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
# Train models
model1 = LogisticRegression(random_state=42).fit(X_train, y_train)
model2 = RandomForestClassifier(random_state=42).fit(X_train, y_train)
model_title = ["Logistic Regression", "Random Forest"]
Binary Classification Example 1: Default Threshold
from model_metrics import summarize_model_performance
model_performance = summarize_model_performance(
model=[model1, model2],
model_title=model_titles,
X=X_test,
y=y_test,
model_type="classification",
return_df=True,
)
model_performance
Output
| Metrics | Logistic Regression | Random Forest |
|---|---|---|
| Precision/PPV | 0.867 | 0.914 |
| Average Precision | 0.937 | 0.968 |
| Sensitivity/Recall | 0.820 | 0.865 |
| Specificity | 0.843 | 0.899 |
| F1-Score | 0.843 | 0.889 |
| AUC ROC | 0.913 | 0.952 |
| Brier Score | 0.118 | 0.083 |
| Model Threshold | 0.500 | 0.500 |
Binary Classification Example 2: Custom Threshold
In this example, we revisit binary classification with the same two models: Logistic
Regression and Random Forest, but adjust the classification threshold
(custom_threshold input in this case) from the default 0.5 to 0.2. This
change allows us to explore how lowering the threshold impacts model performance,
potentially increasing sensitivity (recall) by classifying more instances as
positive (1) at the expense of precision.
from model_metrics import summarize_model_performance
model_performance = summarize_model_performance(
model=[model1, model2],
model_title=model_titles,
X=X_test,
y=y_test,
model_type="classification",
return_df=True,
custom_threshold=0.2,
)
model_performance
Output
| Metrics | Logistic Regression | Random Forest |
|---|---|---|
| Precision/PPV | 0.803 | 0.814 |
| Average Precision | 0.937 | 0.968 |
| Sensitivity/Recall | 0.919 | 0.946 |
| Specificity | 0.719 | 0.730 |
| F1-Score | 0.857 | 0.875 |
| AUC ROC | 0.913 | 0.952 |
| Brier Score | 0.118 | 0.083 |
| Model Threshold | 0.200 | 0.200 |
Binary Classification Example 3: Adult Income Data
In this third ROC AUC evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich combination of categorical and numerical features makes it particularly suitable for evaluating subgroup fairness and model performance across demographic segments.
In this example, we extend binary classification evaluation by introducing the
group_category parameter to assess how model performance varies across
different subpopulations. Specifically, we employ a Random Forest classifier
and incorporate the race column from the test set to form a combined dataset
that includes both predictive features and demographic information.
By passing this categorical variable to group_category, the function
computes and displays subgroup-level metrics side-by-side, including AUC,
precision, recall, and F1-score. This enables clear identification of potential
performance disparities across demographic groups (e.g., by race, gender, or
age category), offering valuable insights into fairness, equity, and subgroup
behavior within the model’s predictions.
X_test_analysis = X_test.join(X["race"])
from model_metrics import summarize_model_performance
model_summary = summarize_model_performance(
y_prob=y_prob[2],
y=y_test,
model_title=model_titles,
model_threshold=model_thresholds,
return_df=True,
decimal_places=3,
group_category=X_test_analysis["race"],
)
model_summary
Output
| Metrics | Amer-Indian-Eskimo | Asian-Pac-Islander | Black | Other | White |
|---|---|---|---|---|---|
| AUC ROC | 0.625 | 0.840 | 0.869 | 0.940 | 0.865 |
| Average Precision | 0.238 | 0.718 | 0.619 | 0.767 | 0.747 |
| Brier Score | 0.100 | 0.127 | 0.071 | 0.051 | 0.119 |
| F1-Score | 0.167 | 0.587 | 0.502 | 0.700 | 0.641 |
| Model Threshold | 0.300 | 0.300 | 0.300 | 0.300 | 0.300 |
| Precision/PPV | 0.125 | 0.533 | 0.475 | 0.700 | 0.646 |
| Sensitivity/Recall | 0.250 | 0.653 | 0.532 | 0.700 | 0.636 |
| Specificity | 0.831 | 0.808 | 0.923 | 0.962 | 0.880 |
Regression Models
In this section, we load the diabetes dataset [1] from scikit-learn, which includes
features like age and BMI, along with a target variable representing disease
progression. The data is then split with train_test_split into training and
testing sets using an 80/20 ratio to facilitate model assessment. We train a
Linear Regression model on unscaled data for a straightforward baseline, followed b
y a Random Forest Regressor with 100 trees, also on unscaled data, to introduce a
more complex approach. Additionally, we train a Ridge Regression model using a
Pipeline that scales the features with StandardScaler before fitting,
incorporating regularization. These steps prepare the models for subsequent
evaluation and comparison using tools provided by the model_metrics library.
Models use in these regression examples:
Linear Regression: A foundational model trained on unscaled data, simple yet effective for baseline evaluation.
Ridge Regression: A regularized model with a Pipeline for scaling, perfect for testing stability and overfitting.
Random Forest Regressor: An ensemble of 100 trees on unscaled data, offering complexity for comparative analysis.
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
# Load dataset
diabetes = load_diabetes(as_frame=True)["frame"]
X = diabetes.drop(columns=["target"])
y = diabetes["target"]
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
# Train Linear Regression (on unscaled data)
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Train Random Forest Regressor (on unscaled data)
rf_model = RandomForestRegressor(
n_estimators=100,
random_state=42,
)
rf_model.fit(X_train, y_train)
# Train Ridge Regression (on scaled data)
ridge_model = Pipeline(
[
("scaler", StandardScaler()),
("estimator", Ridge(alpha=1.0)),
]
)
ridge_model.fit(X_train, y_train)
Regression Example 1: Linear, Ridge
from model_metrics import summarize_model_performance
regression_metrics = summarize_model_performance(
model=[linear_model, ridge_model],
model_title=["Linear Regression", "Ridge Regression"],
X=X_test,
y=y_test,
model_type="regression",
return_df=True,
decimal_places=2,
)
regression_metrics
The output below presents a detailed comparison of the performance and coefficients for two regression models: Linear Regression and Ridge Regression trained on the diabetes dataset. It includes overall metrics such as Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Explained Variance, and R² Score for each model, showing their predictive accuracy. Additionally, it lists the coefficients for each feature (e.g., age, bmi, s1–s6) in both models, highlighting how each variable contributes to the prediction.
Output
| Model | Metric | Variable | Coefficient | MAE | MAPE | MSE | RMSE | Expl. Var. | R^2 |
|---|---|---|---|---|---|---|---|---|---|
| Linear Regression | Overall Metrics | 42.79 | 37.5 | 2900.19 | 53.85 | 0.46 | 0.45 | ||
| Linear Regression | Coefficient | const | 151.35 | ||||||
| Linear Regression | Coefficient | age | 37.9 | ||||||
| Linear Regression | Coefficient | sex | -241.96 | ||||||
| Linear Regression | Coefficient | bmi | 542.43 | ||||||
| Linear Regression | Coefficient | bp | 347.7 | ||||||
| Linear Regression | Coefficient | s1 | -931.49 | ||||||
| Linear Regression | Coefficient | s2 | 518.06 | ||||||
| Linear Regression | Coefficient | s3 | 163.42 | ||||||
| Linear Regression | Coefficient | s4 | 275.32 | ||||||
| Linear Regression | Coefficient | s5 | 736.2 | ||||||
| Linear Regression | Coefficient | s6 | 48.67 | ||||||
| Ridge Regression | Overall Metrics | 42.81 | 37.45 | 2892.01 | 53.78 | 0.46 | 0.45 | ||
| Ridge Regression | Coefficient | const | 153.74 | ||||||
| Ridge Regression | Coefficient | age | 1.81 | ||||||
| Ridge Regression | Coefficient | sex | -11.45 | ||||||
| Ridge Regression | Coefficient | bmi | 25.73 | ||||||
| Ridge Regression | Coefficient | bp | 16.73 | ||||||
| Ridge Regression | Coefficient | s1 | -34.67 | ||||||
| Ridge Regression | Coefficient | s2 | 17.05 | ||||||
| Ridge Regression | Coefficient | s3 | 3.37 | ||||||
| Ridge Regression | Coefficient | s4 | 11.76 | ||||||
| Ridge Regression | Coefficient | s5 | 31.38 | ||||||
| Ridge Regression | Coefficient | s6 | 2.46 |
Regression Example 2: Linear, Ridge, RF (w/ Feature Importance)
In this Regression Example 2, we extend the analysis by introducing a Random Forest
Regressor alongside Linear Regression and Ridge Regression to demonstrate how a
model with feature importances, rather than coefficients, impacts evaluation outcomes.
The code uses the summarize_model_performance function from model_metrics to
assess all three models on the diabetes dataset’s test set, ensuring the Random Forest’s
feature importance-based predictions are reflected in the results while preserving
the coefficient-based results of the other models, as shown in the subsequent table.
from model_metrics import summarize_model_performance
regression_metrics = summarize_model_performance(
model=[linear_model, ridge_model, rf_model],
model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
X=X_test,
y=y_test,
model_type="regression",
return_df=True,
decimal_places=2,
)
regression_metrics
Output
| Model | Metric | Variable | Coefficient | Feat. Imp. | MAE | MAPE | MSE | RMSE | Expl. Var. | R^2 |
|---|---|---|---|---|---|---|---|---|---|---|
| Linear Regression | Overall Metrics | 42.79 | 37.5 | 2900.19 | 53.85 | 0.46 | 0.45 | |||
| Linear Regression | Coefficient | const | 151.35 | |||||||
| Linear Regression | Coefficient | age | 37.90 | |||||||
| Linear Regression | Coefficient | sex | -241.96 | |||||||
| Linear Regression | Coefficient | bmi | 542.43 | |||||||
| Linear Regression | Coefficient | bp | 347.7 | |||||||
| Linear Regression | Coefficient | s1 | -931.49 | |||||||
| Linear Regression | Coefficient | s2 | 518.06 | |||||||
| Linear Regression | Coefficient | s3 | 163.42 | |||||||
| Linear Regression | Coefficient | s4 | 275.32 | |||||||
| Linear Regression | Coefficient | s5 | 736.2 | |||||||
| Linear Regression | Coefficient | s6 | 48.67 | |||||||
| Ridge Regression | Overall Metrics | 42.81 | 37.45 | 2892.01 | 53.78 | 0.46 | 0.45 | |||
| Ridge Regression | Coefficient | const | 153.74 | |||||||
| Ridge Regression | Coefficient | age | 1.81 | |||||||
| Ridge Regression | Coefficient | sex | -11.45 | |||||||
| Ridge Regression | Coefficient | bmi | 25.73 | |||||||
| Ridge Regression | Coefficient | bp | 16.73 | |||||||
| Ridge Regression | Coefficient | s1 | -34.67 | |||||||
| Ridge Regression | Coefficient | s2 | 17.05 | |||||||
| Ridge Regression | Coefficient | s3 | 3.37 | |||||||
| Ridge Regression | Coefficient | s4 | 11.76 | |||||||
| Ridge Regression | Coefficient | s5 | 31.38 | |||||||
| Ridge Regression | Coefficient | s6 | 2.46 | |||||||
| Random Forest | Overall Metrics | 44.05 | 40.01 | 2952.01 | 54.33 | 0.44 | 0.44 | |||
| Random Forest | Feat. Imp. | age | 0.06 | |||||||
| Random Forest | Feat. Imp. | sex | 0.01 | |||||||
| Random Forest | Feat. Imp. | bmi | 0.36 | |||||||
| Random Forest | Feat. Imp. | bp | 0.09 | |||||||
| Random Forest | Feat. Imp. | s1 | 0.05 | |||||||
| Random Forest | Feat. Imp. | s2 | 0.06 | |||||||
| Random Forest | Feat. Imp. | s3 | 0.05 | |||||||
| Random Forest | Feat. Imp. | s4 | 0.02 | |||||||
| Random Forest | Feat. Imp. | s5 | 0.23 | |||||||
| Random Forest | Feat. Imp. | s6 | 0.07 |
Regression Example 3: Adjusted R²
In some regression analyses, it is useful to report Adjusted R² in addition to standard error and variance metrics. Adjusted R² accounts for the number of predictors in the model and penalizes unnecessary complexity, making it more appropriate than R² when comparing models with different feature counts.
This example demonstrates how to include Adjusted R² in the output table by
setting include_adjusted_r2=True in the summarize_model_performance function.
from model_metrics import summarize_model_performance
regression_metrics = summarize_model_performance(
model=[linear_model, ridge_model, rf_model],
model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
X=X_test,
y=y_test,
model_type="regression",
include_adjusted_r2=True,
return_df=True,
decimal_places=2,
)
regression_metrics
The resulting table extends the standard regression metrics (MAE, MAPE, MSE, RMSE, Explained Variance, and R²) by adding an Adjusted R² column, enabling more informed model comparison when feature dimensionality differs.
Output
| Model | Metric | Variable | Coefficient | Feat. Imp. | MAE | MAPE | MSE | RMSE | Expl. Var. | R^2 | Adj. R^2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Linear Regression | Overall Metrics | 42.79 | 37.5 | 2900.19 | 53.85 | 0.46 | 0.45 | 0.38 | |||
| Linear Regression | Coefficient | const | 151.35 | ||||||||
| Linear Regression | Coefficient | age | 37.90 | ||||||||
| Linear Regression | Coefficient | sex | -241.96 | ||||||||
| Linear Regression | Coefficient | bmi | 542.43 | ||||||||
| Linear Regression | Coefficient | bp | 347.7 | ||||||||
| Linear Regression | Coefficient | s1 | -931.49 | ||||||||
| Linear Regression | Coefficient | s2 | 518.06 | ||||||||
| Linear Regression | Coefficient | s3 | 163.42 | ||||||||
| Linear Regression | Coefficient | s4 | 275.32 | ||||||||
| Linear Regression | Coefficient | s5 | 736.2 | ||||||||
| Linear Regression | Coefficient | s6 | 48.67 | ||||||||
| Ridge Regression | Overall Metrics | 42.81 | 37.45 | 2892.01 | 53.78 | 0.46 | 0.45 | 0.38 | |||
| Ridge Regression | Coefficient | const | 153.74 | ||||||||
| Ridge Regression | Coefficient | age | 1.81 | ||||||||
| Ridge Regression | Coefficient | sex | -11.45 | ||||||||
| Ridge Regression | Coefficient | bmi | 25.73 | ||||||||
| Ridge Regression | Coefficient | bp | 16.73 | ||||||||
| Ridge Regression | Coefficient | s1 | -34.67 | ||||||||
| Ridge Regression | Coefficient | s2 | 17.05 | ||||||||
| Ridge Regression | Coefficient | s3 | 3.37 | ||||||||
| Ridge Regression | Coefficient | s4 | 11.76 | ||||||||
| Ridge Regression | Coefficient | s5 | 31.38 | ||||||||
| Ridge Regression | Coefficient | s6 | 2.46 | ||||||||
| Random Forest | Overall Metrics | 44.05 | 40.01 | 2952.01 | 54.33 | 0.44 | 0.44 | 0.37 | |||
| Random Forest | Feat. Imp. | age | 0.06 | ||||||||
| Random Forest | Feat. Imp. | sex | 0.01 | ||||||||
| Random Forest | Feat. Imp. | bmi | 0.36 | ||||||||
| Random Forest | Feat. Imp. | bp | 0.09 | ||||||||
| Random Forest | Feat. Imp. | s1 | 0.05 | ||||||||
| Random Forest | Feat. Imp. | s2 | 0.06 | ||||||||
| Random Forest | Feat. Imp. | s3 | 0.05 | ||||||||
| Random Forest | Feat. Imp. | s4 | 0.02 | ||||||||
| Random Forest | Feat. Imp. | s5 | 0.23 | ||||||||
| Random Forest | Feat. Imp. | s6 | 0.07 |
Regression Example 4: Overall Results
In some scenarios, you may want to simplify the output by excluding variables,
coefficients, and feature importances from the model results. This example
demonstrates how to achieve that by setting overall_only=True in the
summarize_model_performance function, producing a concise table that
focuses on key metrics: model name, Mean Absolute Error (MAE),
Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE),
Root Mean Squared Error (RMSE), Explained Variance, and R² Score.
from model_metrics import summarize_model_performance
regression_metrics = summarize_model_performance(
model=[linear_model, ridge_model, rf_model],
model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
X=X_test,
y=y_test,
model_type="regression",
overall_only=True,
return_df=True,
decimal_places=2,
)
regression_metrics
Output
| Model | Metric | MAE | MAPE | MSE | RMSE | Expl. Var. | R^2 |
|---|---|---|---|---|---|---|---|
| Linear Regression | Overall Metrics | 42.79 | 37.50 | 2900.19 | 53.85 | 0.46 | 0.45 |
| Ridge Regression | Overall Metrics | 42.81 | 37.45 | 2892.01 | 53.78 | 0.46 | 0.45 |
| Random Forest | Overall Metrics | 44.05 | 40.01 | 2952.01 | 54.33 | 0.44 | 0.44 |
Lift Charts
This section illustrates how to assess and compare the ranking effectiveness of classification models using Lift Charts, a valuable tool for evaluating how well a model prioritizes positive instances relative to random chance. Leveraging the Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we plot Lift curves to visualize their relative ability to surface high-value (positive) cases at the top of the prediction list.
A Lift Chart plots the ratio of actual positives identified by the model compared to what would be expected by random selection, across increasingly larger proportions of the sample sorted by predicted probability. The baseline (Lift = 1) represents random chance; curves that rise above this line demonstrate the model’s ability to “lift” positive outcomes toward the top ranks. This makes Lift Charts especially useful in applications like marketing, fraud detection, and risk stratification where targeting the top segment of predictions can yield outsized value. See the mathematical definition of Lift here.
The show_lift_chart function enables flexible creation of Lift Charts for one or more
models. It supports single-plot overlays, subplot layouts, and detailed customization of
labels, titles, and styling. Designed for both exploratory analysis and stakeholder
presentation, this utility helps users better understand model ranking performance
across the population.
- show_lift_chart(model, X, y, y_prob=None, xlabel='Percentage of Sample', ylabel='Lift', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, legend_loc='best')
- Parameters:
model (object or list[object]) – A trained model or a list of models. Each must implement
predict_probato estimate class probabilities. Can be omitted ify_probis provided.X (pd.DataFrame or np.ndarray) – Feature matrix used to generate predictions. Required if using
model. Ignored ify_probis provided.y (pd.Series or np.ndarray) – True binary labels corresponding to the input samples.
y_prob (array-like or list[array-like], optional) – Predicted probabilities for classification models. Can be provided instead of
modelandX.xlabel (str, optional) – Label for the x-axis. Defaults to
"Percentage of Sample".ylabel (str, optional) – Label for the y-axis. Defaults to
"Lift".model_title (str or list[str], optional) – Custom display names for the models. Can be a string or list of strings.
overlay (bool, optional) – If
True, overlays all model lift curves into a single plot. Defaults toFalse.title (str, optional) – Title for the plot or subplot. Set to
""to suppress the title. Defaults toNone.save_plot (bool, optional) – Whether to save the chart(s) to disk. Defaults to
False.image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Maximum number of characters before wrapping titles. If
None, no wrapping is applied.curve_kwgs (dict[str, dict] or list[dict], optional) – Dictionary or list of dictionaries for customizing the lift curve(s) (e.g., color, linewidth).
linestyle_kwgs (dict, optional) – Styling for the baseline (random lift) reference line. Defaults to
{"color": "gray", "linestyle": "--", "linewidth": 2}.subplots (bool, optional) – Whether to show each model in a subplot grid. Cannot be combined with
overlay=True.n_cols (int, optional) – Number of columns in the subplot layout. Defaults to
2.n_rows (int, optional) – Number of rows in the subplot layout. If
None, automatically inferred.figsize (tuple[int, int], optional) – Tuple specifying the size of the figure in inches. Defaults to
(8, 6).label_fontsize (int, optional) – Font size for x/y-axis labels and titles. Defaults to
12.tick_fontsize (int, optional) – Font size for tick marks and legend text. Defaults to
10.gridlines (bool, optional) – Whether to display gridlines in plots. Defaults to
True.legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like
'best','upper right','lower left', etc., or'bottom'to place legend below the plot. Defaults to'best'.
- Returns:
None.Displays or saves lift charts for the specified classification models or probability inputs.- Return type:
None- Raises:
If
overlay=Trueandsubplots=Trueare both set.
Important
You can supply either
modelandXor passy_probdirectly.When using
y_prob, the function bypasses model predictions and uses the provided probabilities for lift chart calculation.Supports single-model or multiple-model workflows, with either model objects or arrays of pre-computed probabilities.
Notes
- What is a Lift Chart?
Lift quantifies how much better a model is at identifying positive cases compared to random selection.
The x-axis represents the proportion of the population (from highest to lowest predicted probability).
The y-axis shows the cumulative lift, calculated as the ratio of observed positives to expected positives under random selection.
- Interpreting Lift Curves:
A higher and steeper curve indicates a stronger model.
The horizontal dashed line at
y = 1is the baseline for random performance.Curves that drop sharply or flatten may indicate poor ranking ability.
- Layout Options:
Use
overlay=Trueto visualize all models on a single axis.Use
subplots=Truefor a side-by-side layout of lift charts.Neither set? Each model gets its own full-sized chart.
- Customization:
Customize the appearance of each model’s curve using
curve_kwgs.Modify the baseline reference line with
linestyle_kwgs.Control title wrapping and font sizes via
text_wrap,label_fontsize, andtick_fontsize.
- Saving Plots:
If
save_plot=True, figures are saved as<model_title>_lift.png/svgoroverlay_lift.png/svg.
Lift Chart Example 1: Subplot Layout
In this first Lift Chart example, we evaluate and compare the ranking performance
of two classification models: Logistic Regression and Random Forest Classifier trained
on the synthetic dataset from the Binary Classification Models section. The chart displays Lift curves for both models in a
two-column subplot layout (n_cols=2, n_rows=1), enabling side-by-side comparison
of how effectively each model prioritizes positive cases.
Each plot shows the model’s Lift across increasing portions of the test set, with a grey dashed line at Lift = 1 indicating the baseline (random performance). Curves above this line reflect the model’s ability to identify more positives than would be expected by chance. The Random Forest produces a steeper initial lift, demonstrating greater concentration of positive cases near the top-ranked predictions.
The show_lift_chart function allows for rich customization, including plot dimensions, axis font sizes, and curve styling. In this example, we set the line widths for both models and saved the plots in both PNG and SVG formats for further reporting or documentation.
from model_metrics import show_lift_chart
show_lift_chart(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "red", "linestyle": "--"},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
subplots=True,
)
Output
Lift Chart Example 2: Overlay
This example overlays Lift curves from two classification models: Logistic Regression and Random Forest Classifier on a single plot for direct visual comparison. Both models were trained on the same synthetic dataset from the Binary Classification Models section, and their lift performance is evaluated on the shared test set.
The Lift curve shows how many more positive outcomes are captured by the model at each quantile compared to a random baseline. A horizontal dashed black line at Lift = 1 represents random selection; curves above this line indicate effective ranking of positive cases. Overlaying curves makes it easier to assess which model better concentrates true positives near the top of the prediction list.
Using the overlay=True option, the show_lift_chart function generates a clean,
unified plot. Each curve is styled with linewidth=2 for clarity, and all axis
elements and tick marks are sized for presentation-quality output. This layout
is particularly helpful for slide decks, performance reports, or model selection
discussions.
from model_metrics import show_lift_chart
show_lift_chart(
model=[model1, model2],
X=X_test,
y=y_test,
overlay=True,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "red", "linestyle": "--", "linewidth": 2},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
)
Output
Gain Charts
This section explores how to evaluate the cumulative performance of classification models in identifying positive outcomes using gain charts. These charts are especially effective at showing the model’s ability to concentrate the correct (positive) predictions in the top-ranked portion of the dataset. Using the same Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we demonstrate how to plot and compare Gain Curves across models.
A gain chart shows the cumulative percentage of actual positive cases captured as we move through the population sorted by predicted probability. Unlike the Lift Chart, which displays the ratio of model performance over baseline, the Gain Chart directly shows the percentage of positives captured, providing a more intuitive sense of how effective a model is at identifying positives early in the ranked list. See the mathematical definition of Gain here.
The show_gain_chart function supports single or multiple models, with options to
overlay all gain curves in a single plot or display them in a flexible subplot layout.
Labels, title wrapping, curve styles, and saving output images are all customizable,
making this function well-suited for both development analysis and final reporting.
- show_gain_chart(model, X, y, y_prob=None, xlabel='Percentage of Sample', ylabel='Cumulative Gain', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, legend_loc='best', show_gini=False, decimal_places=3)
- Parameters:
model (object or list[object]) – A trained classifier or list of classifiers. Each model must support
predict_probaunlessy_probis supplied directly.X (pd.DataFrame or np.ndarray) – The feature matrix used for prediction. Required if
modelis provided.y (pd.Series or np.ndarray) – Ground truth binary labels.
y_prob (list, array, or
None, optional) – Predicted probabilities for classification models. Can be provided instead ofmodelandX.xlabel (str, optional) – Label for the x-axis. Defaults to
"Percentage of Sample".ylabel (str, optional) – Label for the y-axis. Defaults to
"Cumulative Gain".model_title (str or list[str], optional) – Custom display names for each model. If
None, defaults to sequential names.overlay (bool, optional) – If
True, overlay all models on a single axis. Mutually exclusive withsubplots.title (str, optional) – Plot or subplot title. Set to
""to suppress the title.save_plot (bool, optional) – Whether to save the chart(s) to disk.
image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Max characters before title wrapping. Set to
Noneto disable.curve_kwgs (dict[str, dict] or list[dict], optional) – Dict or list of kwargs per model to customize line style.
linestyle_kwgs (dict, optional) – Styling for the random baseline. Defaults to
{"color": "gray", "linestyle": "--", "linewidth": 2}.subplots (bool, optional) – Whether to render a subplot layout. Cannot be used with
overlay.n_cols (int, optional) – Columns in the subplot layout. Defaults to
2.n_rows (int, optional) – Rows in the subplot layout. If
None, inferred automatically.figsize (tuple[int, int], optional) – Figure size (width, height) in inches.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for tick marks and legends.
gridlines (bool, optional) – Whether to show gridlines on the plots.
legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like
'best','upper right','lower left', etc., or'bottom'to place legend below the plot. Defaults to'best'.show_gini (bool, optional) – Whether to display the Gini coefficient in the legend. Defaults to
False.decimal_places (int, optional) – Number of decimal places for displaying the Gini coefficient. Defaults to
3.
- Returns:
None.Displays or saves Gain Charts for one or more models.- Return type:
None- Raises:
If
overlay=Trueandsubplots=Trueare both set.
Important
You can supply either
modelandXor passy_probdirectly.When using
y_prob, the function bypasses model predictions and plots cumulative gains directly from the provided probabilities.Supports single-model or multiple-model workflows, with either model objects or arrays of pre-computed probabilities.
Notes
- What is a Gain Chart?
Plots the cumulative percentage of positives captured vs. sample size.
The x-axis shows the fraction of the sample, ranked by predicted probability.
The y-axis shows what percentage of the total positives have been captured.
- Why use Gain Charts?
Gain Charts help answer: “If I contact the top X% of predictions, how many positives will I catch?”
Especially useful in marketing, lead scoring, risk management, and fraud detection.
- Reading Gain Curves:
Curves that rise steeply and plateau early indicate better model performance.
The dashed baseline (diagonal line) represents random selection.
- Layout Options:
Use
overlay=Trueto combine all gain curves into a single plot.Use
subplots=Truefor a subplot layout per model.If neither is set, plots will be rendered individually.
- Styling Options:
Customize individual model lines via
curve_kwgs.Modify the diagonal baseline line using
linestyle_kwgs.Adjust fonts and wrapping for presentation clarity.
- Saving Output:
Enable
save_plot=Trueto save figures as PNG and/or SVG.Files are named using the model title (e.g.,
Model_1_gain.pngoroverlay_gain.svg).
Gain Chart Example 1: Subplot Layout
In this first Gain Chart example, we compare the cumulative gain performance of two classification models: Logistic Regression and Random Forest Classifier trained on the synthetic dataset from the Binary Classification Models section. This visualization showcases their ability to identify positive instances across different percentiles of the ranked test data.
Each subplot presents the cumulative gain achieved as a function of the percentage of the sample, sorted by descending predicted probability. The grey dashed line represents the baseline (random gain). A model that identifies a high proportion of positive cases in the early part of the ranking will have a steeper and higher curve. In this example, the Random Forest model outpaces Logistic Regression, indicating better early identification of positives.
The show_gain_chart function allows flexible styling and layout control. This example uses a subplot
configuration (n_cols=2, n_rows=1), customized line widths and colors, and includes saving the figure
for documentation or stakeholder presentations.
from model_metrics import show_gain_chart
show_gain_chart(
model=[model1, model2],
X=X_test,
y=y_test,
figsize=(12, 6),
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "grey", "linestyle": "--"},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
subplots=True,
)
Output
Gain Chart Example 2: Displaying Gini Coefficients
This example demonstrates how to include Gini coefficients directly in the gain chart legends using
the show_gini=True parameter. The Gini coefficient is a summary statistic derived from the area
under the gain curve (AUGC), calculated as 2 × AUGC - 1, and ranges from 0 to 1 where higher values
indicate better model discrimination.
Both models: Logistic Regression and Random Forest Classifier were trained on the
synthetic dataset from the Binary Classification Models section.
By enabling show_gini=True (and optionally setting decimal_places=3), each model’s legend
entry automatically displays its Gini coefficient, providing both visual and quantitative performance
comparison in a single view.
The Gini coefficient complements the visual gain curve by offering a single number that summarizes discriminative power. In this example, both the curve shape and the Gini value help identify which model better concentrates positive cases at the top of the predicted ranking. This is particularly useful in presentations, model selection discussions, and performance reporting where stakeholders need both graphical intuition and numeric metrics.
from model_metrics import show_gain_chart
show_gain_chart(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "red", "linestyle": "--"},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
subplots=True,
show_gini=True,
decimal_places=3,
)
Output
Gini coefficient for Logistic Regression: 0.374
Gini coefficient for Random Forest: 0.410
Gain Chart Example 3: Overlay
This example overlays Gain curves from two classification models: Logistic Regression and Random Forest Classifier on a single plot to enable direct visual comparison of their cumulative gain performance. Both models were trained on the same synthetic dataset from the Binary Classification Models section and evaluated on the same test set.
The Gain curve shows the cumulative proportion of true positives captured as you move through the population, ranked by predicted probability. A diagonal baseline line from (0, 0) to (1, 1) indicates the expected performance of a random model. Curves that rise above this line demonstrate superior model ability to concentrate positive cases near the top of the ranked list.
By setting overlay=True, the show_gain_chart function produces a single,
easy-to-read plot containing both models’ gain curves. Each curve is styled
with linewidth=2 for clear visibility. Overlay layouts are ideal for model
selection discussions, presentations, and performance dashboards.
from model_metrics import show_gain_chart
show_gain_chart(
model=[model1, model2],
X=X_test,
y=y_test,
overlay=True,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "red", "linestyle": "--", "linewidth": 2},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
)
Output
ROC AUC Curves
This section demonstrates how to evaluate the performance of binary classification models using ROC AUC curves, a key metric for assessing the trade-off between true positive and false positive rates. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate ROC curves to visualize their discriminatory power.
ROC AUC (Receiver Operating Characteristic Area Under the Curve) provides a
single scalar value representing a model’s ability to distinguish between
positive and negative classes, with a value of 1 indicating perfect classification
and 0.5 representing random guessing. The curves are plotted by varying the
classification threshold and calculating the true positive rate (sensitivity)
against the false positive rate (1-specificity). This makes ROC AUC particularly
useful for comparing models like Logistic Regression, which relies on linear
decision boundaries, and Random Forest Classifier, which leverages ensemble
decision trees, especially when class imbalances or threshold sensitivity are
concerns. The show_roc_curve function simplifies this process, enabling
users to visualize and compare these curves effectively, setting the stage for
detailed performance analysis in subsequent examples.
The show_roc_curve function provides a flexible and powerful way to visualize
the performance of binary classification models using Receiver Operating Characteristic
(ROC) curves. Whether you’re comparing multiple models, evaluating subgroup fairness,
or preparing publication-ready plots, this function allows full control over layout,
styling, and annotations. It supports single and multiple model inputs, optional overlay
or subplot layouts, and group-wise comparisons via a categorical feature. Additional options
allow custom axis labels, AUC precision, curve styling, and export to PNG/SVG.
Designed to be both user-friendly and highly configurable, show_roc_curve
is a practical tool for model evaluation and stakeholder communication.
- show_roc_curve(model=None, X=None, y_prob=None, y=None, xlabel='False Positive Rate', ylabel='True Positive Rate', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, subplots=False, n_rows=None, n_cols=2, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None, delong=None, show_operating_point=False, operating_point_method='youden', operating_point_kwgs=None, legend_loc='lower right')
- Parameters:
model (estimator, list[estimator], or str) – A trained estimator, list of estimators, or placeholders (strings). If
y_probis provided directly,modelmay beNone.X (pd.DataFrame or np.ndarray) – Feature data for prediction. Required when
modelobjects are provided andy_probis not.y_prob (array-like or list[array-like], optional) – Predicted probabilities for the positive class. Can be a single array or a list of arrays corresponding to multiple models.
y (pd.Series or np.ndarray) – True binary target labels for ROC evaluation.
xlabel (str, optional) – Label for the x-axis.
ylabel (str, optional) – Label for the y-axis.
model_title (str or list[str], optional) – Custom model title(s). Can be a string or list of strings. If
None, defaults to"Model 1","Model 2", etc.decimal_places (int, optional) – Number of decimal places for rounding AUC values.
overlay (bool, optional) – Whether to overlay multiple models in a single plot. Cannot be used with
subplotsorgroup_category.title (str, optional) – Title for the plot or subplots. If
"", disables titles entirely.save_plot (bool, optional) – Whether to save plots to disk.
image_path_png (str, optional) – Path to save PNG images.
image_path_svg (str, optional) – Path to save SVG images.
text_wrap (int, optional) – Maximum width before wrapping long titles.
curve_kwgs (list[dict] or dict[str, dict], optional) – Style parameters for ROC curves. Accepts a list of dicts or a nested dict keyed by model titles.
linestyle_kwgs (dict, optional) – Style dictionary for the random guess (diagonal) line.
subplots (bool, optional) – Whether to arrange plots in a grid layout. Cannot be used with
overlay=Trueorgroup_category.n_rows (int, optional) – Number of rows for the subplot grid. Calculated automatically if
None.n_cols (int, optional) – Number of columns for the subplot grid.
figsize (tuple, optional) – Figure size (width, height) in inches.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for tick labels and legend.
gridlines (bool, optional) – Whether to display gridlines.
group_category (array-like, optional) – Categorical variable to group ROC curves (e.g., by sex or race). Cannot be used with
overlayorsubplots.delong (tuple or list[array-like], optional) – Tuple or list containing two predicted probability arrays for Hanley and McNeil’s parametric AUC comparison. Cannot be used with
group_category.show_operating_point (bool, optional) – Whether to display an optimal operating point on the ROC curve.
operating_point_method (str, optional) – Method used to compute the operating point. Supported options are
"youden"and"closest_topleft".operating_point_kwgs (dict, optional) – Styling options for the operating point marker (passed to
matplotlib.scatter).legend_loc (str, optional) – Legend location. Standard matplotlib locations or
"bottom"to place the legend below the plot.
- Returns:
None. Displays or saves ROC curve plots.- Return type:
None- Raises:
If both
subplots=Trueandoverlay=True.If
group_categoryis used withoverlayorsubplots.If
overlay=Trueis used with only one model.If
delongis used whengroup_categoryis provided.If neither (
modelandX) nory_probare supplied.If
operating_point_methodis not one of the supported options.
Important
You can provide either
modelwithXor directly passy_prob.overlayandsubplotsare mutually exclusive.group_categorygroups ROC curves by unique category values.The
delongparameter performs a correlated ROC comparison using the Hanley and McNeil parametric approximation.
Notes
- Flexible Inputs
modelandmodel_titlecan be scalars or lists.Strings in
modelact as placeholders when using precomputed probabilities.Curve styling is controlled via
curve_kwgsandlinestyle_kwgs.
- Operating Point Visualization
When
show_operating_point=True, the optimal threshold is computed and displayed on the ROC curve.The operating point is labeled in the legend with its threshold value.
- Group-wise ROC
When
group_categoryis provided, ROC curves are computed per group.Legends display AUC, total count, positive count, and negative count.
- Plot Modes
overlay=True: All ROC curves on a single plot.subplots=True: Each model plotted in a grid layout.Default behavior produces one plot per model.
- Saving Plots
If
save_plot=True, figures are saved to the specified paths.Filenames are generated automatically based on model name and plot mode.
The show_roc_curve function provides flexible and highly customizable
plotting of ROC curves for binary classification models. It supports overlays,
subplot layouts, and subgroup visualizations, while also allowing export options
and styling hooks for publication-ready output.
ROC AUC Example 1: Subplot Layout
In this first ROC AUC evaluation example, we plot the ROC curves for two
models: Logistic Regression and Random Forest Classifier, trained on the
synthetic dataset from the Binary Classification Models section. The curves are displayed side by side
using a subplot layout (n_cols=2, n_rows=1), with the Logistic Regression curve
in blue and the Random Forest curve in green for clear differentiation.
A red dashed line represents the random guessing baseline. This example
demonstrates how the show_roc_curve function enables straightforward
visualization of model performance, with options to customize colors,
add a grid, and save the plot for reporting purposes.
from model_metrics import show_roc_curve
show_roc_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
decimal_places=2,
n_cols=2,
n_rows=1,
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
linestyle_kwgs={"color": "red", "linestyle": "--"},
subplots=True,
)
Output
ROC AUC Example 2: Overlay
In this second ROC AUC evaluation example, we focus on overlaying the results of
two models: Logistic Regression and Random Forest Classifier, trained on the
synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_roc_curve
function with the overlay=True parameter, the ROC curves for both models are
displayed together, with Logistic Regression in blue and Random Forest in black,
both with a linewidth=2. A red dashed line serves as the random guessing
baseline, and the plot includes a custom title for clarity.
from model_metrics import show_roc_curve
show_roc_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
decimal_places=2,
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
linestyle_kwgs={"color": "red", "linestyle": "--"},
title="ROC Curves: Logistic Regression and Random Forest",
overlay=True,
)
Output
ROC AUC Example 3: DeLong’s Test
In this third ROC AUC evaluation example, we demonstrate how to statistically
compare the performance of two correlated models using Hanley & McNeil’s
parametric AUC comparison (an approximation of DeLong’s test). We utilize the
Logistic Regression and Random Forest Classifier models trained on the
synthetic dataset from the Binary Classification Models section. By passing their predicted probabilities to the
delong parameter of the show_roc_curve function, we can assess whether
the difference in AUC between the two models is statistically significant. This
is particularly useful when models are evaluated on the same
dataset, as it accounts for the inherent correlation in their predictions.
from model_metrics import show_roc_curve
show_roc_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
decimal_places=2,
delong=[model1.predict_proba(X_test)[:, 1], model2.predict_proba(X_test)[:, 1]],
)
Output
AUC for Logistic Regression: 0.91
Hanley & McNeil AUC comparison (Approximation of DeLong's Test):
Logistic Regression AUC = 0.913
Random Forest AUC = 0.952
p-value = 0.0557
ROC AUC Example 4: Hanley Mcneil AUC Test
In this fourth ROC AUC evaluation example, we focus on the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section.
Performs a large-sample z-test for the difference between two correlated AUCs, based on Hanley & McNeil (1982).
- hanley_mcneil_auc_test(y_true, y_scores_1, y_scores_2, model_names=None, verbose=True, return_values=False, decimal_places=4)
- Parameters:
y_true (array-like) – True binary class labels.
y_scores_1 (array-like) – Predicted probabilities or decision scores from the first model.
y_scores_2 (array-like) – Predicted probabilities or decision scores from the second model.
model_names (list or tuple of str, optional) – Optional model names for printed output. Defaults to
("Model 1", "Model 2")if not provided.verbose (bool, optional) – Whether to print the formatted AUC comparison and p-value summary. Defaults to
True.return_values (bool, optional) – Whether to return the numerical results
(auc1, auc2, p_value)for programmatic access instead of just printing them. Defaults toFalse.decimal_places (int, optional) – Number of decimal places for printed AUC and p-value. Defaults to
4.
- Returns:
Tuple of floats
(auc1, auc2, p_value)ifreturn_values=True. Otherwise, prints the results and returnsNone.- Return type:
tuple or None
Important
This test compares two correlated ROC curves evaluated on the same set of samples.
It is a parametric approximation of DeLong’s nonparametric test, as described in [4].
The p-value tests the null hypothesis that the two AUCs are equal.
Notes
- Formula Overview:
Standard error (SE) is computed using Hanley & McNeil’s approximation based on sample counts and the AUC of the first model.
The z-statistic is then computed as:
z = (auc1 - auc2) / SEThe two-sided p-value is derived as:
p = 2 * (1 - norm.cdf(|z|))
- Typical Use Case:
Use this function when comparing two models trained and tested on the same dataset to evaluate whether their ROC-AUCs differ significantly.
The function is particularly useful within pipelines or visualization utilities such as
show_roc_curve()when thedelongargument is provided.
- Integration Example:
This test can be used independently or embedded in a plotting function to provide AUC significance testing.
Example:
from model_metrics import hanley_mcneil_auc_test
# Compare two models' ROC-AUC scores
hanley_mcneil_auc_test(
y_test,
model1.predict_proba(X_test)[:, 1],
model2.predict_proba(X_test)[:, 1],
model_names=["Logistic Regression", "Random Forest"],
verbose=True,
decimal_places=6,
)
Output
Hanley & McNeil AUC Comparison (Approximation of DeLong's Test):
Logistic Regression AUC = 0.912643
Random Forest AUC = 0.951969
p-value = 0.054494
ROC AUC Example 5: Operating Point Using Youden’s J
In this fifth ROC AUC evaluation example, we focus on the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section.
The objective of this example is to identify and visualize an optimal operating point on the ROC curve using Youden’s J statistic, defined as:
This criterion selects the threshold that maximizes the vertical distance between the ROC curve and the random-guess diagonal, providing a balanced tradeoff between sensitivity and specificity. More information on this method can be found here.
The show_roc_curve function supports this directly via the
show_operating_point and operating_point_method parameters.
In the example below, we compute the ROC curve for a decision tree classifier and annotate the optimal operating point determined by Youden’s J statistic.
from model_metrics import show_roc_curve
show_roc_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
decimal_places=2,
show_operating_point=True,
operating_point_method="youden",
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
linestyle_kwgs={"color": "red", "linestyle": "--"},
operating_point_kwgs={
"marker": "o",
"color": "red",
"s": 100,
},
)
When enabled, the operating point is plotted directly on the ROC curve and annotated in the legend with its corresponding decision threshold.
Output
ROC AUC Example 6: Closest to Top Left
In this sixth example, we demonstrate an alternative method for identifying an optimal operating point on the ROC curve using the closest-to-top-left criterion. Like Youden’s J statistic, this approach seeks a balanced threshold, but instead of maximizing the vertical distance from the diagonal, it minimizes the Euclidean distance to the ideal point (0, 1) in ROC space.
The closest-to-top-left method finds the threshold that minimizes:
This geometric criterion is particularly useful when you want to prioritize proximity to perfect classification (top-left corner) rather than maximizing the difference between true positive and false positive rates.
In this ROC AUC evaluation example, we focus on the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section.
The show_roc_curve function supports this method through the operating_point_method parameter
by setting it to "closest_topleft". In the example below, we compute the ROC curve for a
decision tree classifier and annotate the optimal operating point using the closest-to-top-left
criterion.
from model_metrics import show_roc_curve
show_roc_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
decimal_places=2,
show_operating_point=True,
subplots=True,
operating_point_method="closest_topleft",
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
linestyle_kwgs={"color": "red", "linestyle": "--"},
operating_point_kwgs={
"marker": "o",
"color": "red",
"s": 100,
},
)
ROC AUC Example 7: by Category
In this seventh ROC AUC evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner library [3].
Click here to view the corresponding codebase for this workflow.
The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature, such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.
The show_roc_curve function supports this analysis through the
group_category parameter.
For example, by passing group_category=X_test_2["race"],
you can generate a separate ROC curve for each unique racial group in the dataset:
from model_metrics import show_roc_curve
show_roc_curve(
model=model_rf["model"].estimator,
X=X_test,
y=y_test,
model_title="Random Forest Classifier",
decimal_places=2,
group_category=X_test_2["race"],
)
Output
Precision-Recall Curves
This section demonstrates how to evaluate the performance of binary classification models using Precision-Recall (PR) curves, a critical visualization for understanding model behavior in the presence of class imbalance. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate PR curves to examine how well each model identifies true positives while limiting false positives.
Precision-Recall curves focus on the trade-off between precision (positive predictive value) and recall (sensitivity) across different classification thresholds. This is particularly important when the positive class is rare, as is common in fraud detection, disease diagnosis, or adverse event prediction, because ROC AUC can overstate performance under imbalance. Unlike the ROC curve, the PR curve is sensitive to the proportion of positive examples and gives a clearer picture of how well a model performs where it matters most: in identifying the positive class.
The area under the Precision-Recall curve, also known as Average Precision (AP), summarizes model performance across thresholds. A model that maintains high precision as recall increases is generally more desirable, especially in settings where false positives have a high cost. This makes the PR curve a complementary and sometimes more informative tool than ROC AUC in skewed classification scenarios.
- show_pr_curve(model=None, X=None, y=None, y_prob=None, xlabel='Recall', ylabel='Precision', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None, legend_metric='ap', legend_loc='lower left')
- Parameters:
model (object or str or list[object or str]) – A trained model, a string placeholder, or a list containing models or strings to evaluate. If
y_probis supplied directly,modelmay beNone.X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction. Required when
modelobjects are supplied andy_probis not.y (pd.Series or np.ndarray) – True binary labels for evaluation.
y_prob (array-like or list[array-like], optional) – Predicted probabilities for one or multiple models (positive class, shape
(n_samples,)or list of such arrays). When provided, the function bypasses model prediction and uses these probabilities directly.xlabel (str, optional) – Label for the x-axis. Defaults to
"Recall".ylabel (str, optional) – Label for the y-axis. Defaults to
"Precision".model_title (str or list[str], optional) – Custom title(s) for the model(s). Can be a string or list of strings. If
None, defaults to"Model 1","Model 2", etc.decimal_places (int, optional) – Number of decimal places for Average Precision (AP) or AUCPR values. Defaults to
2.overlay (bool, optional) – Whether to overlay multiple models on a single plot. Defaults to
False.title (str, optional) – Title for the plot (used in overlay mode or as global title). If
"", disables the title. Defaults toNone.save_plot (bool, optional) – Whether to save the plot(s) to file. Defaults to
False.image_path_png (str, optional) – File path to save the plot(s) as PNG.
image_path_svg (str, optional) – File path to save the plot(s) as SVG.
text_wrap (int, optional) – Maximum character width before wrapping plot titles. If
None, no wrapping is applied.curve_kwgs (list[dict] or dict[str, dict], optional) – Plot styling for PR curves. Accepts a list of dictionaries or a nested dictionary keyed by model_title.
subplots (bool, optional) – Whether to organize the PR plots in a subplot layout. Cannot be used with
overlay=Trueorgroup_category.n_cols (int, optional) – Number of columns in the subplot layout. Defaults to
2.n_rows (int, optional) – Number of rows in the subplot layout. If
None, calculated automatically based on number of models and columns.figsize (tuple, optional) – Size of the plot or subplots, in inches. Defaults to
(8, 6).label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to
12.tick_fontsize (int, optional) – Font size for ticks and legend text. Defaults to
10.gridlines (bool, optional) – Whether to display gridlines on plots. Defaults to
True.group_category (array-like, optional) – Categorical array to group PR curves. Cannot be used with
subplots=Trueoroverlay=True.legend_metric (str, optional) – Metric to display in the legend. Either
"ap"(Average Precision) or"aucpr"(area under the PR curve). Defaults to"ap".legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like
'lower left','upper right', etc., or'bottom'to place legend below the plot. Defaults to'lower left'.
- Returns:
None.Displays or saves Precision-Recall curve plots for classification models.- Return type:
None- Raises:
If
subplots=Trueandoverlay=Trueare both set.If
group_categoryis used withsubplots=Trueoroverlay=True.If
overlay=Trueis used with only one model.If
legend_metricis not one of"ap"or"aucpr".If neither (
modelandX) nory_probis provided.
If
model_titleis not a string, list of strings, orNone.
Important
You can supply either
modelandXor passy_probdirectly.When using
y_prob, the function bypasses model predictions and computes PR curves and metrics directly from the provided probabilities.Supports single-model or multi-model workflows with either model objects or arrays of pre-computed probabilities (provide a list for multiple curves).
Notes
- Flexible Inputs:
modelandmodel_titlecan be individual items or lists. Strings passed inmodelare treated as placeholder names.Titles can be automatically inferred or explicitly passed using
model_title.
- Group-Wise PR:
If
group_categoryis passed, separate PR curves are plotted for each unique group.The legend will include group-specific Average Precision and class distribution (e.g.,
AP = 0.78, Count: 500, Pos: 120, Neg: 380).
- Average Precision vs. AUCPR:
By default, the legend shows Average Precision (AP), which summarizes the PR curve with greater emphasis on the performance at higher precision levels.
If the user passes
legend_metric="aucpr", the legend will instead display AUCPR (Area Under the Precision-Recall Curve), which gives equal weight to all parts of the curve.
- Plot Modes:
overlay=Trueoverlays all models in one figure.subplots=Truearranges individual PR plots in a subplot layout.If neither is set, separate full-size plots are shown for each model.
- Legend and Styling:
A random classifier baseline (constant precision) is plotted by default.
Customize PR curves with
curve_kwgs.Titles can be disabled with
title="".
- Saving Plots:
If
save_plot=True, plots are saved using the base filename format<model_name>_precision_recalloroverlay_pr_plot.
The show_pr_curve function provides flexible and highly customizable plotting
of Precision-Recall curves for binary classification models. It supports overlays,
subplot layouts, and subgroup visualizations, while also allowing export options and
styling hooks for publication-ready output.
Precision-Recall Example 1: Subplot Layout
In this first Precision-Recall evaluation example, we plot the PR curves for two
models: Logistic Regression and Random Forest Classifier, both trained on the
synthetic dataset from the Binary Classification Models section.
The curves are arranged side by side using a subplot layout (n_cols=2, n_rows=1),
with the Logistic Regression curve rendered in blue and the Random Forest curve
in green to distinguish between models. A gray dashed line indicates the baseline
precision, equal to the prevalence of the positive class in the dataset.
This example illustrates how the show_pr_curve function makes it easy to
visualize and compare model performance when dealing with class imbalance. It
also demonstrates layout flexibility and customization options, including gridlines,
label styling, and export functionality, making it suitable for both exploratory
analysis and final reporting.
from model_metrics import show_pr_curve
show_pr_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
decimal_places=2,
n_cols=2,
n_rows=1,
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
subplots=True,
)
Output
Precision-Recall Example 2: Overlay
In this second Precision-Recall evaluation example, we focus on overlaying the
results of two models: Logistic Regression and Random Forest Classifier, trained
on the synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_pr_curve
function with the overlay=True parameter, the Precision-Recall curves for
both models are displayed together, with Logistic Regression in blue and Random
Forest in black, both with a linewidth=2. The plot includes a custom title
for clarity.
from model_metrics import show_pr_curve
show_pr_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
decimal_places=2,
n_cols=2,
n_rows=1,
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
title="ROC Curves: Logistic Regression and Random Forest",
overlay=True,
)
Output
Precision-Recall Example 3: Categorical
In this third Precision-Recall evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner library [3].
Click here to view the corresponding codebase for this workflow.
The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature, such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.
The show_pr_curve function supports this analysis through the
group_category parameter.
For example, by passing group_category=X_test_2["race"],
you can generate a separate ROC curve for each unique racial group in the dataset:
from model_metrics import show_pr_curve
show_pr_curve(
model=model_rf["model"].estimator,
X=X_test,
y=y_test,
model_title="Random Forest Classifier",
decimal_places=2,
group_category=X_test_2["race"],
)
Output
Confusion Matrix Evaluation
This section introduces the show_confusion_matrix function, which provides a
flexible, styled interface for generating and visualizing confusion matrices
across one or more classification models. It supports advanced features like
threshold overrides, subgroup labeling, classification report display, and fully
customizable plot aesthetics including subplot layouts.
The confusion matrix is a fundamental diagnostic tool for classification models, displaying the counts of true positives, true negatives, false positives, and false negatives. This function goes beyond standard implementations by allowing for custom thresholds (globally or per model), label annotation (e.g., TP, FP, etc.), plot exporting, colorbar toggling, and subplot visualization.
This is especially useful when comparing multiple models side-by-side or needing publication-ready confusion matrices for stakeholders.
- show_confusion_matrix(model=None, X=None, y_prob=None, y=None, model_title=None, title=None, model_threshold=None, custom_threshold=None, class_labels=None, cmap='Blues', save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, figsize=(8, 6), labels=True, label_fontsize=12, tick_fontsize=10, inner_fontsize=10, subplots=False, score=None, class_report=False, show_colorbar=False, **kwargs)
- Parameters:
model (object or str or list[object or str], optional) – A single model (object or string), or a list of models or string placeholders. Can be
Noneify_predory_probis provided directly.X (pd.DataFrame or np.ndarray, optional) – Feature matrix used for prediction when
modelis provided. Ignored ify_predory_probare passed directly.y (pd.Series or np.ndarray) – True target labels.
y_prob (array-like, optional) – Predicted probabilities (positive class). If provided, thresholds will be applied to convert probabilities into class predictions.
model_title (str or list[str], optional) – Custom title(s) for each model. Can be a string or list of strings. If
None, defaults to"Model 1","Model 2", etc.title (str, optional) – Title for each plot. If
"", no title is displayed. IfNone, a default title is shown.model_threshold (dict, optional) – Dictionary of thresholds keyed by model title. Used if
custom_thresholdis not set.custom_threshold (float, optional) – Global override threshold to apply across all models. If set, takes precedence over
model_threshold.class_labels (list[str], optional) – Custom labels for the classes in the matrix.
cmap (str, optional) – Colormap to use for the heatmap. Defaults to
"Blues".save_plot (bool, optional) – Whether to save the generated plot(s).
image_path_png (str, optional) – Path to save the PNG version of the image.
image_path_svg (str, optional) – Path to save the SVG version of the image.
text_wrap (int, optional) – Maximum width of plot titles before wrapping.
figsize (tuple[int, int], optional) – Figure size in inches. Defaults to
(8, 6).labels (bool, optional) – Whether to annotate matrix cells with
TP,FP,FN,TN.label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for axis ticks.
inner_fontsize (int, optional) – Font size for numbers and labels inside cells.
subplots (bool, optional) – Whether to display multiple models in a subplot layout.
score (str, optional) – Scoring metric to use when optimizing threshold (if applicable).
class_report (bool, optional) – If
True, prints a classification report below each matrix.show_colorbar (bool, optional) – Whether to display the colorbar on the confusion matrix heatmap. Defaults to
False.kwargs (dict, optional) – Additional keyword arguments for customization (e.g.,
show_colorbar,n_cols).
- Returns:
None. Displays confusion matrix plots (and optionally saves them).- Return type:
None- Raises:
TypeError – If
model_titleis not a string, a list of strings, orNone.
Important
You can supply either
modelandXor passy_predory_probdirectly.When using
y_prob, the function applies thresholds (custom_thresholdormodel_threshold) to compute predicted labels for the confusion matrix.When using
y_pred, results are calculated directly from the provided predictions without thresholding.Supports single-model or multiple-model workflows, with either model objects or arrays of pre-computed predictions/probabilities.
Notes
- Model Support:
Supports single or multiple classification models.
model_titlemay be inferred automatically or provided explicitly.
- Threshold Handling:
Use
model_thresholdto specify per-model thresholds.custom_thresholdoverrides all other thresholds.
- Plotting Modes:
subplots=Truearranges plots in subplots.Otherwise, plots are displayed one at a time.
- Labeling:
Set
labels=Falseto disable annotating cells withTP,FP,FN,TN.Always shows raw numeric values inside cells.
- Colorbar & Styling:
Toggle colorbar via
show_colorbar(passed viakwargs).Colormap and font sizes are fully configurable.
- Exporting Plots:
Plots can be saved as both PNG and SVG using the respective paths.
Saved filenames follow the pattern
confusion_matrix_<model_name>orgrid_confusion_matrix.
Confusion Matrix Example 1: Threshold=0.5
In this first confusion matrix evaluation example, we focus on showing the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section onto a single plot.
from model_metrics import show_confusion_matrix
show_confusion_matrix(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
cmap="Blues",
text_wrap=40,
subplots=True,
n_cols=2,
n_rows=1,
figsize=(6, 6),
)
Output
Confusion Matrix for Logistic Regression:
Predicted 0 Predicted 1
Actual 0 75 14
Actual 1 20 91
Confusion Matrix for Random Forest:
Predicted 0 Predicted 1
Actual 0 80 9
Actual 1 15 96
Confusion Matrix Example 2: Classification Report
This second confusion matrix evaluation example is nearly identical to the first,
but uses a different color map (cmap="viridis"), sets show_colorbar=True, and class_report=True
to print classification reports for each model in addition to the visual output.
from model_metrics import show_confusion_matrix
show_confusion_matrix(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
cmap="viridis",
text_wrap=40,
subplots=True,
n_cols=2,
n_rows=1,
figsize=(6, 6),
show_colorbar=True,
class_report=True
)
Output
Confusion Matrix for Logistic Regression:
Predicted 0 Predicted 1
Actual 0 76 18
Actual 1 13 93
Classification Report for Logistic Regression:
precision recall f1-score support
0 0.85 0.81 0.83 94
1 0.84 0.88 0.86 106
accuracy 0.84 200
macro avg 0.85 0.84 0.84 200
weighted avg 0.85 0.84 0.84 200
Confusion Matrix for Random Forest:
Predicted 0 Predicted 1
Actual 0 84 10
Actual 1 3 103
Classification Report for Random Forest:
precision recall f1-score support
0 0.97 0.89 0.93 94
1 0.91 0.97 0.94 106
accuracy 0.94 200
macro avg 0.94 0.93 0.93 200
weighted avg 0.94 0.94 0.93 200
Confusion Matrix Example 3: Threshold = 0.37
In this third confusion matrix evaluation example using the synthetic dataset
from the Binary Classification Models section, we apply
a custom classification threshold of 0.37 using the custom_threshold parameter.
This overrides the default threshold of 0.5 and enables us to inspect how the
confusion matrices shift when a more lenient decision boundary is applied. Refer
to the section on threshold selection logic
for caveats on choosing the right threshold.
This is especially useful in imbalanced classification problems or cost-sensitive environments where the trade-off between precision and recall must be adjusted. By lowering the threshold, we increase the number of positive predictions, which can improve recall but may come at the cost of more false positives.
The output matrices for both models: Logistic Regression and Random Forest are shown side by side in a subplot layout for easy visual comparison.
from model_metrics import show_confusion_matrix
show_confusion_matrix(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_titles,
text_wrap=40,
subplots=True,
n_cols=2,
n_rows=1,
figsize=(6, 6),
custom_threshold=0.37,
)
Output
Calibration Curves
This section focuses on calibration curves, a diagnostic tool that compares predicted probabilities to actual outcomes, helping evaluate how well a model’s predicted confidence aligns with observed frequencies. Using models like Logistic Regression or Random Forest on the synthetic dataset from the previous (Binary Classification Models) section, we generate calibration curves to assess the reliability of model probabilities.
Calibration is especially important in domains where probability outputs inform downstream decisions, such as healthcare, finance, and risk management. A well-calibrated model not only predicts the correct class but also outputs meaningful probabilities, for example, when a model predicts a 0.7 probability, we expect roughly 70% of such predictions to be correct.
The show_calibration_curve function simplifies this process by allowing users to
visualize calibration performance across models or subgroups. The plots show the
mean predicted probabilities against the actual observed fractions of positive
cases, with an optional reference line representing perfect calibration.
Additional features include support for overlay or subplot layouts, subgroup
analysis by categorical features, and optional Brier score display, a scalar
measure of calibration quality.
The function offers full control over styling, figure layout, axis labels, and output format, making it easy to generate both exploratory and publication-ready plots.
- show_calibration_curve(model=None, X=None, y_prob=None, y=None, xlabel='Mean Predicted Probability', ylabel='Fraction of Positives', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, bins=10, marker='o', show_brier_score=True, brier_decimals=3, gridlines=True, linestyle_kwgs=None, group_category=None, legend_loc='best', **kwargs)
- Parameters:
model (estimator or list, optional) – A trained classifier or a list of classifiers to evaluate. Can be
Noneify_probis provided directly.X (pd.DataFrame or np.ndarray, optional) – Feature matrix used for predictions when
modelis supplied. Ignored ify_probis passed directly.y_prob (array-like or list of array-like, optional) – Predicted probabilities for the positive class. Can be provided directly instead of
modelandX.y (pd.Series or np.ndarray) – True binary target values.
xlabel (str, optional) – X-axis label. Defaults to
"Mean Predicted Probability".ylabel (str, optional) – Y-axis label. Defaults to
"Fraction of Positives".model_title (str or list[str], optional) – Custom title(s) for the models.
overlay (bool, optional) – If
True, overlays multiple models on one plot.title (str, optional) – Title for the plot. Use
""to suppress.save_plot (bool, optional) – Whether to save the plot(s).
image_path_png (str, optional) – Directory path for PNG export.
image_path_svg (str, optional) – Directory path for SVG export.
text_wrap (int, optional) – Max characters before title text wraps.
curve_kwgs (list[dict] or dict[str, dict], optional) – Styling options for the calibration curves.
subplots (bool, optional) – Whether to arrange models in a subplot layout.
n_cols (int, optional) – Number of columns in the subplot layout. Defaults to
2.n_rows (int, optional) – Number of rows in the subplot layout. Auto-calculated if
None.figsize (tuple, optional) – Figure size in inches (width, height).
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for ticks and legend entries.
bins (int, optional) – Number of bins used to compute calibration.
marker (str, optional) – Marker style for calibration points.
show_brier_score (bool, optional) – Whether to display Brier score in the legend.
brier_decimals (int, optional) – Number of decimal places to display for the Brier score in legend labels. Defaults to 3.
gridlines (bool, optional) – Whether to show gridlines on plots.
linestyle_kwgs (dict, optional) – Styling for the “perfectly calibrated” reference line.
group_category (array-like, optional) – Categorical variable used to create subgroup calibration plots.
legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like
'best','upper right','lower left', etc., or'bottom'to place legend below the plot. Defaults to'best'.kwargs (dict, optional) – Additional keyword arguments passed to the plot function.
- Returns:
None.Displays or saves calibration plots for classification models.- Return type:
None- Raises:
If
overlay=Trueandsubplots=Trueare both set.If
group_categoryis used withoverlayorsubplots.If
curve_kwgslist does not match number of models.
If
model_titleis not a string, list of strings, Series, orNone.
Important
You can supply either
modelandXor passy_probdirectly.When using
y_prob, the function bypasses model predictions and evaluates calibration curves directly from the provided probabilities.brier_decimalscontrols the numeric precision of the Brier score shown in legend entries whenshow_brier_score=True.Supports single-model or multiple-model workflows, including arrays of pre-computed probabilities.
Notes
- Calibration vs Discrimination:
Calibration evaluates how well predicted probabilities reflect observed outcomes, while ROC AUC measures a model’s ability to rank predictions.
- Flexible Plotting Modes:
overlay=Trueplots multiple models on one figure.subplots=Truearranges plots in a subplot layout.If neither is set, individual full-size plots are created.
- Group-Wise Analysis:
Passing
group_categoryplots separate calibration curves by subgroup (e.g., age, race).Each subgroup’s Brier score is shown when
show_brier_score=True.
- Customization:
Use
curve_kwgsandlinestyle_kwgsto control styling.Add markers, gridlines, and custom titles to suit report or presentation needs.
- Saving Outputs:
Set
save_plot=Trueand specifyimage_path_pngorimage_path_svgto export figures.Filenames are auto-generated based on model name and plot type.
Important
Calibration curves are a valuable diagnostic tool for assessing the alignment between predicted probabilities and actual outcomes. By plotting the fraction of positives against predicted probabilities, we can evaluate how well a model’s confidence scores correspond to observed reality. While these plots offer important insights, it’s equally important to understand the assumptions and limitations behind the calibration methods used.
Calibration Curve Example 1: Subplots
This example presents calibration curves for two classification models trained on the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner library [3].
Click here to view the corresponding codebase for this workflow.
The classification models are displayed side by side in a subplot layout. Each
subplot shows how well the predicted probabilities from a model align with the
actual observed outcomes. A diagonal dashed line representing perfect calibration
is included in both plots, and Brier scores are shown in the legend to quantify
each model’s calibration accuracy.
By setting subplots=True, the function automatically arranges the individual plots
based on the number of models and specified columns. This layout is ideal for
visually comparing calibration behavior across models without overlapping lines.
pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]
# Model titles
model_titles = [
"Logistic Regression",
"Random Forest Classifier",
"Decision Tree Classifier",
]
from model_metrics import show_calibration_curve
show_calibration_curve(
model=pipelines_or_models[:2],
X=X_test,
y=y_test,
model_title=model_titles[:2],
text_wrap=50,
bins=10,
show_brier_score=True,
subplots=True,
linestyle_kwgs={"color": "black"},
)
Output
Calibration Curve Example 2: Overlay
This example also uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner library [3].
Click here to view the corresponding codebase for this workflow.
This example demonstrates how to overlay calibration curves from multiple classification
models in a single plot. Overlaying allows for direct visual comparison of how predicted
probabilities from each model align with actual outcomes on the same axes.
The diagonal dashed line represents perfect calibration, and Brier scores are included in the legend for each model, providing a quantitative measure of calibration accuracy.
By setting overlay=True, the function combines all model curves into one figure,
making it easier to evaluate relative performance without splitting across subplots.
pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]
# Model titles
model_titles = [
"Logistic Regression",
"Random Forest Classifier",
"Decision Tree Classifier",
]
from model_metrics import show_calibration_curve
show_calibration_curve(
model=pipelines_or_models,
X=X_test,
y=y_test,
model_title=model_titles,
bins=10,
show_brier_score=True,
overlay=True,
linestyle_kwgs={"color": "black"},
)
Output
Calibration Curve Example 3: by Category
This example, too, uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner library [3].
Click here to view the corresponding codebase for this workflow.
This example shows how to visualize calibration curves separately for each
category within a given feature, in this case, the race column of the joined
test set using a single Random Forest classifier. Each plot represents the
calibration behavior of the model for a specific subgroup, allowing for detailed
insight into how predicted probabilities align with actual outcomes across
demographic categories.
This type of disaggregated visualization is especially useful for fairness
analysis and subgroup performance auditing. By setting group_category="race",
the function automatically detects unique values in the specified column and
generates a separate calibration curve for each.
The dashed diagonal reference line represents perfect calibration. Brier scores are included in each plot to provide a quantitative measure of calibration performance within the group.
Note
When using group_category, both overlay and subplots must be set to
False. This ensures each group receives its own standalone figure, avoiding
conflicting layout behavior.
from model_metrics import show_calibration_curve
show_calibration_curve(
model=model_rf["model"].estimator,
X=X_test,
y=y_test,
model_title="Random Forest Classifier",
bins=10,
show_brier_score=True,
linestyle_kwgs={"color": "black"},
curve_kwgs={title: {"linewidth": 2} for title in model_titles},
group_category=X_test_2["race"],
)
Output
Threshold Metric Curves
This section introduces a powerful utility for exploring how classification thresholds affect key performance metrics, including Precision, Recall, F1 Score, and Specificity. Rather than fixing a threshold (commonly at 0.5), this function allows users to visualize trade-offs across the full range of possible thresholds, making it especially useful when optimizing for use-case-specific goals such as maximizing recall or achieving a minimum precision.
Using the Random Forest Classifier models trained on the adult income dataset [2], this tool helps users answer practical questions like:
What threshold achieves at least 85% precision?
Where does F1 score peak for this model?
How does specificity behave as the threshold increases?
The plot_threshold_metrics function supports optional threshold lookups via
lookup_metric and lookup_value, which prints the closest threshold that
meets your constraint. Plots can be customized with colors, gridlines, line styles,
wrapped titles, and export options.
- plot_threshold_metrics(model=None, X_test=None, y_test=None, y_prob=None, title=None, text_wrap=None, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, baseline_thresh=True, curve_kwgs=None, baseline_kwgs=None, threshold_kwgs=None, lookup_kwgs=None, save_plot=False, image_path_png=None, image_path_svg=None, lookup_metric=None, lookup_value=None, decimal_places=4, model_threshold=None)
Plot Precision, Recall, F1 Score, and Specificity as functions of the decision threshold.
This utility evaluates threshold-dependent classification metrics across the full range of thresholds. It supports highlighting a 0.5 baseline, an explicit model threshold, and a threshold located via a user-specified target metric value.
- Parameters:
model (object, optional) – A trained classification estimator used to produce probabilities when
y_probis not provided. Must supportpredict_probaif used.X_test (pd.DataFrame or np.ndarray, optional) – Feature matrix for evaluation. Required when
modelis supplied.y_test (pd.Series or np.ndarray) – True binary labels. Required.
y_prob (array-like, optional) – Pre-computed predicted probabilities for the positive class. If provided,
modelandX_testare not required.title (str, optional) – Custom title for the plot. If
"", disables the title. IfNone, a default title is shown.text_wrap (int, optional) – Maximum title width before wrapping. If
None, no wrapping is applied.figsize (tuple, optional) – Figure size (width, height) in inches. Defaults to
(8, 6).label_fontsize (int, optional) – Font size for axis labels and title. Defaults to
12.tick_fontsize (int, optional) – Font size for tick labels. Defaults to
10.gridlines (bool, optional) – Whether to show gridlines. Defaults to
True.baseline_thresh (bool, optional) – If
True, adds a reference line at threshold = 0.5.curve_kwgs (dict, optional) – Styling options applied to all metric curves (e.g.,
{"linestyle": "-", "linewidth": 1}).baseline_kwgs (dict, optional) – Styling options for the baseline (0.5) threshold line (default: black dotted line).
threshold_kwgs (dict, optional) – Styling options for the model threshold line when
model_thresholdis provided (default: black dotted line).lookup_kwgs (dict, optional) – Styling options for the lookup threshold line when
lookup_metric/lookup_valueare provided (default: gray dashed line).save_plot (bool, optional) – Whether to save the figure to file.
image_path_png (str, optional) – File path to save PNG output (used when
save_plot=True).image_path_svg (str, optional) – File path to save SVG output (used when
save_plot=True).lookup_metric (str, optional) – Metric used to locate a threshold closest to
lookup_value. One of"precision","recall","f1", or"specificity".lookup_value (float, optional) – Target value for
lookup_metric. Must be provided together withlookup_metric.decimal_places (int, optional) – Number of decimal places for printed threshold output(s). Defaults to
4.model_threshold (float, optional) – A model-specific threshold to highlight (vertical line). Useful if the model does not use 0.5.
- Returns:
None.Displays (and optionally saves) the threshold metrics plot.- Return type:
None- Raises:
If
y_testis not provided.If neither (
modelandX_test) nory_probis provided.If only one of
lookup_metricorlookup_valueis provided.
Important
You can supply either
modelandX_testor passy_probdirectly.When using
y_prob, the function bypasses model inference and uses the provided probabilities.
Notes
- Metric Curves:
Plots include
Precision,Recall,F1 Score, andSpecificityover threshold values.Useful for analyzing how changing the threshold alters model behavior.
- Threshold Lookup:
Set
lookup_metricandlookup_valueto find the closest threshold that meets your constraint.Prints result to console and highlights the corresponding vertical line.
- Styling Options:
Customize plot curves with
curve_kwgs.Adjust baseline style (e.g., at threshold = 0.5) via
baseline_kwgs.
- Three optional vertical guides are supported:
Baseline at 0.5 (
baseline_thresh=True),model_threshold(e.g., a tuned decision threshold),A threshold found by targeting
lookup_metricatlookup_value.
- Exporting:
Use
save_plot=Truewithimage_path_pngand/orimage_path_svgto save outputs.
- Interactivity:
Ideal for presentations or dashboards where visualizing threshold sensitivity is crucial.
Particularly helpful for domains like healthcare, fraud detection, or content moderation, where the cost of false positives vs. false negatives must be carefully managed.
Threshold Curves Example 1: Threshold=0.5
This example demonstrates how to plot threshold-dependent classification metrics using a Random Forest Classifier trained on the adult income dataset [2].
The plot_threshold_metrics function visualizes how Precision, Recall, F1 Score,
and Specificity change as the decision threshold varies. In this configuration,
the baseline threshold line at 0.5 is enabled (baseline_thresh=True),
and the line styling is customized via curve_kwgs. Font sizes and wrapping options
are adjusted for improved clarity in presentation-ready plots.
from model_metrics import plot_threshold_metrics
plot_threshold_metrics(
model=model_rf["model"].estimator,
X_test=X_test,
y_test=y_test,
baseline_thresh=True,
baseline_kwgs={
"color": "purple",
"linestyle": "--",
"linewidth": 2,
},
curve_kwgs={
"linestyle": "-",
"linewidth": 2,
},
text_wrap=40,
)
Output
Threshold Curves Example 2: Targeted Metric Lookup
This example expands on threshold-based classification metric visualization using
a targeted lookup scenario. Suppose a clinical stakeholder or domain expert has
determined (based on prior research, cost-benefit considerations, or operational
constraints) that a precision of approximately 0.879 is ideal for downstream
decision-making (e.g., minimizing false positives in a healthcare setting).
The plot_threshold_metrics function accepts the optional arguments lookup_metric
and lookup_value to help identify the threshold that best aligns with this target.
When these are set, the function automatically locates and highlights the threshold
that most closely achieves the desired metric value, offering transparency and
guidance for threshold tuning.
from model_metrics import plot_threshold_metrics
plot_threshold_metrics(
model=model_rf["model"].estimator,
X_test=X_test,
y_test=y_test,
lookup_metric="precision",
lookup_value=0.879,
baseline_thresh=False,
lookup_kwgs={
"color": "red",
"linestyle": "--",
"linewidth": 2,
},
curve_kwgs={
"linestyle": "-",
"linewidth": 2,
},
text_wrap=40,
)
Output
In this example:
lookup_metric="precision"specifies that we are targeting the precision curve.lookup_value=0.879provides the desired value for that metric.The function will search for the closest possible precision value along the threshold range and display a vertical line at that corresponding threshold.
The threshold value is printed to the console and included in the legend (e.g., Best Threshold: 0.6757).
Threshold Curves Example 3: Model-Specific Threshold
In many production settings, a classifier is deployed with a tuned decision threshold different from the default 0.5 (e.g., to balance costs of false positives vs. false negatives).
This example shows how to explicitly pass a model’s chosen threshold to be drawn as a vertical guide on the plot using model_threshold=....
You can do this whether you’re providing a model/X pair or pre-computed probabilities via y_prob. Below we show the latter.
# Get predicted probabilities for Random Forest model
y_prob_rf = model_rf["model"].estimator.predict_proba(X_test)[:, 1]
# Retrieve model thresholds
model_thresholds = {
"Logistic Regression": next(iter(model_lr["model"].threshold.values())),
"Decision Tree Classifier": next(iter(model_dt["model"].threshold.values())),
"Random Forest Classifier": next(iter(model_rf["model"].threshold.values())),
}
from model_metrics import plot_threshold_metrics
# Example: Use precomputed probabilities but still highlight the model's tuned threshold.
plot_threshold_metrics(
y_prob=y_prob[2], # precomputed probabilities for the positive class
y_test=y_test, # ground-truth labels
baseline_thresh=False, # hide the default 0.5 guide
model_threshold=model_thresholds["Random Forest Classifier"],
threshold_kwgs={ # styling for the model-threshold vertical line
"color": "blue",
"linestyle": "--",
"linewidth": 1,
},
curve_kwgs={ # styling for metric curves
"linestyle": "-",
"linewidth": 1.25,
},
text_wrap=40,
)
Output
Note
model_thresholddraws a labeled vertical line (e.g., Model Threshold: 0.6757), making it clear where the production decision point lies.This is independent of
baseline_thresh; you can enable both if you want to compare the tuned threshold vs. the default 0.5.If you prefer to compute probabilities on-the-fly, pass the model and test features instead of
y_prob:
Residual Diagnostics
Residual diagnostics are essential tools for evaluating regression model performance beyond standard metrics like R² or RMSE. By examining the patterns in residuals: the differences between observed and predicted values, we can identify violations of modeling assumptions, detect systematic errors, and uncover opportunities for model improvement.
The show_residual_diagnostics function provides comprehensive visualization of residual
patterns across multiple dimensions:
Residuals vs Fitted Values: Assess homoscedasticity (constant variance) and identify non-linear patterns
Residuals vs Predictors: Examine whether specific features are associated with systematic prediction errors
Q-Q Plots: Evaluate whether residuals follow a normal distribution
Histogram of Residuals: Visualize the distribution shape and identify outliers
Scale-Location Plots: Detect heteroscedasticity (non-constant variance)
Dataset for Regression Residuals Examples:
The examples in this section use the diabetes dataset from scikit-learn, which was introduced in the Regression Models section. This dataset contains baseline medical measurements and a quantitative measure of disease progression, making it ideal for demonstrating residual analysis techniques across different regression approaches.
What Good Residuals Look Like:
Randomly scattered around zero with no systematic patterns
Constant spread across the range of fitted values (homoscedasticity)
Approximately normally distributed (for inference and prediction intervals)
No strong correlations with individual predictor variables
What Bad Residuals Reveal:
Funnel shapes (heteroscedasticity): Variance increases/decreases with predicted values, suggesting transformations may be needed
Curved patterns: Non-linear relationships that the model hasn’t captured
Clusters or groups: Systematic differences across subpopulations that may require interaction terms or stratified models
Heavy tails or skewness: Outliers or violations of normality assumptions
Patterns vs predictors: Missing interaction effects or non-linear relationships with specific features
More information on residual diagnostics and interpretation can be found in the residual diagnostics section.
- show_residual_diagnostics(model=None, X=None, y=None, y_pred=None, model_title=None, plot_type='all', figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, save_plot=False, image_path_png=None, image_path_svg=None, show_outliers=False, n_outliers=3, suptitle=None, suptitle_y=0.995, text_wrap=None, point_kwgs=None, group_kwgs=None, line_kwgs=None, show_lowess=False, lowess_kwgs=None, group_category=None, show_centroids=False, centroid_type='clusters', n_clusters=None, centroid_kwgs=None, legend_loc='best', legend_kwgs=None, n_cols=None, n_rows=None, heteroskedasticity_test=None, decimal_places=4, show_plots=True, show_diagnostics_table=False, return_diagnostics=False, histogram_type='frequency', kmeans_rstate=42)
- Parameters:
model (estimator or list of estimators, optional) – Trained regression model(s). If
None,y_predmust be provided.X (array-like, optional) – Feature matrix. Required if
modelis provided.y (array-like) – True target values.
y_pred (array-like or list, optional) – Predicted values. Can be provided instead of
modelandX.model_title (str or list[str], optional) – Custom name(s) for model(s). Defaults to
"Model 1","Model 2", etc.plot_type (str or list, optional) – Which diagnostic plot(s) to display. Options:
"all","fitted","qq","scale_location","leverage","influence","histogram","predictors". Can pass a list for specific plots.figsize (tuple, optional) – Figure size (width, height). Defaults vary by
plot_type.label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to
12.tick_fontsize (int, optional) – Font size for tick labels. Defaults to
10.gridlines (bool, optional) – Whether to display grid lines. Defaults to
True.save_plot (bool, optional) – Whether to save the plot(s) to disk. Defaults to
False.image_path_png (str, optional) – Path to save PNG image.
image_path_svg (str, optional) – Path to save SVG image.
show_outliers (bool, optional) – Whether to label outlier points on plots. Defaults to
False.n_outliers (int, optional) – Number of most extreme outliers to label. Defaults to
3.suptitle (str, optional) – Custom title for the overall figure. If
None, uses default; if"", no suptitle is displayed.suptitle_y (float, optional) – Vertical position of the figure suptitle (0-1 range). Defaults to
0.995.text_wrap (int, optional) – Maximum width for wrapping titles.
point_kwgs (dict, optional) – Styling for scatter points (e.g.,
{'alpha': 0.6, 'color': 'blue'}).group_kwgs (dict, optional) – Styling for group scatter points when
group_categoryis provided. Can specify colors as a list for each group.line_kwgs (dict, optional) – Styling for reference lines (e.g.,
{'color': 'red', 'linestyle': '--'}).show_lowess (bool, optional) – Whether to show the LOWESS smoothing trend line on residual plots. Defaults to
False.lowess_kwgs (dict, optional) – Styling for LOWESS smoothing line (e.g.,
{'color': 'blue', 'linewidth': 2}).group_category (str or array-like, optional) – Categorical variable for grouping observations. Can be a column name in
Xor an array matchingyin length.show_centroids (bool, optional) – Whether to plot centroids for groups or clusters. Defaults to
False.centroid_type (str, optional) – Type of centroids to display. Options:
"clusters"(k-means clustering) or"groups"(category centroids). Defaults to"clusters".n_clusters (int, optional) – Number of clusters for k-means clustering when
centroid_type="clusters". Defaults to3if not specified.centroid_kwgs (dict, optional) – Styling for centroid markers (e.g.,
{'marker': 'X', 's': 50, 'c': ['red', 'blue']}).legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like
'best','upper right', etc. Defaults to'best'.legend_kwgs (dict, optional) – Control legend display for groups, centroids, clusters, and heteroskedasticity tests. Use keys
'groups','centroids','clusters','het_tests'with boolean values.n_cols (int, optional) – Number of columns for predictor plots layout. If
None, uses automatic layout.n_rows (int, optional) – Number of rows for predictor plots layout. If
None, automatically calculated.heteroskedasticity_test (str, optional) – Test for heteroskedasticity. Options:
"breusch_pagan","white","goldfeld_quandt","spearman","all", orNone. Defaults toNone.decimal_places (int, optional) – Number of decimal places for all numeric values in diagnostics. Defaults to
4.show_plots (bool, optional) – Whether to display diagnostic plots. Defaults to
True.show_diagnostics_table (bool, optional) – Whether to print a formatted table of diagnostic statistics. Defaults to
False.return_diagnostics (bool, optional) – If
True, return a dictionary containing diagnostic statistics. Defaults toFalse.histogram_type (str, optional) – Type of histogram to display. Options:
"frequency"(raw counts) or"density"(probability density with normal overlay). Defaults to"frequency".kmeans_rstate (int, optional) – Random state for reproducibility in k-means clustering. Defaults to
42.
- Returns:
None(displays plots) or dictionary of diagnostics ifreturn_diagnostics=True.- Return type:
Noneor dict- Raises:
If neither (
modelandX) nory_predis provided.If
plot_typeis not recognized.If both
group_categoryandn_clustersare specified.If
heteroskedasticity_testis not a valid test type.If
histogram_typeis not'frequency'or'density'.If
centroid_typeis not'clusters'or'groups'.If
centroid_type='groups'withoutgroup_categoryspecified.If
show_centroids=Trueandcentroid_type='clusters'butXcontains non-numeric columns (k-means clustering requires all numeric features).
Important
You can supply either
modelandXor passy_preddirectly.When using
y_pred, the function bypasses model predictions and uses the provided values for residual calculation.Supports single-model or multiple-model workflows.
Cannot specify both
group_category(user-defined groups) andn_clusters(automatic clustering) simultaneously.When
centroid_type='groups',group_categorymust be provided.
Notes
- Diagnostic Plot Types:
"all": Creates a comprehensive 2×3 grid with fitted, Q-Q, scale-location, leverage, influence, and histogram plots"fitted": Residuals vs fitted values (detects heteroscedasticity and non-linearity)"qq": Normal Q-Q plot (assesses normality assumption)"scale_location": Standardized residuals vs fitted (evaluates homoscedasticity)"leverage": Residuals vs leverage with Cook’s distance contours (identifies influential points)"influence": Influence plot with bubble sizes proportional to Cook’s distance"histogram": Distribution of residuals with optional normal overlay"predictors": Separate plots for residuals vs each predictor variable
- Heteroskedasticity Testing:
Optional parameter
heteroskedasticity_testperforms formal statistical tests for non-constant variance"breusch_pagan": Tests whether residual variance depends on predicted values"white": More general test that doesn’t assume specific functional form"goldfeld_quandt": Compares variance between two subsamples"spearman": Spearman correlation between absolute residuals and fitted values"all": Runs all available testsTest results are displayed in plot legends and printed when
show_diagnostics_table=TrueIf
None(default), no tests are performed, only visual diagnostics are displayed
- Grouped Analysis:
Use
group_categoryto stratify residuals by categorical variables (e.g., sex, age group, treatment arm)Use
n_clustersfor automatic k-means clustering to identify data-driven residual patternsSet
show_centroids=Trueto overlay group/cluster centroids on plotscentroid_typecontrols centroid display:"clusters"(default) for k-means or"groups"for category centroidskmeans_rstateensures reproducible clustering (default:42)Use
legend_kwgsto control which legend entries appear (groups, centroids, clusters, heteroskedasticity tests)
- Customization:
Control histogram display with
histogram_type:"frequency"for raw counts or"density"for normalized distribution with normal overlayEnable LOWESS smoothing with
show_lowess=Trueto visualize trends in residual plotsCustomize point, line, LOWESS, group, and centroid styling via respective
*_kwgsparametersAdjust subplot layout for predictor plots using
n_colsandn_rowsSet
suptitleto customize or suppress the overall figure title, and usesuptitle_yto adjust vertical position
- Output Options:
Set
show_plots=Falseandshow_diagnostics_table=Trueto print only the diagnostic statistics tableUse
return_diagnostics=Trueto programmatically access diagnostic quantities for custom analysesCombine
show_plots=Trueandshow_diagnostics_table=Truefor comprehensive visual and quantitative assessmentWhen
group_categoryis a column inX, it is automatically excluded from predictor plots to avoid redundancy
Residual Diagnostics Example 1: All Residual Diagnostics Plots
Using the diabetes dataset, this first example demonstrates a complete residual diagnostic analysis for a single
regression model using plot_type="all". This setting generates all available diagnostic
visualizations in a single comprehensive display:
Residuals vs Fitted Values: Detects non-linearity, heteroscedasticity, and outliers
Q-Q Plot: Assesses normality of residuals
Scale-Location Plot: Evaluates homoscedasticity (constant variance)
Residuals vs Leverage: Identifies influential observations
Histogram of Residuals: Shows the distribution shape
We evaluate a Random Forest model trained on the diabetes dataset. The n_clusters=3
parameter performs k-means clustering on the residuals to identify groups of observations
with similar prediction error patterns. Setting show_centroids=True overlays cluster
centers on the residual plots, styled with custom colors and markers via centroid_kwgs.
The kmeans_rstate=222 parameter controls the random seed for k-means clustering, ensuring
reproducible cluster assignments across repeated runs. By default, kmeans_rstate is set
to 42, making clustering deterministic unless explicitly changed. This is important because
k-means uses random initialization; different seeds can produce slightly different cluster
assignments, especially when clusters overlap or are of similar size. Setting a fixed seed
ensures that diagnostic plots remain consistent for documentation, presentations, and
collaborative analysis.
To formally test for heteroscedasticity, we enable heteroskedasticity_test="breusch_pagan".
This optional parameter runs the Breusch-Pagan test, which evaluates whether residual
variance is systematically related to predicted values. Test results, including the test
statistic, p-value, and interpretation, are printed to the console. A significant result
(p < 0.05) indicates heteroscedasticity, suggesting that predictions may be more reliable
for certain ranges of the response variable than others.
Additional customization options include:
n_cols=2: Arranges diagnostic plots in a 2-column grid layout; this is useful for wide-format displays, but it is recommended to not use this setting when displaying all plots together to maintain clarity.histogram_type="density": Displays residuals as a density plot rather than raw countsdecimal_places=2: Controls precision of printed test statisticstick_fontsizeandlabel_fontsize: Adjust text sizing for readabilitysave_plot=Truewith image paths: Exports plots as PNG and SVG for reports
The function also returns a diagnostics dictionary containing residuals, fitted values,
standardized residuals, and leverage statistics. This allows for programmatic access to
diagnostic quantities for custom analyses or integration with resid_diagnostics_to_dataframe
to convert results into a pandas DataFrame for further exploration.
rf_pred = rf_model.predict(X_test)
from model_metrics import show_residual_diagnostics
show_residual_diagnostics(
y_pred=rf_pred,
model_title=["Random Forest"],
X=X_test,
y=y_test,
n_clusters=3,
n_cols=2,
plot_type="all",
show_centroids=True,
centroid_kwgs={"c": ["red", "blue", "green"], "marker": "X", "s": 50},
heteroskedasticity_test="breusch_pagan",
decimal_places=2,
histogram_type="density",
kmeans_rstate=222,
)
Output
Residual Diagnostics Example 2: Single Plot with LOWESS Smoothing
Using the diabetes dataset, this example demonstrates two key capabilities for focused residual analysis:
Selective plot generation: The
plot_typeparameter allows you to generate specific diagnostic plots rather than the full suite. Pass a single plot name as a string (e.g.,"fitted") or a list of plot names for multiple specific plots (e.g.,["fitted", "qq", "histogram"]). This is useful when you need to examine particular model assumptions or create targeted visualizations for reports.LOWESS trend detection: Setting
show_lowess=Trueadds a locally weighted scatterplot smoothing (LOWESS) curve to residual plots. This non-parametric smoothing line reveals systematic patterns or trends in the residuals that might not be obvious from the scatter alone. If model assumptions hold, the LOWESS line should be roughly horizontal at y=0. Pronounced curves or trends indicate potential violations of linearity or suggest that the model is systematically over- or under-predicting in certain regions.
We focus on the Scale-Location plot (plot_type="scale_location"), which is particularly useful for detecting heteroscedasticity: the violation of the constant variance assumption. This plot displays the square root of standardized residuals against fitted values, making it easier to spot changes in residual spread across the prediction range. The LOWESS smoothing line, styled in orange via lowess_kwgs, helps identify whether variance increases, decreases, or remains stable as predictions change.
The heteroskedasticity_test="breusch_pagan" parameter formally tests for
heteroscedasticity. The Breusch-Pagan test evaluates whether residual variance
is systematically related to the predictors or fitted values. Test results appear
in the plot legend (if space permits) or can be displayed in a diagnostic table
using show_diagnostics_table=True. A significant result (p < 0.05) provides
statistical evidence of heteroscedasticity, which may require remedial measures
such as variance-stabilizing transformations, weighted least squares, or robust
standard errors.
from model_metrics import show_residual_diagnostics
show_residual_diagnostics(
y_pred=rf_pred,
model_title="Random Forest",
X=X_test,
y=y_test,
plot_type="scale_location",
point_kwgs={"alpha": 0.9, "color": "blue", "edgecolor": "black", "s": 50},
show_lowess=True,
lowess_kwgs={"color": "red", "linewidth": 2},
heteroskedasticity_test="breusch_pagan",
figsize=(8, 6),
tick_fontsize=12,
label_fontsize=14,
)
Output
Residual Diagnostics Example 3: Diagnostics Table Only
Using the diabetes dataset, this example
demonstrates how to generate a comprehensive residual diagnostics summary table
without displaying plots. By setting show_plots=False and
show_diagnostics_table=True, the function outputs only a tabular summary of key
diagnostic statistics and heteroscedasticity test results.
The diagnostics table includes:
Residual statistics: Mean, standard deviation, min, max, and quartiles
Standardized residual metrics: Useful for identifying outliers (\(|z| > 3\))
Heteroscedasticity test results: When
heteroskedasticity_testis specified
In this example, we set heteroskedasticity_test="all" to run all available tests:
Breusch-Pagan: Tests whether residual variance depends on predicted values
White: A more general test that doesn’t assume a specific functional form
Goldfeld-Quandt: Compares variance between two subsamples
Spearman: Compares correlation between absolute residuals and fitted values
Each test returns a test statistic, p-value, and interpretation. The decimal_places=5
parameter ensures high precision in the printed output, which is useful for reporting
results in research papers or technical documentation.
The return_diagnostics=True parameter returns a dictionary containing all diagnostic
quantities (residuals, fitted values, standardized residuals, leverage, etc.) for
programmatic access or conversion to a DataFrame using resid_diagnostics_to_dataframe.
Note: You can also display both the table and plots simultaneously by setting
show_plots=True and show_diagnostics_table=True together. This provides a
comprehensive view combining visual diagnostics with quantitative summaries, ideal for
thorough model evaluation reports.
Additional parameters used:
plot_type="histogram": Specifies which plot type to generate (only relevant ifshow_plots=True)n_clusters=3andshow_centroids=True: Configures k-means clustering (applied to returned diagnostics)save_plot=True: Would save plots ifshow_plots=True
from model_metrics import show_residual_diagnostics
diagnostics = show_residual_diagnostics(
y_pred=rf_pred,
model_title=["Random Forest"],
X=X_test_diabetes,
y=y_test_diabetes,
n_clusters=3,
n_cols=2,
save_plot=True,
image_path_png=image_path_png,
image_path_svg=image_path_svg,
tick_fontsize=12,
label_fontsize=14,
plot_type="histogram",
show_centroids=True,
centroid_kwgs={"c": ["red", "blue", "green"], "marker": "X", "s": 50},
heteroskedasticity_test="all",
legend_loc="upper right",
show_diagnostics_table=True,
return_diagnostics=True,
show_plots=False,
decimal_places=5,
)
Output
============================================================
Residual Diagnostics: Random Forest
============================================================
Statistic Value
------------------------------------------------------------
N Observations 89
N Predictors 10
------------------------------------------------------------
R-squared 0.44282
Adjusted R-squared 0.37139
------------------------------------------------------------
RMSE 54.33241
MAE 44.05303
------------------------------------------------------------
Mean Residual -0.73056
Std Residual 54.32750
Min Residual -118.34000
Max Residual 153.37000
Jarque-Bera Test p=0.76159 (Normal)
Durbin-Watson 2.20903
------------------------------------------------------------
Mean Leverage 0.12360
Max Leverage 0.39654
Leverage Threshold (2p/n) 0.22472
High Leverage Points 7
------------------------------------------------------------
Heteroskedasticity Tests:
Breusch-Pagan p=0.03750 (Heteroskedastic)
White p=0.10104 (Homoskedastic)
Goldfeld-Quandt p=0.05058 (Homoskedastic)
Spearman Correlation p=0.09465 (Homoskedastic)
============================================================
Residual Diagnostics Example 4: Diagnostics to DataFrame
Building on the Residual Diagnostics Example 2,
the diagnostics dictionary returned by show_residual_diagnostics() can be converted into
a pandas DataFrame for programmatic analysis, reporting, or integration into automated pipelines.
The resid_diagnostics_to_dataframe() helper function handles this conversion seamlessly,
properly flattening nested structures like heteroskedasticity test results. Unlike the
console table which only displays p-values and interpretations, the DataFrame provides
complete test results including both the test statistics and p-values: useful for creating
custom reports, academic papers, or detailed model documentation that requires full
statistical disclosure.
from model_metrics import show_residual_diagnostics, resid_diagnostics_to_dataframe
# Generate diagnostics (reusing Example 2 configuration)
diagnostics = show_residual_diagnostics(
y_pred=rf_prob,
model_title=["Random Forest"],
X=X_test_diabetes,
y=y_test_diabetes,
n_clusters=3,
heteroskedasticity_test="breusch_pagan",
show_diagnostics_table=False, # Suppress console table
show_plots=False, # Suppress plots to focus on data extraction
decimal_places=2,
kmeans_rstate=222,
)
# Convert to DataFrame
df = resid_diagnostics_to_dataframe(diagnostics)
print(df)
This produces a clean DataFrame with all diagnostic statistics:
Output
| Statistic | Value |
|---|---|
| model_name | Random Forest |
| n_observations | 89 |
| n_predictors | 10 |
| mean_residual | -0.730562 |
| std_residual | 54.327496 |
| min_residual | -118.34 |
| max_residual | 153.37 |
| mae | 44.053034 |
| rmse | 54.332408 |
| r2 | 0.44 |
| adj_r2 | 0.37 |
| jarque_bera_stat | 0.5447 |
| jarque_bera_pval | 0.761588 |
| durbin_watson | 2.209026 |
| max_leverage | 0.396543 |
| mean_leverage | 0.123596 |
| max_cooks_d | 0.207521 |
| leverage_threshold | 0.224719 |
| high_leverage_count | 7 |
| influential_points_05 | 0 |
| influential_points_10 | 0 |
| hetero_breusch_pagan_stat | 19.22 |
| hetero_breusch_pagan_pval | 0.04 |
| hetero_breusch_pagan_heteroskedastic | TRUE |
| hetero_white_stat | 78.78 |
| hetero_white_pval | 0.1 |
| hetero_white_heteroskedastic | FALSE |
| hetero_goldfeld_quandt_stat | 1.77 |
| hetero_goldfeld_quandt_pval | 0.05 |
| hetero_goldfeld_quandt_heteroskedastic | FALSE |
| hetero_spearman_stat | 0.18 |
| hetero_spearman_pval | 0.09 |
| hetero_spearman_heteroskedastic | FALSE |
Key Features:
Automatic flattening: Heteroskedasticity test results are expanded into separate rows (e.g.,
hetero_breusch_pagan_stat,hetero_breusch_pagan_pval)Normality interpretation: Jarque-Bera results are split into statistic, p-value, and boolean normality indicator
Ready for export: DataFrame can be saved to CSV, Excel, or integrated into reporting workflows with
df.to_csv('diagnostics.csv', index=False)Programmatic access: Extract specific statistics with
df.loc[df['Statistic'] == 'R-squared', 'Value'].iloc[0]
When multiple heteroskedasticity tests are requested using heteroskedasticity_test="all",
the DataFrame will include rows for all four tests (Breusch-Pagan, White, Goldfeld-Quandt,
and Spearman Correlation), each with their respective statistics, p-values, and
heteroskedasticity indicators.
Residual Diagnostics Example 5: Grouped Analysis with Customization
Using the diabetes dataset, this example demonstrates the full power of grouped residual analysis, showcasing how stratification by categorical variables can reveal differential model performance across subpopulations. We use the Random Forest model trained on the diabetes dataset, focusing on three key predictors (age, BMI, and sex) to examine whether prediction errors vary systematically between male and female patients.
Understanding the residual diagnostic example 4 parameters
Core Data Parameters:
y_pred=rf_pred: Provides pre-computed predictions from the Random Forest model, bypassing the need to pass the model object itselfX=X_test_diab_copy[["age", "bmi"]]: Focuses analysis on two specific predictors rather than all featuresy=y_test_diabetes: True target values for residual calculationmodel_title="Random Forest": Custom display name for the model
Grouping and Visualization Parameters:
plot_type="predictors": Creates separate residual plots for each predictor variable (age, BMI, sex_category), allowing examination of predictor-specific error patternsgroup_category="sex_category": Stratifies all visualizations by sex, color-coding points by Male/Female to reveal group-specific patternscentroid_type="groups": Instructs the function to compute centroids for each category ingroup_category(Male and Female) rather than using k-means clusteringshow_centroids=True: Overlays the mean residual position for each sex on every plot, making systematic bias immediately visible
Styling and Aesthetics:
group_kwgs: Controls the appearance of scatter points for each group:"color": ["#1f77b4", "#ff7f0e"]: Custom hex colors for Male (blue) and Female (orange)"alpha": 0.8: Semi-transparency to reveal overlapping points"s": 60: Point size for readability"edgecolors": "black": Black borders around points for definition
centroid_kwgs: Customizes the centroid markers to stand out:"c": ["red", "blue"]: Distinct colors for Male and Female centroids"marker": "X": X-shaped markers easily distinguishable from data points"s": 50: Marker size balancing visibility with clarity
Statistical Testing:
heteroskedasticity_test="all": Runs all four heteroskedasticity tests (Breusch-Pagan, White, Goldfeld-Quandt, Spearman) on each predictor, with results displayed in plot legends. This comprehensive testing approach provides convergent evidence about whether variance differs systematically across groups or predicted values.
Layout and Display:
figsize=(12, 8): Larger figure accommodates predictor subplots with clear legendstick_fontsize=14,label_fontsize=16: Enhanced readability for presentation or publicationlegend_loc="bottom": Places legends below plots to avoid obscuring data points, especially useful with heteroskedasticity test resultssuptitle="": Suppresses the overall figure title for a cleaner, more professional appearance
Additional Parameters:
decimal_places=2: Rounds all statistical test results to two decimal places for concise displaykmeans_rstate=222: Sets random seed for reproducibility (relevant only ifcentroid_type="clusters"were used)save_plot=True,image_path_png,image_path_svg: Exports high-quality figures for reports or publications
# Generate predictions from multiple models
linear_pred = linear_model.predict(X_test_diabetes)
rf_pred = rf_model.predict(X_test_diabetes)
ridge_pred = ridge_model.predict(X_test_diabetes)
# The 'sex' column is already categorical-like (coded as positive/negative values)
# Let's make it more interpretable
X_test_copy = X_test.copy()
X_test_copy["sex_category"] = X_test_copy["sex"].apply(
lambda x: "Male" if x > 0 else "Female"
)
from model_metrics import show_residual_diagnostics
# Generate residual diagnostics stratified by sex
show_residual_diagnostics(
y_pred=rf_pred,
model_title="Random Forest",
X=X_test_diab_copy,
y=y_test_diabetes,
plot_type="predictors",
show_centroids=True,
group_category="sex_category",
centroid_kwgs={"c": ["red", "blue"], "marker": "X", "s": 50},
suptitle="",
legend_loc="bottom",
)
Output
What the residual diagnostics example 4 analysis reveals
By examining residuals stratified by sex across age, BMI, and the sex category itself, we can identify:
Systematic Bias: Do centroids deviate from zero differently for males vs females? If the female centroid is consistently above zero while the male centroid is below, the model systematically under-predicts for females and over-predicts for males.
Differential Variance: Do residuals spread more widely for one group? The heteroskedasticity tests quantify whether prediction uncertainty differs between sexes.
Predictor-Specific Patterns: Does the relationship between residuals and a predictor (e.g., BMI) differ by sex? Diverging patterns suggest interaction effects that the model hasn’t captured.
Fairness Assessment: In healthcare applications, differential error patterns by sex could indicate the need for sex-specific calibration or additional interaction terms in the model.
Interpreting Centroids
The centroids represent the average (x, y) position of residuals for each group on each predictor plot:
X-coordinate: Mean value of the predictor for that group
Y-coordinate: Mean residual for that group
A centroid with y ≠ 0 indicates systematic over-prediction (y < 0) or under-prediction (y > 0) for that group. Centroids at different vertical positions reveal bias, while different horizontal positions reflect predictor distribution differences between groups.
Heteroskedasticity Test Interpretation
Each test evaluates whether residual variance is constant:
Breusch-Pagan: Tests if variance depends on fitted values or predictors
White: General test not assuming specific functional form
Goldfeld-Quandt: Compares variance between high and low predictor values
Spearman: Correlation between absolute residuals and fitted values
Statistical significance (typically p < 0.05) indicates heteroscedasticity, suggesting transformations or weighted regression may be needed.