Model Performance Summaries

Summarizes model performance metrics for classification and regression models.

summarize_model_performance(model=None, X=None, y_prob=None, y_pred=None, y=None, model_type='classification', model_threshold=None, model_title=None, custom_threshold=None, score=None, return_df=False, overall_only=False, decimal_places=3, group_category=None, include_adjusted_r2=False)

Parameters:

model (estimator, list, or None) – Trained model or list of trained models. If None, y_prob or y_pred must be provided.
X (array-like or None) – Feature matrix used for evaluation. Required if model is provided without precomputed predictions.
y_prob (array-like, list, or None) – Predicted probabilities for classification models. Can be provided instead of model and X.
y_pred (array-like, list, or None) – Predicted labels (classification) or continuous predictions (regression). Can be provided instead of model and X.
y (array-like) – True target values.
model_type (str, optional) – Specifies whether the model is for classification or regression. Must be either "classification" or "regression".
model_threshold (float, dict, or None, optional) – Classification decision thresholds. Can be a float or dict keyed by model name. Ignored if custom_threshold is provided.
custom_threshold (float or None, optional) – Overrides all model thresholds with a fixed value. If set, excludes the “Model Threshold” row.
model_title (str, list, or None, optional) – Custom model names to display in output. Defaults to inferred names like Model_1, Model_2, etc.
score (str or None, optional) – Optional custom scoring metric for threshold resolution.
return_df (bool, optional) – If True, returns results as a pandas.DataFrame instead of printing.
overall_only (bool, optional) – For regression models, if True, returns only overall metrics (without coefficients or feature importances).
decimal_places (int, optional) – Number of decimal places for rounding metric values. Defaults to 3.
group_category (str, array-like, or None, optional) – Optional grouping variable for classification metrics. Can be a column name in X or an array matching the length of y.
include_adjusted_r2 (bool, optional) – For regression models, if True, computes and includes adjusted R-squared. Requires both model and X to be provided.

Returns:

If return_df=True:
- Classification (no groups): metrics as rows, models as columns.
- Classification (grouped): metrics as rows, groups as columns.
- Regression: rows for metrics, coefficients, and/or feature importances.
If return_df=False, prints a formatted performance summary usingmanual layout logic.

Return type:

pandas.DataFrame or None

Raises:

ValueError –

If model_type is invalid.
If overall_only=True is used for classification models.
If neither (model and X) nor (y_prob or y_pred) are provided.

Important

You can supply either model with X or precomputed y_prob / y_pred directly.
When using precomputed predictions, the function bypasses model inference.
Group-level metrics are only available for classification tasks using group_category.
If custom_threshold is specified, it overrides all model thresholds.

Notes

Classification Models:
- Computes precision, recall, specificity, F1-score, AUC ROC, Brier score, and average precision.
- Supports per-group metric computation when group_category is provided.
- Grouped outputs automatically use group names as table headers and maintain metric order (with “Model Threshold” appearing last).
- Works with multiple models, custom thresholds, or precomputed probabilities.
Regression Models:
- Computes MAE, MAPE, MSE, RMSE, Explained Variance, R², and optionally Adj. R² (if include_adjusted_r2=True).
- Extracts coefficients, intercepts, and feature importances (if available).
- Preserves original manual formatting block:
  
  Maintains right-aligned column layout and visual separators for readability.
  
  Ensures coefficients and intercepts are displayed consistently across models.
  
  Provides clear model breaks and retains console formatting identical to previous releases.
- overall_only=True limits output to a single summary row per model.
Output Behavior:
- return_df=False: prints a fully formatted summary preserving manual alignment.
- return_df=True: returns a structured DataFrame suitable for further analysis or visualization.
- Metrics are rounded to the specified decimal_places for clarity.

The summarize_model_performance function provides a structured evaluation of classification and regression models, generating key performance metrics. For classification models, it computes precision, recall, specificity, F1-score, and AUC ROC. For regression models, it extracts coefficients and evaluates error metrics like MSE, RMSE, and R². The function allows specifying custom thresholds, metric rounding, and formatted display options.

Below are two examples demonstrating how to evaluate multiple models using summarize_model_performance. The function calculates and presents metrics for classification and regression models.

Binary Classification Models

This section introduces binary classification using two widely used machine learning models: Logistic Regression and Random Forest Classifier.

These examples demonstrate how model_metrics prepares and trains models on a synthetic dataset, setting the stage for evaluating their performance in subsequent sections. Both models use a default classification threshold of 0.5, where predictions are classified as positive (1) if the predicted probability exceeds 0.5, and negative (0) otherwise.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Generate a synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    random_state=42,
)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

# Train models
model1 = LogisticRegression(random_state=42).fit(X_train, y_train)
model2 = RandomForestClassifier(random_state=42).fit(X_train, y_train)

model_title = ["Logistic Regression", "Random Forest"]

Binary Classification Example 1: Default Threshold

from model_metrics import summarize_model_performance

model_performance = summarize_model_performance(
    model=[model1, model2],
    model_title=model_titles,
    X=X_test,
    y=y_test,
    model_type="classification",
    return_df=True,
)

model_performance

Output

Metrics	Logistic Regression	Random Forest
Precision/PPV	0.867	0.914
Average Precision	0.937	0.968
Sensitivity/Recall	0.820	0.865
Specificity	0.843	0.899
F1-Score	0.843	0.889
AUC ROC	0.913	0.952
Brier Score	0.118	0.083
Model Threshold	0.500	0.500

Binary Classification Example 2: Custom Threshold

In this example, we revisit binary classification with the same two models: Logistic Regression and Random Forest, but adjust the classification threshold (custom_threshold input in this case) from the default 0.5 to 0.2. This change allows us to explore how lowering the threshold impacts model performance, potentially increasing sensitivity (recall) by classifying more instances as positive (1) at the expense of precision.

from model_metrics import summarize_model_performance

model_performance = summarize_model_performance(
    model=[model1, model2],
    model_title=model_titles,
    X=X_test,
    y=y_test,
    model_type="classification",
    return_df=True,
    custom_threshold=0.2,
)

model_performance

Output

Metrics	Logistic Regression	Random Forest
Precision/PPV	0.803	0.814
Average Precision	0.937	0.968
Sensitivity/Recall	0.919	0.946
Specificity	0.719	0.730
F1-Score	0.857	0.875
AUC ROC	0.913	0.952
Brier Score	0.118	0.083
Model Threshold	0.200	0.200

Binary Classification Example 3: Adult Income Data

In this third ROC AUC evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich combination of categorical and numerical features makes it particularly suitable for evaluating subgroup fairness and model performance across demographic segments.

In this example, we extend binary classification evaluation by introducing the group_category parameter to assess how model performance varies across different subpopulations. Specifically, we employ a Random Forest classifier and incorporate the race column from the test set to form a combined dataset that includes both predictive features and demographic information.

By passing this categorical variable to group_category, the function computes and displays subgroup-level metrics side-by-side, including AUC, precision, recall, and F1-score. This enables clear identification of potential performance disparities across demographic groups (e.g., by race, gender, or age category), offering valuable insights into fairness, equity, and subgroup behavior within the model’s predictions.

X_test_analysis = X_test.join(X["race"])

from model_metrics import summarize_model_performance

model_summary = summarize_model_performance(
    y_prob=y_prob[2],
    y=y_test,
    model_title=model_titles,
    model_threshold=model_thresholds,
    return_df=True,
    decimal_places=3,
    group_category=X_test_analysis["race"],
)

model_summary

Output

Metrics	Amer-Indian-Eskimo	Asian-Pac-Islander	Black	Other	White
AUC ROC	0.625	0.840	0.869	0.940	0.865
Average Precision	0.238	0.718	0.619	0.767	0.747
Brier Score	0.100	0.127	0.071	0.051	0.119
F1-Score	0.167	0.587	0.502	0.700	0.641
Model Threshold	0.300	0.300	0.300	0.300	0.300
Precision/PPV	0.125	0.533	0.475	0.700	0.646
Sensitivity/Recall	0.250	0.653	0.532	0.700	0.636
Specificity	0.831	0.808	0.923	0.962	0.880

Regression Models

In this section, we load the diabetes dataset [1] from scikit-learn, which includes features like age and BMI, along with a target variable representing disease progression. The data is then split with train_test_split into training and testing sets using an 80/20 ratio to facilitate model assessment. We train a Linear Regression model on unscaled data for a straightforward baseline, followed b y a Random Forest Regressor with 100 trees, also on unscaled data, to introduce a more complex approach. Additionally, we train a Ridge Regression model using a Pipeline that scales the features with StandardScaler before fitting, incorporating regularization. These steps prepare the models for subsequent evaluation and comparison using tools provided by the model_metrics library.

Models use in these regression examples:

Linear Regression: A foundational model trained on unscaled data, simple yet effective for baseline evaluation.
Ridge Regression: A regularized model with a Pipeline for scaling, perfect for testing stability and overfitting.
Random Forest Regressor: An ensemble of 100 trees on unscaled data, offering complexity for comparative analysis.

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# Load dataset
diabetes = load_diabetes(as_frame=True)["frame"]
X = diabetes.drop(columns=["target"])
y = diabetes["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

# Train Linear Regression (on unscaled data)
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Train Random Forest Regressor (on unscaled data)
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
)
rf_model.fit(X_train, y_train)

# Train Ridge Regression (on scaled data)
ridge_model = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("estimator", Ridge(alpha=1.0)),
    ]
)
ridge_model.fit(X_train, y_train)

Regression Example 1: Linear, Ridge

from model_metrics import summarize_model_performance

regression_metrics = summarize_model_performance(
    model=[linear_model, ridge_model],
    model_title=["Linear Regression", "Ridge Regression"],
    X=X_test,
    y=y_test,
    model_type="regression",
    return_df=True,
    decimal_places=2,
)

regression_metrics

The output below presents a detailed comparison of the performance and coefficients for two regression models: Linear Regression and Ridge Regression trained on the diabetes dataset. It includes overall metrics such as Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Explained Variance, and R² Score for each model, showing their predictive accuracy. Additionally, it lists the coefficients for each feature (e.g., age, bmi, s1–s6) in both models, highlighting how each variable contributes to the prediction.

Output

Model	Metric	Variable	Coefficient	MAE	MAPE	MSE	RMSE	Expl. Var.	R^2
Linear Regression	Overall Metrics			42.79	37.5	2900.19	53.85	0.46	0.45
Linear Regression	Coefficient	const	151.35
Linear Regression	Coefficient	age	37.9
Linear Regression	Coefficient	sex	-241.96
Linear Regression	Coefficient	bmi	542.43
Linear Regression	Coefficient	bp	347.7
Linear Regression	Coefficient	s1	-931.49
Linear Regression	Coefficient	s2	518.06
Linear Regression	Coefficient	s3	163.42
Linear Regression	Coefficient	s4	275.32
Linear Regression	Coefficient	s5	736.2
Linear Regression	Coefficient	s6	48.67
Ridge Regression	Overall Metrics			42.81	37.45	2892.01	53.78	0.46	0.45
Ridge Regression	Coefficient	const	153.74
Ridge Regression	Coefficient	age	1.81
Ridge Regression	Coefficient	sex	-11.45
Ridge Regression	Coefficient	bmi	25.73
Ridge Regression	Coefficient	bp	16.73
Ridge Regression	Coefficient	s1	-34.67
Ridge Regression	Coefficient	s2	17.05
Ridge Regression	Coefficient	s3	3.37
Ridge Regression	Coefficient	s4	11.76
Ridge Regression	Coefficient	s5	31.38
Ridge Regression	Coefficient	s6	2.46

Regression Example 2: Linear, Ridge, RF (w/ Feature Importance)

In this Regression Example 2, we extend the analysis by introducing a Random Forest Regressor alongside Linear Regression and Ridge Regression to demonstrate how a model with feature importances, rather than coefficients, impacts evaluation outcomes. The code uses the summarize_model_performance function from model_metrics to assess all three models on the diabetes dataset’s test set, ensuring the Random Forest’s feature importance-based predictions are reflected in the results while preserving the coefficient-based results of the other models, as shown in the subsequent table.

from model_metrics import summarize_model_performance

regression_metrics = summarize_model_performance(
    model=[linear_model, ridge_model, rf_model],
    model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
    X=X_test,
    y=y_test,
    model_type="regression",
    return_df=True,
    decimal_places=2,
)

regression_metrics

Output

Model	Metric	Variable	Coefficient	Feat. Imp.	MAE	MAPE	MSE	RMSE	Expl. Var.	R^2
Linear Regression	Overall Metrics				42.79	37.5	2900.19	53.85	0.46	0.45
Linear Regression	Coefficient	const	151.35
Linear Regression	Coefficient	age	37.90
Linear Regression	Coefficient	sex	-241.96
Linear Regression	Coefficient	bmi	542.43
Linear Regression	Coefficient	bp	347.7
Linear Regression	Coefficient	s1	-931.49
Linear Regression	Coefficient	s2	518.06
Linear Regression	Coefficient	s3	163.42
Linear Regression	Coefficient	s4	275.32
Linear Regression	Coefficient	s5	736.2
Linear Regression	Coefficient	s6	48.67
Ridge Regression	Overall Metrics				42.81	37.45	2892.01	53.78	0.46	0.45
Ridge Regression	Coefficient	const	153.74
Ridge Regression	Coefficient	age	1.81
Ridge Regression	Coefficient	sex	-11.45
Ridge Regression	Coefficient	bmi	25.73
Ridge Regression	Coefficient	bp	16.73
Ridge Regression	Coefficient	s1	-34.67
Ridge Regression	Coefficient	s2	17.05
Ridge Regression	Coefficient	s3	3.37
Ridge Regression	Coefficient	s4	11.76
Ridge Regression	Coefficient	s5	31.38
Ridge Regression	Coefficient	s6	2.46
Random Forest	Overall Metrics				44.05	40.01	2952.01	54.33	0.44	0.44
Random Forest	Feat. Imp.	age		0.06
Random Forest	Feat. Imp.	sex		0.01
Random Forest	Feat. Imp.	bmi		0.36
Random Forest	Feat. Imp.	bp		0.09
Random Forest	Feat. Imp.	s1		0.05
Random Forest	Feat. Imp.	s2		0.06
Random Forest	Feat. Imp.	s3		0.05
Random Forest	Feat. Imp.	s4		0.02
Random Forest	Feat. Imp.	s5		0.23
Random Forest	Feat. Imp.	s6		0.07

Regression Example 3: Adjusted R²

In some regression analyses, it is useful to report Adjusted R² in addition to standard error and variance metrics. Adjusted R² accounts for the number of predictors in the model and penalizes unnecessary complexity, making it more appropriate than R² when comparing models with different feature counts.

This example demonstrates how to include Adjusted R² in the output table by setting include_adjusted_r2=True in the summarize_model_performance function.

from model_metrics import summarize_model_performance

regression_metrics = summarize_model_performance(
    model=[linear_model, ridge_model, rf_model],
    model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
    X=X_test,
    y=y_test,
    model_type="regression",
    include_adjusted_r2=True,
    return_df=True,
    decimal_places=2,
)

regression_metrics

The resulting table extends the standard regression metrics (MAE, MAPE, MSE, RMSE, Explained Variance, and R²) by adding an Adjusted R² column, enabling more informed model comparison when feature dimensionality differs.

Output

Model	Metric	Variable	Coefficient	Feat. Imp.	MAE	MAPE	MSE	RMSE	Expl. Var.	R^2	Adj. R^2
Linear Regression	Overall Metrics				42.79	37.5	2900.19	53.85	0.46	0.45	0.38
Linear Regression	Coefficient	const	151.35
Linear Regression	Coefficient	age	37.90
Linear Regression	Coefficient	sex	-241.96
Linear Regression	Coefficient	bmi	542.43
Linear Regression	Coefficient	bp	347.7
Linear Regression	Coefficient	s1	-931.49
Linear Regression	Coefficient	s2	518.06
Linear Regression	Coefficient	s3	163.42
Linear Regression	Coefficient	s4	275.32
Linear Regression	Coefficient	s5	736.2
Linear Regression	Coefficient	s6	48.67
Ridge Regression	Overall Metrics				42.81	37.45	2892.01	53.78	0.46	0.45	0.38
Ridge Regression	Coefficient	const	153.74
Ridge Regression	Coefficient	age	1.81
Ridge Regression	Coefficient	sex	-11.45
Ridge Regression	Coefficient	bmi	25.73
Ridge Regression	Coefficient	bp	16.73
Ridge Regression	Coefficient	s1	-34.67
Ridge Regression	Coefficient	s2	17.05
Ridge Regression	Coefficient	s3	3.37
Ridge Regression	Coefficient	s4	11.76
Ridge Regression	Coefficient	s5	31.38
Ridge Regression	Coefficient	s6	2.46
Random Forest	Overall Metrics				44.05	40.01	2952.01	54.33	0.44	0.44	0.37
Random Forest	Feat. Imp.	age		0.06
Random Forest	Feat. Imp.	sex		0.01
Random Forest	Feat. Imp.	bmi		0.36
Random Forest	Feat. Imp.	bp		0.09
Random Forest	Feat. Imp.	s1		0.05
Random Forest	Feat. Imp.	s2		0.06
Random Forest	Feat. Imp.	s3		0.05
Random Forest	Feat. Imp.	s4		0.02
Random Forest	Feat. Imp.	s5		0.23
Random Forest	Feat. Imp.	s6		0.07

Regression Example 4: Overall Results

In some scenarios, you may want to simplify the output by excluding variables, coefficients, and feature importances from the model results. This example demonstrates how to achieve that by setting overall_only=True in the summarize_model_performance function, producing a concise table that focuses on key metrics: model name, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Explained Variance, and R² Score.

from model_metrics import summarize_model_performance

regression_metrics = summarize_model_performance(
    model=[linear_model, ridge_model, rf_model],
    model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
    X=X_test,
    y=y_test,
    model_type="regression",
    overall_only=True,
    return_df=True,
    decimal_places=2,
)

regression_metrics

Output

Model	Metric	MAE	MAPE	MSE	RMSE	Expl. Var.	R^2
Linear Regression	Overall Metrics	42.79	37.50	2900.19	53.85	0.46	0.45
Ridge Regression	Overall Metrics	42.81	37.45	2892.01	53.78	0.46	0.45
Random Forest	Overall Metrics	44.05	40.01	2952.01	54.33	0.44	0.44

Lift Charts

This section illustrates how to assess and compare the ranking effectiveness of classification models using Lift Charts, a valuable tool for evaluating how well a model prioritizes positive instances relative to random chance. Leveraging the Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we plot Lift curves to visualize their relative ability to surface high-value (positive) cases at the top of the prediction list.

A Lift Chart plots the ratio of actual positives identified by the model compared to what would be expected by random selection, across increasingly larger proportions of the sample sorted by predicted probability. The baseline (Lift = 1) represents random chance; curves that rise above this line demonstrate the model’s ability to “lift” positive outcomes toward the top ranks. This makes Lift Charts especially useful in applications like marketing, fraud detection, and risk stratification where targeting the top segment of predictions can yield outsized value. See the mathematical definition of Lift here.

The show_lift_chart function enables flexible creation of Lift Charts for one or more models. It supports single-plot overlays, subplot layouts, and detailed customization of labels, titles, and styling. Designed for both exploratory analysis and stakeholder presentation, this utility helps users better understand model ranking performance across the population.

show_lift_chart(model, X, y, y_prob=None, xlabel='Percentage of Sample', ylabel='Lift', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, legend_loc='best')

Parameters:

model (object or list[object]) – A trained model or a list of models. Each must implement predict_proba to estimate class probabilities. Can be omitted if y_prob is provided.
X (pd.DataFrame or np.ndarray) – Feature matrix used to generate predictions. Required if using model. Ignored if y_prob is provided.
y (pd.Series or np.ndarray) – True binary labels corresponding to the input samples.
y_prob (array-like or list[array-like], optional) – Predicted probabilities for classification models. Can be provided instead of model and X.
xlabel (str, optional) – Label for the x-axis. Defaults to "Percentage of Sample".
ylabel (str, optional) – Label for the y-axis. Defaults to "Lift".
model_title (str or list[str], optional) – Custom display names for the models. Can be a string or list of strings.
overlay (bool, optional) – If True, overlays all model lift curves into a single plot. Defaults to False.
title (str, optional) – Title for the plot or subplot. Set to "" to suppress the title. Defaults to None.
save_plot (bool, optional) – Whether to save the chart(s) to disk. Defaults to False.
image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Maximum number of characters before wrapping titles. If None, no wrapping is applied.
curve_kwgs (dict[str, dict] or list[dict], optional) – Dictionary or list of dictionaries for customizing the lift curve(s) (e.g., color, linewidth).
linestyle_kwgs (dict, optional) – Styling for the baseline (random lift) reference line. Defaults to {"color": "gray", "linestyle": "--", "linewidth": 2}.
subplots (bool, optional) – Whether to show each model in a subplot grid. Cannot be combined with overlay=True.
n_cols (int, optional) – Number of columns in the subplot layout. Defaults to 2.
n_rows (int, optional) – Number of rows in the subplot layout. If None, automatically inferred.
figsize (tuple[int, int], optional) – Tuple specifying the size of the figure in inches. Defaults to (8, 6).
label_fontsize (int, optional) – Font size for x/y-axis labels and titles. Defaults to 12.
tick_fontsize (int, optional) – Font size for tick marks and legend text. Defaults to 10.
gridlines (bool, optional) – Whether to display gridlines in plots. Defaults to True.
legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like 'best', 'upper right', 'lower left', etc., or 'bottom' to place legend below the plot. Defaults to 'best'.

Returns:

None. Displays or saves lift charts for the specified classification models or probability inputs.

Return type:

None

Raises:

ValueError –

If overlay=True and subplots=True are both set.

Important

You can supply either model and X or pass y_prob directly.
When using y_prob, the function bypasses model predictions and uses the provided probabilities for lift chart calculation.
Supports single-model or multiple-model workflows, with either model objects or arrays of pre-computed probabilities.

Notes

What is a Lift Chart?
- Lift quantifies how much better a model is at identifying positive cases compared to random selection.
- The x-axis represents the proportion of the population (from highest to lowest predicted probability).
- The y-axis shows the cumulative lift, calculated as the ratio of observed positives to expected positives under random selection.
Interpreting Lift Curves:
- A higher and steeper curve indicates a stronger model.
- The horizontal dashed line at y = 1 is the baseline for random performance.
- Curves that drop sharply or flatten may indicate poor ranking ability.
Layout Options:
- Use overlay=True to visualize all models on a single axis.
- Use subplots=True for a side-by-side layout of lift charts.
- Neither set? Each model gets its own full-sized chart.
Customization:
- Customize the appearance of each model’s curve using curve_kwgs.
- Modify the baseline reference line with linestyle_kwgs.
- Control title wrapping and font sizes via text_wrap, label_fontsize, and tick_fontsize.
Saving Plots:
- If save_plot=True, figures are saved as <model_title>_lift.png/svg or overlay_lift.png/svg.

Lift Chart Example 1: Subplot Layout

In this first Lift Chart example, we evaluate and compare the ranking performance of two classification models: Logistic Regression and Random Forest Classifier trained on the synthetic dataset from the Binary Classification Models section. The chart displays Lift curves for both models in a two-column subplot layout (n_cols=2, n_rows=1), enabling side-by-side comparison of how effectively each model prioritizes positive cases.

Each plot shows the model’s Lift across increasing portions of the test set, with a grey dashed line at Lift = 1 indicating the baseline (random performance). Curves above this line reflect the model’s ability to identify more positives than would be expected by chance. The Random Forest produces a steeper initial lift, demonstrating greater concentration of positive cases near the top-ranked predictions.

The show_lift_chart function allows for rich customization, including plot dimensions, axis font sizes, and curve styling. In this example, we set the line widths for both models and saved the plots in both PNG and SVG formats for further reporting or documentation.

from model_metrics import show_lift_chart

show_lift_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    subplots=True,
)

Output

Lift Chart Example 2: Overlay

This example overlays Lift curves from two classification models: Logistic Regression and Random Forest Classifier on a single plot for direct visual comparison. Both models were trained on the same synthetic dataset from the Binary Classification Models section, and their lift performance is evaluated on the shared test set.

The Lift curve shows how many more positive outcomes are captured by the model at each quantile compared to a random baseline. A horizontal dashed black line at Lift = 1 represents random selection; curves above this line indicate effective ranking of positive cases. Overlaying curves makes it easier to assess which model better concentrates true positives near the top of the prediction list.

Using the overlay=True option, the show_lift_chart function generates a clean, unified plot. Each curve is styled with linewidth=2 for clarity, and all axis elements and tick marks are sized for presentation-quality output. This layout is particularly helpful for slide decks, performance reports, or model selection discussions.

from model_metrics import show_lift_chart

show_lift_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    overlay=True,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "red", "linestyle": "--", "linewidth": 2},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
)

Output

Gain Charts

This section explores how to evaluate the cumulative performance of classification models in identifying positive outcomes using gain charts. These charts are especially effective at showing the model’s ability to concentrate the correct (positive) predictions in the top-ranked portion of the dataset. Using the same Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we demonstrate how to plot and compare Gain Curves across models.

A gain chart shows the cumulative percentage of actual positive cases captured as we move through the population sorted by predicted probability. Unlike the Lift Chart, which displays the ratio of model performance over baseline, the Gain Chart directly shows the percentage of positives captured, providing a more intuitive sense of how effective a model is at identifying positives early in the ranked list. See the mathematical definition of Gain here.

The show_gain_chart function supports single or multiple models, with options to overlay all gain curves in a single plot or display them in a flexible subplot layout. Labels, title wrapping, curve styles, and saving output images are all customizable, making this function well-suited for both development analysis and final reporting.

show_gain_chart(model, X, y, y_prob=None, xlabel='Percentage of Sample', ylabel='Cumulative Gain', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, legend_loc='best', show_gini=False, decimal_places=3)

Parameters:

model (object or list[object]) – A trained classifier or list of classifiers. Each model must support predict_proba unless y_prob is supplied directly.
X (pd.DataFrame or np.ndarray) – The feature matrix used for prediction. Required if model is provided.
y (pd.Series or np.ndarray) – Ground truth binary labels.
y_prob (list, array, or None, optional) – Predicted probabilities for classification models. Can be provided instead of model and X.
xlabel (str, optional) – Label for the x-axis. Defaults to "Percentage of Sample".
ylabel (str, optional) – Label for the y-axis. Defaults to "Cumulative Gain".
model_title (str or list[str], optional) – Custom display names for each model. If None, defaults to sequential names.
overlay (bool, optional) – If True, overlay all models on a single axis. Mutually exclusive with subplots.
title (str, optional) – Plot or subplot title. Set to "" to suppress the title.
save_plot (bool, optional) – Whether to save the chart(s) to disk.
image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Max characters before title wrapping. Set to None to disable.
curve_kwgs (dict[str, dict] or list[dict], optional) – Dict or list of kwargs per model to customize line style.
linestyle_kwgs (dict, optional) – Styling for the random baseline. Defaults to {"color": "gray", "linestyle": "--", "linewidth": 2}.
subplots (bool, optional) – Whether to render a subplot layout. Cannot be used with overlay.
n_cols (int, optional) – Columns in the subplot layout. Defaults to 2.
n_rows (int, optional) – Rows in the subplot layout. If None, inferred automatically.
figsize (tuple[int, int], optional) – Figure size (width, height) in inches.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for tick marks and legends.
gridlines (bool, optional) – Whether to show gridlines on the plots.
legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like 'best', 'upper right', 'lower left', etc., or 'bottom' to place legend below the plot. Defaults to 'best'.
show_gini (bool, optional) – Whether to display the Gini coefficient in the legend. Defaults to False.
decimal_places (int, optional) – Number of decimal places for displaying the Gini coefficient. Defaults to 3.

Returns:

None. Displays or saves Gain Charts for one or more models.

Return type:

None

Raises:

ValueError –

If overlay=True and subplots=True are both set.

Important

You can supply either model and X or pass y_prob directly.
When using y_prob, the function bypasses model predictions and plots cumulative gains directly from the provided probabilities.
Supports single-model or multiple-model workflows, with either model objects or arrays of pre-computed probabilities.

Notes

What is a Gain Chart?
- Plots the cumulative percentage of positives captured vs. sample size.
- The x-axis shows the fraction of the sample, ranked by predicted probability.
- The y-axis shows what percentage of the total positives have been captured.
Why use Gain Charts?
- Gain Charts help answer: “If I contact the top X% of predictions, how many positives will I catch?”
- Especially useful in marketing, lead scoring, risk management, and fraud detection.
Reading Gain Curves:
- Curves that rise steeply and plateau early indicate better model performance.
- The dashed baseline (diagonal line) represents random selection.
Layout Options:
- Use overlay=True to combine all gain curves into a single plot.
- Use subplots=True for a subplot layout per model.
- If neither is set, plots will be rendered individually.
Styling Options:
- Customize individual model lines via curve_kwgs.
- Modify the diagonal baseline line using linestyle_kwgs.
- Adjust fonts and wrapping for presentation clarity.
Saving Output:
- Enable save_plot=True to save figures as PNG and/or SVG.
- Files are named using the model title (e.g., Model_1_gain.png or overlay_gain.svg).

Gain Chart Example 1: Subplot Layout

In this first Gain Chart example, we compare the cumulative gain performance of two classification models: Logistic Regression and Random Forest Classifier trained on the synthetic dataset from the Binary Classification Models section. This visualization showcases their ability to identify positive instances across different percentiles of the ranked test data.

Each subplot presents the cumulative gain achieved as a function of the percentage of the sample, sorted by descending predicted probability. The grey dashed line represents the baseline (random gain). A model that identifies a high proportion of positive cases in the early part of the ranking will have a steeper and higher curve. In this example, the Random Forest model outpaces Logistic Regression, indicating better early identification of positives.

The show_gain_chart function allows flexible styling and layout control. This example uses a subplot configuration (n_cols=2, n_rows=1), customized line widths and colors, and includes saving the figure for documentation or stakeholder presentations.

from model_metrics import show_gain_chart

show_gain_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    figsize=(12, 6),
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "grey", "linestyle": "--"},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    subplots=True,
)

Output

Gain Chart Example 2: Displaying Gini Coefficients

This example demonstrates how to include Gini coefficients directly in the gain chart legends using the show_gini=True parameter. The Gini coefficient is a summary statistic derived from the area under the gain curve (AUGC), calculated as 2 × AUGC - 1, and ranges from 0 to 1 where higher values indicate better model discrimination.

Both models: Logistic Regression and Random Forest Classifier were trained on the synthetic dataset from the Binary Classification Models section. By enabling show_gini=True (and optionally setting decimal_places=3), each model’s legend entry automatically displays its Gini coefficient, providing both visual and quantitative performance comparison in a single view.

The Gini coefficient complements the visual gain curve by offering a single number that summarizes discriminative power. In this example, both the curve shape and the Gini value help identify which model better concentrates positive cases at the top of the predicted ranking. This is particularly useful in presentations, model selection discussions, and performance reporting where stakeholders need both graphical intuition and numeric metrics.

from model_metrics import show_gain_chart

show_gain_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    subplots=True,
    show_gini=True,
    decimal_places=3,
)

Output

Gini coefficient for Logistic Regression: 0.374
Gini coefficient for Random Forest: 0.410

Gain Chart Example 2 - With Gini Coefficients

Gain Chart Example 3: Overlay

This example overlays Gain curves from two classification models: Logistic Regression and Random Forest Classifier on a single plot to enable direct visual comparison of their cumulative gain performance. Both models were trained on the same synthetic dataset from the Binary Classification Models section and evaluated on the same test set.

The Gain curve shows the cumulative proportion of true positives captured as you move through the population, ranked by predicted probability. A diagonal baseline line from (0, 0) to (1, 1) indicates the expected performance of a random model. Curves that rise above this line demonstrate superior model ability to concentrate positive cases near the top of the ranked list.

By setting overlay=True, the show_gain_chart function produces a single, easy-to-read plot containing both models’ gain curves. Each curve is styled with linewidth=2 for clear visibility. Overlay layouts are ideal for model selection discussions, presentations, and performance dashboards.

from model_metrics import show_gain_chart

show_gain_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    overlay=True,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "red", "linestyle": "--", "linewidth": 2},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
)

Output

ROC AUC Curves

This section demonstrates how to evaluate the performance of binary classification models using ROC AUC curves, a key metric for assessing the trade-off between true positive and false positive rates. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate ROC curves to visualize their discriminatory power.

ROC AUC (Receiver Operating Characteristic Area Under the Curve) provides a single scalar value representing a model’s ability to distinguish between positive and negative classes, with a value of 1 indicating perfect classification and 0.5 representing random guessing. The curves are plotted by varying the classification threshold and calculating the true positive rate (sensitivity) against the false positive rate (1-specificity). This makes ROC AUC particularly useful for comparing models like Logistic Regression, which relies on linear decision boundaries, and Random Forest Classifier, which leverages ensemble decision trees, especially when class imbalances or threshold sensitivity are concerns. The show_roc_curve function simplifies this process, enabling users to visualize and compare these curves effectively, setting the stage for detailed performance analysis in subsequent examples.

The show_roc_curve function provides a flexible and powerful way to visualize the performance of binary classification models using Receiver Operating Characteristic (ROC) curves. Whether you’re comparing multiple models, evaluating subgroup fairness, or preparing publication-ready plots, this function allows full control over layout, styling, and annotations. It supports single and multiple model inputs, optional overlay or subplot layouts, and group-wise comparisons via a categorical feature. Additional options allow custom axis labels, AUC precision, curve styling, and export to PNG/SVG. Designed to be both user-friendly and highly configurable, show_roc_curve is a practical tool for model evaluation and stakeholder communication.

show_roc_curve(model=None, X=None, y_prob=None, y=None, xlabel='False Positive Rate', ylabel='True Positive Rate', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, subplots=False, n_rows=None, n_cols=2, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None, delong=None, show_operating_point=False, operating_point_method='youden', operating_point_kwgs=None, legend_loc='lower right')

Parameters:

model (estimator, list[estimator], or str) – A trained estimator, list of estimators, or placeholders (strings). If y_prob is provided directly, model may be None.
X (pd.DataFrame or np.ndarray) – Feature data for prediction. Required when model objects are provided and y_prob is not.
y_prob (array-like or list[array-like], optional) – Predicted probabilities for the positive class. Can be a single array or a list of arrays corresponding to multiple models.
y (pd.Series or np.ndarray) – True binary target labels for ROC evaluation.
xlabel (str, optional) – Label for the x-axis.
ylabel (str, optional) – Label for the y-axis.
model_title (str or list[str], optional) – Custom model title(s). Can be a string or list of strings. If None, defaults to "Model 1", "Model 2", etc.
decimal_places (int, optional) – Number of decimal places for rounding AUC values.
overlay (bool, optional) – Whether to overlay multiple models in a single plot. Cannot be used with subplots or group_category.
title (str, optional) – Title for the plot or subplots. If "", disables titles entirely.
save_plot (bool, optional) – Whether to save plots to disk.
image_path_png (str, optional) – Path to save PNG images.
image_path_svg (str, optional) – Path to save SVG images.
text_wrap (int, optional) – Maximum width before wrapping long titles.
curve_kwgs (list[dict] or dict[str, dict], optional) – Style parameters for ROC curves. Accepts a list of dicts or a nested dict keyed by model titles.
linestyle_kwgs (dict, optional) – Style dictionary for the random guess (diagonal) line.
subplots (bool, optional) – Whether to arrange plots in a grid layout. Cannot be used with overlay=True or group_category.
n_rows (int, optional) – Number of rows for the subplot grid. Calculated automatically if None.
n_cols (int, optional) – Number of columns for the subplot grid.
figsize (tuple, optional) – Figure size (width, height) in inches.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for tick labels and legend.
gridlines (bool, optional) – Whether to display gridlines.
group_category (array-like, optional) – Categorical variable to group ROC curves (e.g., by sex or race). Cannot be used with overlay or subplots.
delong (tuple or list[array-like], optional) – Tuple or list containing two predicted probability arrays for Hanley and McNeil’s parametric AUC comparison. Cannot be used with group_category.
show_operating_point (bool, optional) – Whether to display an optimal operating point on the ROC curve.
operating_point_method (str, optional) – Method used to compute the operating point. Supported options are "youden" and "closest_topleft".
operating_point_kwgs (dict, optional) – Styling options for the operating point marker (passed to matplotlib.scatter).
legend_loc (str, optional) – Legend location. Standard matplotlib locations or "bottom" to place the legend below the plot.

Returns:

None. Displays or saves ROC curve plots.

Return type:

None

Raises:

ValueError –

If both subplots=True and overlay=True.
If group_category is used with overlay or subplots.
If overlay=True is used with only one model.
If delong is used when group_category is provided.
If neither (model and X) nor y_prob are supplied.
If operating_point_method is not one of the supported options.

Important

You can provide either model with X or directly pass y_prob.
overlay and subplots are mutually exclusive.
group_category groups ROC curves by unique category values.
The delong parameter performs a correlated ROC comparison using the Hanley and McNeil parametric approximation.

Notes

Flexible Inputs
- model and model_title can be scalars or lists.
- Strings in model act as placeholders when using precomputed probabilities.
- Curve styling is controlled via curve_kwgs and linestyle_kwgs.
Operating Point Visualization
- When show_operating_point=True, the optimal threshold is computed and displayed on the ROC curve.
- The operating point is labeled in the legend with its threshold value.
Group-wise ROC
- When group_category is provided, ROC curves are computed per group.
- Legends display AUC, total count, positive count, and negative count.
Plot Modes
- overlay=True: All ROC curves on a single plot.
- subplots=True: Each model plotted in a grid layout.
- Default behavior produces one plot per model.
Saving Plots
- If save_plot=True, figures are saved to the specified paths.
- Filenames are generated automatically based on model name and plot mode.

The show_roc_curve function provides flexible and highly customizable plotting of ROC curves for binary classification models. It supports overlays, subplot layouts, and subgroup visualizations, while also allowing export options and styling hooks for publication-ready output.

ROC AUC Example 1: Subplot Layout

In this first ROC AUC evaluation example, we plot the ROC curves for two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section. The curves are displayed side by side using a subplot layout (n_cols=2, n_rows=1), with the Logistic Regression curve in blue and the Random Forest curve in green for clear differentiation. A red dashed line represents the random guessing baseline. This example demonstrates how the show_roc_curve function enables straightforward visualization of model performance, with options to customize colors, add a grid, and save the plot for reporting purposes.

from model_metrics import show_roc_curve

show_roc_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    decimal_places=2,
    n_cols=2,
    n_rows=1,
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    subplots=True,
)

Output

ROC AUC Example 2: Overlay

In this second ROC AUC evaluation example, we focus on overlaying the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_roc_curve function with the overlay=True parameter, the ROC curves for both models are displayed together, with Logistic Regression in blue and Random Forest in black, both with a linewidth=2. A red dashed line serves as the random guessing baseline, and the plot includes a custom title for clarity.

from model_metrics import show_roc_curve

show_roc_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    decimal_places=2,
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    title="ROC Curves: Logistic Regression and Random Forest",
    overlay=True,
)

Output

ROC AUC Example 3: DeLong’s Test

In this third ROC AUC evaluation example, we demonstrate how to statistically compare the performance of two correlated models using Hanley & McNeil’s parametric AUC comparison (an approximation of DeLong’s test). We utilize the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the Binary Classification Models section. By passing their predicted probabilities to the delong parameter of the show_roc_curve function, we can assess whether the difference in AUC between the two models is statistically significant. This is particularly useful when models are evaluated on the same dataset, as it accounts for the inherent correlation in their predictions.

from model_metrics import show_roc_curve

show_roc_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    decimal_places=2,
    delong=[model1.predict_proba(X_test)[:, 1], model2.predict_proba(X_test)[:, 1]],
)

Output

AUC for Logistic Regression: 0.91

Hanley & McNeil AUC comparison (Approximation of DeLong's Test):
  Logistic Regression AUC = 0.913
  Random Forest AUC = 0.952
  p-value = 0.0557

ROC AUC Example 4: Hanley Mcneil AUC Test

In this fourth ROC AUC evaluation example, we focus on the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section.

Performs a large-sample z-test for the difference between two correlated AUCs, based on Hanley & McNeil (1982).

hanley_mcneil_auc_test(y_true, y_scores_1, y_scores_2, model_names=None, verbose=True, return_values=False, decimal_places=4)

Parameters:

y_true (array-like) – True binary class labels.
y_scores_1 (array-like) – Predicted probabilities or decision scores from the first model.
y_scores_2 (array-like) – Predicted probabilities or decision scores from the second model.
model_names (list or tuple of str, optional) – Optional model names for printed output. Defaults to ("Model 1", "Model 2") if not provided.
verbose (bool, optional) – Whether to print the formatted AUC comparison and p-value summary. Defaults to True.
return_values (bool, optional) – Whether to return the numerical results (auc1, auc2, p_value) for programmatic access instead of just printing them. Defaults to False.
decimal_places (int, optional) – Number of decimal places for printed AUC and p-value. Defaults to 4.

Returns:

Tuple of floats (auc1, auc2, p_value) if return_values=True. Otherwise, prints the results and returns None.

Return type:

tuple or None

Important

This test compares two correlated ROC curves evaluated on the same set of samples.
It is a parametric approximation of DeLong’s nonparametric test, as described in [4].
The p-value tests the null hypothesis that the two AUCs are equal.

Notes

Formula Overview:
- Standard error (SE) is computed using Hanley & McNeil’s approximation based on sample counts and the AUC of the first model.
- The z-statistic is then computed as: z = (auc1 - auc2) / SE
- The two-sided p-value is derived as: p = 2 * (1 - norm.cdf(|z|))
Typical Use Case:
- Use this function when comparing two models trained and tested on the same dataset to evaluate whether their ROC-AUCs differ significantly.
- The function is particularly useful within pipelines or visualization utilities such as show_roc_curve() when the delong argument is provided.
Integration Example:
- This test can be used independently or embedded in a plotting function to provide AUC significance testing.

Example:

from model_metrics import hanley_mcneil_auc_test

# Compare two models' ROC-AUC scores
hanley_mcneil_auc_test(
    y_test,
    model1.predict_proba(X_test)[:, 1],
    model2.predict_proba(X_test)[:, 1],
    model_names=["Logistic Regression", "Random Forest"],
    verbose=True,
    decimal_places=6,
)

Output

Hanley & McNeil AUC Comparison (Approximation of DeLong's Test):
  Logistic Regression AUC = 0.912643
  Random Forest AUC = 0.951969
  p-value = 0.054494

ROC AUC Example 5: Operating Point Using Youden’s J

In this fifth ROC AUC evaluation example, we focus on the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section.

The objective of this example is to identify and visualize an optimal operating point on the ROC curve using Youden’s J statistic, defined as:

\[J = \text{TPR} - \text{FPR}\]

This criterion selects the threshold that maximizes the vertical distance between the ROC curve and the random-guess diagonal, providing a balanced tradeoff between sensitivity and specificity. More information on this method can be found here.

The show_roc_curve function supports this directly via the show_operating_point and operating_point_method parameters.

In the example below, we compute the ROC curve for a decision tree classifier and annotate the optimal operating point determined by Youden’s J statistic.

from model_metrics import show_roc_curve

show_roc_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    decimal_places=2,
    show_operating_point=True,
    operating_point_method="youden",
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    operating_point_kwgs={
        "marker": "o",
        "color": "red",
        "s": 100,
    },
)

When enabled, the operating point is plotted directly on the ROC curve and annotated in the legend with its corresponding decision threshold.

Output

Classifiers AUROC with Youden Operating Point

ROC AUC Example 6: Closest to Top Left

In this sixth example, we demonstrate an alternative method for identifying an optimal operating point on the ROC curve using the closest-to-top-left criterion. Like Youden’s J statistic, this approach seeks a balanced threshold, but instead of maximizing the vertical distance from the diagonal, it minimizes the Euclidean distance to the ideal point (0, 1) in ROC space.

The closest-to-top-left method finds the threshold that minimizes:

\[d = \sqrt{(1 - \text{TPR})^2 + \text{FPR}^2}\]

This geometric criterion is particularly useful when you want to prioritize proximity to perfect classification (top-left corner) rather than maximizing the difference between true positive and false positive rates.

In this ROC AUC evaluation example, we focus on the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section.

The show_roc_curve function supports this method through the operating_point_method parameter by setting it to "closest_topleft". In the example below, we compute the ROC curve for a decision tree classifier and annotate the optimal operating point using the closest-to-top-left criterion.

from model_metrics import show_roc_curve

show_roc_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    decimal_places=2,
    show_operating_point=True,
    subplots=True,
    operating_point_method="closest_topleft",
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    operating_point_kwgs={
        "marker": "o",
        "color": "red",
        "s": 100,
    },
)

ROC Curve Example - Closest to Top Left Operating Point

ROC AUC Example 7: by Category

In this seventh ROC AUC evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow.

The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature, such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.

The show_roc_curve function supports this analysis through the group_category parameter.

For example, by passing group_category=X_test_2["race"], you can generate a separate ROC curve for each unique racial group in the dataset:

from model_metrics import show_roc_curve

show_roc_curve(
    model=model_rf["model"].estimator,
    X=X_test,
    y=y_test,
    model_title="Random Forest Classifier",
    decimal_places=2,
    group_category=X_test_2["race"],
)

Output

Precision-Recall Curves

This section demonstrates how to evaluate the performance of binary classification models using Precision-Recall (PR) curves, a critical visualization for understanding model behavior in the presence of class imbalance. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate PR curves to examine how well each model identifies true positives while limiting false positives.

Precision-Recall curves focus on the trade-off between precision (positive predictive value) and recall (sensitivity) across different classification thresholds. This is particularly important when the positive class is rare, as is common in fraud detection, disease diagnosis, or adverse event prediction, because ROC AUC can overstate performance under imbalance. Unlike the ROC curve, the PR curve is sensitive to the proportion of positive examples and gives a clearer picture of how well a model performs where it matters most: in identifying the positive class.

The area under the Precision-Recall curve, also known as Average Precision (AP), summarizes model performance across thresholds. A model that maintains high precision as recall increases is generally more desirable, especially in settings where false positives have a high cost. This makes the PR curve a complementary and sometimes more informative tool than ROC AUC in skewed classification scenarios.

show_pr_curve(model=None, X=None, y=None, y_prob=None, xlabel='Recall', ylabel='Precision', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None, legend_metric='ap', legend_loc='lower left')

Parameters:

model (object or str or list[object or str]) – A trained model, a string placeholder, or a list containing models or strings to evaluate. If y_prob is supplied directly, model may be None.
X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction. Required when model objects are supplied and y_prob is not.
y (pd.Series or np.ndarray) – True binary labels for evaluation.
y_prob (array-like or list[array-like], optional) – Predicted probabilities for one or multiple models (positive class, shape (n_samples,) or list of such arrays). When provided, the function bypasses model prediction and uses these probabilities directly.
xlabel (str, optional) – Label for the x-axis. Defaults to "Recall".
ylabel (str, optional) – Label for the y-axis. Defaults to "Precision".
model_title (str or list[str], optional) – Custom title(s) for the model(s). Can be a string or list of strings. If None, defaults to "Model 1", "Model 2", etc.
decimal_places (int, optional) – Number of decimal places for Average Precision (AP) or AUCPR values. Defaults to 2.
overlay (bool, optional) – Whether to overlay multiple models on a single plot. Defaults to False.
title (str, optional) – Title for the plot (used in overlay mode or as global title). If "", disables the title. Defaults to None.
save_plot (bool, optional) – Whether to save the plot(s) to file. Defaults to False.
image_path_png (str, optional) – File path to save the plot(s) as PNG.
image_path_svg (str, optional) – File path to save the plot(s) as SVG.
text_wrap (int, optional) – Maximum character width before wrapping plot titles. If None, no wrapping is applied.
curve_kwgs (list[dict] or dict[str, dict], optional) – Plot styling for PR curves. Accepts a list of dictionaries or a nested dictionary keyed by model_title.
subplots (bool, optional) – Whether to organize the PR plots in a subplot layout. Cannot be used with overlay=True or group_category.
n_cols (int, optional) – Number of columns in the subplot layout. Defaults to 2.
n_rows (int, optional) – Number of rows in the subplot layout. If None, calculated automatically based on number of models and columns.
figsize (tuple, optional) – Size of the plot or subplots, in inches. Defaults to (8, 6).
label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to 12.
tick_fontsize (int, optional) – Font size for ticks and legend text. Defaults to 10.
gridlines (bool, optional) – Whether to display gridlines on plots. Defaults to True.
group_category (array-like, optional) – Categorical array to group PR curves. Cannot be used with subplots=True or overlay=True.
legend_metric (str, optional) – Metric to display in the legend. Either "ap" (Average Precision) or "aucpr" (area under the PR curve). Defaults to "ap".
legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like 'lower left', 'upper right', etc., or 'bottom' to place legend below the plot. Defaults to 'lower left'.

Returns:

None. Displays or saves Precision-Recall curve plots for classification models.

Return type:

None

Raises:

ValueError –
- If subplots=True and overlay=True are both set.
- If group_category is used with subplots=True or overlay=True.
- If overlay=True is used with only one model.
- If legend_metric is not one of "ap" or "aucpr".
- If neither (model and X) nor y_prob is provided.
TypeError –
- If model_title is not a string, list of strings, or None.

Important

You can supply either model and X or pass y_prob directly.
When using y_prob, the function bypasses model predictions and computes PR curves and metrics directly from the provided probabilities.
Supports single-model or multi-model workflows with either model objects or arrays of pre-computed probabilities (provide a list for multiple curves).

Notes

Flexible Inputs:
- model and model_title can be individual items or lists. Strings passed in model are treated as placeholder names.
- Titles can be automatically inferred or explicitly passed using model_title.
Group-Wise PR:
- If group_category is passed, separate PR curves are plotted for each unique group.
- The legend will include group-specific Average Precision and class distribution (e.g., AP = 0.78, Count: 500, Pos: 120, Neg: 380).
Average Precision vs. AUCPR:
- By default, the legend shows Average Precision (AP), which summarizes the PR curve with greater emphasis on the performance at higher precision levels.
- If the user passes legend_metric="aucpr", the legend will instead display AUCPR (Area Under the Precision-Recall Curve), which gives equal weight to all parts of the curve.
Plot Modes:
- overlay=True overlays all models in one figure.
- subplots=True arranges individual PR plots in a subplot layout.
- If neither is set, separate full-size plots are shown for each model.
Legend and Styling:
- A random classifier baseline (constant precision) is plotted by default.
- Customize PR curves with curve_kwgs.
- Titles can be disabled with title="".
Saving Plots:
- If save_plot=True, plots are saved using the base filename format <model_name>_precision_recall or overlay_pr_plot.

The show_pr_curve function provides flexible and highly customizable plotting of Precision-Recall curves for binary classification models. It supports overlays, subplot layouts, and subgroup visualizations, while also allowing export options and styling hooks for publication-ready output.

Precision-Recall Example 1: Subplot Layout

In this first Precision-Recall evaluation example, we plot the PR curves for two models: Logistic Regression and Random Forest Classifier, both trained on the synthetic dataset from the Binary Classification Models section. The curves are arranged side by side using a subplot layout (n_cols=2, n_rows=1), with the Logistic Regression curve rendered in blue and the Random Forest curve in green to distinguish between models. A gray dashed line indicates the baseline precision, equal to the prevalence of the positive class in the dataset.

This example illustrates how the show_pr_curve function makes it easy to visualize and compare model performance when dealing with class imbalance. It also demonstrates layout flexibility and customization options, including gridlines, label styling, and export functionality, making it suitable for both exploratory analysis and final reporting.

from model_metrics import show_pr_curve

show_pr_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    decimal_places=2,
    n_cols=2,
    n_rows=1,
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    subplots=True,
)

Output

Precision-Recall Example 2: Overlay

In this second Precision-Recall evaluation example, we focus on overlaying the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_pr_curve function with the overlay=True parameter, the Precision-Recall curves for both models are displayed together, with Logistic Regression in blue and Random Forest in black, both with a linewidth=2. The plot includes a custom title for clarity.

from model_metrics import show_pr_curve

show_pr_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    decimal_places=2,
    n_cols=2,
    n_rows=1,
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    title="ROC Curves: Logistic Regression and Random Forest",
    overlay=True,
)

Output

Precision-Recall Example 3: Categorical

In this third Precision-Recall evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow.

The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature, such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.

The show_pr_curve function supports this analysis through the group_category parameter.

For example, by passing group_category=X_test_2["race"], you can generate a separate ROC curve for each unique racial group in the dataset:

from model_metrics import show_pr_curve

show_pr_curve(
    model=model_rf["model"].estimator,
    X=X_test,
    y=y_test,
    model_title="Random Forest Classifier",
    decimal_places=2,
    group_category=X_test_2["race"],
)

Output

Decision Tree Precision-Recall Example 3

Confusion Matrix Evaluation

This section introduces the show_confusion_matrix function, which provides a flexible, styled interface for generating and visualizing confusion matrices across one or more classification models. It supports advanced features like threshold overrides, subgroup labeling, classification report display, and fully customizable plot aesthetics including subplot layouts.

The confusion matrix is a fundamental diagnostic tool for classification models, displaying the counts of true positives, true negatives, false positives, and false negatives. This function goes beyond standard implementations by allowing for custom thresholds (globally or per model), label annotation (e.g., TP, FP, etc.), plot exporting, colorbar toggling, and subplot visualization.

This is especially useful when comparing multiple models side-by-side or needing publication-ready confusion matrices for stakeholders.

show_confusion_matrix(model=None, X=None, y_prob=None, y=None, model_title=None, title=None, model_threshold=None, custom_threshold=None, class_labels=None, cmap='Blues', save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, figsize=(8, 6), labels=True, label_fontsize=12, tick_fontsize=10, inner_fontsize=10, subplots=False, score=None, class_report=False, show_colorbar=False, **kwargs)

Parameters:

model (object or str or list[object or str], optional) – A single model (object or string), or a list of models or string placeholders. Can be None if y_pred or y_prob is provided directly.
X (pd.DataFrame or np.ndarray, optional) – Feature matrix used for prediction when model is provided. Ignored if y_pred or y_prob are passed directly.
y (pd.Series or np.ndarray) – True target labels.
y_prob (array-like, optional) – Predicted probabilities (positive class). If provided, thresholds will be applied to convert probabilities into class predictions.
model_title (str or list[str], optional) – Custom title(s) for each model. Can be a string or list of strings. If None, defaults to "Model 1", "Model 2", etc.
title (str, optional) – Title for each plot. If "", no title is displayed. If None, a default title is shown.
model_threshold (dict, optional) – Dictionary of thresholds keyed by model title. Used if custom_threshold is not set.
custom_threshold (float, optional) – Global override threshold to apply across all models. If set, takes precedence over model_threshold.
class_labels (list[str], optional) – Custom labels for the classes in the matrix.
cmap (str, optional) – Colormap to use for the heatmap. Defaults to "Blues".
save_plot (bool, optional) – Whether to save the generated plot(s).
image_path_png (str, optional) – Path to save the PNG version of the image.
image_path_svg (str, optional) – Path to save the SVG version of the image.
text_wrap (int, optional) – Maximum width of plot titles before wrapping.
figsize (tuple[int, int], optional) – Figure size in inches. Defaults to (8, 6).
labels (bool, optional) – Whether to annotate matrix cells with TP, FP, FN, TN.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for axis ticks.
inner_fontsize (int, optional) – Font size for numbers and labels inside cells.
subplots (bool, optional) – Whether to display multiple models in a subplot layout.
score (str, optional) – Scoring metric to use when optimizing threshold (if applicable).
class_report (bool, optional) – If True, prints a classification report below each matrix.
show_colorbar (bool, optional) – Whether to display the colorbar on the confusion matrix heatmap. Defaults to False.
kwargs (dict, optional) – Additional keyword arguments for customization (e.g., show_colorbar, n_cols).

Returns:

None. Displays confusion matrix plots (and optionally saves them).

Return type:

None

Raises:

TypeError – If model_title is not a string, a list of strings, or None.

Important

You can supply either model and X or pass y_pred or y_prob directly.
When using y_prob, the function applies thresholds (custom_threshold or model_threshold) to compute predicted labels for the confusion matrix.
When using y_pred, results are calculated directly from the provided predictions without thresholding.
Supports single-model or multiple-model workflows, with either model objects or arrays of pre-computed predictions/probabilities.

Notes

Model Support:
- Supports single or multiple classification models.
- model_title may be inferred automatically or provided explicitly.
Threshold Handling:
- Use model_threshold to specify per-model thresholds.
- custom_threshold overrides all other thresholds.
Plotting Modes:
- subplots=True arranges plots in subplots.
- Otherwise, plots are displayed one at a time.
Labeling:
- Set labels=False to disable annotating cells with TP, FP, FN, TN.
- Always shows raw numeric values inside cells.
Colorbar & Styling:
- Toggle colorbar via show_colorbar (passed via kwargs).
- Colormap and font sizes are fully configurable.
Exporting Plots:
- Plots can be saved as both PNG and SVG using the respective paths.
- Saved filenames follow the pattern confusion_matrix_<model_name> or grid_confusion_matrix.

Confusion Matrix Example 1: Threshold=0.5

In this first confusion matrix evaluation example, we focus on showing the results of two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section onto a single plot.

from model_metrics import show_confusion_matrix

show_confusion_matrix(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    cmap="Blues",
    text_wrap=40,
    subplots=True,
    n_cols=2,
    n_rows=1,
    figsize=(6, 6),
)

Output

Confusion Matrix for Logistic Regression:

        Predicted 0  Predicted 1
Actual 0           75           14
Actual 1           20           91

Confusion Matrix for Random Forest:

        Predicted 0  Predicted 1
Actual 0           80            9
Actual 1           15           96

Confusion Matrix Example 2: Classification Report

This second confusion matrix evaluation example is nearly identical to the first, but uses a different color map (cmap="viridis"), sets show_colorbar=True, and class_report=True to print classification reports for each model in addition to the visual output.

from model_metrics import show_confusion_matrix

show_confusion_matrix(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    cmap="viridis",
    text_wrap=40,
    subplots=True,
    n_cols=2,
    n_rows=1,
    figsize=(6, 6),
    show_colorbar=True,
    class_report=True
)

Output

Confusion Matrix for Logistic Regression:

          Predicted 0  Predicted 1
Actual 0           76           18
Actual 1           13           93

Classification Report for Logistic Regression:

              precision    recall  f1-score   support

           0       0.85      0.81      0.83        94
           1       0.84      0.88      0.86       106

    accuracy                           0.84       200
   macro avg       0.85      0.84      0.84       200
weighted avg       0.85      0.84      0.84       200

Confusion Matrix for Random Forest:

           Predicted 0  Predicted 1
Actual 0            84           10
Actual 1             3          103

Classification Report for Random Forest:

              precision    recall  f1-score   support

           0       0.97      0.89      0.93        94
           1       0.91      0.97      0.94       106

    accuracy                           0.94       200
   macro avg       0.94      0.93      0.93       200
weighted avg       0.94      0.94      0.93       200

Confusion Matrix Example 3: Threshold = 0.37

In this third confusion matrix evaluation example using the synthetic dataset from the Binary Classification Models section, we apply a custom classification threshold of 0.37 using the custom_threshold parameter. This overrides the default threshold of 0.5 and enables us to inspect how the confusion matrices shift when a more lenient decision boundary is applied. Refer to the section on threshold selection logic for caveats on choosing the right threshold.

This is especially useful in imbalanced classification problems or cost-sensitive environments where the trade-off between precision and recall must be adjusted. By lowering the threshold, we increase the number of positive predictions, which can improve recall but may come at the cost of more false positives.

The output matrices for both models: Logistic Regression and Random Forest are shown side by side in a subplot layout for easy visual comparison.

from model_metrics import show_confusion_matrix

show_confusion_matrix(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_titles,
    text_wrap=40,
    subplots=True,
    n_cols=2,
    n_rows=1,
    figsize=(6, 6),
    custom_threshold=0.37,
)

Output

Calibration Curves

This section focuses on calibration curves, a diagnostic tool that compares predicted probabilities to actual outcomes, helping evaluate how well a model’s predicted confidence aligns with observed frequencies. Using models like Logistic Regression or Random Forest on the synthetic dataset from the previous (Binary Classification Models) section, we generate calibration curves to assess the reliability of model probabilities.

Calibration is especially important in domains where probability outputs inform downstream decisions, such as healthcare, finance, and risk management. A well-calibrated model not only predicts the correct class but also outputs meaningful probabilities, for example, when a model predicts a 0.7 probability, we expect roughly 70% of such predictions to be correct.

The show_calibration_curve function simplifies this process by allowing users to visualize calibration performance across models or subgroups. The plots show the mean predicted probabilities against the actual observed fractions of positive cases, with an optional reference line representing perfect calibration. Additional features include support for overlay or subplot layouts, subgroup analysis by categorical features, and optional Brier score display, a scalar measure of calibration quality.

The function offers full control over styling, figure layout, axis labels, and output format, making it easy to generate both exploratory and publication-ready plots.

show_calibration_curve(model=None, X=None, y_prob=None, y=None, xlabel='Mean Predicted Probability', ylabel='Fraction of Positives', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, subplots=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, bins=10, marker='o', show_brier_score=True, brier_decimals=3, gridlines=True, linestyle_kwgs=None, group_category=None, legend_loc='best', **kwargs)

Parameters:

model (estimator or list, optional) – A trained classifier or a list of classifiers to evaluate. Can be None if y_prob is provided directly.
X (pd.DataFrame or np.ndarray, optional) – Feature matrix used for predictions when model is supplied. Ignored if y_prob is passed directly.
y_prob (array-like or list of array-like, optional) – Predicted probabilities for the positive class. Can be provided directly instead of model and X.
y (pd.Series or np.ndarray) – True binary target values.
xlabel (str, optional) – X-axis label. Defaults to "Mean Predicted Probability".
ylabel (str, optional) – Y-axis label. Defaults to "Fraction of Positives".
model_title (str or list[str], optional) – Custom title(s) for the models.
overlay (bool, optional) – If True, overlays multiple models on one plot.
title (str, optional) – Title for the plot. Use "" to suppress.
save_plot (bool, optional) – Whether to save the plot(s).
image_path_png (str, optional) – Directory path for PNG export.
image_path_svg (str, optional) – Directory path for SVG export.
text_wrap (int, optional) – Max characters before title text wraps.
curve_kwgs (list[dict] or dict[str, dict], optional) – Styling options for the calibration curves.
subplots (bool, optional) – Whether to arrange models in a subplot layout.
n_cols (int, optional) – Number of columns in the subplot layout. Defaults to 2.
n_rows (int, optional) – Number of rows in the subplot layout. Auto-calculated if None.
figsize (tuple, optional) – Figure size in inches (width, height).
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for ticks and legend entries.
bins (int, optional) – Number of bins used to compute calibration.
marker (str, optional) – Marker style for calibration points.
show_brier_score (bool, optional) – Whether to display Brier score in the legend.
brier_decimals (int, optional) – Number of decimal places to display for the Brier score in legend labels. Defaults to 3.
gridlines (bool, optional) – Whether to show gridlines on plots.
linestyle_kwgs (dict, optional) – Styling for the “perfectly calibrated” reference line.
group_category (array-like, optional) – Categorical variable used to create subgroup calibration plots.
legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like 'best', 'upper right', 'lower left', etc., or 'bottom' to place legend below the plot. Defaults to 'best'.
kwargs (dict, optional) – Additional keyword arguments passed to the plot function.

Returns:

None. Displays or saves calibration plots for classification models.

Return type:

None

Raises:

ValueError –
- If overlay=True and subplots=True are both set.
- If group_category is used with overlay or subplots.
- If curve_kwgs list does not match number of models.
TypeError –
- If model_title is not a string, list of strings, Series, or None.

Important

You can supply either model and X or pass y_prob directly.
When using y_prob, the function bypasses model predictions and evaluates calibration curves directly from the provided probabilities.
brier_decimals controls the numeric precision of the Brier score shown in legend entries when show_brier_score=True.
Supports single-model or multiple-model workflows, including arrays of pre-computed probabilities.

Notes

Calibration vs Discrimination:
- Calibration evaluates how well predicted probabilities reflect observed outcomes, while ROC AUC measures a model’s ability to rank predictions.
Flexible Plotting Modes:
- overlay=True plots multiple models on one figure.
- subplots=True arranges plots in a subplot layout.
- If neither is set, individual full-size plots are created.
Group-Wise Analysis:
- Passing group_category plots separate calibration curves by subgroup (e.g., age, race).
- Each subgroup’s Brier score is shown when show_brier_score=True.
Customization:
- Use curve_kwgs and linestyle_kwgs to control styling.
- Add markers, gridlines, and custom titles to suit report or presentation needs.
Saving Outputs:
- Set save_plot=True and specify image_path_png or image_path_svg to export figures.
- Filenames are auto-generated based on model name and plot type.

Important

Calibration curves are a valuable diagnostic tool for assessing the alignment between predicted probabilities and actual outcomes. By plotting the fraction of positives against predicted probabilities, we can evaluate how well a model’s confidence scores correspond to observed reality. While these plots offer important insights, it’s equally important to understand the assumptions and limitations behind the calibration methods used.

Calibration Curve Example 1: Subplots

This example presents calibration curves for two classification models trained on the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow. The classification models are displayed side by side in a subplot layout. Each subplot shows how well the predicted probabilities from a model align with the actual observed outcomes. A diagonal dashed line representing perfect calibration is included in both plots, and Brier scores are shown in the legend to quantify each model’s calibration accuracy.

By setting subplots=True, the function automatically arranges the individual plots based on the number of models and specified columns. This layout is ideal for visually comparing calibration behavior across models without overlapping lines.

pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]

# Model titles
model_titles = [
    "Logistic Regression",
    "Random Forest Classifier",
    "Decision Tree Classifier",
]

from model_metrics import show_calibration_curve

show_calibration_curve(
    model=pipelines_or_models[:2],
    X=X_test,
    y=y_test,
    model_title=model_titles[:2],
    text_wrap=50,
    bins=10,
    show_brier_score=True,
    subplots=True,
    linestyle_kwgs={"color": "black"},
)

Output

Calibration Curve Example 2: Overlay

This example also uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow. This example demonstrates how to overlay calibration curves from multiple classification models in a single plot. Overlaying allows for direct visual comparison of how predicted probabilities from each model align with actual outcomes on the same axes.

The diagonal dashed line represents perfect calibration, and Brier scores are included in the legend for each model, providing a quantitative measure of calibration accuracy.

By setting overlay=True, the function combines all model curves into one figure, making it easier to evaluate relative performance without splitting across subplots.

pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]

# Model titles
model_titles = [
    "Logistic Regression",
    "Random Forest Classifier",
    "Decision Tree Classifier",
]


from model_metrics import show_calibration_curve

show_calibration_curve(
    model=pipelines_or_models,
    X=X_test,
    y=y_test,
    model_title=model_titles,
    bins=10,
    show_brier_score=True,
    overlay=True,
    linestyle_kwgs={"color": "black"},
)

Output

Calibration Curve Example 3: by Category

This example, too, uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow. This example shows how to visualize calibration curves separately for each category within a given feature, in this case, the race column of the joined test set using a single Random Forest classifier. Each plot represents the calibration behavior of the model for a specific subgroup, allowing for detailed insight into how predicted probabilities align with actual outcomes across demographic categories.

This type of disaggregated visualization is especially useful for fairness analysis and subgroup performance auditing. By setting group_category="race", the function automatically detects unique values in the specified column and generates a separate calibration curve for each.

The dashed diagonal reference line represents perfect calibration. Brier scores are included in each plot to provide a quantitative measure of calibration performance within the group.

Note

When using group_category, both overlay and subplots must be set to False. This ensures each group receives its own standalone figure, avoiding conflicting layout behavior.

from model_metrics import show_calibration_curve

show_calibration_curve(
    model=model_rf["model"].estimator,
    X=X_test,
    y=y_test,
    model_title="Random Forest Classifier",
    bins=10,
    show_brier_score=True,
    linestyle_kwgs={"color": "black"},
    curve_kwgs={title: {"linewidth": 2} for title in model_titles},
    group_category=X_test_2["race"],
)

Output

Threshold Metric Curves

This section introduces a powerful utility for exploring how classification thresholds affect key performance metrics, including Precision, Recall, F1 Score, and Specificity. Rather than fixing a threshold (commonly at 0.5), this function allows users to visualize trade-offs across the full range of possible thresholds, making it especially useful when optimizing for use-case-specific goals such as maximizing recall or achieving a minimum precision.

Using the Random Forest Classifier models trained on the adult income dataset [2], this tool helps users answer practical questions like:

What threshold achieves at least 85% precision?
Where does F1 score peak for this model?
How does specificity behave as the threshold increases?

The plot_threshold_metrics function supports optional threshold lookups via lookup_metric and lookup_value, which prints the closest threshold that meets your constraint. Plots can be customized with colors, gridlines, line styles, wrapped titles, and export options.

plot_threshold_metrics(model=None, X_test=None, y_test=None, y_prob=None, title=None, text_wrap=None, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, baseline_thresh=True, curve_kwgs=None, baseline_kwgs=None, threshold_kwgs=None, lookup_kwgs=None, save_plot=False, image_path_png=None, image_path_svg=None, lookup_metric=None, lookup_value=None, decimal_places=4, model_threshold=None)

Plot Precision, Recall, F1 Score, and Specificity as functions of the decision threshold.

This utility evaluates threshold-dependent classification metrics across the full range of thresholds. It supports highlighting a 0.5 baseline, an explicit model threshold, and a threshold located via a user-specified target metric value.

Parameters:

model (object, optional) – A trained classification estimator used to produce probabilities when y_prob is not provided. Must support predict_proba if used.
X_test (pd.DataFrame or np.ndarray, optional) – Feature matrix for evaluation. Required when model is supplied.
y_test (pd.Series or np.ndarray) – True binary labels. Required.
y_prob (array-like, optional) – Pre-computed predicted probabilities for the positive class. If provided, model and X_test are not required.
title (str, optional) – Custom title for the plot. If "", disables the title. If None, a default title is shown.
text_wrap (int, optional) – Maximum title width before wrapping. If None, no wrapping is applied.
figsize (tuple, optional) – Figure size (width, height) in inches. Defaults to (8, 6).
label_fontsize (int, optional) – Font size for axis labels and title. Defaults to 12.
tick_fontsize (int, optional) – Font size for tick labels. Defaults to 10.
gridlines (bool, optional) – Whether to show gridlines. Defaults to True.
baseline_thresh (bool, optional) – If True, adds a reference line at threshold = 0.5.
curve_kwgs (dict, optional) – Styling options applied to all metric curves (e.g., {"linestyle": "-", "linewidth": 1}).
baseline_kwgs (dict, optional) – Styling options for the baseline (0.5) threshold line (default: black dotted line).
threshold_kwgs (dict, optional) – Styling options for the model threshold line when model_threshold is provided (default: black dotted line).
lookup_kwgs (dict, optional) – Styling options for the lookup threshold line when lookup_metric/lookup_value are provided (default: gray dashed line).
save_plot (bool, optional) – Whether to save the figure to file.
image_path_png (str, optional) – File path to save PNG output (used when save_plot=True).
image_path_svg (str, optional) – File path to save SVG output (used when save_plot=True).
lookup_metric (str, optional) – Metric used to locate a threshold closest to lookup_value. One of "precision", "recall", "f1", or "specificity".
lookup_value (float, optional) – Target value for lookup_metric. Must be provided together with lookup_metric.
decimal_places (int, optional) – Number of decimal places for printed threshold output(s). Defaults to 4.
model_threshold (float, optional) – A model-specific threshold to highlight (vertical line). Useful if the model does not use 0.5.

Returns:

None. Displays (and optionally saves) the threshold metrics plot.

Return type:

None

Raises:

ValueError –

If y_test is not provided.
If neither (model and X_test) nor y_prob is provided.
If only one of lookup_metric or lookup_value is provided.

Important

You can supply either model and X_test or pass y_prob directly.
When using y_prob, the function bypasses model inference and uses the provided probabilities.

Notes

Metric Curves:
- Plots include Precision, Recall, F1 Score, and Specificity over threshold values.
- Useful for analyzing how changing the threshold alters model behavior.
Threshold Lookup:
- Set lookup_metric and lookup_value to find the closest threshold that meets your constraint.
- Prints result to console and highlights the corresponding vertical line.
Styling Options:
- Customize plot curves with curve_kwgs.
- Adjust baseline style (e.g., at threshold = 0.5) via baseline_kwgs.
Three optional vertical guides are supported:
1. Baseline at 0.5 (baseline_thresh=True),
2. model_threshold (e.g., a tuned decision threshold),
3. A threshold found by targeting lookup_metric at lookup_value.
Exporting:
- Use save_plot=True with image_path_png and/or image_path_svg to save outputs.
Interactivity:
- Ideal for presentations or dashboards where visualizing threshold sensitivity is crucial.
- Particularly helpful for domains like healthcare, fraud detection, or content moderation, where the cost of false positives vs. false negatives must be carefully managed.

Threshold Curves Example 1: Threshold=0.5

This example demonstrates how to plot threshold-dependent classification metrics using a Random Forest Classifier trained on the adult income dataset [2].

The plot_threshold_metrics function visualizes how Precision, Recall, F1 Score, and Specificity change as the decision threshold varies. In this configuration, the baseline threshold line at 0.5 is enabled (baseline_thresh=True), and the line styling is customized via curve_kwgs. Font sizes and wrapping options are adjusted for improved clarity in presentation-ready plots.

from model_metrics import plot_threshold_metrics

plot_threshold_metrics(
    model=model_rf["model"].estimator,
    X_test=X_test,
    y_test=y_test,
    baseline_thresh=True,
    baseline_kwgs={
        "color": "purple",
        "linestyle": "--",
        "linewidth": 2,
    },
    curve_kwgs={
        "linestyle": "-",
        "linewidth": 2,
    },
    text_wrap=40,
)

Output

Threshold Curves Example 2: Targeted Metric Lookup

This example expands on threshold-based classification metric visualization using a targeted lookup scenario. Suppose a clinical stakeholder or domain expert has determined (based on prior research, cost-benefit considerations, or operational constraints) that a precision of approximately 0.879 is ideal for downstream decision-making (e.g., minimizing false positives in a healthcare setting).

The plot_threshold_metrics function accepts the optional arguments lookup_metric and lookup_value to help identify the threshold that best aligns with this target. When these are set, the function automatically locates and highlights the threshold that most closely achieves the desired metric value, offering transparency and guidance for threshold tuning.

from model_metrics import plot_threshold_metrics

plot_threshold_metrics(
    model=model_rf["model"].estimator,
    X_test=X_test,
    y_test=y_test,
    lookup_metric="precision",
    lookup_value=0.879,
    baseline_thresh=False,
    lookup_kwgs={
        "color": "red",
        "linestyle": "--",
        "linewidth": 2,
    },
    curve_kwgs={
        "linestyle": "-",
        "linewidth": 2,
    },
    text_wrap=40,
)

Output

In this example:

lookup_metric="precision" specifies that we are targeting the precision curve.
lookup_value=0.879 provides the desired value for that metric.
The function will search for the closest possible precision value along the threshold range and display a vertical line at that corresponding threshold.
The threshold value is printed to the console and included in the legend (e.g., Best Threshold: 0.6757).

Threshold Curves Example 3: Model-Specific Threshold

In many production settings, a classifier is deployed with a tuned decision threshold different from the default 0.5 (e.g., to balance costs of false positives vs. false negatives). This example shows how to explicitly pass a model’s chosen threshold to be drawn as a vertical guide on the plot using model_threshold=.... You can do this whether you’re providing a model/X pair or pre-computed probabilities via y_prob. Below we show the latter.

# Get predicted probabilities for Random Forest model
y_prob_rf = model_rf["model"].estimator.predict_proba(X_test)[:, 1]

# Retrieve model thresholds
model_thresholds = {
    "Logistic Regression": next(iter(model_lr["model"].threshold.values())),
    "Decision Tree Classifier": next(iter(model_dt["model"].threshold.values())),
    "Random Forest Classifier": next(iter(model_rf["model"].threshold.values())),
}

from model_metrics import plot_threshold_metrics

# Example: Use precomputed probabilities but still highlight the model's tuned threshold.
plot_threshold_metrics(
    y_prob=y_prob[2],      # precomputed probabilities for the positive class
    y_test=y_test,         # ground-truth labels
    baseline_thresh=False, # hide the default 0.5 guide
    model_threshold=model_thresholds["Random Forest Classifier"],
    threshold_kwgs={       # styling for the model-threshold vertical line
        "color": "blue",
        "linestyle": "--",
        "linewidth": 1,
    },
    curve_kwgs={           # styling for metric curves
        "linestyle": "-",
        "linewidth": 1.25,
    },
    text_wrap=40,
)

Output

Threshold Curves Example 3 (Model-Specific Threshold)

Note

model_threshold draws a labeled vertical line (e.g., Model Threshold: 0.6757), making it clear where the production decision point lies.
This is independent of baseline_thresh; you can enable both if you want to compare the tuned threshold vs. the default 0.5.
If you prefer to compute probabilities on-the-fly, pass the model and test features instead of y_prob:

Residual Diagnostics

Residual diagnostics are essential tools for evaluating regression model performance beyond standard metrics like R² or RMSE. By examining the patterns in residuals: the differences between observed and predicted values, we can identify violations of modeling assumptions, detect systematic errors, and uncover opportunities for model improvement.

The show_residual_diagnostics function provides comprehensive visualization of residual patterns across multiple dimensions:

Residuals vs Fitted Values: Assess homoscedasticity (constant variance) and identify non-linear patterns
Residuals vs Predictors: Examine whether specific features are associated with systematic prediction errors
Q-Q Plots: Evaluate whether residuals follow a normal distribution
Histogram of Residuals: Visualize the distribution shape and identify outliers
Scale-Location Plots: Detect heteroscedasticity (non-constant variance)

Dataset for Regression Residuals Examples:

The examples in this section use the diabetes dataset from scikit-learn, which was introduced in the Regression Models section. This dataset contains baseline medical measurements and a quantitative measure of disease progression, making it ideal for demonstrating residual analysis techniques across different regression approaches.

What Good Residuals Look Like:

Randomly scattered around zero with no systematic patterns
Constant spread across the range of fitted values (homoscedasticity)
Approximately normally distributed (for inference and prediction intervals)
No strong correlations with individual predictor variables

What Bad Residuals Reveal:

Funnel shapes (heteroscedasticity): Variance increases/decreases with predicted values, suggesting transformations may be needed
Curved patterns: Non-linear relationships that the model hasn’t captured
Clusters or groups: Systematic differences across subpopulations that may require interaction terms or stratified models
Heavy tails or skewness: Outliers or violations of normality assumptions
Patterns vs predictors: Missing interaction effects or non-linear relationships with specific features

More information on residual diagnostics and interpretation can be found in the residual diagnostics section.

show_residual_diagnostics(model=None, X=None, y=None, y_pred=None, model_title=None, plot_type='all', figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True, save_plot=False, image_path_png=None, image_path_svg=None, show_outliers=False, n_outliers=3, suptitle=None, suptitle_y=0.995, text_wrap=None, point_kwgs=None, group_kwgs=None, line_kwgs=None, show_lowess=False, lowess_kwgs=None, group_category=None, show_centroids=False, centroid_type='clusters', n_clusters=None, centroid_kwgs=None, legend_loc='best', legend_kwgs=None, n_cols=None, n_rows=None, heteroskedasticity_test=None, decimal_places=4, show_plots=True, show_diagnostics_table=False, return_diagnostics=False, histogram_type='frequency', kmeans_rstate=42)

Parameters:

model (estimator or list of estimators, optional) – Trained regression model(s). If None, y_pred must be provided.
X (array-like, optional) – Feature matrix. Required if model is provided.
y (array-like) – True target values.
y_pred (array-like or list, optional) – Predicted values. Can be provided instead of model and X.
model_title (str or list[str], optional) – Custom name(s) for model(s). Defaults to "Model 1", "Model 2", etc.
plot_type (str or list, optional) – Which diagnostic plot(s) to display. Options: "all", "fitted", "qq", "scale_location", "leverage", "influence", "histogram", "predictors". Can pass a list for specific plots.
figsize (tuple, optional) – Figure size (width, height). Defaults vary by plot_type.
label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to 12.
tick_fontsize (int, optional) – Font size for tick labels. Defaults to 10.
gridlines (bool, optional) – Whether to display grid lines. Defaults to True.
save_plot (bool, optional) – Whether to save the plot(s) to disk. Defaults to False.
image_path_png (str, optional) – Path to save PNG image.
image_path_svg (str, optional) – Path to save SVG image.
show_outliers (bool, optional) – Whether to label outlier points on plots. Defaults to False.
n_outliers (int, optional) – Number of most extreme outliers to label. Defaults to 3.
suptitle (str, optional) – Custom title for the overall figure. If None, uses default; if "", no suptitle is displayed.
suptitle_y (float, optional) – Vertical position of the figure suptitle (0-1 range). Defaults to 0.995.
text_wrap (int, optional) – Maximum width for wrapping titles.
point_kwgs (dict, optional) – Styling for scatter points (e.g., {'alpha': 0.6, 'color': 'blue'}).
group_kwgs (dict, optional) – Styling for group scatter points when group_category is provided. Can specify colors as a list for each group.
line_kwgs (dict, optional) – Styling for reference lines (e.g., {'color': 'red', 'linestyle': '--'}).
show_lowess (bool, optional) – Whether to show the LOWESS smoothing trend line on residual plots. Defaults to False.
lowess_kwgs (dict, optional) – Styling for LOWESS smoothing line (e.g., {'color': 'blue', 'linewidth': 2}).
group_category (str or array-like, optional) – Categorical variable for grouping observations. Can be a column name in X or an array matching y in length.
show_centroids (bool, optional) – Whether to plot centroids for groups or clusters. Defaults to False.
centroid_type (str, optional) – Type of centroids to display. Options: "clusters" (k-means clustering) or "groups" (category centroids). Defaults to "clusters".
n_clusters (int, optional) – Number of clusters for k-means clustering when centroid_type="clusters". Defaults to 3 if not specified.
centroid_kwgs (dict, optional) – Styling for centroid markers (e.g., {'marker': 'X', 's': 50, 'c': ['red', 'blue']}).
legend_loc (str, optional) – Location for the legend. Standard matplotlib locations like 'best', 'upper right', etc. Defaults to 'best'.
legend_kwgs (dict, optional) – Control legend display for groups, centroids, clusters, and heteroskedasticity tests. Use keys 'groups', 'centroids', 'clusters', 'het_tests' with boolean values.
n_cols (int, optional) – Number of columns for predictor plots layout. If None, uses automatic layout.
n_rows (int, optional) – Number of rows for predictor plots layout. If None, automatically calculated.
heteroskedasticity_test (str, optional) – Test for heteroskedasticity. Options: "breusch_pagan", "white", "goldfeld_quandt", "spearman", "all", or None. Defaults to None.
decimal_places (int, optional) – Number of decimal places for all numeric values in diagnostics. Defaults to 4.
show_plots (bool, optional) – Whether to display diagnostic plots. Defaults to True.
show_diagnostics_table (bool, optional) – Whether to print a formatted table of diagnostic statistics. Defaults to False.
return_diagnostics (bool, optional) – If True, return a dictionary containing diagnostic statistics. Defaults to False.
histogram_type (str, optional) – Type of histogram to display. Options: "frequency" (raw counts) or "density" (probability density with normal overlay). Defaults to "frequency".
kmeans_rstate (int, optional) – Random state for reproducibility in k-means clustering. Defaults to 42.

Returns:

None (displays plots) or dictionary of diagnostics if return_diagnostics=True.

Return type:

None or dict

Raises:

ValueError –

If neither (model and X) nor y_pred is provided.
If plot_type is not recognized.
If both group_category and n_clusters are specified.
If heteroskedasticity_test is not a valid test type.
If histogram_type is not 'frequency' or 'density'.
If centroid_type is not 'clusters' or 'groups'.
If centroid_type='groups' without group_category specified.
If show_centroids=True and centroid_type='clusters' but X contains non-numeric columns (k-means clustering requires all numeric features).

Important

You can supply either model and X or pass y_pred directly.
When using y_pred, the function bypasses model predictions and uses the provided values for residual calculation.
Supports single-model or multiple-model workflows.
Cannot specify both group_category (user-defined groups) and n_clusters (automatic clustering) simultaneously.
When centroid_type='groups', group_category must be provided.

Notes

Diagnostic Plot Types:
- "all": Creates a comprehensive 2×3 grid with fitted, Q-Q, scale-location, leverage, influence, and histogram plots
- "fitted": Residuals vs fitted values (detects heteroscedasticity and non-linearity)
- "qq": Normal Q-Q plot (assesses normality assumption)
- "scale_location": Standardized residuals vs fitted (evaluates homoscedasticity)
- "leverage": Residuals vs leverage with Cook’s distance contours (identifies influential points)
- "influence": Influence plot with bubble sizes proportional to Cook’s distance
- "histogram": Distribution of residuals with optional normal overlay
- "predictors": Separate plots for residuals vs each predictor variable
Heteroskedasticity Testing:
- Optional parameter heteroskedasticity_test performs formal statistical tests for non-constant variance
- "breusch_pagan": Tests whether residual variance depends on predicted values
- "white": More general test that doesn’t assume specific functional form
- "goldfeld_quandt": Compares variance between two subsamples
- "spearman": Spearman correlation between absolute residuals and fitted values
- "all": Runs all available tests
- Test results are displayed in plot legends and printed when show_diagnostics_table=True
- If None (default), no tests are performed, only visual diagnostics are displayed
Grouped Analysis:
- Use group_category to stratify residuals by categorical variables (e.g., sex, age group, treatment arm)
- Use n_clusters for automatic k-means clustering to identify data-driven residual patterns
- Set show_centroids=True to overlay group/cluster centroids on plots
- centroid_type controls centroid display: "clusters" (default) for k-means or "groups" for category centroids
- kmeans_rstate ensures reproducible clustering (default: 42)
- Use legend_kwgs to control which legend entries appear (groups, centroids, clusters, heteroskedasticity tests)
Customization:
- Control histogram display with histogram_type: "frequency" for raw counts or "density" for normalized distribution with normal overlay
- Enable LOWESS smoothing with show_lowess=True to visualize trends in residual plots
- Customize point, line, LOWESS, group, and centroid styling via respective *_kwgs parameters
- Adjust subplot layout for predictor plots using n_cols and n_rows
- Set suptitle to customize or suppress the overall figure title, and use suptitle_y to adjust vertical position
Output Options:
- Set show_plots=False and show_diagnostics_table=True to print only the diagnostic statistics table
- Use return_diagnostics=True to programmatically access diagnostic quantities for custom analyses
- Combine show_plots=True and show_diagnostics_table=True for comprehensive visual and quantitative assessment
- When group_category is a column in X, it is automatically excluded from predictor plots to avoid redundancy

Residual Diagnostics Example 1: All Residual Diagnostics Plots

Using the diabetes dataset, this first example demonstrates a complete residual diagnostic analysis for a single regression model using plot_type="all". This setting generates all available diagnostic visualizations in a single comprehensive display:

Residuals vs Fitted Values: Detects non-linearity, heteroscedasticity, and outliers
Q-Q Plot: Assesses normality of residuals
Scale-Location Plot: Evaluates homoscedasticity (constant variance)
Residuals vs Leverage: Identifies influential observations
Histogram of Residuals: Shows the distribution shape

We evaluate a Random Forest model trained on the diabetes dataset. The n_clusters=3 parameter performs k-means clustering on the residuals to identify groups of observations with similar prediction error patterns. Setting show_centroids=True overlays cluster centers on the residual plots, styled with custom colors and markers via centroid_kwgs.

The kmeans_rstate=222 parameter controls the random seed for k-means clustering, ensuring reproducible cluster assignments across repeated runs. By default, kmeans_rstate is set to 42, making clustering deterministic unless explicitly changed. This is important because k-means uses random initialization; different seeds can produce slightly different cluster assignments, especially when clusters overlap or are of similar size. Setting a fixed seed ensures that diagnostic plots remain consistent for documentation, presentations, and collaborative analysis.

To formally test for heteroscedasticity, we enable heteroskedasticity_test="breusch_pagan". This optional parameter runs the Breusch-Pagan test, which evaluates whether residual variance is systematically related to predicted values. Test results, including the test statistic, p-value, and interpretation, are printed to the console. A significant result (p < 0.05) indicates heteroscedasticity, suggesting that predictions may be more reliable for certain ranges of the response variable than others.

Additional customization options include:

n_cols=2: Arranges diagnostic plots in a 2-column grid layout; this is useful for wide-format displays, but it is recommended to not use this setting when displaying all plots together to maintain clarity.
histogram_type="density": Displays residuals as a density plot rather than raw counts
decimal_places=2: Controls precision of printed test statistics
tick_fontsize and label_fontsize: Adjust text sizing for readability
save_plot=True with image paths: Exports plots as PNG and SVG for reports

The function also returns a diagnostics dictionary containing residuals, fitted values, standardized residuals, and leverage statistics. This allows for programmatic access to diagnostic quantities for custom analyses or integration with resid_diagnostics_to_dataframe to convert results into a pandas DataFrame for further exploration.

rf_pred = rf_model.predict(X_test)

from model_metrics import show_residual_diagnostics

show_residual_diagnostics(
    y_pred=rf_pred,
    model_title=["Random Forest"],
    X=X_test,
    y=y_test,
    n_clusters=3,
    n_cols=2,
    plot_type="all",
    show_centroids=True,
    centroid_kwgs={"c": ["red", "blue", "green"], "marker": "X", "s": 50},
    heteroskedasticity_test="breusch_pagan",
    decimal_places=2,
    histogram_type="density",
    kmeans_rstate=222,
)

Output

Residual Diagnostics Example - Comprehensive Analysis

Residual Diagnostics Example 2: Single Plot with LOWESS Smoothing

Using the diabetes dataset, this example demonstrates two key capabilities for focused residual analysis:

Selective plot generation: The plot_type parameter allows you to generate specific diagnostic plots rather than the full suite. Pass a single plot name as a string (e.g., "fitted") or a list of plot names for multiple specific plots (e.g., ["fitted", "qq", "histogram"]). This is useful when you need to examine particular model assumptions or create targeted visualizations for reports.
LOWESS trend detection: Setting show_lowess=True adds a locally weighted scatterplot smoothing (LOWESS) curve to residual plots. This non-parametric smoothing line reveals systematic patterns or trends in the residuals that might not be obvious from the scatter alone. If model assumptions hold, the LOWESS line should be roughly horizontal at y=0. Pronounced curves or trends indicate potential violations of linearity or suggest that the model is systematically over- or under-predicting in certain regions.

We focus on the Scale-Location plot (plot_type="scale_location"), which is particularly useful for detecting heteroscedasticity: the violation of the constant variance assumption. This plot displays the square root of standardized residuals against fitted values, making it easier to spot changes in residual spread across the prediction range. The LOWESS smoothing line, styled in orange via lowess_kwgs, helps identify whether variance increases, decreases, or remains stable as predictions change.

The heteroskedasticity_test="breusch_pagan" parameter formally tests for heteroscedasticity. The Breusch-Pagan test evaluates whether residual variance is systematically related to the predictors or fitted values. Test results appear in the plot legend (if space permits) or can be displayed in a diagnostic table using show_diagnostics_table=True. A significant result (p < 0.05) provides statistical evidence of heteroscedasticity, which may require remedial measures such as variance-stabilizing transformations, weighted least squares, or robust standard errors.

from model_metrics import show_residual_diagnostics

show_residual_diagnostics(
    y_pred=rf_pred,
    model_title="Random Forest",
    X=X_test,
    y=y_test,
    plot_type="scale_location",
    point_kwgs={"alpha": 0.9, "color": "blue", "edgecolor": "black", "s": 50},
    show_lowess=True,
    lowess_kwgs={"color": "red", "linewidth": 2},
    heteroskedasticity_test="breusch_pagan",
    figsize=(8, 6),
    tick_fontsize=12,
    label_fontsize=14,
)

Output

Residual Diagnostics Example - Grouped by Sex Category

Residual Diagnostics Example 3: Diagnostics Table Only

Using the diabetes dataset, this example demonstrates how to generate a comprehensive residual diagnostics summary table without displaying plots. By setting show_plots=False and show_diagnostics_table=True, the function outputs only a tabular summary of key diagnostic statistics and heteroscedasticity test results.

The diagnostics table includes:

Residual statistics: Mean, standard deviation, min, max, and quartiles
Standardized residual metrics: Useful for identifying outliers (\(|z| > 3\))
Heteroscedasticity test results: When heteroskedasticity_test is specified

In this example, we set heteroskedasticity_test="all" to run all available tests:

Breusch-Pagan: Tests whether residual variance depends on predicted values
White: A more general test that doesn’t assume a specific functional form
Goldfeld-Quandt: Compares variance between two subsamples
Spearman: Compares correlation between absolute residuals and fitted values

Each test returns a test statistic, p-value, and interpretation. The decimal_places=5 parameter ensures high precision in the printed output, which is useful for reporting results in research papers or technical documentation.

The return_diagnostics=True parameter returns a dictionary containing all diagnostic quantities (residuals, fitted values, standardized residuals, leverage, etc.) for programmatic access or conversion to a DataFrame using resid_diagnostics_to_dataframe.

Note: You can also display both the table and plots simultaneously by setting show_plots=True and show_diagnostics_table=True together. This provides a comprehensive view combining visual diagnostics with quantitative summaries, ideal for thorough model evaluation reports.

Additional parameters used:

plot_type="histogram": Specifies which plot type to generate (only relevant if show_plots=True)
n_clusters=3 and show_centroids=True: Configures k-means clustering (applied to returned diagnostics)
save_plot=True: Would save plots if show_plots=True

from model_metrics import show_residual_diagnostics

diagnostics = show_residual_diagnostics(
    y_pred=rf_pred,
    model_title=["Random Forest"],
    X=X_test_diabetes,
    y=y_test_diabetes,
    n_clusters=3,
    n_cols=2,
    save_plot=True,
    image_path_png=image_path_png,
    image_path_svg=image_path_svg,
    tick_fontsize=12,
    label_fontsize=14,
    plot_type="histogram",
    show_centroids=True,
    centroid_kwgs={"c": ["red", "blue", "green"], "marker": "X", "s": 50},
    heteroskedasticity_test="all",
    legend_loc="upper right",
    show_diagnostics_table=True,
    return_diagnostics=True,
    show_plots=False,
    decimal_places=5,
)

Output

============================================================
Residual Diagnostics: Random Forest
============================================================
Statistic                                     Value
------------------------------------------------------------
N Observations                                   89
N Predictors                                     10
------------------------------------------------------------
R-squared                                   0.44282
Adjusted R-squared                          0.37139
------------------------------------------------------------
RMSE                                       54.33241
MAE                                        44.05303
------------------------------------------------------------
Mean Residual                              -0.73056
Std Residual                               54.32750
Min Residual                             -118.34000
Max Residual                              153.37000
Jarque-Bera Test               p=0.76159 (Normal)
Durbin-Watson                               2.20903
------------------------------------------------------------
Mean Leverage                               0.12360
Max Leverage                                0.39654
Leverage Threshold (2p/n)                   0.22472
High Leverage Points                              7
------------------------------------------------------------
Heteroskedasticity Tests:
Breusch-Pagan                 p=0.03750 (Heteroskedastic)
White                         p=0.10104 (Homoskedastic)
Goldfeld-Quandt               p=0.05058 (Homoskedastic)
Spearman Correlation          p=0.09465 (Homoskedastic)
============================================================

Residual Diagnostics Example 4: Diagnostics to DataFrame

Building on the Residual Diagnostics Example 2, the diagnostics dictionary returned by show_residual_diagnostics() can be converted into a pandas DataFrame for programmatic analysis, reporting, or integration into automated pipelines. The resid_diagnostics_to_dataframe() helper function handles this conversion seamlessly, properly flattening nested structures like heteroskedasticity test results. Unlike the console table which only displays p-values and interpretations, the DataFrame provides complete test results including both the test statistics and p-values: useful for creating custom reports, academic papers, or detailed model documentation that requires full statistical disclosure.

from model_metrics import show_residual_diagnostics, resid_diagnostics_to_dataframe

# Generate diagnostics (reusing Example 2 configuration)
diagnostics = show_residual_diagnostics(
    y_pred=rf_prob,
    model_title=["Random Forest"],
    X=X_test_diabetes,
    y=y_test_diabetes,
    n_clusters=3,
    heteroskedasticity_test="breusch_pagan",
    show_diagnostics_table=False,  # Suppress console table
    show_plots=False,  # Suppress plots to focus on data extraction
    decimal_places=2,
    kmeans_rstate=222,
)

# Convert to DataFrame
df = resid_diagnostics_to_dataframe(diagnostics)
print(df)

This produces a clean DataFrame with all diagnostic statistics:

Output

Statistic	Value
model_name	Random Forest
n_observations	89
n_predictors	10
mean_residual	-0.730562
std_residual	54.327496
min_residual	-118.34
max_residual	153.37
mae	44.053034
rmse	54.332408
r2	0.44
adj_r2	0.37
jarque_bera_stat	0.5447
jarque_bera_pval	0.761588
durbin_watson	2.209026
max_leverage	0.396543
mean_leverage	0.123596
max_cooks_d	0.207521
leverage_threshold	0.224719
high_leverage_count	7
influential_points_05	0
influential_points_10	0
hetero_breusch_pagan_stat	19.22
hetero_breusch_pagan_pval	0.04
hetero_breusch_pagan_heteroskedastic	TRUE
hetero_white_stat	78.78
hetero_white_pval	0.1
hetero_white_heteroskedastic	FALSE
hetero_goldfeld_quandt_stat	1.77
hetero_goldfeld_quandt_pval	0.05
hetero_goldfeld_quandt_heteroskedastic	FALSE
hetero_spearman_stat	0.18
hetero_spearman_pval	0.09
hetero_spearman_heteroskedastic	FALSE

Key Features:

Automatic flattening: Heteroskedasticity test results are expanded into separate rows (e.g., hetero_breusch_pagan_stat, hetero_breusch_pagan_pval)
Normality interpretation: Jarque-Bera results are split into statistic, p-value, and boolean normality indicator
Ready for export: DataFrame can be saved to CSV, Excel, or integrated into reporting workflows with df.to_csv('diagnostics.csv', index=False)
Programmatic access: Extract specific statistics with df.loc[df['Statistic'] == 'R-squared', 'Value'].iloc[0]

When multiple heteroskedasticity tests are requested using heteroskedasticity_test="all", the DataFrame will include rows for all four tests (Breusch-Pagan, White, Goldfeld-Quandt, and Spearman Correlation), each with their respective statistics, p-values, and heteroskedasticity indicators.

Residual Diagnostics Example 5: Grouped Analysis with Customization

Using the diabetes dataset, this example demonstrates the full power of grouped residual analysis, showcasing how stratification by categorical variables can reveal differential model performance across subpopulations. We use the Random Forest model trained on the diabetes dataset, focusing on three key predictors (age, BMI, and sex) to examine whether prediction errors vary systematically between male and female patients.

Understanding the residual diagnostic example 4 parameters

Core Data Parameters:

y_pred=rf_pred: Provides pre-computed predictions from the Random Forest model, bypassing the need to pass the model object itself
X=X_test_diab_copy[["age", "bmi"]]: Focuses analysis on two specific predictors rather than all features
y=y_test_diabetes: True target values for residual calculation
model_title="Random Forest": Custom display name for the model

Grouping and Visualization Parameters:

plot_type="predictors": Creates separate residual plots for each predictor variable (age, BMI, sex_category), allowing examination of predictor-specific error patterns
group_category="sex_category": Stratifies all visualizations by sex, color-coding points by Male/Female to reveal group-specific patterns
centroid_type="groups": Instructs the function to compute centroids for each category in group_category (Male and Female) rather than using k-means clustering
show_centroids=True: Overlays the mean residual position for each sex on every plot, making systematic bias immediately visible

Styling and Aesthetics:

group_kwgs: Controls the appearance of scatter points for each group:
- "color": ["#1f77b4", "#ff7f0e"]: Custom hex colors for Male (blue) and Female (orange)
- "alpha": 0.8: Semi-transparency to reveal overlapping points
- "s": 60: Point size for readability
- "edgecolors": "black": Black borders around points for definition
centroid_kwgs: Customizes the centroid markers to stand out:
- "c": ["red", "blue"]: Distinct colors for Male and Female centroids
- "marker": "X": X-shaped markers easily distinguishable from data points
- "s": 50: Marker size balancing visibility with clarity

Statistical Testing:

heteroskedasticity_test="all": Runs all four heteroskedasticity tests (Breusch-Pagan, White, Goldfeld-Quandt, Spearman) on each predictor, with results displayed in plot legends. This comprehensive testing approach provides convergent evidence about whether variance differs systematically across groups or predicted values.

Layout and Display:

figsize=(12, 8): Larger figure accommodates predictor subplots with clear legends
tick_fontsize=14, label_fontsize=16: Enhanced readability for presentation or publication
legend_loc="bottom": Places legends below plots to avoid obscuring data points, especially useful with heteroskedasticity test results
suptitle="": Suppresses the overall figure title for a cleaner, more professional appearance

Additional Parameters:

decimal_places=2: Rounds all statistical test results to two decimal places for concise display
kmeans_rstate=222: Sets random seed for reproducibility (relevant only if centroid_type="clusters" were used)
save_plot=True, image_path_png, image_path_svg: Exports high-quality figures for reports or publications

# Generate predictions from multiple models
linear_pred = linear_model.predict(X_test_diabetes)
rf_pred = rf_model.predict(X_test_diabetes)
ridge_pred = ridge_model.predict(X_test_diabetes)

# The 'sex' column is already categorical-like (coded as positive/negative values)
# Let's make it more interpretable
X_test_copy = X_test.copy()
X_test_copy["sex_category"] = X_test_copy["sex"].apply(
    lambda x: "Male" if x > 0 else "Female"
)

from model_metrics import show_residual_diagnostics

# Generate residual diagnostics stratified by sex
show_residual_diagnostics(
    y_pred=rf_pred,
    model_title="Random Forest",
    X=X_test_diab_copy,
    y=y_test_diabetes,
    plot_type="predictors",
    show_centroids=True,
    group_category="sex_category",
    centroid_kwgs={"c": ["red", "blue"], "marker": "X", "s": 50},
    suptitle="",
    legend_loc="bottom",
)

Output

What the residual diagnostics example 4 analysis reveals

By examining residuals stratified by sex across age, BMI, and the sex category itself, we can identify:

Systematic Bias: Do centroids deviate from zero differently for males vs females? If the female centroid is consistently above zero while the male centroid is below, the model systematically under-predicts for females and over-predicts for males.
Differential Variance: Do residuals spread more widely for one group? The heteroskedasticity tests quantify whether prediction uncertainty differs between sexes.
Predictor-Specific Patterns: Does the relationship between residuals and a predictor (e.g., BMI) differ by sex? Diverging patterns suggest interaction effects that the model hasn’t captured.
Fairness Assessment: In healthcare applications, differential error patterns by sex could indicate the need for sex-specific calibration or additional interaction terms in the model.

Interpreting Centroids

The centroids represent the average (x, y) position of residuals for each group on each predictor plot:

X-coordinate: Mean value of the predictor for that group
Y-coordinate: Mean residual for that group

A centroid with y ≠ 0 indicates systematic over-prediction (y < 0) or under-prediction (y > 0) for that group. Centroids at different vertical positions reveal bias, while different horizontal positions reflect predictor distribution differences between groups.

Heteroskedasticity Test Interpretation

Each test evaluates whether residual variance is constant:

Breusch-Pagan: Tests if variance depends on fitted values or predictors
White: General test not assuming specific functional form
Goldfeld-Quandt: Compares variance between high and low predictor values
Spearman: Correlation between absolute residuals and fitted values

Statistical significance (typically p < 0.05) indicates heteroscedasticity, suggesting transformations or weighted regression may be needed.

Residual Diagnostics Example 6: Multiple Models with Shared Axes

Using the diabetes dataset, this example demonstrates how to compare residual diagnostics across multiple models with standardized axis limits for direct visual comparison. By passing a list of predictions and model names, you can evaluate multiple models side-by-side using any diagnostic plot type.

Key features demonstrated:

Model Comparison Across Any Plot Type

While this example uses histograms (plot_type="histogram"), the same approach works for all diagnostic plot types:

Residuals vs Fitted (plot_type="fitted"): Compare linearity assumptions
Q-Q Plots (plot_type="qq"): Compare normality of residuals
Scale-Location (plot_type="scale_location"): Compare homoscedasticity
Leverage Plots (plot_type="leverage"): Compare influential observations
All plots (plot_type="all"): Generate all 6 diagnostic plots for each model

Simply change the plot_type parameter while keeping the same multi-model structure (y_pred=[model1, model2]) to create comprehensive cross-model comparisons for any diagnostic.

Shared Axis Limits

xlim=(-175, 175): Standardizes the x-axis (residual values) across both models
ylim=(0, 10): Standardizes the y-axis (frequency counts) across both models

This ensures that visual differences in residual distributions reflect actual model performance rather than different axis scales.

Group-Based Analysis

The group_category="sex_category" parameter colors points by sex, with custom styling via group_kwgs. When combined with show_centroids=True and centroid_type="groups", group-specific centroids are displayed to reveal whether residual patterns differ across demographic groups.

Legend Positioning

Setting legend_loc="bottom" places legends below the x-axis with proper spacing. The function automatically adds vertical space to accommodate bottom legends when using default figure sizes.

Comprehensive Heteroscedasticity Testing

heteroskedasticity_test="all" runs all four available tests (Breusch-Pagan, White, Goldfeld-Quandt, Spearman) and displays results in the legend. This helps identify whether residual variance is constant across fitted values.

Title and Layout Customization

suptitle="": Suppresses the overall figure title for a cleaner look
text_wrap=35: Wraps subplot titles at 35 characters
n_cols=2: Arranges subplots in 2 columns for side-by-side comparison

When to Use Multi-Model Comparison

Multi-model comparison is particularly valuable when:

Comparing different algorithms (e.g., linear vs tree-based models)
Evaluating hyperparameter tuning results (e.g., different regularization strengths)
Assessing feature engineering impact (e.g., with vs without transformations)
Creating model selection documentation for reports or publications

When to Use Shared Axes

Shared axis limits (xlim/ ylim) are recommended when models have similar scales. If models produce residuals on very different scales, omit these parameters to let each subplot use optimal ranges.

from model_metrics import show_residual_diagnostics

show_residual_diagnostics(
    y_pred=[ridge_pred, rf_pred],
    model_title=["Ridge Regression", "Random Forest"],
    X=X_test[["age", "bmi", "sex_category"]],
    y=y_test,
    plot_type="histogram",  # Try "all", "fitted", "qq", etc.
    show_centroids=True,
    centroid_type="groups",
    group_category="sex_category",
    centroid_kwgs={"c": ["red", "blue"], "marker": "X", "s": 50},
    group_kwgs={
        "color": ["#1f77b4", "#ff7f0e"],  # Custom hex colors
        "alpha": 0.8,
        "s": 60,
        "edgecolors": "black",
    },
    heteroskedasticity_test="all",
    suptitle="",
    n_cols=2,
    text_wrap=35,
    legend_loc="bottom",
    xlim=(-175, 175),
    ylim=(0, 10),
)

Output

Residual Diagnostics Example - Model Comparison with Shared Axes