Model Performance Summaries

Summarizes model performance metrics for classification and regression models.

summarize_model_performance(model, X, y, model_type='classification', model_threshold=None, model_title=None, custom_threshold=None, score=None, return_df=False, overall_only=False, decimal_places=3)

Parameters:

model (object or list) – A trained model or a list of trained models.
X (pd.DataFrame) – Feature matrix used for evaluation.
y (pd.Series or np.array) – Target variable.
model_type (str, optional) – Specifies whether the model is for classification or regression. Must be either "classification" or "regression". Defaults to "classification".
model_threshold (dict, optional) – Threshold values for classification models. If provided, this dictionary specifies thresholds per model. Defaults to None.
model_title (str or list, optional) – Custom model names for display. If None, names are inferred from the models. Defaults to None.
custom_threshold (float, optional) – A fixed threshold for classification, overriding model_threshold. If set, the “Model Threshold” row is excluded. Defaults to None.
score (str, optional) – A custom scoring metric for classification models. Defaults to None.
return_df (bool, optional) – If True, returns a DataFrame instead of printing results. Defaults to False.
overall_only (bool, optional) – If True, returns only the “Overall Metrics” row, removing coefficient-related columns for regression models. Defaults to False.
decimal_places (int, optional) – Number of decimal places to round numerical metrics. Defaults to 3.

Returns:

A DataFrame containing model performance metrics if return_df=True. Otherwise, the metrics are printed in a formatted table.

Return type:

pd.DataFrame or None

Raises:

ValueError – If model_type="classification" and overall_only=True.
ValueError – If model_type is not "classification" or "regression".

Notes

Classification Models:
- Computes precision, recall, specificity, AUC ROC, F1-score, Brier score, and other key metrics.
- Requires models supporting predict_proba or decision_function.
- If custom_threshold is set, it overrides model_threshold.
Regression Models:
- Computes MAE, MAPE, MSE, RMSE, Explained Variance, and R² Score.
- Uses statsmodels.OLS to extract coefficients and p-values.
- If overall_only=True, the DataFrame retains only overall performance metrics.
All Models:
- When decimal_places is specified with a desired number, it controls the precision of decimal places displayed in the results.
- If return_df=False, the function outputs results in a printed, formatted, readable structure instead of returning a DataFrame.

The summarize_model_performance function provides a structured evaluation of classification and regression models, generating key performance metrics. For classification models, it computes precision, recall, specificity, F1-score, and AUC ROC. For regression models, it extracts coefficients and evaluates error metrics like MSE, RMSE, and R². The function allows specifying custom thresholds, metric rounding, and formatted display options.

Below are two examples demonstrating how to evaluate multiple models using summarize_model_performance. The function calculates and presents metrics for classification and regression models.

Binary Classification Models

This section introduces binary classification using two widely used machine learning models: Logistic Regression and Random Forest Classifier.

These examples demonstrate how model_metrics prepares and trains models on a synthetic dataset, setting the stage for evaluating their performance in subsequent sections. Both models use a default classification threshold of 0.5, where predictions are classified as positive (1) if the predicted probability exceeds 0.5, and negative (0) otherwise.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Generate a synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    random_state=42,
)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

# Train models
model1 = LogisticRegression().fit(X_train, y_train)
model2 = RandomForestClassifier().fit(X_train, y_train)

model_title = ["Logistic Regression", "Random Forest"]

Binary Classification Example 1

from model_metrics import summarize_model_performance

model_performance = summarize_model_performance(
    model=[model1, model2],
    model_title=model_title,
    X=X_test,
    y=y_test,
    model_type="classification",
    return_df=True,
)

model_performance

Output

Metrics	Logistic Regression	Random Forest
Precision/PPV	0.867	0.912
Average Precision	0.937	0.966
Sensitivity/Recall	0.82	0.838
Specificity	0.843	0.899
F1-Score	0.843	0.873
AUC ROC	0.913	0.95
Brier Score	0.118	0.086
Model Threshold	0.5	0.5

Binary Classification Example 2

In this example, we revisit binary classification with the same two models—Logistic Regression and Random Forest—but adjust the classification threshold (custom_threshold input in this case) from the default 0.5 to 0.2. This change allows us to explore how lowering the threshold impacts model performance, potentially increasing sensitivity (recall) by classifying more instances as positive (1) at the expense of precision.

from model_metrics import summarize_model_performance

model_performance = summarize_model_performance(
    model=[model1, model2],
    model_title=model_title,
    X=X_test,
    y=y_test,
    model_type="classification",
    return_df=True,
    custom_threshold=0.2,
)

model_performance

Output

Metrics	Logistic Regression	Random Forest
Precision/PPV	0.803	0.831
Average Precision	0.937	0.966
Sensitivity/Recall	0.919	0.928
Specificity	0.719	0.764
F1-Score	0.857	0.877
AUC ROC	0.913	0.949
Brier Score	0.118	0.085
Model Threshold	0.2	0.2

Regression Models

In this section, we load the diabetes dataset [1] from scikit-learn, which includes features like age and BMI, along with a target variable representing disease progression. The data is then split with train_test_split into training and testing sets using an 80/20 ratio to facilitate model assessment. We train a Linear Regression model on unscaled data for a straightforward baseline, followed b y a Random Forest Regressor with 100 trees, also on unscaled data, to introduce a more complex approach. Additionally, we train a Ridge Regression model using a Pipeline that scales the features with StandardScaler before fitting, incorporating regularization. These steps prepare the models for subsequent evaluation and comparison using tools provided by the model_metrics library.

Models use in these regression examples:

Linear Regression: A foundational model trained on unscaled data, simple yet effective for baseline evaluation.
Ridge Regression: A regularized model with a Pipeline for scaling, perfect for testing stability and overfitting.
Random Forest Regressor: An ensemble of 100 trees on unscaled data, offering complexity for comparative analysis.

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# Load dataset
diabetes = load_diabetes(as_frame=True)["frame"]
X = diabetes.drop(columns=["target"])
y = diabetes["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

# Train Linear Regression (on unscaled data)
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Train Random Forest Regressor (on unscaled data)
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
)
rf_model.fit(X_train, y_train)

# Train Ridge Regression (on scaled data)
ridge_model = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("estimator", Ridge(alpha=1.0)),
    ]
)
ridge_model.fit(X_train, y_train)

Regression Example 1

from model_metrics import summarize_model_performance

regression_metrics = summarize_model_performance(
    model=[linear_model, ridge_model],
    model_title=["Linear Regression", "Ridge Regression"],
    X=X_test,
    y=y_test,
    model_type="regression",
    return_df=True,
)

regression_metrics

The output below presents a detailed comparison of the performance and coefficients for two regression models—Linear Regression and Ridge Regression—trained on the diabetes dataset. It includes overall metrics such as Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Explained Variance, and R² Score for each model, showing their predictive accuracy. Additionally, it lists the coefficients for each feature (e.g., age, bmi, s1–s6) in both models, highlighting how each variable contributes to the prediction. This output serves as a foundation for evaluating and comparing the models’ effectiveness in [Your Library Name]’s documentation.

Output

Model	Metric	Variable	Coefficient	MAE	MAPE	MSE	RMSE	Expl. Var.	R^2 Score
Linear Regression	Overall Metrics			42.794	37.5	2900.194	53.853	0.455	0.453
Linear Regression	Coefficient	const	151.346
Linear Regression	Coefficient	age	37.904
Linear Regression	Coefficient	sex	-241.964
Linear Regression	Coefficient	bmi	542.429
Linear Regression	Coefficient	bp	347.704
Linear Regression	Coefficient	s1	-931.489
Linear Regression	Coefficient	s2	518.062
Linear Regression	Coefficient	s3	163.42
Linear Regression	Coefficient	s4	275.318
Linear Regression	Coefficient	s5	736.199
Linear Regression	Coefficient	s6	48.671
Ridge Regression	Overall Metrics			42.812	37.448	2892.015	53.777	0.457	0.454
Ridge Regression	Coefficient	const	153.737
Ridge Regression	Coefficient	age	1.807
Ridge Regression	Coefficient	sex	-11.448
Ridge Regression	Coefficient	bmi	25.733
Ridge Regression	Coefficient	bp	16.734
Ridge Regression	Coefficient	s1	-34.672
Ridge Regression	Coefficient	s2	17.053
Ridge Regression	Coefficient	s3	3.37
Ridge Regression	Coefficient	s4	11.764
Ridge Regression	Coefficient	s5	31.378
Ridge Regression	Coefficient	s6	2.458

Regression Example 2

In this Regression Example 2, we extend the analysis by introducing a Random Forest Regressor alongside Linear Regression and Ridge Regression to demonstrate how a model with feature importances, rather than coefficients, impacts evaluation outcomes. The code uses the summarize_model_performance function from model_metrics to assess all three models on the diabetes dataset’s test set, ensuring the Random Forest’s feature importance-based predictions are reflected in the results while preserving the coefficient-based results of the other models, as shown in the subsequent table.

from model_metrics import summarize_model_performance

regression_metrics = summarize_model_performance(
    model=[linear_model, ridge_model, rf_model],
    model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
    X=X_test,
    y=y_test,
    model_type="regression",
    return_df=True,
)

regression_metrics

Output

Model	Metric	Variable	Coefficient	Feat. Imp.	MAE	MAPE	MSE	RMSE	Expl. Var.	R^2 Score
Linear Regression	Overall Metrics				42.794	37.5	2900.194	53.853	0.455	0.453
Linear Regression	Coefficient	const	151.346
Linear Regression	Coefficient	age	37.904
Linear Regression	Coefficient	sex	-241.964
Linear Regression	Coefficient	bmi	542.429
Linear Regression	Coefficient	bp	347.704
Linear Regression	Coefficient	s1	-931.489
Linear Regression	Coefficient	s2	518.062
Linear Regression	Coefficient	s3	163.42
Linear Regression	Coefficient	s4	275.318
Linear Regression	Coefficient	s5	736.199
Linear Regression	Coefficient	s6	48.671
Ridge Regression	Overall Metrics				42.812	37.448	2892.015	53.777	0.457	0.454
Ridge Regression	Coefficient	const	153.737
Ridge Regression	Coefficient	age	1.807
Ridge Regression	Coefficient	sex	-11.448
Ridge Regression	Coefficient	bmi	25.733
Ridge Regression	Coefficient	bp	16.734
Ridge Regression	Coefficient	s1	-34.672
Ridge Regression	Coefficient	s2	17.053
Ridge Regression	Coefficient	s3	3.37
Ridge Regression	Coefficient	s4	11.764
Ridge Regression	Coefficient	s5	31.378
Ridge Regression	Coefficient	s6	2.458
Random Forest	Overall Metrics				44.053	40.005	2952.011	54.332	0.443	0.443
Random Forest	Feat. Imp.	age		0.059
Random Forest	Feat. Imp.	sex		0.01
Random Forest	Feat. Imp.	bmi		0.355
Random Forest	Feat. Imp.	bp		0.088
Random Forest	Feat. Imp.	s1		0.053
Random Forest	Feat. Imp.	s2		0.057
Random Forest	Feat. Imp.	s3		0.051
Random Forest	Feat. Imp.	s4		0.024
Random Forest	Feat. Imp.	s5		0.231
Random Forest	Feat. Imp.	s6		0.071

Regression Example 3

In some scenarios, you may want to simplify the output by excluding variables, coefficients, and feature importances from the model results. This example demonstrates how to achieve that by setting overall_only=True in the summarize_model_performance function, producing a concise table that focuses on key metrics: model name, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Explained Variance, and R² Score.

from model_metrics import summarize_model_performance

regression_metrics = summarize_model_performance(
    model=[linear_model, ridge_model, rf_model],
    model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
    X=X_test,
    y=y_test,
    model_type="regression",
    overall_only=True,
    return_df=True,
)

regression_metrics

Output

Model	Metric	MAE	MAPE	MSE	RMSE	Expl. Var.	R^2 Score
Linear Regression	Overall Metrics	42.794	37.5	2900.194	53.853	0.455	0.453
Ridge Regression	Overall Metrics	42.812	37.448	2892.015	53.777	0.457	0.454
Random Forest	Overall Metrics	44.053	40.005	2952.011	54.332	0.443	0.443

Lift Charts

This section illustrates how to assess and compare the ranking effectiveness of classification models using Lift Charts, a valuable tool for evaluating how well a model prioritizes positive instances relative to random chance. Leveraging the Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we plot Lift curves to visualize their relative ability to surface high-value (positive) cases at the top of the prediction list.

A Lift Chart plots the ratio of actual positives identified by the model compared to what would be expected by random selection, across increasingly larger proportions of the sample sorted by predicted probability. The baseline (Lift = 1) represents random chance; curves that rise above this line demonstrate the model’s ability to “lift” positive outcomes toward the top ranks. This makes Lift Charts especially useful in applications like marketing, fraud detection, and risk stratification—where targeting the top segment of predictions can yield outsized value.

The show_lift_chart function enables flexible creation of Lift Charts for one or more models. It supports single-plot overlays, grid layouts, and detailed customization of labels, titles, and styling. Designed for both exploratory analysis and stakeholder presentation, this utility helps users better understand model ranking performance across the population.

show_lift_chart(model, X, y, xlabel='Percentage of Sample', ylabel='Lift', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True)

Parameters:

model (object or list[object]) – A trained model or a list of models. Each must implement predict_proba to estimate class probabilities.
X (pd.DataFrame or np.ndarray) – Feature matrix used to generate predictions.
y (pd.Series or np.ndarray) – True binary labels corresponding to the input samples.
xlabel (str, optional) – Label for the x-axis. Defaults to "Percentage of Sample".
ylabel (str, optional) – Label for the y-axis. Defaults to "Lift".
model_title (str or list[str], optional) – Custom display names for the models. Can be a string or list of strings.
overlay (bool, optional) – If True, overlays all model lift curves into a single plot. Defaults to False.
title (str, optional) – Title for the plot or grid. Set to "" to suppress the title. Defaults to None.
save_plot (bool, optional) – Whether to save the chart(s) to disk. Defaults to False.
image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Maximum number of characters before wrapping titles. If None, no wrapping is applied.
curve_kwgs (dict[str, dict] or list[dict], optional) – Dictionary or list of dictionaries for customizing the lift curve(s) (e.g., color, linewidth).
linestyle_kwgs (dict, optional) – Styling for the baseline (random lift) reference line. Defaults to {"color": "gray", "linestyle": "--", "linewidth": 2}.
grid (bool, optional) – Whether to show each model in a subplot grid. Cannot be combined with overlay=True.
n_rows (int, optional) – Number of rows in the grid layout. If None, automatically inferred.
n_cols (int, optional) – Number of columns in the grid layout. Defaults to 2.
figsize (tuple[int, int], optional) – Tuple specifying the size of the figure in inches. Defaults to (8, 6).
label_fontsize (int, optional) – Font size for x/y-axis labels and titles. Defaults to 12.
tick_fontsize (int, optional) – Font size for tick marks and legend text. Defaults to 10.
gridlines (bool, optional) – Whether to display gridlines in plots. Defaults to True.

Returns:

None. Displays or saves lift charts for the specified classification models.

Return type:

None

Raises:

ValueError –

If overlay=True and grid=True are both set.

Notes

What is a Lift Chart?
- Lift quantifies how much better a model is at identifying positive cases compared to random selection.
- The x-axis represents the proportion of the population (from highest to lowest predicted probability).
- The y-axis shows the cumulative lift, calculated as the ratio of observed positives to expected positives under random selection.
Interpreting Lift Curves:
- A higher and steeper curve indicates a stronger model.
- The horizontal dashed line at y = 1 is the baseline for random performance.
- Curves that drop sharply or flatten may indicate poor ranking ability.
Layout Options:
- Use overlay=True to visualize all models on a single axis.
- Use grid=True for a side-by-side layout of lift charts.
- Neither set? Each model gets its own full-sized chart.
Customization:
- Customize the appearance of each model’s curve using curve_kwgs.
- Modify the baseline reference line with linestyle_kwgs.
- Control title wrapping and font sizes via text_wrap, label_fontsize, and tick_fontsize.
Saving Plots:
- If save_plot=True, figures are saved as <model_title>_lift.png/svg or overlay_lift.png/svg.

Lift Chart Example 1 (Grid Layout)

In this first Lift Chart example, we evaluate and compare the ranking performance of two classification models—Logistic Regression and Random Forest Classifier—trained on the synthetic dataset from the Binary Classification Models section. The chart displays Lift curves for both models in a two-column grid layout (n_cols=2, n_rows=1), enabling side-by-side comparison of how effectively each model prioritizes positive cases.

Each plot shows the model’s Lift across increasing portions of the test set, with a grey dashed line at Lift = 1 indicating the baseline (random performance). Curves above this line reflect the model’s ability to identify more positives than would be expected by chance. The Random Forest typically produces a steeper initial lift, demonstrating greater concentration of positive cases near the top-ranked predictions.

The show_lift_chart function allows for rich customization, including plot dimensions, axis font sizes, and curve styling. In this example, we set the line widths for both models and saved the plots in both PNG and SVG formats for further reporting or documentation.

from model_metrics import show_lift_chart

show_lift_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "grey", "linestyle": "--"},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    grid=True,
)

Output

Lift Chart Example 2 (Overlay)

This example overlays Lift curves from two classification models—Logistic Regression and Random Forest Classifier—on a single plot for direct visual comparison. Both models were trained on the same synthetic dataset from the Binary Classification Models section, and their lift performance is evaluated on the shared test set.

The Lift curve shows how many more positive outcomes are captured by the model at each quantile compared to a random baseline. A horizontal dashed black line at Lift = 1 represents random selection; curves above this line indicate effective ranking of positive cases. Overlaying curves makes it easier to assess which model better concentrates true positives near the top of the prediction list.

Using the overlay=True option, the show_lift_chart function generates a clean, unified plot. Each curve is styled with linewidth=2 for clarity, and all axis elements and tick marks are sized for presentation-quality output. This layout is particularly helpful for slide decks, performance reports, or model selection discussions.

from model_metrics import show_lift_chart

show_lift_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    overlay=True,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "black", "linestyle": "--", "linewidth": 2},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
)

Gain Charts

This section explores how to evaluate the cumulative performance of classification models in identifying positive outcomes using Gain Charts. These charts are especially effective at showing the model’s ability to concentrate the correct (positive) predictions in the top-ranked portion of the dataset. Using the same Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we demonstrate how to plot and compare Gain Curves across models.

A Gain Chart shows the cumulative percentage of actual positive cases captured as we move through the population sorted by predicted probability. Unlike the Lift Chart, which displays the ratio of model performance over baseline, the Gain Chart directly shows the percentage of positives captured—providing a more intuitive sense of how effective a model is at identifying positives early in the ranked list.

The show_gain_chart function supports single or multiple models, with options to overlay all gain curves in a single plot or display them in a flexible grid layout. Labels, title wrapping, curve styles, and saving output images are all customizable, making this function well-suited for both development analysis and final reporting.

show_gain_chart(model, X, y, xlabel='Percentage of Sample', ylabel='Cumulative Gain', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True)

Parameters:

model (object or list[object]) – A trained classifier or list of classifiers. Each model must support predict_proba.
X (pd.DataFrame or np.ndarray) – The feature matrix used for prediction.
y (pd.Series or np.ndarray) – Ground truth binary labels.
xlabel (str, optional) – Label for the x-axis. Defaults to "Percentage of Sample".
ylabel (str, optional) – Label for the y-axis. Defaults to "Cumulative Gain".
model_title (str or list[str], optional) – Custom display names for each model.
overlay (bool, optional) – If True, overlay all models on a single axis. Mutually exclusive with grid.
title (str, optional) – Plot or grid title. Set to "" to suppress the title.
save_plot (bool, optional) – Whether to save the chart(s) to disk.
image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Max characters before title wrapping. Set to None to disable.
curve_kwgs (dict[str, dict] or list[dict], optional) – Dict or list of kwargs per model to customize line style.
linestyle_kwgs (dict, optional) – Styling for the random baseline. Defaults to {"color": "gray", "linestyle": "--", "linewidth": 2}.
grid (bool, optional) – Whether to render a grid layout. Cannot be used with overlay.
n_rows (int, optional) – Rows in the grid layout. If None, inferred automatically.
n_cols (int, optional) – Columns in the grid layout. Defaults to 2.
figsize (tuple[int, int], optional) – Figure size (width, height) in inches.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for tick marks and legends.
gridlines (bool, optional) – Whether to show gridlines on the plots.

Returns:

None. Displays or saves Gain Charts for one or more models.

Return type:

None

Raises:

ValueError –

If overlay=True and grid=True are both set.

Notes

What is a Gain Chart?
- Plots the cumulative percentage of positives captured vs. sample size.
- The x-axis shows the fraction of the sample, ranked by predicted probability.
- The y-axis shows what percentage of the total positives have been captured.
Why use Gain Charts?
- Gain Charts help answer: “If I contact the top X% of predictions, how many positives will I catch?”
- Especially useful in marketing, lead scoring, risk management, and fraud detection.
Reading Gain Curves:
- Curves that rise steeply and plateau early indicate better model performance.
- The dashed baseline (diagonal line) represents random selection.
Layout Options:
- Use overlay=True to combine all gain curves into a single plot.
- Use grid=True for a subplot layout per model.
- If neither is set, plots will be rendered individually.
Styling Options:
- Customize individual model lines via curve_kwgs.
- Modify the diagonal baseline line using linestyle_kwgs.
- Adjust fonts and wrapping for presentation clarity.
Saving Output:
- Enable save_plot=True to save figures as PNG and/or SVG.
- Files are named using the model title (e.g., Model_1_gain.png or overlay_gain.svg).

Gain Chart Example 1 (Grid Layout)

In this first Gain Chart example, we compare the cumulative gain performance of two classification models— Logistic Regression and Random Forest Classifier—trained on the synthetic dataset from the Binary Classification Models section. This visualization showcases their ability to identify positive instances across different percentiles of the ranked test data.

Each subplot presents the cumulative gain achieved as a function of the percentage of the sample, sorted by descending predicted probability. The grey dashed line represents the baseline (random gain). A model that identifies a high proportion of positive cases in the early part of the ranking will have a steeper and higher curve. In this example, the Random Forest model typically outpaces Logistic Regression, indicating better early identification of positives.

The show_gain_chart function allows flexible styling and layout control. This example uses a grid configuration (n_cols=2, n_rows=1), customized line widths and colors, and includes saving the figure for documentation or stakeholder presentations.

from model_metrics import show_gain_chart

show_gain_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "grey", "linestyle": "--"},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    grid=True,
)

Output

Gain Chart Example 2 (Overlay)

This example overlays Gain curves from two classification models—Logistic Regression and Random Forest Classifier—on a single plot to enable direct visual comparison of their cumulative gain performance. Both models were trained on the same synthetic dataset from the Binary Classification Models section and evaluated on the same test set.

The Gain curve shows the cumulative proportion of true positives captured as you move through the population, ranked by predicted probability. A diagonal baseline line from (0, 0) to (1, 1) indicates the expected performance of a random model. Curves that rise above this line demonstrate superior model ability to concentrate positive cases near the top of the ranked list.

By setting overlay=True, the show_gain_chart function produces a single, easy-to-read plot containing both models’ gain curves. Each curve is styled with linewidth=2 for clear visibility. Overlay layouts are ideal for model selection discussions, presentations, and performance dashboards.

from model_metrics import show_gain_chart

show_gain_chart(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    overlay=True,
    model_title=["Logistic Regression", "Random Forest"],
    linestyle_kwgs={"color": "black", "linestyle": "--", "linewidth": 2},
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
)

ROC AUC Curves

This section demonstrates how to evaluate the performance of binary classification models using ROC AUC curves, a key metric for assessing the trade-off between true positive and false positive rates. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate ROC curves to visualize their discriminatory power.

ROC AUC (Receiver Operating Characteristic Area Under the Curve) provides a single scalar value representing a model’s ability to distinguish between positive and negative classes, with a value of 1 indicating perfect classification and 0.5 representing random guessing. The curves are plotted by varying the classification threshold and calculating the true positive rate (sensitivity) against the false positive rate (1-specificity). This makes ROC AUC particularly useful for comparing models like Logistic Regression, which relies on linear decision boundaries, and Random Forest Classifier, which leverages ensemble decision trees, especially when class imbalances or threshold sensitivity are concerns. The show_roc_curve function simplifies this process, enabling users to visualize and compare these curves effectively, setting the stage for detailed performance analysis in subsequent examples.

The show_roc_curve function provides a flexible and powerful way to visualize the performance of binary classification models using Receiver Operating Characteristic (ROC) curves. Whether you’re comparing multiple models, evaluating subgroup fairness, or preparing publication-ready plots, this function allows full control over layout, styling, and annotations. It supports single and multiple model inputs, optional overlay or grid layouts, and group-wise comparisons via a categorical feature. Additional options allow custom axis labels, AUC precision, curve styling, and export to PNG/SVG. Designed to be both user-friendly and highly configurable, show_roc_curve is a practical tool for model evaluation and stakeholder communication.

show_roc_curve(model, X, y, xlabel='False Positive Rate', ylabel='True Positive Rate', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None)

Parameters:

model (object or str or list[object or str]) – A trained model, a string placeholder, or a list containing models or strings to evaluate.
X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction.
y (pd.Series or np.ndarray) – True binary labels for evaluation.
xlabel (str, optional) – Label for the x-axis. Defaults to "False Positive Rate".
ylabel (str, optional) – Label for the y-axis. Defaults to "True Positive Rate".
model_title (str or list[str], optional) – Custom title(s) for the models. Can be a string or list of strings. If None, defaults to "Model 1", "Model 2", etc.
decimal_places (int, optional) – Number of decimal places for AUC values. Defaults to 2.
overlay (bool, optional) – Whether to overlay multiple models on a single plot. Defaults to False.
title (str, optional) – Title for the plot (used in overlay mode or as global title). If "", disables the title. Defaults to None.
save_plot (bool, optional) – Whether to save the plot(s) to file. Defaults to False.
image_path_png (str, optional) – File path to save the plot(s) as PNG.
image_path_svg (str, optional) – File path to save the plot(s) as SVG.
text_wrap (int, optional) – Maximum character width before wrapping plot titles. If None, no wrapping is applied.
curve_kwgs (list[dict] or dict[str, dict], optional) – Plot styling for ROC curves. Accepts a list of dictionaries or a nested dictionary keyed by model_title.
linestyle_kwgs (dict, optional) – Style for the random guess (diagonal) line. Defaults to {"color": "gray", "linestyle": "--", "linewidth": 2}.
grid (bool, optional) – Whether to organize the ROC plots in a subplot grid layout. Cannot be used with overlay=True or group_category.
n_rows (int, optional) – Number of rows in the grid layout. If None, calculated automatically based on number of models and columns.
n_cols (int, optional) – Number of columns in the grid layout. Defaults to 2.
figsize (tuple, optional) – Size of the plot or grid of plots, in inches. Defaults to (8, 6).
label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to 12.
tick_fontsize (int, optional) – Font size for ticks and legend text. Defaults to 10.
gridlines (bool, optional) – Whether to display grid lines on plots. Defaults to True.
group_category (array-like, optional) – Categorical array to group ROC curves. Cannot be used with grid=True or overlay=True.

Returns:

None. Displays or saves ROC curve plots for classification models.

Return type:

None

Raises:

ValueError –

If grid=True and overlay=True are both set.
If group_category is used with grid or overlay.
If overlay=True is used with only one model.

Notes

Flexible Inputs:
- model and model_title can be individual items or lists. Strings passed in model are treated as placeholder names.
- Titles can be automatically inferred or explicitly passed using model_title.
Group-Wise ROC:
- If group_category is passed, separate ROC curves are plotted for each unique group.
- The legend will include group-specific AUC and class distribution (e.g., AUC = 0.87, Count: 500, Pos: 120, Neg: 380).
Plot Modes:
- overlay=True overlays all models in one figure.
- grid=True arranges individual ROC plots in a subplot layout.
- If neither is set, separate full-size plots are shown for each model.
Legend and Styling:
- A random guess reference line (diagonal) is plotted by default.
- Customize ROC curves with curve_kwgs and the diagonal line with linestyle_kwgs.
- Titles can be disabled with title="".
Saving Plots:
- If save_plot=True, plots are saved using the base filename format <model_name>_roc_auc or overlay_roc_auc_plot.

The show_roc_curve function provides flexible and highly customizable plotting of ROC curves for binary classification models. It supports overlays, grid layouts, and subgroup visualizations, while also allowing export options and styling hooks for publication-ready output.

ROC AUC Example 1 (Grid Layout)

In this first ROC AUC evaluation example, we plot the ROC curves for two models: Logistic Regression and Random Forest Classifier, trained on the synthetic dataset from the Binary Classification Models section. The curves are displayed side by side using a grid layout (n_cols=2, n_rows=1), with the Logistic Regression curve in blue and the Random Forest curve in green for clear differentiation. A red dashed line represents the random guessing baseline. This example demonstrates how the show_roc_curve function enables straightforward visualization of model performance, with options to customize colors, add a grid, and save the plot for reporting purposes.

from model_metrics import show_roc_curve

show_roc_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_title,
    decimal_places=2,
    n_cols=2,
    n_rows=1,
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "green", "linewidth": 2},
    },
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    grid=True,
)

Output

ROC AUC Example 2 (Overlay)

In this second ROC AUC evaluation example, we focus on overlaying the results of two models—Logistic Regression and Random Forest Classifier—trained on the synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_roc_curve function with the overlay=True parameter, the ROC curves for both models are displayed together, with Logistic Regression in blue and Random Forest in black, both with a linewidth=2. A red dashed line serves as the random guessing baseline, and the plot includes a custom title for clarity.

from model_metrics import show_roc_curve

show_roc_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_title,
    decimal_places=2,
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    linestyle_kwgs={"color": "red", "linestyle": "--"},
    title="ROC Curves: Logistic Regression and Random Forest",
    overlay=True,
)

Output

ROC AUC Example 3 (by Category)

In this third ROC AUC evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow.

The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature—such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.

The show_roc_curve function supports this analysis through the group_category parameter.

For example, by passing group_category=X_test_2["race"], you can generate a separate ROC curve for each unique racial group in the dataset:

from model_metrics import show_roc_curve

show_roc_curve(
    model=model_dt["model"].estimator,
    X=X_test,
    y=y_test,
    model_title="Decision Tree Classifier,
    decimal_places=2,
    group_category=X_test_2["race"],
)

Output

Precision-Recall Curves

This section demonstrates how to evaluate the performance of binary classification models using Precision-Recall (PR) curves, a critical visualization for understanding model behavior in the presence of class imbalance. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate PR curves to examine how well each model identifies true positives while limiting false positives.

Precision-Recall curves focus on the trade-off between precision (positive predictive value) and recall (sensitivity) across different classification thresholds. This is particularly important when the positive class is rare—as is common in fraud detection, disease diagnosis, or adverse event prediction—because ROC AUC can overstate performance under imbalance. Unlike the ROC curve, the PR curve is sensitive to the proportion of positive examples and gives a clearer picture of how well a model performs where it matters most: in identifying the positive class.

The area under the Precision-Recall curve, also known as Average Precision (AP), summarizes model performance across thresholds. A model that maintains high precision as recall increases is generally more desirable, especially in settings where false positives have a high cost. This makes the PR curve a complementary and sometimes more informative tool than ROC AUC in skewed classification scenarios.

show_pr_curve(model, X, y, xlabel='Recall', ylabel='Precision', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None)

Parameters:

model (object or str or list[object or str]) – A trained model, a string placeholder, or a list containing models or strings to evaluate.
X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction.
y (pd.Series or np.ndarray) – True binary labels for evaluation.
xlabel (str, optional) – Label for the x-axis. Defaults to "Recall".
ylabel (str, optional) – Label for the y-axis. Defaults to "Precision".
model_title (str or list[str], optional) – Custom title(s) for the model(s). Can be a string or list of strings. If None, defaults to "Model 1", "Model 2", etc.
decimal_places (int, optional) – Number of decimal places for Average Precision (AP) values. Defaults to 2.
overlay (bool, optional) – Whether to overlay multiple models on a single plot. Defaults to False.
title (str, optional) – Title for the plot (used in overlay mode or as global title). If "", disables the title. Defaults to None.
save_plot (bool, optional) – Whether to save the plot(s) to file. Defaults to False.
image_path_png (str, optional) – File path to save the plot(s) as PNG.
image_path_svg (str, optional) – File path to save the plot(s) as SVG.
text_wrap (int, optional) – Maximum character width before wrapping plot titles. If None, no wrapping is applied.
curve_kwgs (list[dict] or dict[str, dict], optional) – Plot styling for PR curves. Accepts a list of dictionaries or a nested dictionary keyed by model_title.
grid (bool, optional) – Whether to organize the PR plots in a subplot grid layout. Cannot be used with overlay=True or group_category.
n_rows (int, optional) – Number of rows in the grid layout. If None, calculated automatically based on number of models and columns.
n_cols (int, optional) – Number of columns in the grid layout. Defaults to 2.
figsize (tuple, optional) – Size of the plot or grid of plots, in inches. Defaults to (8, 6).
label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to 12.
tick_fontsize (int, optional) – Font size for ticks and legend text. Defaults to 10.
gridlines (bool, optional) – Whether to display grid lines on plots. Defaults to True.
group_category (array-like, optional) – Categorical array to group PR curves. Cannot be used with grid=True or overlay=True.
legend_metric (str, optional) – Metric to display in the legend. Either "ap" (Average Precision) or "aucpr" (area under the PR curve). Defaults to "ap".

Returns:

None. Displays or saves Precision-Recall curve plots for classification models.

Return type:

None

Raises:

ValueError –
- If grid=True and overlay=True are both set.
- If group_category is used with grid=True or overlay=True.
- If overlay=True is used with only one model.
- If legend_metric is not one of "ap" or "aucpr".
TypeError –
- If model_title is not a string, list of strings, or None.

Notes

Flexible Inputs:
- model and model_title can be individual items or lists. Strings passed in model are treated as placeholder names.
- Titles can be automatically inferred or explicitly passed using model_title.
Group-Wise PR:
- If group_category is passed, separate PR curves are plotted for each unique group.
- The legend will include group-specific Average Precision and class distribution (e.g., AP = 0.78, Count: 500, Pos: 120, Neg: 380).
Average Precision vs. AUCPR:
- By default, the legend shows Average Precision (AP), which summarizes the PR curve with greater emphasis on the performance at higher precision levels.
- If the user passes legend_metric="aucpr", the legend will instead display AUCPR (Area Under the Precision-Recall Curve), which gives equal weight to all parts of the curve.
Plot Modes:
- overlay=True overlays all models in one figure.
- grid=True arranges individual PR plots in a subplot layout.
- If neither is set, separate full-size plots are shown for each model.
Legend and Styling:
- A random classifier baseline (constant precision) is plotted by default.
- Customize PR curves with curve_kwgs.
- Titles can be disabled with title="".
Saving Plots:
- If save_plot=True, plots are saved using the base filename format <model_name>_precision_recall or overlay_pr_plot.

The show_pr_curve function provides flexible and highly customizable plotting of Precision-Recall curves for binary classification models. It supports overlays, grid layouts, and subgroup visualizations, while also allowing export options and styling hooks for publication-ready output.

Precision-Recall Example 1 (Grid Layout)

In this first Precision-Recall evaluation example, we plot the PR curves for two models: Logistic Regression and Random Forest Classifier, both trained on the synthetic dataset from the Binary Classification Models section. The curves are arranged side by side using a grid layout (n_cols=2, n_rows=1), with the Logistic Regression curve rendered in blue and the Random Forest curve in green to distinguish between models. A gray dashed line indicates the baseline precision, equal to the prevalence of the positive class in the dataset.

This example illustrates how the show_pr_curve function makes it easy to visualize and compare model performance when dealing with class imbalance. It also demonstrates layout flexibility and customization options, including gridlines, label styling, and export functionality—making it suitable for both exploratory analysis and final reporting.

from model_metrics import show_pr_curve

show_pr_curve(
    model=[logistic_model, rf_model],
    X=X_test,
    y=y_test,
    model_title=["Logistic Regression", "Random Forest"],
    decimal_places=2,
    grid=True,
    n_cols=2,
    n_rows=1,
    curve_kwgs=[
        {"color": "blue"},
        {"color": "green"}
    ],
    gridlines=True
)

Output

Precision-Recall Example 2 (Overlay)

In this second Precision-Recall evaluation example, we focus on overlaying the results of two models—Logistic Regression and Random Forest Classifier—trained on the synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_pr_curve function with the overlay=True parameter, the Precision-Recall curves for both models are displayed together, with Logistic Regression in blue and Random Forest in black, both with a linewidth=2. The plot includes a custom title for clarity.

from model_metrics import show_pr_curve

show_pr_curve(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_title,
    curve_kwgs={
        "Logistic Regression": {"color": "blue", "linewidth": 2},
        "Random Forest": {"color": "black", "linewidth": 2},
    },
    title="ROC Curves: Logistic Regression and Random Forest",
    overlay=True,
)

Output

Precision-Recall Example 3 (Categorical)

In this third Precision-Recall evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow.

The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature—such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.

The show_pr_curve function supports this analysis through the group_category parameter.

For example, by passing group_category=X_test_2["race"], you can generate a separate ROC curve for each unique racial group in the dataset:

from model_metrics import show_pr_curve

show_pr_curve(
    model=model_dt["model"].estimator,
    X=X_test,
    y=y_test,
    model_title="Decision Tree Classifier,
    group_category=X_test_2["race"],
    legend_metric="aucpr",
)

Output

Decision Tree Precision-Recall Example 3

Confusion Matrix Evaluation

This section introduces the show_confusion_matrix function, which provides a flexible, styled interface for generating and visualizing confusion matrices across one or more classification models. It supports advanced features like threshold overrides, subgroup labeling, classification report display, and fully customizable plot aesthetics including grid layouts.

The confusion matrix is a fundamental diagnostic tool for classification models, displaying the counts of true positives, true negatives, false positives, and false negatives. This function goes beyond standard implementations by allowing for custom thresholds (globally or per model), label annotation (e.g., TP, FP, etc.), plot exporting, colorbar toggling, and grid visualization.

This is especially useful when comparing multiple models side-by-side or needing publication-ready confusion matrices for stakeholders.

show_confusion_matrix(model, X, y, model_title=None, title=None, model_threshold=None, custom_threshold=None, class_labels=None, cmap='Blues', save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, figsize=(8, 6), labels=True, label_fontsize=12, tick_fontsize=10, inner_fontsize=10, grid=False, score=None, class_report=False, **kwargs)

Parameters:

model (object or str or list[object or str]) – A single model (object or string), or a list of models or string placeholders.
X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction.
y (pd.Series or np.ndarray) – True target labels.
model_title (str or list[str], optional) – Custom title(s) for each model. Can be a string or list of strings. If None, defaults to "Model 1", "Model 2", etc.
title (str, optional) – Title for each plot. If "", no title is displayed. If None, a default title is shown.
model_threshold (dict, optional) – Dictionary of thresholds keyed by model title. Used if custom_threshold is not set.
custom_threshold (float, optional) – Global override threshold to apply across all models.
class_labels (list[str], optional) – Custom labels for the classes in the matrix.
cmap (str, optional) – Colormap to use for the heatmap. Defaults to "Blues".
save_plot (bool, optional) – Whether to save the generated plot(s).
image_path_png (str, optional) – Path to save the PNG version of the image.
image_path_svg (str, optional) – Path to save the SVG version of the image.
text_wrap (int, optional) – Maximum width of plot titles before wrapping.
figsize (tuple[int, int], optional) – Figure size in inches. Defaults to (8, 6).
labels (bool, optional) – Whether to annotate matrix cells with TP, FP, FN, TN.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for axis ticks.
inner_fontsize (int, optional) – Font size for numbers and labels inside cells.
grid (bool, optional) – Whether to display multiple models in a grid layout.
score (str, optional) – Scoring metric to use when optimizing threshold (if applicable).
class_report (bool, optional) – If True, prints a classification report below each matrix.
kwargs (dict, optional) – Additional keyword arguments for customization (e.g., show_colorbar, n_cols).

Returns:

None. Displays confusion matrix plots (and optionally saves them).

Return type:

None

Raises:

TypeError – If model_title is not a string, a list of strings, or None.

Notes

Model Support:
- Supports single or multiple classification models.
- model_title may be inferred automatically or provided explicitly.
Threshold Handling:
- Use model_threshold to specify per-model thresholds.
- custom_threshold overrides all other thresholds.
Plotting Modes:
- grid=True arranges plots in subplots.
- Otherwise, plots are displayed one at a time.
Labeling:
- Set labels=False to disable annotating cells with TP, FP, FN, TN.
- Always shows raw numeric values inside cells.
Colorbar & Styling:
- Toggle colorbar via show_colorbar (passed via kwargs).
- Colormap and font sizes are fully configurable.
Exporting Plots:
- Plots can be saved as both PNG and SVG using the respective paths.
- Saved filenames follow the pattern confusion_matrix_<model_name> or grid_confusion_matrix.

Confusion Matrix Example 1 (Threshold=0.5)

In this first confusion matrix evaluation example, we focus on showing the results of two models—Logistic Regression and Random Forest Classifier—trained on the synthetic dataset from the Binary Classification Models section onto a single plot.

from model_metrics import show_confusion_matrix

show_confusion_matrix(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_title,
    cmap="Blues",
    text_wrap=20,
    grid=True,
    n_cols=2,
    n_rows=1,
    figsize=(6, 6),
)

Output

Confusion Matrix Example 2 (Classification Report)

This second confusion matrix evaluation example is nearly identical to the first, but uses a different color map (cmap="viridis") and sets class_report=True to print classification reports for each model in addition to the visual output.

from model_metrics import show_confusion_matrix

show_confusion_matrix(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_title,
    cmap="viridis",
    text_wrap=20,
    grid=True,
    n_cols=2,
    n_rows=1,
    figsize=(6, 6),
    class_report=True
)

Output

Confusion Matrix for Logistic Regression:

          Predicted 0  Predicted 1
Actual 0           76           18
Actual 1           13           93

Classification Report for Logistic Regression:

              precision    recall  f1-score   support

           0       0.85      0.81      0.83        94
           1       0.84      0.88      0.86       106

    accuracy                           0.84       200
   macro avg       0.85      0.84      0.84       200
weighted avg       0.85      0.84      0.84       200

Confusion Matrix for Random Forest:

           Predicted 0  Predicted 1
Actual 0            84           10
Actual 1             3          103

Classification Report for Random Forest:

              precision    recall  f1-score   support

           0       0.97      0.89      0.93        94
           1       0.91      0.97      0.94       106

    accuracy                           0.94       200
   macro avg       0.94      0.93      0.93       200
weighted avg       0.94      0.94      0.93       200

Confusion Matrix Example 3 (Threshold = 0.37)

In this third confusion matrix evaluation example using the synthetic dataset from the Binary Classification Models section, we apply a custom classification threshold of 0.37 using the custom_threshold parameter. This overrides the default threshold of 0.5 and enables us to inspect how the confusion matrices shift when a more lenient decision boundary is applied. Refer to the section on threshold selection logic for caveats on choosing the right threshold.

This is especially useful in imbalanced classification problems or cost-sensitive environments where the trade-off between precision and recall must be adjusted. By lowering the threshold, we typically increase the number of positive predictions, which can improve recall but may come at the cost of more false positives.

The output matrices for both models—Logistic Regression and Random Forest—are shown side by side in a grid layout for easy visual comparison.

from model_metrics import show_confusion_matrix

show_confusion_matrix(
    model=[model1, model2],
    X=X_test,
    y=y_test,
    model_title=model_title,
    text_wrap=20,
    grid=True,
    n_cols=2,
    n_rows=1,
    figsize=(6, 6),
    custom_threshold=0.37,
)

Output

Calibration Curves

This section focuses on calibration curves, a diagnostic tool that compares predicted probabilities to actual outcomes, helping evaluate how well a model’s predicted confidence aligns with observed frequencies. Using models like Logistic Regression or Random Forest on the synthetic dataset from the previous (Binary Classification Models) section, we generate calibration curves to assess the reliability of model probabilities.

Calibration is especially important in domains where probability outputs inform downstream decisions, such as healthcare, finance, and risk management. A well-calibrated model not only predicts the correct class but also outputs meaningful probabilities—for example, when a model predicts a 0.7 probability, we expect roughly 70% of such predictions to be correct.

The show_calibration_curve function simplifies this process by allowing users to visualize calibration performance across models or subgroups. The plots show the mean predicted probabilities against the actual observed fractions of positive cases, with an optional reference line representing perfect calibration. Additional features include support for overlay or grid layouts, subgroup analysis by categorical features, and optional Brier score display—a scalar measure of calibration quality.

The function offers full control over styling, figure layout, axis labels, and output format, making it easy to generate both exploratory and publication-ready plots.

show_calibration_curve(model, X, y, xlabel='Mean Predicted Probability', ylabel='Fraction of Positives', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, grid=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, bins=10, marker='o', show_brier_score=True, gridlines=True, linestyle_kwgs=None, group_category=None, **kwargs)

Parameters:

model (estimator or list) – A trained classifier or a list of classifiers to evaluate.
X (pd.DataFrame or np.ndarray) – Feature matrix used for predictions.
y (pd.Series or np.ndarray) – True binary target values.
xlabel (str, optional) – X-axis label. Defaults to "Mean Predicted Probability".
ylabel (str, optional) – Y-axis label. Defaults to "Fraction of Positives".
model_title (str or list[str], optional) – Custom title(s) for the models.
overlay (bool, optional) – If True, overlays multiple models on one plot.
title (str, optional) – Title for the plot. Use "" to suppress.
save_plot (bool, optional) – Whether to save the plot(s).
image_path_png (str, optional) – Directory path for PNG export.
image_path_svg (str, optional) – Directory path for SVG export.
text_wrap (int, optional) – Max characters before title text wraps.
curve_kwgs (list[dict] or dict[str, dict], optional) – Styling options for the calibration curves.
grid (bool, optional) – Whether to arrange models in a subplot grid.
n_cols (int, optional) – Number of columns in the grid layout. Defaults to 2.
n_rows (int, optional) – Number of rows in the grid layout. Auto-calculated if None.
figsize (tuple, optional) – Figure size in inches (width, height).
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for ticks and legend entries.
bins (int, optional) – Number of bins used to compute calibration.
marker (str, optional) – Marker style for calibration points.
show_brier_score (bool, optional) – Whether to display Brier score in the legend.
gridlines (bool, optional) – Whether to show gridlines on plots.
linestyle_kwgs (dict, optional) – Styling for the “perfectly calibrated” reference line.
group_category (array-like, optional) – Categorical variable used to create subgroup calibration plots.

Returns:

None. Displays or saves calibration plots for classification models.

Return type:

None

Raises:

ValueError –

If overlay=True and grid=True are both set.
If group_category is used with overlay or grid.
If curve_kwgs list does not match number of models.

Notes

Calibration vs Discrimination:
- Calibration evaluates how well predicted probabilities reflect observed outcomes, while ROC AUC measures a model’s ability to rank predictions.
Flexible Plotting Modes:
- overlay=True plots multiple models on one figure.
- grid=True arranges plots in a grid layout.
- If neither is set, individual full-size plots are created.
Group-Wise Analysis:
- Passing group_category plots separate calibration curves by subgroup (e.g., age, race).
- Each subgroup’s Brier score is shown when show_brier_score=True.
Customization:
- Use curve_kwgs and linestyle_kwgs to control styling.
- Add markers, gridlines, and custom titles to suit report or presentation needs.
Saving Outputs:
- Set save_plot=True and specify image_path_png or image_path_svg to export figures.
- Filenames are auto-generated based on model name and plot type.

Important

Calibration curves are a valuable diagnostic tool for assessing the alignment between predicted probabilities and actual outcomes. By plotting the fraction of positives against predicted probabilities, we can evaluate how well a model’s confidence scores correspond to observed reality. While these plots offer important insights, it’s equally important to understand the assumptions and limitations behind the calibration methods used.

Calibration Curve Example 1 (Grid-like)

This example presents calibration curves for two classification models trained on the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow. The classification models are displayed side by side in a grid layout. Each subplot shows how well the predicted probabilities from a model align with the actual observed outcomes. A diagonal dashed line representing perfect calibration is included in both plots, and Brier scores are shown in the legend to quantify each model’s calibration accuracy.

By setting grid=True, the function automatically arranges the individual plots based on the number of models and specified columns. This layout is ideal for visually comparing calibration behavior across models without overlapping lines.

pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]

# Model titles
model_titles = [
    "Logistic Regression",
    "Random Forest Classifier",
    "Decision Tree Classifier",
]

from model_metrics import show_calibration_curve

show_calibration_curve(
    model=pipelines_or_models[:2],
    X=X_test,
    y=y_test,
    model_title=model_titles[:2],
    text_wrap=50,
    bins=10,
    show_brier_score=True,
    grid=True,
    linestyle_kwgs={"color": "black"},
)

Output

Calibration Curve Example 2 (Overlay)

This example also uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow. This example demonstrates how to overlay calibration curves from multiple classification models in a single plot. Overlaying allows for direct visual comparison of how predicted probabilities from each model align with actual outcomes on the same axes.

The diagonal dashed line represents perfect calibration, and Brier scores are included in the legend for each model, providing a quantitative measure of calibration accuracy.

By setting overlay=True, the function combines all model curves into one figure, making it easier to evaluate relative performance without splitting across subplots.

pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]

# Model titles
model_titles = [
    "Logistic Regression",
    "Random Forest Classifier",
    "Decision Tree Classifier",
]


from model_metrics import show_calibration_curve

show_calibration_curve(
    model=pipelines_or_models,
    X=X_test,
    y=y_test,
    model_title=model_titles,
    bins=10,
    show_brier_score=True,
    overlay=True
    linestyle_kwgs={"color": "black"},
)

Output

Calibration Curve Example 3 (by Category)

This example, too, uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.

To build and evaluate our models, we use the model_tuner library [3]. Click here to view the corresponding codebase for this workflow. This example shows how to visualize calibration curves separately for each category within a given feature—in this case, the race column of the joined test set—using a single Random Forest classifier. Each plot represents the calibration behavior of the model for a specific subgroup, allowing for detailed insight into how predicted probabilities align with actual outcomes across demographic categories.

This type of disaggregated visualization is especially useful for fairness analysis and subgroup performance auditing. By setting group_category="race", the function automatically detects unique values in the specified column and generates a separate calibration curve for each.

The dashed diagonal reference line represents perfect calibration. Brier scores are included in each plot to provide a quantitative measure of calibration performance within the group.

Note

When using group_category, both overlay and grid must be set to False. This ensures each group receives its own standalone figure, avoiding conflicting layout behavior.

from model_metrics import show_calibration_curve

show_calibration_curve(
    model=model_rf["model"].estimator,
    X=X_test,
    y=y_test,
    model_title="Random Forest Classifier",
    bins=10,
    show_brier_score=True,
    linestyle_kwgs={"color": "black"},
    curve_kwgs={title: {"linewidth": 2} for title in model_titles},
    group_category=X_test_2["race"],
)

Output

Threshold Metric Curves

This section introduces a powerful utility for exploring how classification thresholds affect key performance metrics, including Precision, Recall, F1 Score, and Specificity. Rather than fixing a threshold (commonly at 0.5), this function allows users to visualize trade-offs across the full range of possible thresholds, making it especially useful when optimizing for use-case-specific goals such as maximizing recall or achieving a minimum precision.

Using the Random Forest Classifier models trained on the adult income dataset [2], this tool helps users answer practical questions like:

What threshold achieves at least 85% precision?
Where does F1 score peak for this model?
How does specificity behave as the threshold increases?

The plot_threshold_metrics function supports optional threshold lookups via lookup_metric and lookup_value, which prints the closest threshold that meets your constraint. Plots can be customized with colors, gridlines, line styles, wrapped titles, and export options.

plot_threshold_metrics(model, X_test, y_test, title=None, text_wrap=None, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, baseline_thresh=True, curve_kwgs=None, baseline_kwgs=None, save_plot=False, image_path_png=None, image_path_svg=None, lookup_metric=None, lookup_value=None, decimal_places=4)

Parameters:

model (object) – A trained classification model that supports predict_proba.
X_test (pd.DataFrame or np.ndarray) – Feature matrix for evaluation.
y_test (pd.Series or np.ndarray) – True binary labels.
title (str, optional) – Custom title for the plot. If "", disables the title.
text_wrap (int, optional) – Maximum width of the title before wrapping. If None, no wrapping is applied.
figsize (tuple, optional) – Tuple representing the figure size in inches. Defaults to (8, 6).
label_fontsize (int, optional) – Font size for axis labels and title.
tick_fontsize (int, optional) – Font size for tick labels.
gridlines (bool, optional) – Whether to show grid lines. Defaults to True.
baseline_thresh (bool, optional) – If True, adds a dashed line at threshold = 0.5.
curve_kwgs (dict, optional) – Dictionary of styling options for metric curves (e.g., {"linewidth": 2}).
baseline_kwgs (dict, optional) – Dictionary of styling options for the baseline threshold line.
save_plot (bool, optional) – Whether to save the figure to file.
image_path_png (str, optional) – File path to save PNG output.
image_path_svg (str, optional) – File path to save SVG output.
lookup_metric (str, optional) – Metric to search for best threshold (“precision”, “recall”, “f1”, or “specificity”).
lookup_value (float, optional) – Desired value for the lookup metric.
decimal_places (int, optional) – Number of decimal places for printed threshold output.

Returns:

None. Displays or saves the metric vs. threshold plot.

Return type:

None

Notes

Metric Curves:
- Plots include Precision, Recall, F1 Score, and Specificity over threshold values.
- Useful for analyzing how changing the threshold alters model behavior.
Threshold Lookup:
- Set lookup_metric and lookup_value to find the closest threshold that meets your constraint.
- Prints result to console and highlights the corresponding vertical line.
Styling Options:
- Customize plot curves with curve_kwgs.
- Adjust baseline style (e.g., at threshold = 0.5) via baseline_kwgs.
Exporting:
- Use save_plot=True with image_path_png and/or image_path_svg to save outputs.
Interactivity:
- Ideal for presentations or dashboards where visualizing threshold sensitivity is crucial.
- Particularly helpful for domains like healthcare, fraud detection, or content moderation, where the cost of false positives vs. false negatives must be carefully managed.

Threshold Curves Example 1 (Threshold=0.5)

This example demonstrates how to plot threshold-dependent classification metrics using a Random Forest Classifier trained on the adult income dataset [2].

The plot_threshold_metrics function visualizes how Precision, Recall, F1 Score, and Specificity change as the decision threshold varies. In this configuration, the baseline threshold line at 0.5 is enabled (baseline_thresh=True), and the line styling is customized via curve_kwgs. Font sizes and wrapping options are adjusted for improved clarity in presentation-ready plots.

from model_metrics import plot_threshold_metrics

plot_threshold_metrics(
    model=model_rf["model"].estimator,
    X_test=X_test,
    y_test=y_test,
    baseline_thresh=False,
    curve_kwgs={
        "linestyle": "-",
        "linewidth": 2,
    },
    text_wrap=40,
)

Output

Threshold Curves Example 2 (Targeted Metric Lookup)

This example expands on threshold-based classification metric visualization using a targeted lookup scenario. Suppose a clinical stakeholder or domain expert has determined—based on prior research, cost-benefit considerations, or operational constraints—that a precision of approximately 0.879 is ideal for downstream decision-making (e.g., minimizing false positives in a healthcare setting).

The plot_threshold_metrics function accepts the optional arguments lookup_metric and lookup_value to help identify the threshold that best aligns with this target. When these are set, the function automatically locates and highlights the threshold that most closely achieves the desired metric value, offering transparency and guidance for threshold tuning.

from model_metrics import plot_threshold_metrics

plot_threshold_metrics(
    model=model_rf["model"].estimator,
    X_test=X_test,
    y_test=y_test,
    lookup_metric="precision",
    lookup_value=0.879,
    baseline_thresh=False,
    lookup_kwgs={
        "color": "red",
        "linestyle": "--",
        "linewidth": 2,
    },
    curve_kwgs={
        "linestyle": "-",
        "linewidth": 2,
    },
    text_wrap=40,
)

Output

In this example:

lookup_metric="precision" specifies that we are targeting the precision curve.
lookup_value=0.879 provides the desired value for that metric.
The function will search for the closest possible precision value along the threshold range and display a vertical line at that corresponding threshold.
The threshold value is printed to the console and included in the legend (e.g., Best Threshold: 0.6757).