Model Performance Summaries
Summarizes model performance metrics for classification and regression models.
- summarize_model_performance(model, X, y, model_type='classification', model_threshold=None, model_title=None, custom_threshold=None, score=None, return_df=False, overall_only=False, decimal_places=3)
- Parameters:
model (object or list) – A trained model or a list of trained models.
X (pd.DataFrame) – Feature matrix used for evaluation.
y (pd.Series or np.array) – Target variable.
model_type (str, optional) – Specifies whether the model is for classification or regression. Must be either
"classification"
or"regression"
. Defaults to"classification"
.model_threshold (dict, optional) – Threshold values for classification models. If provided, this dictionary specifies thresholds per model. Defaults to
None
.model_title (str or list, optional) – Custom model names for display. If
None
, names are inferred from the models. Defaults toNone
.custom_threshold (float, optional) – A fixed threshold for classification, overriding
model_threshold
. If set, the “Model Threshold” row is excluded. Defaults toNone
.score (str, optional) – A custom scoring metric for classification models. Defaults to
None
.return_df (bool, optional) – If
True
, returns a DataFrame instead of printing results. Defaults toFalse
.overall_only (bool, optional) – If
True
, returns only the “Overall Metrics” row, removing coefficient-related columns for regression models. Defaults toFalse
.decimal_places (int, optional) – Number of decimal places to round numerical metrics. Defaults to
3
.
- Returns:
A DataFrame containing model performance metrics if
return_df=True
. Otherwise, the metrics are printed in a formatted table.- Return type:
pd.DataFrame or None
- Raises:
ValueError – If
model_type="classification"
andoverall_only=True
.ValueError – If
model_type
is not"classification"
or"regression"
.
Notes
- Classification Models:
Computes precision, recall, specificity, AUC ROC, F1-score, Brier score, and other key metrics.
Requires models supporting
predict_proba
ordecision_function
.If
custom_threshold
is set, it overridesmodel_threshold
.
- Regression Models:
Computes MAE, MAPE, MSE, RMSE, Explained Variance, and R² Score.
Uses
statsmodels.OLS
to extract coefficients and p-values.If
overall_only=True
, the DataFrame retains only overall performance metrics.
- All Models:
When
decimal_places
is specified with a desired number, it controls the precision of decimal places displayed in the results.If
return_df=False
, the function outputs results in a printed, formatted, readable structure instead of returning a DataFrame.
The summarize_model_performance
function provides a structured evaluation of classification and regression models, generating key performance metrics. For classification models, it computes precision, recall, specificity, F1-score, and AUC ROC. For regression models, it extracts coefficients and evaluates error metrics like MSE, RMSE, and R². The function allows specifying custom thresholds, metric rounding, and formatted display options.
Below are two examples demonstrating how to evaluate multiple models using summarize_model_performance
. The function calculates and presents metrics for classification and regression models.
Binary Classification Models
This section introduces binary classification using two widely used machine learning models: Logistic Regression and Random Forest Classifier.
These examples demonstrate how model_metrics
prepares and trains models on a
synthetic dataset, setting the stage for evaluating their performance in subsequent sections.
Both models use a default classification threshold of 0.5, where predictions are
classified as positive (1) if the predicted probability exceeds 0.5, and negative (0)
otherwise.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate a synthetic dataset
X, y = make_classification(
n_samples=1000,
n_features=10,
random_state=42,
)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
# Train models
model1 = LogisticRegression().fit(X_train, y_train)
model2 = RandomForestClassifier().fit(X_train, y_train)
model_title = ["Logistic Regression", "Random Forest"]
Binary Classification Example 1
from model_metrics import summarize_model_performance
model_performance = summarize_model_performance(
model=[model1, model2],
model_title=model_title,
X=X_test,
y=y_test,
model_type="classification",
return_df=True,
)
model_performance
Output
Metrics | Logistic Regression | Random Forest |
---|---|---|
Precision/PPV | 0.867 | 0.912 |
Average Precision | 0.937 | 0.966 |
Sensitivity/Recall | 0.82 | 0.838 |
Specificity | 0.843 | 0.899 |
F1-Score | 0.843 | 0.873 |
AUC ROC | 0.913 | 0.95 |
Brier Score | 0.118 | 0.086 |
Model Threshold | 0.5 | 0.5 |
Binary Classification Example 2
In this example, we revisit binary classification with the same two models—Logistic
Regression and Random Forest—but adjust the classification threshold
(custom_threshold
input in this case) from the default 0.5 to 0.2. This
change allows us to explore how lowering the threshold impacts model performance,
potentially increasing sensitivity (recall) by classifying more instances as
positive (1) at the expense of precision.
from model_metrics import summarize_model_performance
model_performance = summarize_model_performance(
model=[model1, model2],
model_title=model_title,
X=X_test,
y=y_test,
model_type="classification",
return_df=True,
custom_threshold=0.2,
)
model_performance
Output
Metrics | Logistic Regression | Random Forest |
---|---|---|
Precision/PPV | 0.803 | 0.831 |
Average Precision | 0.937 | 0.966 |
Sensitivity/Recall | 0.919 | 0.928 |
Specificity | 0.719 | 0.764 |
F1-Score | 0.857 | 0.877 |
AUC ROC | 0.913 | 0.949 |
Brier Score | 0.118 | 0.085 |
Model Threshold | 0.2 | 0.2 |
Regression Models
In this section, we load the diabetes dataset [1] from scikit-learn
, which includes
features like age and BMI, along with a target variable representing disease
progression. The data is then split with train_test_split
into training and
testing sets using an 80/20 ratio to facilitate model assessment. We train a
Linear Regression model on unscaled data for a straightforward baseline, followed b
y a Random Forest Regressor with 100 trees, also on unscaled data, to introduce a
more complex approach. Additionally, we train a Ridge Regression model using a
Pipeline
that scales the features with StandardScaler
before fitting,
incorporating regularization. These steps prepare the models for subsequent
evaluation and comparison using tools provided by the model_metrics
library.
Models use in these regression examples:
Linear Regression: A foundational model trained on unscaled data, simple yet effective for baseline evaluation.
Ridge Regression: A regularized model with a Pipeline for scaling, perfect for testing stability and overfitting.
Random Forest Regressor: An ensemble of 100 trees on unscaled data, offering complexity for comparative analysis.
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
# Load dataset
diabetes = load_diabetes(as_frame=True)["frame"]
X = diabetes.drop(columns=["target"])
y = diabetes["target"]
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
)
# Train Linear Regression (on unscaled data)
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Train Random Forest Regressor (on unscaled data)
rf_model = RandomForestRegressor(
n_estimators=100,
random_state=42,
)
rf_model.fit(X_train, y_train)
# Train Ridge Regression (on scaled data)
ridge_model = Pipeline(
[
("scaler", StandardScaler()),
("estimator", Ridge(alpha=1.0)),
]
)
ridge_model.fit(X_train, y_train)
Regression Example 1
from model_metrics import summarize_model_performance
regression_metrics = summarize_model_performance(
model=[linear_model, ridge_model],
model_title=["Linear Regression", "Ridge Regression"],
X=X_test,
y=y_test,
model_type="regression",
return_df=True,
)
regression_metrics
The output below presents a detailed comparison of the performance and coefficients for two regression models—Linear Regression and Ridge Regression—trained on the diabetes dataset. It includes overall metrics such as Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Explained Variance, and R² Score for each model, showing their predictive accuracy. Additionally, it lists the coefficients for each feature (e.g., age, bmi, s1–s6) in both models, highlighting how each variable contributes to the prediction. This output serves as a foundation for evaluating and comparing the models’ effectiveness in [Your Library Name]’s documentation.
Output
Model | Metric | Variable | Coefficient | MAE | MAPE | MSE | RMSE | Expl. Var. | R^2 Score |
---|---|---|---|---|---|---|---|---|---|
Linear Regression | Overall Metrics | 42.794 | 37.5 | 2900.194 | 53.853 | 0.455 | 0.453 | ||
Linear Regression | Coefficient | const | 151.346 | ||||||
Linear Regression | Coefficient | age | 37.904 | ||||||
Linear Regression | Coefficient | sex | -241.964 | ||||||
Linear Regression | Coefficient | bmi | 542.429 | ||||||
Linear Regression | Coefficient | bp | 347.704 | ||||||
Linear Regression | Coefficient | s1 | -931.489 | ||||||
Linear Regression | Coefficient | s2 | 518.062 | ||||||
Linear Regression | Coefficient | s3 | 163.42 | ||||||
Linear Regression | Coefficient | s4 | 275.318 | ||||||
Linear Regression | Coefficient | s5 | 736.199 | ||||||
Linear Regression | Coefficient | s6 | 48.671 | ||||||
Ridge Regression | Overall Metrics | 42.812 | 37.448 | 2892.015 | 53.777 | 0.457 | 0.454 | ||
Ridge Regression | Coefficient | const | 153.737 | ||||||
Ridge Regression | Coefficient | age | 1.807 | ||||||
Ridge Regression | Coefficient | sex | -11.448 | ||||||
Ridge Regression | Coefficient | bmi | 25.733 | ||||||
Ridge Regression | Coefficient | bp | 16.734 | ||||||
Ridge Regression | Coefficient | s1 | -34.672 | ||||||
Ridge Regression | Coefficient | s2 | 17.053 | ||||||
Ridge Regression | Coefficient | s3 | 3.37 | ||||||
Ridge Regression | Coefficient | s4 | 11.764 | ||||||
Ridge Regression | Coefficient | s5 | 31.378 | ||||||
Ridge Regression | Coefficient | s6 | 2.458 |
Regression Example 2
In this Regression Example 2, we extend the analysis by introducing a Random Forest
Regressor alongside Linear Regression and Ridge Regression to demonstrate how a
model with feature importances, rather than coefficients, impacts evaluation outcomes.
The code uses the summarize_model_performance
function from model_metrics
to
assess all three models on the diabetes dataset’s test set, ensuring the Random Forest’s
feature importance-based predictions are reflected in the results while preserving
the coefficient-based results of the other models, as shown in the subsequent table.
from model_metrics import summarize_model_performance
regression_metrics = summarize_model_performance(
model=[linear_model, ridge_model, rf_model],
model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
X=X_test,
y=y_test,
model_type="regression",
return_df=True,
)
regression_metrics
Output
Model | Metric | Variable | Coefficient | Feat. Imp. | MAE | MAPE | MSE | RMSE | Expl. Var. | R^2 Score |
---|---|---|---|---|---|---|---|---|---|---|
Linear Regression | Overall Metrics | 42.794 | 37.5 | 2900.194 | 53.853 | 0.455 | 0.453 | |||
Linear Regression | Coefficient | const | 151.346 | |||||||
Linear Regression | Coefficient | age | 37.904 | |||||||
Linear Regression | Coefficient | sex | -241.964 | |||||||
Linear Regression | Coefficient | bmi | 542.429 | |||||||
Linear Regression | Coefficient | bp | 347.704 | |||||||
Linear Regression | Coefficient | s1 | -931.489 | |||||||
Linear Regression | Coefficient | s2 | 518.062 | |||||||
Linear Regression | Coefficient | s3 | 163.42 | |||||||
Linear Regression | Coefficient | s4 | 275.318 | |||||||
Linear Regression | Coefficient | s5 | 736.199 | |||||||
Linear Regression | Coefficient | s6 | 48.671 | |||||||
Ridge Regression | Overall Metrics | 42.812 | 37.448 | 2892.015 | 53.777 | 0.457 | 0.454 | |||
Ridge Regression | Coefficient | const | 153.737 | |||||||
Ridge Regression | Coefficient | age | 1.807 | |||||||
Ridge Regression | Coefficient | sex | -11.448 | |||||||
Ridge Regression | Coefficient | bmi | 25.733 | |||||||
Ridge Regression | Coefficient | bp | 16.734 | |||||||
Ridge Regression | Coefficient | s1 | -34.672 | |||||||
Ridge Regression | Coefficient | s2 | 17.053 | |||||||
Ridge Regression | Coefficient | s3 | 3.37 | |||||||
Ridge Regression | Coefficient | s4 | 11.764 | |||||||
Ridge Regression | Coefficient | s5 | 31.378 | |||||||
Ridge Regression | Coefficient | s6 | 2.458 | |||||||
Random Forest | Overall Metrics | 44.053 | 40.005 | 2952.011 | 54.332 | 0.443 | 0.443 | |||
Random Forest | Feat. Imp. | age | 0.059 | |||||||
Random Forest | Feat. Imp. | sex | 0.01 | |||||||
Random Forest | Feat. Imp. | bmi | 0.355 | |||||||
Random Forest | Feat. Imp. | bp | 0.088 | |||||||
Random Forest | Feat. Imp. | s1 | 0.053 | |||||||
Random Forest | Feat. Imp. | s2 | 0.057 | |||||||
Random Forest | Feat. Imp. | s3 | 0.051 | |||||||
Random Forest | Feat. Imp. | s4 | 0.024 | |||||||
Random Forest | Feat. Imp. | s5 | 0.231 | |||||||
Random Forest | Feat. Imp. | s6 | 0.071 |
Regression Example 3
In some scenarios, you may want to simplify the output by excluding variables,
coefficients, and feature importances from the model results. This example
demonstrates how to achieve that by setting overall_only=True
in the
summarize_model_performance
function, producing a concise table that
focuses on key metrics: model name, Mean Absolute Error (MAE),
Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE),
Root Mean Squared Error (RMSE), Explained Variance, and R² Score.
from model_metrics import summarize_model_performance
regression_metrics = summarize_model_performance(
model=[linear_model, ridge_model, rf_model],
model_title=["Linear Regression", "Ridge Regression", "Random Forest"],
X=X_test,
y=y_test,
model_type="regression",
overall_only=True,
return_df=True,
)
regression_metrics
Output
Model | Metric | MAE | MAPE | MSE | RMSE | Expl. Var. | R^2 Score |
---|---|---|---|---|---|---|---|
Linear Regression | Overall Metrics | 42.794 | 37.5 | 2900.194 | 53.853 | 0.455 | 0.453 |
Ridge Regression | Overall Metrics | 42.812 | 37.448 | 2892.015 | 53.777 | 0.457 | 0.454 |
Random Forest | Overall Metrics | 44.053 | 40.005 | 2952.011 | 54.332 | 0.443 | 0.443 |
Lift Charts
This section illustrates how to assess and compare the ranking effectiveness of classification models using Lift Charts, a valuable tool for evaluating how well a model prioritizes positive instances relative to random chance. Leveraging the Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we plot Lift curves to visualize their relative ability to surface high-value (positive) cases at the top of the prediction list.
A Lift Chart plots the ratio of actual positives identified by the model compared to what would be expected by random selection, across increasingly larger proportions of the sample sorted by predicted probability. The baseline (Lift = 1) represents random chance; curves that rise above this line demonstrate the model’s ability to “lift” positive outcomes toward the top ranks. This makes Lift Charts especially useful in applications like marketing, fraud detection, and risk stratification—where targeting the top segment of predictions can yield outsized value.
The show_lift_chart
function enables flexible creation of Lift Charts for one or more
models. It supports single-plot overlays, grid layouts, and detailed customization of
labels, titles, and styling. Designed for both exploratory analysis and stakeholder
presentation, this utility helps users better understand model ranking performance
across the population.
- show_lift_chart(model, X, y, xlabel='Percentage of Sample', ylabel='Lift', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True)
- Parameters:
model (object or list[object]) – A trained model or a list of models. Each must implement
predict_proba
to estimate class probabilities.X (pd.DataFrame or np.ndarray) – Feature matrix used to generate predictions.
y (pd.Series or np.ndarray) – True binary labels corresponding to the input samples.
xlabel (str, optional) – Label for the x-axis. Defaults to
"Percentage of Sample"
.ylabel (str, optional) – Label for the y-axis. Defaults to
"Lift"
.model_title (str or list[str], optional) – Custom display names for the models. Can be a string or list of strings.
overlay (bool, optional) – If
True
, overlays all model lift curves into a single plot. Defaults toFalse
.title (str, optional) – Title for the plot or grid. Set to
""
to suppress the title. Defaults toNone
.save_plot (bool, optional) – Whether to save the chart(s) to disk. Defaults to
False
.image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Maximum number of characters before wrapping titles. If
None
, no wrapping is applied.curve_kwgs (dict[str, dict] or list[dict], optional) – Dictionary or list of dictionaries for customizing the lift curve(s) (e.g., color, linewidth).
linestyle_kwgs (dict, optional) – Styling for the baseline (random lift) reference line. Defaults to
{"color": "gray", "linestyle": "--", "linewidth": 2}
.grid (bool, optional) – Whether to show each model in a subplot grid. Cannot be combined with
overlay=True
.n_rows (int, optional) – Number of rows in the grid layout. If
None
, automatically inferred.n_cols (int, optional) – Number of columns in the grid layout. Defaults to
2
.figsize (tuple[int, int], optional) – Tuple specifying the size of the figure in inches. Defaults to
(8, 6)
.label_fontsize (int, optional) – Font size for x/y-axis labels and titles. Defaults to
12
.tick_fontsize (int, optional) – Font size for tick marks and legend text. Defaults to
10
.gridlines (bool, optional) – Whether to display gridlines in plots. Defaults to
True
.
- Returns:
None.
Displays or saves lift charts for the specified classification models.- Return type:
None
- Raises:
If
overlay=True
andgrid=True
are both set.
Notes
- What is a Lift Chart?
Lift quantifies how much better a model is at identifying positive cases compared to random selection.
The x-axis represents the proportion of the population (from highest to lowest predicted probability).
The y-axis shows the cumulative lift, calculated as the ratio of observed positives to expected positives under random selection.
- Interpreting Lift Curves:
A higher and steeper curve indicates a stronger model.
The horizontal dashed line at
y = 1
is the baseline for random performance.Curves that drop sharply or flatten may indicate poor ranking ability.
- Layout Options:
Use
overlay=True
to visualize all models on a single axis.Use
grid=True
for a side-by-side layout of lift charts.Neither set? Each model gets its own full-sized chart.
- Customization:
Customize the appearance of each model’s curve using
curve_kwgs
.Modify the baseline reference line with
linestyle_kwgs
.Control title wrapping and font sizes via
text_wrap
,label_fontsize
, andtick_fontsize
.
- Saving Plots:
If
save_plot=True
, figures are saved as<model_title>_lift.png/svg
oroverlay_lift.png/svg
.
Lift Chart Example 1 (Grid Layout)
In this first Lift Chart example, we evaluate and compare the ranking performance
of two classification models—Logistic Regression and Random Forest Classifier—trained
on the synthetic dataset from the Binary Classification Models section. The chart displays Lift curves for both models in a
two-column grid layout (n_cols=2, n_rows=1
), enabling side-by-side comparison
of how effectively each model prioritizes positive cases.
Each plot shows the model’s Lift across increasing portions of the test set, with a grey dashed line at Lift = 1 indicating the baseline (random performance). Curves above this line reflect the model’s ability to identify more positives than would be expected by chance. The Random Forest typically produces a steeper initial lift, demonstrating greater concentration of positive cases near the top-ranked predictions.
The show_lift_chart function allows for rich customization, including plot dimensions, axis font sizes, and curve styling. In this example, we set the line widths for both models and saved the plots in both PNG and SVG formats for further reporting or documentation.
from model_metrics import show_lift_chart
show_lift_chart(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "grey", "linestyle": "--"},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
grid=True,
)
Output
Lift Chart Example 2 (Overlay)
This example overlays Lift curves from two classification models—Logistic Regression and Random Forest Classifier—on a single plot for direct visual comparison. Both models were trained on the same synthetic dataset from the Binary Classification Models section, and their lift performance is evaluated on the shared test set.
The Lift curve shows how many more positive outcomes are captured by the model at each quantile compared to a random baseline. A horizontal dashed black line at Lift = 1 represents random selection; curves above this line indicate effective ranking of positive cases. Overlaying curves makes it easier to assess which model better concentrates true positives near the top of the prediction list.
Using the overlay=True
option, the show_lift_chart
function generates a clean,
unified plot. Each curve is styled with linewidth=2
for clarity, and all axis
elements and tick marks are sized for presentation-quality output. This layout
is particularly helpful for slide decks, performance reports, or model selection
discussions.
from model_metrics import show_lift_chart
show_lift_chart(
model=[model1, model2],
X=X_test,
y=y_test,
overlay=True,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "black", "linestyle": "--", "linewidth": 2},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
)
Gain Charts
This section explores how to evaluate the cumulative performance of classification models in identifying positive outcomes using Gain Charts. These charts are especially effective at showing the model’s ability to concentrate the correct (positive) predictions in the top-ranked portion of the dataset. Using the same Logistic Regression, Decision Tree, and Random Forest Classifier models trained on the synthetic dataset introduced in the Binary Classification Models section, we demonstrate how to plot and compare Gain Curves across models.
A Gain Chart shows the cumulative percentage of actual positive cases captured as we move through the population sorted by predicted probability. Unlike the Lift Chart, which displays the ratio of model performance over baseline, the Gain Chart directly shows the percentage of positives captured—providing a more intuitive sense of how effective a model is at identifying positives early in the ranked list.
The show_gain_chart
function supports single or multiple models, with options to
overlay all gain curves in a single plot or display them in a flexible grid layout.
Labels, title wrapping, curve styles, and saving output images are all customizable,
making this function well-suited for both development analysis and final reporting.
- show_gain_chart(model, X, y, xlabel='Percentage of Sample', ylabel='Cumulative Gain', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=None, label_fontsize=12, tick_fontsize=10, gridlines=True)
- Parameters:
model (object or list[object]) – A trained classifier or list of classifiers. Each model must support
predict_proba
.X (pd.DataFrame or np.ndarray) – The feature matrix used for prediction.
y (pd.Series or np.ndarray) – Ground truth binary labels.
xlabel (str, optional) – Label for the x-axis. Defaults to
"Percentage of Sample"
.ylabel (str, optional) – Label for the y-axis. Defaults to
"Cumulative Gain"
.model_title (str or list[str], optional) – Custom display names for each model.
overlay (bool, optional) – If
True
, overlay all models on a single axis. Mutually exclusive withgrid
.title (str, optional) – Plot or grid title. Set to
""
to suppress the title.save_plot (bool, optional) – Whether to save the chart(s) to disk.
image_path_png (str, optional) – Output path for saving PNG image(s).
image_path_svg (str, optional) – Output path for saving SVG image(s).
text_wrap (int, optional) – Max characters before title wrapping. Set to
None
to disable.curve_kwgs (dict[str, dict] or list[dict], optional) – Dict or list of kwargs per model to customize line style.
linestyle_kwgs (dict, optional) – Styling for the random baseline. Defaults to
{"color": "gray", "linestyle": "--", "linewidth": 2}
.grid (bool, optional) – Whether to render a grid layout. Cannot be used with
overlay
.n_rows (int, optional) – Rows in the grid layout. If
None
, inferred automatically.n_cols (int, optional) – Columns in the grid layout. Defaults to
2
.figsize (tuple[int, int], optional) – Figure size (width, height) in inches.
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for tick marks and legends.
gridlines (bool, optional) – Whether to show gridlines on the plots.
- Returns:
None.
Displays or saves Gain Charts for one or more models.- Return type:
None
- Raises:
If
overlay=True
andgrid=True
are both set.
Notes
- What is a Gain Chart?
Plots the cumulative percentage of positives captured vs. sample size.
The x-axis shows the fraction of the sample, ranked by predicted probability.
The y-axis shows what percentage of the total positives have been captured.
- Why use Gain Charts?
Gain Charts help answer: “If I contact the top X% of predictions, how many positives will I catch?”
Especially useful in marketing, lead scoring, risk management, and fraud detection.
- Reading Gain Curves:
Curves that rise steeply and plateau early indicate better model performance.
The dashed baseline (diagonal line) represents random selection.
- Layout Options:
Use
overlay=True
to combine all gain curves into a single plot.Use
grid=True
for a subplot layout per model.If neither is set, plots will be rendered individually.
- Styling Options:
Customize individual model lines via
curve_kwgs
.Modify the diagonal baseline line using
linestyle_kwgs
.Adjust fonts and wrapping for presentation clarity.
- Saving Output:
Enable
save_plot=True
to save figures as PNG and/or SVG.Files are named using the model title (e.g.,
Model_1_gain.png
oroverlay_gain.svg
).
Gain Chart Example 1 (Grid Layout)
In this first Gain Chart example, we compare the cumulative gain performance of two classification models— Logistic Regression and Random Forest Classifier—trained on the synthetic dataset from the Binary Classification Models section. This visualization showcases their ability to identify positive instances across different percentiles of the ranked test data.
Each subplot presents the cumulative gain achieved as a function of the percentage of the sample, sorted by descending predicted probability. The grey dashed line represents the baseline (random gain). A model that identifies a high proportion of positive cases in the early part of the ranking will have a steeper and higher curve. In this example, the Random Forest model typically outpaces Logistic Regression, indicating better early identification of positives.
The show_gain_chart
function allows flexible styling and layout control. This example uses a grid
configuration (n_cols=2, n_rows=1
), customized line widths and colors, and includes saving the figure
for documentation or stakeholder presentations.
from model_metrics import show_gain_chart
show_gain_chart(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "grey", "linestyle": "--"},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
grid=True,
)
Output
Gain Chart Example 2 (Overlay)
This example overlays Gain curves from two classification models—Logistic Regression and Random Forest Classifier—on a single plot to enable direct visual comparison of their cumulative gain performance. Both models were trained on the same synthetic dataset from the Binary Classification Models section and evaluated on the same test set.
The Gain curve shows the cumulative proportion of true positives captured as you move through the population, ranked by predicted probability. A diagonal baseline line from (0, 0) to (1, 1) indicates the expected performance of a random model. Curves that rise above this line demonstrate superior model ability to concentrate positive cases near the top of the ranked list.
By setting overlay=True
, the show_gain_chart
function produces a single,
easy-to-read plot containing both models’ gain curves. Each curve is styled
with linewidth=2
for clear visibility. Overlay layouts are ideal for model
selection discussions, presentations, and performance dashboards.
from model_metrics import show_gain_chart
show_gain_chart(
model=[model1, model2],
X=X_test,
y=y_test,
overlay=True,
model_title=["Logistic Regression", "Random Forest"],
linestyle_kwgs={"color": "black", "linestyle": "--", "linewidth": 2},
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
)
ROC AUC Curves
This section demonstrates how to evaluate the performance of binary classification models using ROC AUC curves, a key metric for assessing the trade-off between true positive and false positive rates. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate ROC curves to visualize their discriminatory power.
ROC AUC (Receiver Operating Characteristic Area Under the Curve) provides a
single scalar value representing a model’s ability to distinguish between
positive and negative classes, with a value of 1 indicating perfect classification
and 0.5 representing random guessing. The curves are plotted by varying the
classification threshold and calculating the true positive rate (sensitivity)
against the false positive rate (1-specificity). This makes ROC AUC particularly
useful for comparing models like Logistic Regression, which relies on linear
decision boundaries, and Random Forest Classifier, which leverages ensemble
decision trees, especially when class imbalances or threshold sensitivity are
concerns. The show_roc_curve
function simplifies this process, enabling
users to visualize and compare these curves effectively, setting the stage for
detailed performance analysis in subsequent examples.
The show_roc_curve
function provides a flexible and powerful way to visualize
the performance of binary classification models using Receiver Operating Characteristic
(ROC) curves. Whether you’re comparing multiple models, evaluating subgroup fairness,
or preparing publication-ready plots, this function allows full control over layout,
styling, and annotations. It supports single and multiple model inputs, optional overlay
or grid layouts, and group-wise comparisons via a categorical feature. Additional options
allow custom axis labels, AUC precision, curve styling, and export to PNG/SVG.
Designed to be both user-friendly and highly configurable, show_roc_curve
is a practical tool for model evaluation and stakeholder communication.
- show_roc_curve(model, X, y, xlabel='False Positive Rate', ylabel='True Positive Rate', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, linestyle_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None)
- Parameters:
model (object or str or list[object or str]) – A trained model, a string placeholder, or a list containing models or strings to evaluate.
X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction.
y (pd.Series or np.ndarray) – True binary labels for evaluation.
xlabel (str, optional) – Label for the x-axis. Defaults to
"False Positive Rate"
.ylabel (str, optional) – Label for the y-axis. Defaults to
"True Positive Rate"
.model_title (str or list[str], optional) – Custom title(s) for the models. Can be a string or list of strings. If
None
, defaults to"Model 1"
,"Model 2"
, etc.decimal_places (int, optional) – Number of decimal places for AUC values. Defaults to
2
.overlay (bool, optional) – Whether to overlay multiple models on a single plot. Defaults to
False
.title (str, optional) – Title for the plot (used in overlay mode or as global title). If
""
, disables the title. Defaults toNone
.save_plot (bool, optional) – Whether to save the plot(s) to file. Defaults to
False
.image_path_png (str, optional) – File path to save the plot(s) as PNG.
image_path_svg (str, optional) – File path to save the plot(s) as SVG.
text_wrap (int, optional) – Maximum character width before wrapping plot titles. If
None
, no wrapping is applied.curve_kwgs (list[dict] or dict[str, dict], optional) – Plot styling for ROC curves. Accepts a list of dictionaries or a nested dictionary keyed by model_title.
linestyle_kwgs (dict, optional) – Style for the random guess (diagonal) line. Defaults to
{"color": "gray", "linestyle": "--", "linewidth": 2}
.grid (bool, optional) – Whether to organize the ROC plots in a subplot grid layout. Cannot be used with
overlay=True
orgroup_category
.n_rows (int, optional) – Number of rows in the grid layout. If
None
, calculated automatically based on number of models and columns.n_cols (int, optional) – Number of columns in the grid layout. Defaults to
2
.figsize (tuple, optional) – Size of the plot or grid of plots, in inches. Defaults to
(8, 6)
.label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to
12
.tick_fontsize (int, optional) – Font size for ticks and legend text. Defaults to
10
.gridlines (bool, optional) – Whether to display grid lines on plots. Defaults to
True
.group_category (array-like, optional) – Categorical array to group ROC curves. Cannot be used with
grid=True
oroverlay=True
.
- Returns:
None.
Displays or saves ROC curve plots for classification models.- Return type:
None
- Raises:
If
grid=True
andoverlay=True
are both set.If
group_category
is used withgrid
oroverlay
.If
overlay=True
is used with only one model.
Notes
- Flexible Inputs:
model
andmodel_title
can be individual items or lists. Strings passed inmodel
are treated as placeholder names.Titles can be automatically inferred or explicitly passed using
model_title
.
- Group-Wise ROC:
If
group_category
is passed, separate ROC curves are plotted for each unique group.The legend will include group-specific AUC and class distribution (e.g.,
AUC = 0.87, Count: 500, Pos: 120, Neg: 380
).
- Plot Modes:
overlay=True
overlays all models in one figure.grid=True
arranges individual ROC plots in a subplot layout.If neither is set, separate full-size plots are shown for each model.
- Legend and Styling:
A random guess reference line (diagonal) is plotted by default.
Customize ROC curves with
curve_kwgs
and the diagonal line withlinestyle_kwgs
.Titles can be disabled with
title=""
.
- Saving Plots:
If
save_plot=True
, plots are saved using the base filename format<model_name>_roc_auc
oroverlay_roc_auc_plot
.
The show_roc_curve
function provides flexible and highly customizable
plotting of ROC curves for binary classification models. It supports overlays,
grid layouts, and subgroup visualizations, while also allowing export options
and styling hooks for publication-ready output.
ROC AUC Example 1 (Grid Layout)
In this first ROC AUC evaluation example, we plot the ROC curves for two
models: Logistic Regression and Random Forest Classifier, trained on the
synthetic dataset from the Binary Classification Models section. The curves are displayed side by side
using a grid layout (n_cols=2, n_rows=1
), with the Logistic Regression curve
in blue and the Random Forest curve in green for clear differentiation.
A red dashed line represents the random guessing baseline. This example
demonstrates how the show_roc_curve
function enables straightforward
visualization of model performance, with options to customize colors,
add a grid, and save the plot for reporting purposes.
from model_metrics import show_roc_curve
show_roc_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_title,
decimal_places=2,
n_cols=2,
n_rows=1,
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "green", "linewidth": 2},
},
linestyle_kwgs={"color": "red", "linestyle": "--"},
grid=True,
)
Output
ROC AUC Example 2 (Overlay)
In this second ROC AUC evaluation example, we focus on overlaying the results of
two models—Logistic Regression and Random Forest Classifier—trained on the
synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_roc_curve
function with the overlay=True
parameter, the ROC curves for both models are
displayed together, with Logistic Regression in blue and Random Forest in black,
both with a linewidth=2
. A red dashed line serves as the random guessing
baseline, and the plot includes a custom title for clarity.
from model_metrics import show_roc_curve
show_roc_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_title,
decimal_places=2,
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
linestyle_kwgs={"color": "red", "linestyle": "--"},
title="ROC Curves: Logistic Regression and Random Forest",
overlay=True,
)
Output
ROC AUC Example 3 (by Category)
In this third ROC AUC evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner
library [3].
Click here to view the corresponding codebase for this workflow.
The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature—such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.
The show_roc_curve
function supports this analysis through the
group_category
parameter.
For example, by passing group_category=X_test_2["race"]
,
you can generate a separate ROC curve for each unique racial group in the dataset:
from model_metrics import show_roc_curve
show_roc_curve(
model=model_dt["model"].estimator,
X=X_test,
y=y_test,
model_title="Decision Tree Classifier,
decimal_places=2,
group_category=X_test_2["race"],
)
Output
Precision-Recall Curves
This section demonstrates how to evaluate the performance of binary classification models using Precision-Recall (PR) curves, a critical visualization for understanding model behavior in the presence of class imbalance. Using the Logistic Regression and Random Forest Classifier models trained on the synthetic dataset from the previous (Binary Classification Models) section, we generate PR curves to examine how well each model identifies true positives while limiting false positives.
Precision-Recall curves focus on the trade-off between precision (positive predictive value) and recall (sensitivity) across different classification thresholds. This is particularly important when the positive class is rare—as is common in fraud detection, disease diagnosis, or adverse event prediction—because ROC AUC can overstate performance under imbalance. Unlike the ROC curve, the PR curve is sensitive to the proportion of positive examples and gives a clearer picture of how well a model performs where it matters most: in identifying the positive class.
The area under the Precision-Recall curve, also known as Average Precision (AP), summarizes model performance across thresholds. A model that maintains high precision as recall increases is generally more desirable, especially in settings where false positives have a high cost. This makes the PR curve a complementary and sometimes more informative tool than ROC AUC in skewed classification scenarios.
- show_pr_curve(model, X, y, xlabel='Recall', ylabel='Precision', model_title=None, decimal_places=2, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, grid=False, n_rows=None, n_cols=2, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, group_category=None)
- Parameters:
model (object or str or list[object or str]) – A trained model, a string placeholder, or a list containing models or strings to evaluate.
X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction.
y (pd.Series or np.ndarray) – True binary labels for evaluation.
xlabel (str, optional) – Label for the x-axis. Defaults to
"Recall"
.ylabel (str, optional) – Label for the y-axis. Defaults to
"Precision"
.model_title (str or list[str], optional) – Custom title(s) for the model(s). Can be a string or list of strings. If
None
, defaults to"Model 1"
,"Model 2"
, etc.decimal_places (int, optional) – Number of decimal places for Average Precision (AP) values. Defaults to
2
.overlay (bool, optional) – Whether to overlay multiple models on a single plot. Defaults to
False
.title (str, optional) – Title for the plot (used in overlay mode or as global title). If
""
, disables the title. Defaults toNone
.save_plot (bool, optional) – Whether to save the plot(s) to file. Defaults to
False
.image_path_png (str, optional) – File path to save the plot(s) as PNG.
image_path_svg (str, optional) – File path to save the plot(s) as SVG.
text_wrap (int, optional) – Maximum character width before wrapping plot titles. If
None
, no wrapping is applied.curve_kwgs (list[dict] or dict[str, dict], optional) – Plot styling for PR curves. Accepts a list of dictionaries or a nested dictionary keyed by model_title.
grid (bool, optional) – Whether to organize the PR plots in a subplot grid layout. Cannot be used with
overlay=True
orgroup_category
.n_rows (int, optional) – Number of rows in the grid layout. If
None
, calculated automatically based on number of models and columns.n_cols (int, optional) – Number of columns in the grid layout. Defaults to
2
.figsize (tuple, optional) – Size of the plot or grid of plots, in inches. Defaults to
(8, 6)
.label_fontsize (int, optional) – Font size for axis labels and titles. Defaults to
12
.tick_fontsize (int, optional) – Font size for ticks and legend text. Defaults to
10
.gridlines (bool, optional) – Whether to display grid lines on plots. Defaults to
True
.group_category (array-like, optional) – Categorical array to group PR curves. Cannot be used with
grid=True
oroverlay=True
.legend_metric (str, optional) – Metric to display in the legend. Either
"ap"
(Average Precision) or"aucpr"
(area under the PR curve). Defaults to"ap"
.
- Returns:
None.
Displays or saves Precision-Recall curve plots for classification models.- Return type:
None
- Raises:
If
grid=True
andoverlay=True
are both set.If
group_category
is used withgrid=True
oroverlay=True
.If
overlay=True
is used with only one model.If
legend_metric
is not one of"ap"
or"aucpr"
.
If
model_title
is not a string, list of strings, orNone
.
Notes
- Flexible Inputs:
model
andmodel_title
can be individual items or lists. Strings passed inmodel
are treated as placeholder names.Titles can be automatically inferred or explicitly passed using
model_title
.
- Group-Wise PR:
If
group_category
is passed, separate PR curves are plotted for each unique group.The legend will include group-specific Average Precision and class distribution (e.g.,
AP = 0.78, Count: 500, Pos: 120, Neg: 380
).
- Average Precision vs. AUCPR:
By default, the legend shows Average Precision (AP), which summarizes the PR curve with greater emphasis on the performance at higher precision levels.
If the user passes
legend_metric="aucpr"
, the legend will instead display AUCPR (Area Under the Precision-Recall Curve), which gives equal weight to all parts of the curve.
- Plot Modes:
overlay=True
overlays all models in one figure.grid=True
arranges individual PR plots in a subplot layout.If neither is set, separate full-size plots are shown for each model.
- Legend and Styling:
A random classifier baseline (constant precision) is plotted by default.
Customize PR curves with
curve_kwgs
.Titles can be disabled with
title=""
.
- Saving Plots:
If
save_plot=True
, plots are saved using the base filename format<model_name>_precision_recall
oroverlay_pr_plot
.
The show_pr_curve
function provides flexible and highly customizable plotting
of Precision-Recall curves for binary classification models. It supports overlays,
grid layouts, and subgroup visualizations, while also allowing export options and
styling hooks for publication-ready output.
Precision-Recall Example 1 (Grid Layout)
In this first Precision-Recall evaluation example, we plot the PR curves for two
models: Logistic Regression and Random Forest Classifier, both trained on the
synthetic dataset from the Binary Classification Models section.
The curves are arranged side by side using a grid layout (n_cols=2, n_rows=1
),
with the Logistic Regression curve rendered in blue and the Random Forest curve
in green to distinguish between models. A gray dashed line indicates the baseline
precision, equal to the prevalence of the positive class in the dataset.
This example illustrates how the show_pr_curve
function makes it easy to
visualize and compare model performance when dealing with class imbalance. It
also demonstrates layout flexibility and customization options, including gridlines,
label styling, and export functionality—making it suitable for both exploratory
analysis and final reporting.
from model_metrics import show_pr_curve
show_pr_curve(
model=[logistic_model, rf_model],
X=X_test,
y=y_test,
model_title=["Logistic Regression", "Random Forest"],
decimal_places=2,
grid=True,
n_cols=2,
n_rows=1,
curve_kwgs=[
{"color": "blue"},
{"color": "green"}
],
gridlines=True
)
Output
Precision-Recall Example 2 (Overlay)
In this second Precision-Recall evaluation example, we focus on overlaying the
results of two models—Logistic Regression and Random Forest Classifier—trained
on the synthetic dataset from the Binary Classification Models section onto a single plot. Using the show_pr_curve
function with the overlay=True
parameter, the Precision-Recall curves for
both models are displayed together, with Logistic Regression in blue and Random
Forest in black, both with a linewidth=2
. The plot includes a custom title
for clarity.
from model_metrics import show_pr_curve
show_pr_curve(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_title,
curve_kwgs={
"Logistic Regression": {"color": "blue", "linewidth": 2},
"Random Forest": {"color": "black", "linewidth": 2},
},
title="ROC Curves: Logistic Regression and Random Forest",
overlay=True,
)
Output
Precision-Recall Example 3 (Categorical)
In this third Precision-Recall evaluation example, we utilize the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner
library [3].
Click here to view the corresponding codebase for this workflow.
The objective here is to assess ROC AUC scores not just overall, but across each category of a selected feature—such as occupation, education, marital-status, or race. This approach enables deeper insight into how performance varies by subgroup, which is particularly important for fairness, bias detection, and subgroup-level interpretability.
The show_pr_curve
function supports this analysis through the
group_category
parameter.
For example, by passing group_category=X_test_2["race"]
,
you can generate a separate ROC curve for each unique racial group in the dataset:
from model_metrics import show_pr_curve
show_pr_curve(
model=model_dt["model"].estimator,
X=X_test,
y=y_test,
model_title="Decision Tree Classifier,
group_category=X_test_2["race"],
legend_metric="aucpr",
)
Output
Confusion Matrix Evaluation
This section introduces the show_confusion_matrix
function, which provides a
flexible, styled interface for generating and visualizing confusion matrices
across one or more classification models. It supports advanced features like
threshold overrides, subgroup labeling, classification report display, and fully
customizable plot aesthetics including grid layouts.
The confusion matrix is a fundamental diagnostic tool for classification models, displaying the counts of true positives, true negatives, false positives, and false negatives. This function goes beyond standard implementations by allowing for custom thresholds (globally or per model), label annotation (e.g., TP, FP, etc.), plot exporting, colorbar toggling, and grid visualization.
This is especially useful when comparing multiple models side-by-side or needing publication-ready confusion matrices for stakeholders.
- show_confusion_matrix(model, X, y, model_title=None, title=None, model_threshold=None, custom_threshold=None, class_labels=None, cmap='Blues', save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, figsize=(8, 6), labels=True, label_fontsize=12, tick_fontsize=10, inner_fontsize=10, grid=False, score=None, class_report=False, **kwargs)
- Parameters:
model (object or str or list[object or str]) – A single model (object or string), or a list of models or string placeholders.
X (pd.DataFrame or np.ndarray) – Feature matrix used for prediction.
y (pd.Series or np.ndarray) – True target labels.
model_title (str or list[str], optional) – Custom title(s) for each model. Can be a string or list of strings. If None, defaults to
"Model 1"
,"Model 2"
, etc.title (str, optional) – Title for each plot. If
""
, no title is displayed. If None, a default title is shown.model_threshold (dict, optional) – Dictionary of thresholds keyed by model title. Used if
custom_threshold
is not set.custom_threshold (float, optional) – Global override threshold to apply across all models.
class_labels (list[str], optional) – Custom labels for the classes in the matrix.
cmap (str, optional) – Colormap to use for the heatmap. Defaults to
"Blues"
.save_plot (bool, optional) – Whether to save the generated plot(s).
image_path_png (str, optional) – Path to save the PNG version of the image.
image_path_svg (str, optional) – Path to save the SVG version of the image.
text_wrap (int, optional) – Maximum width of plot titles before wrapping.
figsize (tuple[int, int], optional) – Figure size in inches. Defaults to
(8, 6)
.labels (bool, optional) – Whether to annotate matrix cells with
TP
,FP
,FN
,TN
.label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for axis ticks.
inner_fontsize (int, optional) – Font size for numbers and labels inside cells.
grid (bool, optional) – Whether to display multiple models in a grid layout.
score (str, optional) – Scoring metric to use when optimizing threshold (if applicable).
class_report (bool, optional) – If True, prints a classification report below each matrix.
kwargs (dict, optional) – Additional keyword arguments for customization (e.g., show_colorbar,
n_cols
).
- Returns:
None. Displays confusion matrix plots (and optionally saves them).
- Return type:
None
- Raises:
TypeError – If
model_title
is not a string, a list of strings, or None.
Notes
- Model Support:
Supports single or multiple classification models.
model_title
may be inferred automatically or provided explicitly.
- Threshold Handling:
Use
model_threshold
to specify per-model thresholds.custom_threshold
overrides all other thresholds.
- Plotting Modes:
grid=True
arranges plots in subplots.Otherwise, plots are displayed one at a time.
- Labeling:
Set
labels=False
to disable annotating cells withTP
,FP
,FN
,TN
.Always shows raw numeric values inside cells.
- Colorbar & Styling:
Toggle colorbar via
show_colorbar
(passed viakwargs
).Colormap and font sizes are fully configurable.
- Exporting Plots:
Plots can be saved as both PNG and SVG using the respective paths.
Saved filenames follow the pattern
confusion_matrix_<model_name>
orgrid_confusion_matrix
.
Confusion Matrix Example 1 (Threshold=0.5)
In this first confusion matrix evaluation example, we focus on showing the results of two models—Logistic Regression and Random Forest Classifier—trained on the synthetic dataset from the Binary Classification Models section onto a single plot.
from model_metrics import show_confusion_matrix
show_confusion_matrix(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_title,
cmap="Blues",
text_wrap=20,
grid=True,
n_cols=2,
n_rows=1,
figsize=(6, 6),
)
Output
Confusion Matrix Example 2 (Classification Report)
This second confusion matrix evaluation example is nearly identical to the first,
but uses a different color map (cmap="viridis"
) and sets class_report=True
to print classification reports for each model in addition to the visual output.
from model_metrics import show_confusion_matrix
show_confusion_matrix(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_title,
cmap="viridis",
text_wrap=20,
grid=True,
n_cols=2,
n_rows=1,
figsize=(6, 6),
class_report=True
)
Output
Confusion Matrix for Logistic Regression:
Predicted 0 Predicted 1
Actual 0 76 18
Actual 1 13 93
Classification Report for Logistic Regression:
precision recall f1-score support
0 0.85 0.81 0.83 94
1 0.84 0.88 0.86 106
accuracy 0.84 200
macro avg 0.85 0.84 0.84 200
weighted avg 0.85 0.84 0.84 200
Confusion Matrix for Random Forest:
Predicted 0 Predicted 1
Actual 0 84 10
Actual 1 3 103
Classification Report for Random Forest:
precision recall f1-score support
0 0.97 0.89 0.93 94
1 0.91 0.97 0.94 106
accuracy 0.94 200
macro avg 0.94 0.93 0.93 200
weighted avg 0.94 0.94 0.93 200
Confusion Matrix Example 3 (Threshold = 0.37)
In this third confusion matrix evaluation example using the synthetic dataset
from the Binary Classification Models section, we apply
a custom classification threshold of 0.37 using the custom_threshold
parameter.
This overrides the default threshold of 0.5 and enables us to inspect how the
confusion matrices shift when a more lenient decision boundary is applied. Refer
to the section on threshold selection logic
for caveats on choosing the right threshold.
This is especially useful in imbalanced classification problems or cost-sensitive environments where the trade-off between precision and recall must be adjusted. By lowering the threshold, we typically increase the number of positive predictions, which can improve recall but may come at the cost of more false positives.
The output matrices for both models—Logistic Regression and Random Forest—are shown side by side in a grid layout for easy visual comparison.
from model_metrics import show_confusion_matrix
show_confusion_matrix(
model=[model1, model2],
X=X_test,
y=y_test,
model_title=model_title,
text_wrap=20,
grid=True,
n_cols=2,
n_rows=1,
figsize=(6, 6),
custom_threshold=0.37,
)
Output
Calibration Curves
This section focuses on calibration curves, a diagnostic tool that compares predicted probabilities to actual outcomes, helping evaluate how well a model’s predicted confidence aligns with observed frequencies. Using models like Logistic Regression or Random Forest on the synthetic dataset from the previous (Binary Classification Models) section, we generate calibration curves to assess the reliability of model probabilities.
Calibration is especially important in domains where probability outputs inform downstream decisions, such as healthcare, finance, and risk management. A well-calibrated model not only predicts the correct class but also outputs meaningful probabilities—for example, when a model predicts a 0.7 probability, we expect roughly 70% of such predictions to be correct.
The show_calibration_curve
function simplifies this process by allowing users to
visualize calibration performance across models or subgroups. The plots show the
mean predicted probabilities against the actual observed fractions of positive
cases, with an optional reference line representing perfect calibration.
Additional features include support for overlay or grid layouts, subgroup
analysis by categorical features, and optional Brier score display—a scalar
measure of calibration quality.
The function offers full control over styling, figure layout, axis labels, and output format, making it easy to generate both exploratory and publication-ready plots.
- show_calibration_curve(model, X, y, xlabel='Mean Predicted Probability', ylabel='Fraction of Positives', model_title=None, overlay=False, title=None, save_plot=False, image_path_png=None, image_path_svg=None, text_wrap=None, curve_kwgs=None, grid=False, n_cols=2, n_rows=None, figsize=None, label_fontsize=12, tick_fontsize=10, bins=10, marker='o', show_brier_score=True, gridlines=True, linestyle_kwgs=None, group_category=None, **kwargs)
- Parameters:
model (estimator or list) – A trained classifier or a list of classifiers to evaluate.
X (pd.DataFrame or np.ndarray) – Feature matrix used for predictions.
y (pd.Series or np.ndarray) – True binary target values.
xlabel (str, optional) – X-axis label. Defaults to
"Mean Predicted Probability"
.ylabel (str, optional) – Y-axis label. Defaults to
"Fraction of Positives"
.model_title (str or list[str], optional) – Custom title(s) for the models.
overlay (bool, optional) – If
True
, overlays multiple models on one plot.title (str, optional) – Title for the plot. Use
""
to suppress.save_plot (bool, optional) – Whether to save the plot(s).
image_path_png (str, optional) – Directory path for PNG export.
image_path_svg (str, optional) – Directory path for SVG export.
text_wrap (int, optional) – Max characters before title text wraps.
curve_kwgs (list[dict] or dict[str, dict], optional) – Styling options for the calibration curves.
grid (bool, optional) – Whether to arrange models in a subplot grid.
n_cols (int, optional) – Number of columns in the grid layout. Defaults to
2
.n_rows (int, optional) – Number of rows in the grid layout. Auto-calculated if
None
.figsize (tuple, optional) – Figure size in inches (width, height).
label_fontsize (int, optional) – Font size for axis labels and titles.
tick_fontsize (int, optional) – Font size for ticks and legend entries.
bins (int, optional) – Number of bins used to compute calibration.
marker (str, optional) – Marker style for calibration points.
show_brier_score (bool, optional) – Whether to display Brier score in the legend.
gridlines (bool, optional) – Whether to show gridlines on plots.
linestyle_kwgs (dict, optional) – Styling for the “perfectly calibrated” reference line.
group_category (array-like, optional) – Categorical variable used to create subgroup calibration plots.
- Returns:
None.
Displays or saves calibration plots for classification models.- Return type:
None
- Raises:
If
overlay=True
andgrid=True
are both set.If
group_category
is used withoverlay
orgrid
.If
curve_kwgs
list does not match number of models.
Notes
- Calibration vs Discrimination:
Calibration evaluates how well predicted probabilities reflect observed outcomes, while ROC AUC measures a model’s ability to rank predictions.
- Flexible Plotting Modes:
overlay=True
plots multiple models on one figure.grid=True
arranges plots in a grid layout.If neither is set, individual full-size plots are created.
- Group-Wise Analysis:
Passing
group_category
plots separate calibration curves by subgroup (e.g., age, race).Each subgroup’s Brier score is shown when
show_brier_score=True
.
- Customization:
Use
curve_kwgs
andlinestyle_kwgs
to control styling.Add markers, gridlines, and custom titles to suit report or presentation needs.
- Saving Outputs:
Set
save_plot=True
and specifyimage_path_png
orimage_path_svg
to export figures.Filenames are auto-generated based on model name and plot type.
Important
Calibration curves are a valuable diagnostic tool for assessing the alignment between predicted probabilities and actual outcomes. By plotting the fraction of positives against predicted probabilities, we can evaluate how well a model’s confidence scores correspond to observed reality. While these plots offer important insights, it’s equally important to understand the assumptions and limitations behind the calibration methods used.
Calibration Curve Example 1 (Grid-like)
This example presents calibration curves for two classification models trained on the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner
library [3].
Click here to view the corresponding codebase for this workflow.
The classification models are displayed side by side in a grid layout. Each
subplot shows how well the predicted probabilities from a model align with the
actual observed outcomes. A diagonal dashed line representing perfect calibration
is included in both plots, and Brier scores are shown in the legend to quantify
each model’s calibration accuracy.
By setting grid=True
, the function automatically arranges the individual plots
based on the number of models and specified columns. This layout is ideal for
visually comparing calibration behavior across models without overlapping lines.
pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]
# Model titles
model_titles = [
"Logistic Regression",
"Random Forest Classifier",
"Decision Tree Classifier",
]
from model_metrics import show_calibration_curve
show_calibration_curve(
model=pipelines_or_models[:2],
X=X_test,
y=y_test,
model_title=model_titles[:2],
text_wrap=50,
bins=10,
show_brier_score=True,
grid=True,
linestyle_kwgs={"color": "black"},
)
Output
Calibration Curve Example 2 (Overlay)
This example also uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner
library [3].
Click here to view the corresponding codebase for this workflow.
This example demonstrates how to overlay calibration curves from multiple classification
models in a single plot. Overlaying allows for direct visual comparison of how predicted
probabilities from each model align with actual outcomes on the same axes.
The diagonal dashed line represents perfect calibration, and Brier scores are included in the legend for each model, providing a quantitative measure of calibration accuracy.
By setting overlay=True
, the function combines all model curves into one figure,
making it easier to evaluate relative performance without splitting across subplots.
pipelines_or_models = [
model_lr["model"].estimator,
model_rf["model"].estimator,
model_dt["model"].estimator,
]
# Model titles
model_titles = [
"Logistic Regression",
"Random Forest Classifier",
"Decision Tree Classifier",
]
from model_metrics import show_calibration_curve
show_calibration_curve(
model=pipelines_or_models,
X=X_test,
y=y_test,
model_title=model_titles,
bins=10,
show_brier_score=True,
overlay=True
linestyle_kwgs={"color": "black"},
)
Output
Calibration Curve Example 3 (by Category)
This example, too, uses the well-known Adult Income dataset [2], a widely used benchmark for binary classification tasks. Its rich mix of categorical and numerical features makes it particularly suitable for analyzing model performance across different subgroups.
To build and evaluate our models, we use the model_tuner
library [3].
Click here to view the corresponding codebase for this workflow.
This example shows how to visualize calibration curves separately for each
category within a given feature—in this case, the race column of the joined
test set—using a single Random Forest classifier. Each plot represents the
calibration behavior of the model for a specific subgroup, allowing for detailed
insight into how predicted probabilities align with actual outcomes across
demographic categories.
This type of disaggregated visualization is especially useful for fairness
analysis and subgroup performance auditing. By setting group_category="race"
,
the function automatically detects unique values in the specified column and
generates a separate calibration curve for each.
The dashed diagonal reference line represents perfect calibration. Brier scores are included in each plot to provide a quantitative measure of calibration performance within the group.
Note
When using group_category
, both overlay
and grid
must be set to
False
. This ensures each group receives its own standalone figure, avoiding
conflicting layout behavior.
from model_metrics import show_calibration_curve
show_calibration_curve(
model=model_rf["model"].estimator,
X=X_test,
y=y_test,
model_title="Random Forest Classifier",
bins=10,
show_brier_score=True,
linestyle_kwgs={"color": "black"},
curve_kwgs={title: {"linewidth": 2} for title in model_titles},
group_category=X_test_2["race"],
)
Output
Threshold Metric Curves
This section introduces a powerful utility for exploring how classification thresholds affect key performance metrics, including Precision, Recall, F1 Score, and Specificity. Rather than fixing a threshold (commonly at 0.5), this function allows users to visualize trade-offs across the full range of possible thresholds, making it especially useful when optimizing for use-case-specific goals such as maximizing recall or achieving a minimum precision.
Using the Random Forest Classifier models trained on the adult income dataset [2], this tool helps users answer practical questions like:
What threshold achieves at least 85% precision?
Where does F1 score peak for this model?
How does specificity behave as the threshold increases?
The plot_threshold_metrics function supports optional threshold lookups via
lookup_metric
and lookup_value
, which prints the closest threshold that
meets your constraint. Plots can be customized with colors, gridlines, line styles,
wrapped titles, and export options.
- plot_threshold_metrics(model, X_test, y_test, title=None, text_wrap=None, figsize=(8, 6), label_fontsize=12, tick_fontsize=10, gridlines=True, baseline_thresh=True, curve_kwgs=None, baseline_kwgs=None, save_plot=False, image_path_png=None, image_path_svg=None, lookup_metric=None, lookup_value=None, decimal_places=4)
- Parameters:
model (object) – A trained classification model that supports
predict_proba
.X_test (pd.DataFrame or np.ndarray) – Feature matrix for evaluation.
y_test (pd.Series or np.ndarray) – True binary labels.
title (str, optional) – Custom title for the plot. If
""
, disables the title.text_wrap (int, optional) – Maximum width of the title before wrapping. If
None
, no wrapping is applied.figsize (tuple, optional) – Tuple representing the figure size in inches. Defaults to
(8, 6)
.label_fontsize (int, optional) – Font size for axis labels and title.
tick_fontsize (int, optional) – Font size for tick labels.
gridlines (bool, optional) – Whether to show grid lines. Defaults to
True
.baseline_thresh (bool, optional) – If
True
, adds a dashed line at threshold = 0.5.curve_kwgs (dict, optional) – Dictionary of styling options for metric curves (e.g.,
{"linewidth": 2}
).baseline_kwgs (dict, optional) – Dictionary of styling options for the baseline threshold line.
save_plot (bool, optional) – Whether to save the figure to file.
image_path_png (str, optional) – File path to save PNG output.
image_path_svg (str, optional) – File path to save SVG output.
lookup_metric (str, optional) – Metric to search for best threshold (“precision”, “recall”, “f1”, or “specificity”).
lookup_value (float, optional) – Desired value for the lookup metric.
decimal_places (int, optional) – Number of decimal places for printed threshold output.
- Returns:
None.
Displays or saves the metric vs. threshold plot.- Return type:
None
Notes
- Metric Curves:
Plots include
Precision
,Recall
,F1 Score
, andSpecificity
over threshold values.Useful for analyzing how changing the threshold alters model behavior.
- Threshold Lookup:
Set
lookup_metric
andlookup_value
to find the closest threshold that meets your constraint.Prints result to console and highlights the corresponding vertical line.
- Styling Options:
Customize plot curves with
curve_kwgs
.Adjust baseline style (e.g., at threshold = 0.5) via
baseline_kwgs
.
- Exporting:
Use
save_plot=True
withimage_path_png
and/orimage_path_svg
to save outputs.
- Interactivity:
Ideal for presentations or dashboards where visualizing threshold sensitivity is crucial.
Particularly helpful for domains like healthcare, fraud detection, or content moderation, where the cost of false positives vs. false negatives must be carefully managed.
Threshold Curves Example 1 (Threshold=0.5)
This example demonstrates how to plot threshold-dependent classification metrics using a Random Forest Classifier trained on the adult income dataset [2].
The plot_threshold_metrics
function visualizes how Precision, Recall, F1 Score,
and Specificity change as the decision threshold varies. In this configuration,
the baseline threshold line at 0.5 is enabled (baseline_thresh=True
),
and the line styling is customized via curve_kwgs
. Font sizes and wrapping options
are adjusted for improved clarity in presentation-ready plots.
from model_metrics import plot_threshold_metrics
plot_threshold_metrics(
model=model_rf["model"].estimator,
X_test=X_test,
y_test=y_test,
baseline_thresh=False,
curve_kwgs={
"linestyle": "-",
"linewidth": 2,
},
text_wrap=40,
)
Output
Threshold Curves Example 2 (Targeted Metric Lookup)
This example expands on threshold-based classification metric visualization using
a targeted lookup scenario. Suppose a clinical stakeholder or domain expert has
determined—based on prior research, cost-benefit considerations, or operational
constraints—that a precision of approximately 0.879
is ideal for downstream
decision-making (e.g., minimizing false positives in a healthcare setting).
The plot_threshold_metrics
function accepts the optional arguments lookup_metric
and lookup_value
to help identify the threshold that best aligns with this target.
When these are set, the function automatically locates and highlights the threshold
that most closely achieves the desired metric value, offering transparency and
guidance for threshold tuning.
from model_metrics import plot_threshold_metrics
plot_threshold_metrics(
model=model_rf["model"].estimator,
X_test=X_test,
y_test=y_test,
lookup_metric="precision",
lookup_value=0.879,
baseline_thresh=False,
lookup_kwgs={
"color": "red",
"linestyle": "--",
"linewidth": 2,
},
curve_kwgs={
"linestyle": "-",
"linewidth": 2,
},
text_wrap=40,
)
Output
In this example:
lookup_metric="precision"
specifies that we are targeting the precision curve.lookup_value=0.879
provides the desired value for that metric.The function will search for the closest possible precision value along the threshold range and display a vertical line at that corresponding threshold.
The threshold value is printed to the console and included in the legend (e.g., Best Threshold: 0.6757).