Changelog
Version 0.0.5a12
Three small but useful changes to the crosstab pair this round.
decimal_places parameter on overlap_crosstab
New keyword-only argument controlling how many decimal places the returned cells round to. Defaults to None (no rounding) for backward compatibility. Most useful with normalize=True to control proportion display:
ct = overlap_crosstab(
y_true=y_test,
y_prob_a=y_prob_lr,
y_prob_b=y_prob_rf,
normalize=True,
decimal_places=2,
)
Rounding runs after normalize and mask_impossible, so NaN cells from mask_impossible=True are left untouched. Integer counts (default mode without normalize) round to themselves and stay visually unchanged.
Independent font-size control in the summary panel
The right-hand swap-summary panel of plot_overlap_crosstab now exposes two separate font-size knobs that were previously coupled:
summary_body_fontsize(new): controls the italic body lines beneath each colored headline. Defaults toNone, which falls back tosummary_fontsize - 1(the original derived size, so the visual default is unchanged).label_fontsize(existing, expanded): already controlled the matrix row and column headers; now also drives the bold colored summary headlines vialabel_fontsize + 5. Defaultlabel_fontsize=12gives 17pt headlines, matching the previoussummary_fontsize + 6 = 17exactly.
The two knobs are independent. Bumping one does not move the other. Driven by the need for larger summary headlines on dense slides without inflating the body sentences.
plot_overlap_crosstab(
y_true=y_test,
model_a=lr, model_b=rf, X_a=X_test,
label_fontsize=18,
summary_body_fontsize=14,
)
Whitespace removal above the matrix
_draw_crosstab_matrix had two layout issues that conspired to leave 30% or more empty space above the matrix in larger figure sizes:
ylimextended ton + 1.4when the topmost content (thelabel_baxis label) only needed aboutn + 0.75. That left 0.65 units of empty data space inside the axes.set_aspect("equal")used the default anchor"C", which centered the squared axes box in its subplot cell, sending leftover space upward when matplotlib’s default top margin already biased the layout.
Fixes:
Tightened
ax.set_ylimfrom(-0.3, n + 1.4)to(-0.3, n + 0.85).Changed
ax.set_aspect("equal")toax.set_aspect("equal", anchor="N").
Both apply unconditionally to every render mode: table_only, full view, no-summary, no-legend, and combine_plots-nested.
Tests added
17 new tests across test_metrics_utils.py and test_model_evaluator_addtl.py pinning the changes above. Coverage breakdown:
overlap_crosstabdecimal_places: 5 tests (rounds with normalize, None preserves precision, NaN cells survive,decimal_places=0rounds to whole numbers, no-op on integer counts)._draw_crosstab_summaryfont sizes: 6 tests (default 17pt headlines, explicit label_fontsize scales headlines, body defaults to fontsize-1, explicit body_fontsize wins, two knobs independent, backward-compat positional args).plot_overlap_crosstabintegration: 3 tests (summary_body_fontsize accepted and applied, label_fontsize flows into summary headlines)._draw_crosstab_matrixlayout regression: 3 tests (ylim tight, anchor is N, end-to-end public-function anchor check).
Version 0.0.5a11
A summary of the overlap_crosstab and plot_overlap_crosstab functions added to the overlap family on this iteration, plus the supporting helpers, the font-handling infrastructure, and small touch-ups to existing overlap functions in metrics_utils.py and model_evaluator.py.
What the module adds
Two sibling functions extending the overlap family to a fourth view of the same comparison. Where overlap_summary gives per-category marginal counts and plot_overlap_venns shows per-category region overlap, the crosstab pair shows the full joint distribution of (model_a category, model_b category) in one 4x4 frame.
overlap_crosstabreturns a 4x4 pandas DataFrame indexed by {TP, FP, FN, TN} on both axes, counting every observation that lands in each (row category, column category) pair. The diagonal is the agreement count; the off-diagonal valid cells are the TP-swap and FP-swap pairs.plot_overlap_crosstabrenders the same crosstab as a styled matplotlib figure with three cell colorings (green for agreement, red for swaps, blue-gray for the eight structurally impossible cells), an optional right-hand summary panel surfacing the derived swap story, and an optional legend strip beneath the matrix.
Both share the same input-parameter block as the rest of the overlap family, so the same prediction inputs that drive a Venn or a summary table also drive the crosstab without modification.
Public functions
overlap_crosstab
overlap_crosstab(
y_true,
y_pred_a=None,
y_pred_b=None,
*,
y_prob_a=None,
y_prob_b=None,
model_a=None,
model_b=None,
X_a=None,
X_b=None,
threshold_a=None,
threshold_b=None,
score_a=None,
score_b=None,
label_a="Model A",
label_b="Model B",
normalize=False,
mask_impossible=False,
verbose=False,
)
Returns a 4x4 DataFrame indexed by {TP, FP, FN, TN} with the same four columns. verbose=True prints the cell-meaning legend plus the derived swap summary before returning. normalize=True divides every cell by the total observation count. mask_impossible=True sets the eight structurally impossible cells to NaN instead of leaving them at 0.
plot_overlap_crosstab
plot_overlap_crosstab(
y_true,
y_pred_a=None,
y_pred_b=None,
*,
y_prob_a=None,
y_prob_b=None,
model_a=None,
model_b=None,
X_a=None,
X_b=None,
threshold_a=None,
threshold_b=None,
score_a=None,
score_b=None,
label_a="Model A",
label_b="Model B",
title=None,
table_only=False,
show_summary=True,
show_legend=True,
colors=None,
cell_fontsize=14,
label_fontsize=12,
title_fontsize=14,
summary_fontsize=11,
font=None,
figsize=None,
save_plot=False,
image_path_png=None,
image_path_svg=None,
image_filename=None,
ax=None,
)
Renders the crosstab as a matplotlib figure. Calls overlap_crosstab internally, then dispatches to three private helpers (_draw_crosstab_matrix, _draw_crosstab_summary, _draw_crosstab_legend) for the three regions of the figure.
Structural-impossibility insight
Of the 16 cells in the 4x4 grid, only 8 are reachable. Because y_true is fixed per observation:
actual positives can only be TP or FN for each model
actual negatives can only be TN or FP
So any cell mixing a positive-subpop category (TP, FN) on one axis with a negative-subpop category (FP, TN) on the other is structurally impossible. A single observation cannot be TP for one model and FP for the other, since y_true is either 1 or 0, not both.
The valid cells partition cleanly:
Diagonal (4 cells): agreement, both models put the observation in the same confusion-matrix category. (TP, TP), (FP, FP), (FN, FN), (TN, TN).
Off-diagonal, positive-subpop (2 cells): TP swap pair. (TP, FN) and (FN, TP). One model catches an actual positive the other misses.
Off-diagonal, negative-subpop (2 cells): FP swap pair. (FP, TN) and (TN, FP). One model false-alarms on an actual negative the other clears.
The structural-impossibility check is encoded in _draw_crosstab_matrix (cells where row category and column category are in different subpops get the impossibility color) and in overlap_crosstab’s mask_impossible flag (NaN those cells in the returned DataFrame instead of leaving them at 0).
Prediction-input modes
Same three-way exclusivity as the rest of the family. For each side (a, b) supply exactly one of:
y_pred_*(binary predictions, used as is)y_prob_*(positive-class probabilities, thresholded internally viathreshold_*, default 0.5)model_* + X_*(predictions viaget_predictions; the model’s stored.thresholdattribute is looked up automatically, withthreshold_*as an explicit override andscore_*selecting a non-default score key from the model’s.thresholddict)
Supplying more than one on the same side raises ValueError. The y_prob_* path defaults to 0.5 when no threshold_* is set, since there is no model object available on that path to source a stored threshold from.
Display controls (plot_overlap_crosstab)
titlecontrols the figure title and its agreement-rate subtitle.Nonedefaults to"{label_a} vs {label_b}: Prediction Overlap". Pass""to suppress.table_onlyis the one-flag shortcut for a bare matrix: suppresses the title, the side summary panel, and the bottom legend strip. Equivalent to passingtitle="",show_summary=False,show_legend=False.show_summarytoggles the right-hand swap-summary text block (shared TP / FN / TN / FP counts, TP-swap and FP-swap breakdowns).show_legendtoggles the color-legend strip beneath the matrix.colorsaccepts a dict with keys"agree","disagree","impossible"to override the default cell colors. Unspecified keys keep their defaults (soft green, soft pink, soft blue-gray).Font sizes are split across four parameters:
cell_fontsizefor the cell counts,label_fontsizefor the row and column headers,title_fontsizefor the title block,summary_fontsizefor the side panel.
Default figsize behavior
figsize=None triggers a three-way default based on which display regions are active:
(5, 5)whentable_only=True(bare square matrix)(11, 6.5)whenshow_summary=True(matrix plus side panel)(7, 6.5)otherwise (matrix plus legend, no side panel)
The matrix uses set_aspect("equal") so cells always render square. Passing an asymmetric figsize (e.g. (10, 4)) produces a figure of that exact size with whitespace on the wider dimension, since cells will not stretch to fill.
When the title is suppressed (table_only=True or title=""), tight_layout is invoked with rect=[0, 0, 1, 1] so the matrix uses the full figure height. When the title is active, rect=[0, 0, 1, 0.88] reserves the top 12% for the title and subtitle.
Font handling
A new module-level _FONT_ALIASES dict in metrics_utils.py maps common cross-platform font names (Arial, Helvetica, Times New Roman, Consolas, Cascadia Code, Source Code Pro, Fira Code, JetBrains Mono, Menlo, Monaco, Georgia, Verdana, Calibri, Cambria, Tahoma, Garamond) to fallback chains ending in a font that is guaranteed installable (DejaVu Sans or DejaVu Sans Mono, both shipped with matplotlib).
_resolve_font_family(font) expands the input to a chain, filters it to the fonts actually findable on the system (via font_manager.findfont(name, fallback_to_default=False)), and returns the surviving list. Behavior summary:
font=NonereturnsNone(the caller leavesrcParamsuntouched).font="Arial"(or any aliased name) always succeeds, because the alias chain ends in DejaVu Sans.font="UnknownFontName"(not in the alias map, not installed) raisesValueErrorwith a useful message listing the supported aliases and pointing at how to check what is installed.font=42raisesTypeError.
In plot_overlap_crosstab, the resolved font is applied via matplotlib.rc_context({"font.family": resolved_font}), so the override scopes to the call only and does not modify the global rcParams.
The asymmetry between aliased and unaliased fonts is intentional. Aliased fonts have a curated fallback path that gracefully degrades to something installable on every platform, so library users do not need to install Microsoft fonts on Linux to use font="Arial". Unaliased fonts are taken at the user’s word and validated literally, so typos and missing fonts surface as errors rather than getting silently swapped to DejaVu Sans.
Supporting helpers
_print_overlap_crosstab_legend(ct, label_a, label_b)(inmetrics_utils.py) prints both the static cell-meaning legend (what each cell type represents, which cells are structurally impossible) and the dynamic swap summary derived from the raw integer counts (agreement rate, shared TP / FN / TN / FP, TP and FP swap breakdowns with net deltas)._draw_crosstab_matrix(ax, ct, label_a, label_b, colors, cell_fontsize, label_fontsize)(inmetrics_utils.py) renders the 4x4 grid: per-cell color viaRectanglepatches, per-cell text viaax.textwith the count formatted by:,, row and column headers, and the axis-level labels. Uses a plain hyphen rather than an em dash for the NaN cell text undermask_impossible=True._draw_crosstab_summary(ax, label_b, stats, fontsize)(inmetrics_utils.py) renders the right-hand text block with four colored headlines (shared FN, shared TN, TP swap, FP swap) and their italic body lines, all relative tolabel_bsince the swap framing reads naturally with the newer or candidate model in the_bslot._draw_crosstab_legend(ax, colors, fontsize)(inmetrics_utils.py) renders the horizontal color-legend strip using matplotlib’sPatchandax.legend._resolve_font_family(font)(inmetrics_utils.py) is the font alias expander documented above._FONT_ALIASES(module-level dict inmetrics_utils.py) is the alias table.
Touch-ups to existing overlap functions
overlap_summary: restored then_{label_a}andn_{label_b}columns (the per-model totals for the category, equal toboth + {label}_only) after a brief detour to a 5-column partition-only form. The 7-column form was restored because the per-model totals are exactly what the Venn diagram shows beneath each circle, making the table-to-figure cross-reference direct._print_overlap_summary_legend: updated to document both the partition identity (both + {a}_only + {b}_only + outside == subpop) and the per-model decomposition (n_{label} == both + {label}_only) explicitly, since the difference betweenn_{label}(full venn circle) and{label}_only(exclusive crescent) is the most common reader-confusion point.overlap_table: tightened theindexparameter to raiseTypeErrorwith a helpful message when a scalar (e.g. a column name string) is passed instead of an array-like, sincepd.Index(scalar)produces the crypticIndex(...) must be called with a collectionerror._draw_crosstab_matrixaxis-label spacing: dropped the axis-levellabel_by-position fromn + 0.9ton + 0.55so the column-axis label sits at the same visible distance from the column headers as the row-axis label sits from the row headers._draw_crosstab_summary: applied the:,thousands separator to every number in the swap-summary block (the swap headlines and the body lines were originally missing the comma format, leaving numbers like1249in a sea of1,249and+1,079).
Design decisions baked in
Same input-parameter block as the rest of the overlap family. Swapping between
plot_overlap_venns,overlap_summary,overlap_table,overlap_crosstab, andplot_overlap_crosstabfor the same comparison is a one-character edit.Three exclusive input modes per side, no implicit threshold lookup on the y_prob path. The y_prob path defaults to 0.5 because there is no model object to source a stored threshold from. Users who want tuned-threshold behavior with probabilities in hand pass
threshold_*explicitly, or use the model path.Strict ValueError on unfound unaliased fonts; graceful degradation for aliased fonts. Library users do not need to install Microsoft fonts on Linux to call
font="Arial", but typos and unrecognized fonts loudly fail rather than silently rendering in DejaVu Sans.Font override scoped via mpl.rc_context. The
fontparameter applies only to the current call and never leaks into the globalrcParams, so other figures in the same notebook are unaffected.table_only as the one-flag shortcut for bare matrices. Sets
title="",show_summary=False,show_legend=Falsetogether, because those are the three knobs anyone toggles together when preparing a slide-ready figure.aspect=”equal” on the matrix. Cells stay square at every figure size, which is the right tradeoff for the symmetry-reading use case (the TP-swap and FP-swap pairs read off the diagonal best when cells are square).
n_{label} columns retained in overlap_summary despite redundancy with the partition columns. The per-model totals match exactly what the venn diagram shows beneath each circle, making the table-to-figure cross-reference direct.
Plot function calls data function internally rather than duplicating the counting logic.
plot_overlap_crosstabinvokesoverlap_crosstabto build the DataFrame, then renders. Single source of truth for the counts; test the data function and the figure inherits correctness for free.Swap summary framed relative to label_b. “Tab+Text catches 5 extra, misses 3” reads naturally when the newer or candidate model is in the
_bslot. The narrative arrow points from the baseline (_a) to the candidate (_b).
Version 0.0.5a10
Three sibling functions for comparing two binary classifiers head to head:
plot_overlap_vennsrenders equal-area Venn diagrams across any subset of {“TP”, “FP”, “FN”, “TN”} categories. Each panel shows the overlap between the two models’ predictions within the relevant subpopulation (actual positives for TP/FN, actual negatives for FP/TN).overlap_tablereturns a per-observation DataFrame classifying each row as TP/FP/FN/TN for both models, with an agreement flag. Useful for drilling into specific observations or merging with other patient-level data.overlap_summaryreturns a four-row DataFrame indexed by {TP, FP, FN, TN} with per-category counts of the venn regions. Useful as numeric companion to the figure.
All three share the same input-parameter block, so swapping between table, summary, and figure for the same comparison is a one-character edit.
Public functions
plot_overlap_venns
plot_overlap_venns(
y_true,
y_pred_a=None,
y_pred_b=None,
*,
y_prob_a=None,
y_prob_b=None,
model_a=None,
model_b=None,
X_a=None,
X_b=None,
threshold_a=None,
threshold_b=None,
score_a=None,
score_b=None,
categories=("FN", "TN"),
label_a="Model A",
label_b="Model B",
titles=None,
title_pad=None,
label_kwgs=None,
inner_fontsize=12,
outer_fontsize=12,
title_fontsize=11,
figsize=None,
ncols=None,
pad=1.08,
h_pad=None,
w_pad=None,
colors=None,
alpha=0.4,
save_plot=False,
image_path_png=None,
image_path_svg=None,
image_filename=None,
ax=None,
)
overlap_table
overlap_table(
y_true,
y_pred_a=None,
y_pred_b=None,
*,
y_prob_a=None,
y_prob_b=None,
model_a=None,
model_b=None,
X_a=None,
X_b=None,
threshold_a=None,
threshold_b=None,
score_a=None,
score_b=None,
label_a="Model A",
label_b="Model B",
index=None,
)
Returns a DataFrame with columns y_true, {label_a}_pred,
{label_b}_pred, {label_a}_category, {label_b}_category, agree.
overlap_summary
overlap_summary(
y_true,
y_pred_a=None,
y_pred_b=None,
*,
y_prob_a=None,
y_prob_b=None,
model_a=None,
model_b=None,
X_a=None,
X_b=None,
threshold_a=None,
threshold_b=None,
score_a=None,
score_b=None,
label_a="Model A",
label_b="Model B",
verbose=False,
)
Returns a four-row DataFrame indexed by {TP, FP, FN, TN} with seven
columns: n_{label_a}, n_{label_b}, both, {label_a}_only,
{label_b}_only, outside, subpop. Two identities hold every row:
partition: both + {a}_only + {b}_only + outside == subpop
per-model: n_{label} == both + {label}_only
verbose=True prints a human-readable column legend via
_print_overlap_summary_legend.
Prediction-input modes
Each side (a, b) accepts exactly one of three input modes:
y_pred_*(binary predictions, used as is)y_prob_*(positive-class probabilities, thresholded internally viathreshold_*, default 0.5)model_* + X_*(predictions generated viaget_predictions; the model’s stored.thresholdattribute is looked up automatically, withthreshold_*as an explicit override andscore_*selecting a non-default key from the model’s.thresholddict)
Supplying more than one of these on the same side raises ValueError.
When mixing model objects without tuned thresholds, the system falls back
to a 0.5 default silently, so plain sklearn classifiers work without
extra setup. When mixing this view with the standalone confusion matrix
from show_confusion_matrix, pass the same threshold_* value to both
to keep the partition counts aligned.
label_kwgs (visibility toggles for plot_overlap_venns)
A single dict parameter controls all text decorations on each Venn panel. Six boolean keys, all default True:
show_title heading line above the diagram
show_subtitle auto stats line beneath the heading
show_set_labels model names beneath each circle
show_set_totals "FN total: X" line beneath model names
show_inner_count the number inside each region
show_inner_role the "both miss" / model-name role text inside each region
Pass only the keys you want to override. Common combinations:
# slide deck: title only, no chrome
label_kwgs={
"show_subtitle": False,
"show_set_totals": False,
"show_inner_role": False,
}
# bare venn: just circles and numbers
label_kwgs={
"show_title": False,
"show_set_labels": False,
"show_set_totals": False,
"show_inner_role": False,
}
Default figsize behavior
When figsize=None (the default) and the function creates its own figure,
_venn_default_figsize computes the panel size from title text length:
panel_w = max(5.5, max_chars * (title_fontsize * 0.009) + 0.6)
panel_h = 5.0
Width grows with title length but floors at 5.5 inches to avoid inter-column gaps. Height stays fixed at 5 inches to keep room for the title, the circles, and the bottom set labels without overflowing into the next row. Width and height are decoupled, so a long title widens panels without compressing them vertically.
combine_plots integration
plot_overlap_venns accepts an ax= parameter. When supplied, the
function does not create its own figure. It releases the host axes and
carves a sub-gridspec from the host axes’ slot, then adds new sub-axes
inside that sub-gridspec for each category panel.
Recommended pattern for mixing into a full evaluation suite: add one
plot_overlap_venns entry to combine_plots’s plot_calls per
category, each with a single-element categories tuple. Nesting all four
inside a single panel works but produces visually cramped output when
the figure also contains other plots competing for width.
For per-row spacing requirements, combine_plots switches to
constrained_layout when hspace is passed. height_ratios=[1, 0.7, 1,
1, 1.4, 1.4] plus hspace=0.15 is a reasonable starting point for the
evaluation-suite plus four-venn-rows case.
Supporting helpers in metrics_utils.py
_VENN_CATEGORY_SPEC
Module-private dict mapping each category key to display metadata:
title, subpop_val, in_set_val, both_role, outside_label, subpop_name.
_venn_blend(c1, c2)
RGB midpoint of two matplotlib color specs (via to_rgb).
_venn_resolve_side(side, y_pred, model, X, *, y_true=None, y_prob=None,
threshold=None, score=None)
Three-way exclusivity check on (y_pred, y_prob, model). Returns the
integer 1-D prediction array for one side. Model path routes through
get_predictions for threshold-aware prediction.
_venn_category_counts(y_true, y_pred_a, y_pred_b, cat)
Returns (a_only, b_only, both, outside, n_sub) integer counts for one
category. Shared backend for plot_overlap_venns and overlap_summary.
_venn_default_figsize(counts_per_cat, categories, titles, label_kwgs,
title_fontsize, ncols, nrows)
Computes per-panel figsize from title text length. Reads show_title
and show_subtitle from label_kwgs to skip text estimation when titles
are hidden.
_draw_one_venn(ax, cat, counts, label_a, label_b,
inner_fontsize, outer_fontsize, title_fontsize,
colors, alpha, *,
title_override, title_pad, label_kwgs)
Renders one category's overlap Venn into the provided axes. Handles
set-label assembly, inner region text composition, color blending,
and title formatting per the label_kwgs toggles.
_overlap_table_categorize(yt, yp)
Per-observation TP/FN/FP/TN classification used by overlap_table.
_print_overlap_summary_legend(label_a, label_b)
Human-readable column legend for overlap_summary. Documents both the
partition identity (sum-to-subpop) and the per-model decomposition
(n_{label} = both + {label}_only) since the difference between
n_{label} (full venn circle) and {label}_only (exclusive crescent) is
the most common reader-confusion point.
Design decisions baked in
Threshold default uses the model’s stored value, not 0.5. A venn comparison should show each model at its real operating point. Models without a
.thresholdattribute fall back to 0.5 silently.Single ``threshold_*`` parameter per side, not the library’s ``model_threshold`` plus ``custom_threshold`` pair. The pair only made sense for functions that needed to distinguish “use stored” from “override scalar”. The venn API has one knob.
Three exclusive input modes per side (y_pred, y_prob, model). Forces an explicit choice rather than silent ambiguity if multiple are passed.
Pure-data companions return DataFrames, not styled objects. Preserves chaining and downstream
.locaccess. Legend printing is opt-in viaverbose=Trueor a separate helper call.n_{label} columns retained in overlap_summary despite redundancy with the partition columns. The per-model totals match exactly what the venn diagram shows beneath each circle, making the table-to-figure cross-reference direct.
Bare-string category accepted (e.g. ``categories=”FN”``). Avoids the footgun where iterating the string yields per-character category lookups.
Single function for the venn (not a singular/plural split). combine_plots integration uses
plot_overlap_vennswith a single-elementcategoriestuple rather than introducing a separateplot_overlap_vennfunction.
Version 0.0.5a9
Important
Corrected package name in pyproject.toml from model_metrics_dev to model_metrics. The 0.0.5a8 release was inadvertently published under the wrong package name and could not be recalled in time; 0.0.5a9 supersedes it.
New Features
Added
ax=Noneparameter to seven single-plot functions:show_roc_curve,show_pr_curve,show_confusion_matrix,show_calibration_curve,show_lift_chart,show_gain_chart, andplot_threshold_metrics. When a pre-createdmatplotlib.axes.Axesobject is supplied, each function draws onto that axes and suppresses its internalplt.show()andsave_plot_imagescalls. All existing call signatures remain fully backward-compatible — the defaultax=Nonepreserves prior behavior exactly.Added
combine_plotsfunction that assembles multiple single-plot function calls into a shared subplot figure. Accepts a list of(func, kwargs)tuples, pre-allocates one axes per panel, and passes each axes to the corresponding function via theaxparameter. Supports configurable grid layout (n_cols,n_rows), customfigsize,suptitle,tight_layout, and fullsave_plot_imagesintegration. Unused trailing panels are hidden automatically. Panels that raise exceptions render an inline error message rather than aborting the figure.Added
tick_fontsizeparameter tocombine_plots. Bothlabel_fontsizeandtick_fontsizeare now automatically injected into each panel function viainspect.signature, giving uniform typography across the grid without per-panel repetition. Per-panel overrides inplot_callsalways take precedence.Added
hspaceandwspaceparameters tocombine_plotsfor explicit row and column spacing control. When either is supplied,constrained_layoutis activated and spacing is applied viafig.get_layout_engine().set()aftertight_layout, correctly handling panels withlegend_loc="bottom"that place legends outside the axes bbox.Added
height_ratiosparameter tocombine_plots, passed throughgridspec_kw, allowing individual rows to be resized independently. Useful when mixing plot types of different natural heights (e.g., confusion matrices alongside full-height curve panels).Fixed the overlay path in
show_roc_curve,show_pr_curve,show_lift_chart,show_gain_chart,show_calibration_curve, andplot_threshold_metrics. Previously,overlay=Truecalledplt.figure()unconditionally, creating a standalone figure instead of drawing onto the suppliedax. All six functions now checkax is not Nonebefore creating a figure, correctly routing overlay draws intocombine_plotspanel grids.
Bug Fixes
Fixed
combine_plotsaxes flattening logic. The previousif num_plots == 1guard incorrectly wrapped a numpy ndarray of axes in a list whenplt.subplots(1, N)was called withN > 1and only one plot call was provided, causing anAttributeError: 'numpy.ndarray' object has no attribute 'plot'. The fix inspects the axes object directly usinghasattrrather than relying onnum_plots.Fixed
show_confusion_matrixaxes reference conflict. The loop variableaxused in the subplots branch shadowed the user-suppliedaxparameter. The user-supplied value is now stashed as_user_axbefore the loop runs to prevent clobbering.Fixed
show_confusion_matrixignoringmodel_thresholdwhen passed as a list andXis provided. Previously the full list was forwarded toget_predictionswhich expects a scalar or dict, causing incorrect threshold application. The list is now indexed per model (model_threshold[idx]) in theX-provided branch.Fixed
show_confusion_matrixraisingTypeError: 'float' object is not subscriptablewhenmodel_thresholdwas passed as a scalar float in theX is Nonebranch. The branch now checksisinstancebefore indexing, correctly handling scalar, list, and dict threshold inputs.
Version 0.0.5a8
New Features
Added
ax=Noneparameter to seven single-plot functions:show_roc_curve,show_pr_curve,show_confusion_matrix,show_calibration_curve,show_lift_chart,show_gain_chart, andplot_threshold_metrics. When a pre-createdmatplotlib.axes.Axesobject is supplied, each function draws onto that axes and suppresses its internalplt.show()andsave_plot_imagescalls. All existing call signatures remain fully backward-compatible — the defaultax=Nonepreserves prior behavior exactly.Added
combine_plotsfunction that assembles multiple single-plot function calls into a shared subplot figure. Accepts a list of(func, kwargs)tuples, pre-allocates one axes per panel, and passes each axes to the corresponding function via theaxparameter. Supports configurable grid layout (n_cols,n_rows), customfigsize,suptitle,tight_layout, and fullsave_plot_imagesintegration. Unused trailing panels are hidden automatically. Panels that raise exceptions render an inline error message rather than aborting the figure.Added
tick_fontsizeparameter tocombine_plots. Bothlabel_fontsizeandtick_fontsizeare now automatically injected into each panel function viainspect.signature, giving uniform typography across the grid without per-panel repetition. Per-panel overrides inplot_callsalways take precedence.Added
hspaceandwspaceparameters tocombine_plotsfor explicit row and column spacing control. When either is supplied,constrained_layoutis activated and spacing is applied viafig.get_layout_engine().set()aftertight_layout, correctly handling panels withlegend_loc="bottom"that place legends outside the axes bbox.Added
height_ratiosparameter tocombine_plots, passed throughgridspec_kw, allowing individual rows to be resized independently. Useful when mixing plot types of different natural heights (e.g., confusion matrices alongside full-height curve panels).Fixed the overlay path in
show_roc_curve,show_pr_curve,show_lift_chart,show_gain_chart,show_calibration_curve, andplot_threshold_metrics. Previously,overlay=Truecalledplt.figure()unconditionally, creating a standalone figure instead of drawing onto the suppliedax. All six functions now checkax is not Nonebefore creating a figure, correctly routing overlay draws intocombine_plotspanel grids.
Bug Fixes
Fixed
combine_plotsaxes flattening logic. The previousif num_plots == 1guard incorrectly wrapped a numpy ndarray of axes in a list whenplt.subplots(1, N)was called withN > 1and only one plot call was provided, causing anAttributeError: 'numpy.ndarray' object has no attribute 'plot'. The fix inspects the axes object directly usinghasattrrather than relying onnum_plots.Fixed
show_confusion_matrixaxes reference conflict. The loop variableaxused in the subplots branch shadowed the user-suppliedaxparameter. The user-supplied value is now stashed as_user_axbefore the loop runs to prevent clobbering.Fixed
show_confusion_matrixignoringmodel_thresholdwhen passed as a list andXis provided. Previously the full list was forwarded toget_predictionswhich expects a scalar or dict, causing incorrect threshold application. The list is now indexed per model (model_threshold[idx]) in theX-provided branch.Fixed
show_confusion_matrixraisingTypeError: 'float' object is not subscriptablewhenmodel_thresholdwas passed as a scalar float in theX is Nonebranch. The branch now checksisinstancebefore indexing, correctly handling scalar, list, and dict threshold inputs.
Version 0.0.5a7
Summary
This version drops support for Python 3.7.4 and sets the minimum required Python version to 3.8. Python 3.7 reached end-of-life in June 2023 and is no longer supported by the library. Users on Python 3.7.x must upgrade before installing this version.
This version also delivers four major workstreams across the library: a
ground-up hardening of ModelCalculator, full multi-model support for
plot_threshold_metrics, a new image_filename parameter across all
eight plotting functions, and Python 3.8 compatibility restoration.
1. ModelCalculator
_extract_final_model
The original branching logic was ambiguous and failed silently for several real-world model wrapper patterns. The method now resolves wrappers in a strict, documented priority order:
Plain
dictwith a"model"key (e.g. model_tuner pkl format{"model": <Model>}) unwrapped first before any other check.sklearn.pipeline.Pipelineviahasattr(model, "steps"), extracting the last step.Objects with an
estimatorattribute (e.g. model_tunerModelobjects wrapping aCalibratedClassifierCV).Objects with a
modelattribute (e.g. custom wrapper classes).Standalone sklearn-compatible objects with
predict,predict_proba, ordecision_function.
The dict unwrap path was entirely absent before this change, causing
AttributeError: 'dict' object has no attribute 'predict' when loading
pkl files saved in model_tuner format.
generate_predictions
The prediction block previously called model.predict() and
model.predict_proba() directly on the raw object retrieved from
model_dict, bypassing _extract_final_model. This caused the same
AttributeError on dict-wrapped models. The block now routes through
the already-resolved estimator variable for all prediction calls,
while retaining the model.threshold check on the original object since
that attribute lives on the model_tuner wrapper, not the inner estimator.
_add_metrics
Replaced type(y_test_m) == pd.DataFrame with
isinstance(y_test_m, pd.DataFrame) per Python best practices. Also
corrected squeeze(axis=0) to squeeze(axis=1), which is the correct
axis for collapsing a single-column DataFrame into a Series.
_get_shap_explainer (new helper)
Replaced the generic shap.Explainer auto-detection call, which fired
internal probe warnings on every invocation, with an explicit helper that
selects the correct explainer class based on model attributes:
shap.TreeExplainerfor tree models (tree_orestimators_).shap.LinearExplainerfor linear models (coef_).shap.KernelExplainervia_make_predict_proba_wrapperfor everything else.
A guard at the top raises ValueError immediately for models without
predict_proba so unsupported models fail with a clear message rather
than crashing inside SHAP internals.
_make_predict_proba_wrapper (new helper)
KernelExplainer internally converts DataFrames to numpy before calling
the model function, which caused StandardScaler (fitted with named
columns) to emit UserWarning: X does not have valid feature names on
every SHAP call. The new wrapper re-attaches the original column names
before passing the array to predict_proba, eliminating the warning at
the source rather than suppressing it.
_calculate_shap_values
Global SHAP previously iterated row-by-row via itertuples, calling
the explainer once per sample. This has been replaced with a single
batched explainer(X_transformed) call, which is orders of magnitude
faster on datasets of any meaningful size.
Fixed the multi-class SHAP averaging from
.mean(axis=0).mean(axis=1) to .mean(axis=2).mean(axis=0), which
is the correct reduction order for a
(n_samples, n_features, n_classes) tensor.
Added pipeline unwrapping at the top of the method so the method handles dict/Pipeline-wrapped models passed directly rather than only pre-unwrapped estimators.
The include_contributions row-wise path now consistently returns top-N
{feature: shap_value} dicts sorted by absolute value, matching the
coefficient path. Previously it returned a flat dict over all features
regardless of top_n.
_calculate_coefficients
Added a secondary Pipeline unwrap after _extract_final_model for
cases where the extracted model is itself a Pipeline (e.g. model_tuner
objects where .estimator is a CalibratedClassifierCV wrapping a
Pipeline). Without this, coef_ lookup failed on the Pipeline
object rather than its final step.
The include_contributions=False path now returns top-N feature-name
lists rather than dicts, making it symmetric with the SHAP path and
ensuring subset_results column content is consistent regardless of
which explainability method is used.
Note
This is a behavior change for any code consuming the default output of
_calculate_coefficients as dicts. Pass
include_contributions=True to restore the previous dict output.
2. plot_threshold_metrics
Multi-model support
The function previously accepted only a single model or y_prob array.
It now accepts lists of models, y_prob arrays, and thresholds and
supports three display modes:
Single: one plot per model, unchanged from previous behavior.
Overlay: all models on a single shared axes, with curve labels prefixed by model name for disambiguation.
Subplots: one subplot per model in an auto-sized grid.
New parameters: model_title, overlay, subplots, n_cols,
n_rows, suptitle, suptitle_y, and model_threshold (now
accepts a list).
Model title defaulting
When model_title=None and model objects are provided, titles now
default to the model class name via extract_model_name() rather than
generic “Model 1”, “Model 2” index labels. When only y_prob arrays
are provided the index fallback is retained since there is nothing to
extract a name from.
n_rows / n_cols auto-derivation
When n_rows is explicitly provided but n_cols is left at its
default of 2, n_cols is now automatically derived as
ceil(num_models / n_rows). Previously specifying n_rows=1 with 3
models still produced a 2-column grid because n_cols was never
recalculated.
y_test flattening
precision_recall_curve and roc_curve inside _plot_single now
receive np.asarray(y_test).ravel() to prevent
ValueError: Found input variables with inconsistent numbers of samples
when y_test is a single-column DataFrame loaded from parquet.
suptitle / title separation
suptitle controls the overall figure heading above all subplots.
title controls per-subplot headings. The two can be set
independently: passing title="" suppresses per-subplot titles while
still showing the suptitle, and vice versa. Previously there was no way
to have both levels of titling independently.
3. image_filename Save Integration
Updated save_plot_images
Added three new parameters: image_filename, fig, and dpi.
Saving is now triggered when either save_plot=True or
image_filename is provided, so callers no longer need to set
save_plot=True just to use a custom filename. When image_filename
is provided it takes precedence over the auto-generated filename. The
function now calls fig.savefig() targeting the correct figure object
rather than plt.savefig(), which targeted whatever the current active
figure happened to be at call time.
All eight plotting functions updated
image_filename=None added to the signature immediately after
image_path_svg and threaded through every save_plot_images call
site across show_confusion_matrix, show_roc_curve,
show_pr_curve, show_lift_chart, show_gain_chart,
show_calibration_curve, plot_threshold_metrics, and
show_residual_diagnostics (22 call sites in total).
The if save_plot: guards that previously wrapped save_plot_images
calls in show_calibration_curve (group path) and
show_residual_diagnostics (both save sites) have been removed since
save_plot_images now handles the trigger logic internally.
All eight function docstrings updated to document image_filename
immediately after image_path_svg.
4. Python 3.8 Compatibility and Version Floor Change
The minimum supported Python version has been raised from 3.7.4 to
3.8. This aligns with the broader Python ecosystem where 3.7 has been
end-of-life since June 2023 and several upstream dependencies no longer
ship 3.7-compatible wheels.
fastparquet replaced with pyarrow
fastparquet fails to build on Python 3.8 due to a Cython ndarray
type identifier incompatibility that affects all released versions. The
dependency has been replaced with pyarrow>=11.0.0,<=14.0.2, which is
the last release with official Python 3.8 wheels. pd.read_parquet()
uses pyarrow automatically so no code changes were required.
pip upgrade required for Python 3.8 environment
The venv_3_8 environment shipped with pip 19.2.3 (2019), which
cannot parse modern pyproject.toml files used by packages like
ninja and scipy. This caused cascading build failures for
scikit-learn and pyarrow. Running
pip install --upgrade pip before
pip install -r requirements.txt unblocks the full install.
typing imports added to plot_utils.py
Optional, List, Dict, Union, and Tuple were used in
type annotations but not imported. Python 3.10+ allows X | None union
syntax without importing from typing but Python 3.8 requires the
explicit import. Added
from typing import Optional, List, Dict, Union, Tuple to resolve
NameError: name 'Optional' is not defined on import.
5. Test Suite
pytest collection conflict resolved
Both py_scripts/test_model_calculator.py and
unittests/test_model_calculator.py share the same basename, causing
pytest to fail at collection with an import file mismatch error.
Fixed by adding __init__.py to both py_scripts/ and
unittests/ and clearing stale __pycache__ artifacts.
test_model_calculator fixes
test_calculate_coefficients: updated assertion from
isinstance(row, dict) to isinstance(row, list) since the default
path now returns feature-name lists, not dicts.
test_extract_final_model_wrapped_model: resolved by adding the
hasattr(model, "model") branch to _extract_final_model. No test
change needed.
test_calculate_shap_unsupported_model: resolved by adding the
predict_proba guard at the top of _get_shap_explainer so
unsupported models raise ValueError before reaching
KernelExplainer.
test_calculate_shap_unexpected_shape and
test_rowwise_shap_output_unexpected_type: both monkeypatched
shap.Explainer which is no longer called. Updated to monkeypatch
ModelCalculator._get_shap_explainer instead.
test_model_evaluator fixes
test_plot_threshold_metrics_with_lookup and
test_plot_threshold_metrics_all_lookup_metrics: both asserted
"Best threshold" in captured.out. The print format now prefixes the
model name (e.g. "LogisticRegression -- best threshold for...").
Updated assertions to "best threshold" in captured.out which matches
both the old and new format.
Final result: 204 collected, 204 passed. Coverage: metrics_utils 79%,
model_calculator 87%, model_evaluator 83%,
partial_dependence 89%, plot_utils 70%.
6. Test and Usage Scripts
test_model_calculator.py (py_scripts)
Standalone .py equivalent of the test notebook with section headers,
coloured PASS/FAIL labels, and .to_string() DataFrame output for
clean terminal readability. Paths are anchored to __file__ via
SCRIPT_DIR and PROJECT_ROOT so the script runs correctly
regardless of which directory python is invoked from.
test_model_calculator.ipynb
Updated to use load_breast_cancer() (30 real named features) instead
of make_classification() (generic feature_0..9 labels) so SHAP
and coefficient outputs show meaningful feature names. Added
max_iter=10000 to LogisticRegression to suppress the convergence
warning that cluttered notebook output.
Version 0.0.5a6
Bug Fixes
Fixed interactive Plotly plot not rendering in Jupyter notebooks unless
save_plotswas set. Display and saving are now fully decoupled; the plot always renders regardless ofsave_plots.Removed duplicate HTML save block that existed inside the static plot section.
New Features
Added
x_label_mapandy_label_mapparameters for mapping raw axis values to human-readable tick labels; useful for encoded or numeric categorical features.Added
modebar_image_formatparameter ("png","svg","jpeg","webp") to control the download format of the Plotly modebar camera button. Defaults to"png".
Improvements
Docstring updated to document
x_label_map,y_label_map, andmodebar_image_format.Raisessection expanded to cover allValueErrorconditions, including invalidsave_plots, missing image paths, missing HTML paths, invalidplot_type, and invalidmodebar_image_format.Update to
plot_3d_pdpdocstringAdds full categorical feature support to
plot_3d_pdpwhile preserving backward compatibility with numeric grids. The function now renders categorical axes correctly in both Matplotlib and Plotly by mapping categories to numeric surface positions and overlaying labels. Custom label mapping is supported for cleaner presentation. Interactive hover and axis ticks now display true category names. HTML export logic was refactored to prevent duplicate writes and ensure reliable saving across plot modes.
Version 0.0.5a5
Update to
plot_3d_pdpdocstringAdds full categorical feature support to
plot_3d_pdpwhile preserving backward compatibility with numeric grids. The function now renders categorical axes correctly in both Matplotlib and Plotly by mapping categories to numeric surface positions and overlaying labels. Custom label mapping is supported for cleaner presentation. Interactive hover and axis ticks now display true category names. HTML export logic was refactored to prevent duplicate writes and ensure reliable saving across plot modes.
Version 0.0.5a4
Adds full categorical feature support to
plot_3d_pdpwhile preserving backward compatibility with numeric grids. The function now renders categorical axes correctly in both Matplotlib and Plotly by mapping categories to numeric surface positions and overlaying labels. Custom label mapping is supported for cleaner presentation. Interactive hover and axis ticks now display true category names. HTML export logic was refactored to prevent duplicate writes and ensure reliable saving across plot modes.
Version 0.0.5a3
Major Features Added
Axis Limits: Added
xlimandylimparameters for standardizing axis ranges across multiple model comparisonsBottom Legend Support: Automatic figure height adjustment when
legend_loc="bottom"to prevent legend overlap with x-axis labelsMulti-Model Layout: Improved default layout for multiple models with
plot_type="all"- now arranges as one row per model (6 columns × N rows) instead of mixed layout
Bug Fixes
Heteroskedasticity Tests: Fixed categorical variable handling in all tests (Breusch-Pagan, White, Goldfeld-Quandt) by encoding categorical columns before running tests
Legend Formatting:
Fixed legend duplication bug in Scale-Location plot (removed test name prepending since interpretations already contain test names)
Fixed histogram legend to use
apply_legend()for consistent formattingFixed legend kwargs not being properly passed to Scale-Location plot
Group Category Handling: Fixed KeyError when
group_categorycolumn not inXDataFrame by properly checking column existence before filtering predictor columnsIndex Alignment: Fixed AssertionError when using external
group_categoryarray by ensuring Series index matchesX.indexPython 3.8 Compatibility: Fixed LaTeX rendering error in Scale-Location y-axis label by replacing
\text{}with\mathrm{}for matplotlib <3.3 compatibility
Enhancements
Scale-Location Y-Axis: Changed to LaTeX notation
r"$\sqrt{|\text{Std. Residuals}|}$"for better readabilityHistogram Overlay: Removed normal distribution overlay from
histogram_type="frequency"for cleaner, simpler visualization (overlay still present forhistogram_type="density")Text Wrapping: Added
text_wrapparameter support to all subplot titles (previously only worked for suptitles)Helper Functions:
Created
apply_axis_limits()helper inplot_utils.pyEnhanced
apply_legend()to handle bottom legend resizing with flag-based prevention of multiple resizes
Refactoring: Refactored
plot_threshold_metrics()to useapply_plot_title()andapply_legend()helpers for consistency
Version 0.0.5a2
Testing Improvements
Added 150+ comprehensive unit tests covering:
Edge cases and error handling
Parameter validation
Different input pathways (model vs y_prob vs y_pred)
Group category functionality
Styling and customization options
Integration scenarios
Code Quality
Refactored monolithic 3866-line
model_metrics.pyinto modular components:model_evaluator.py- Main plotting and evaluation functionsmetrics_utils.py- Utility functions and calculationsplot_utils.py- Plotting helper functions
Improved code maintainability and organization
Enhanced error messages and validation
Operating point visualization with two methods:
operating_point_method='youden'- Youden’s J statisticoperating_point_method='closest_topleft'- Closest to top-left corner
DeLong test support for AUC comparison between models via
delongparameterLegend ordering - Proper organization: AUC curves → Random Guess → Operating Points
Custom operating point styling via
operating_point_kwgs
Residual Diagnostics Expansion
New plot types:
'influence'- Influence plot with Cook’s distance bubbles'predictors'- Individual residual plots for each predictor
Heteroskedasticity testing with multiple methods:
'breusch_pagan'- Breusch-Pagan test'white'- White’s test'goldfeld_quandt'- Goldfeld-Quandt test'spearman'- Spearman rank correlation'all'- Run all tests
LOWESS smoothing via
show_lowessparameterCentroid visualization with two modes:
User-defined groups via
group_categoryAutomatic K-means clustering via
n_clusters
Histogram types:
histogram_type='frequency'- Raw counts (default)histogram_type='density'- Probability density with normal overlay
Diagnostics table - Comprehensive model diagnostics via
show_diagnostics_tableReturn diagnostics - Programmatic access via
return_diagnostics=True
Group Category Support
All classification plots now support
group_categoryparameter:ROC curves with per-group AUC and counts
PR curves with per-group metrics
Calibration curves with per-group calibration
Residual diagnostics support group visualization with centroids
Summary performance supports grouped classification metrics
Gain Chart Enhancement
Gini coefficient calculation and display via
show_giniparameterCustom decimal places for Gini via
decimal_placesparameter
Legend Customization
Legend location now supports:
Standard matplotlib locations (‘best’, ‘upper right’, etc.)
'bottom'- Places legend below plot (perfect for group categories)
Automatic legend ordering for better readability
summarize_model_performance
Added
include_adjusted_r2for regression modelsAdded
group_categoryfor grouped classification metricsAdded
overall_onlyfor regression to show only aggregate metricsImproved coefficient ordering (intercept first)
Better handling of feature importances for tree-based models
show_confusion_matrix
Added
show_colorbarparameter (default: False)Added
labelsparameter to toggle TN/FP/FN/TP labelsImproved font size controls (
inner_fontsize,label_fontsize,tick_fontsize)
show_roc_curve
Added
show_operating_pointandoperating_point_methodAdded
operating_point_kwgsfor custom stylingAdded
delongparameter for AUC comparisonAdded
group_categoryfor stratified analysisAdded
legend_locparameter
show_pr_curve
Added
legend_metricparameter (‘ap’ or ‘aucpr’)Added
group_categoryfor stratified analysisAdded
legend_locparameter
show_calibration_curve
Added
show_brier_scoreparameter (default: True)Added
brier_decimalsfor formattingAdded
group_categoryfor stratified analysisAdded
legend_locparameter
show_gain_chart
Added
show_giniparameter (default: False)Added
decimal_placesfor Gini formatting
plot_threshold_metrics
Added
lookup_metricandlookup_valuefor threshold optimizationAdded
model_thresholdto highlight specific thresholdsAdded
baseline_threshto toggle baseline lineAdded custom styling:
curve_kwgs,baseline_kwgs,threshold_kwgs,lookup_kwgs
show_residual_diagnostics
Added
plot_typeoptions: ‘all’, ‘fitted’, ‘qq’, ‘scale_location’, ‘leverage’, ‘influence’, ‘histogram’, ‘predictors’Added
heteroskedasticity_testwith multiple test optionsAdded
show_lowessfor trend linesAdded
lowess_kwgsfor LOWESS stylingAdded
group_categoryfor stratified analysisAdded
group_kwgsfor custom group stylingAdded
show_centroidsandcentroid_kwgsAdded
centroid_type(‘clusters’ or ‘groups’)Added
n_clustersfor automatic clusteringAdded
histogram_type(‘frequency’ or ‘density’)Added
show_diagnostics_tableandreturn_diagnosticsAdded
show_plotsto disable plottingAdded
show_outliersandn_outliersfor labelingAdded
legend_locparameterAdded
legend_kwgsto control legend display for groups, centroids, clusters, and het_testsAdded
kmeans_rstatefor reproducible clusteringAdded
n_colsandn_rowsfor custom subplot layoutsAdded
point_kwgsfor scatter point styling (supportsedgecolor,linewidth, etc.)
Bug Fixes
Fixed confusion matrix colorbar removal when
show_colorbar=FalseFixed duplicate text handling in confusion matrix displays
Fixed legend placement for grouped visualizations
Fixed text wrapping for long titles
Fixed LOWESS exception handling (now fails gracefully)
Fixed feature importance display for tree-based models
Fixed coefficient ordering in regression output
Fixed empty metric columns in regression feature importance rows
Documentation Improvements
Comprehensive docstrings for all major functions
Parameter descriptions with examples
Error message improvements for better debugging
Type hints and validation error messages
Usage examples in docstrings
Testing
Test suite expanded from ~50 tests to 152 tests
Coverage increased from 50% to 86% on core modules
All edge cases and error conditions tested
Integration tests for real-world workflows
Parametrized tests for systematic coverage
Performance
No performance regressions
Modular code structure improves maintainability
Efficient calculation caching where applicable
Migration Guide
From 0.0.5a1 to 0.0.5a2:
No changes required - all existing code will work as before. New features are opt-in:
Version 0.0.5a1
Added
Operating Point Visualization for ROC Curves: Added
show_operating_pointparameter to display optimal classification thresholds on ROC curves with two methods:youden: Youden’s J statistic (maximizes TPR - FPR)closest_topleft: Point closest to top-left corner (minimizes distance to perfect classifier)Configurable via
operating_point_methodandoperating_point_kwgsparametersOperating points display threshold values in legends and appear as markers on curves
Gini Coefficient for Gain Charts: Added automatic calculation and display of Gini coefficient in
show_gain_chart()Prints Gini coefficient for each model (default: 3 decimal places)
Displays in legend labels across all plot modes (overlay, subplots, single)
Configurable via
show_ginianddecimal_placesparameters
Legend Location Control: Added
legend_locparameter to all plotting functions for flexible legend positioningSupports standard matplotlib locations (
'lower right','upper left','best', etc.)Special ‘bottom’ option places legend below plot with proper spacing
Available in:
show_roc_curve(),show_pr_curve(),show_calibration_curve(),show_lift_chart(),show_gain_chart()
Improved
Legend Ordering for ROC Curves: Standardized legend entry order across all plot modes
Order: Model curves with AUC → Random Guess baseline → Operating points
Ensures consistent, intuitive legend presentation
Overlay Mode for ROC Curves: Enhanced operating point display in overlay plots
Combined AUC and operating point threshold in single legend entry
Format: “Model Name (AUC = 0.XX, Op = 0.XX)”
Operating point markers appear on curves without duplicate legend entries
Technical Details
Operating points calculated post-ROC curve generation using optimal threshold selection
Gini coefficient derived from area under gain curve:
Gini = 2 × AUGC - 1Legend positioning uses
bbox_to_anchorfor'bottom'placement with dynamic spacingAll changes maintain backward compatibility with existing code
Version 0.0.4a10
Refactored and stabilized the summarize_model_performance function to improve consistency across classification and regression workflows while preserving the exact formatting logic for printed outputs and regression coefficient display.
Changes
Consolidated redundant metric computation into dedicated helper functions for classification and regression metrics.
Ensured regression coefficients, intercepts, and feature importances are retained and ordered correctly in the final DataFrame output.
Fixed grouped classification output so Model Threshold always appears last, and group headers correctly reflect category names.
Added conditional handling for grouped classification to prevent
KeyErrorwhen the"Model"column is absent.Preserved the original manual formatting block to maintain Leon’s custom printing logic for both classification and regression:
Right-aligned all table columns for readability.
Retained separator-based visual formatting and model-wise breaks.
Preserved coefficient and intercept reporting behavior exactly as before, ensuring regression results remain interpretable and consistent.
Impact
Classification and regression now produce stable, well-ordered, and readable summaries.
Grouped and non-grouped runs behave consistently without disrupting regression coefficient output.
Backward compatibility with previous console and DataFrame output formats maintained.
Version 0.0.4a9
This release introduces a new parameter, brier_decimals, to the show_calibration_curve() function, allowing users to control the number of decimal places displayed for the Brier score.
Changes Made
Added
brier_decimalsparameter (default:3) next toshow_brier_score.Updated Brier score display logic to format using
round(brier_score, brier_decimals).Improved readability and precision consistency across calibration plots.
Impact
No breaking changes.
Users now have finer control over Brier score precision in calibration curve visualizations.
Quick Example
from model_metrics import show_calibration_curve
show_calibration_curve(model, X, y, show_brier_score=True, brier_decimals=4)
Version 0.0.4a8
Summary:
Updated hanley_mcneil_auc_test() function to perform a large-sample z-test for comparing correlated AUCs, based on Hanley & McNeil (1982), an analytical approximation of DeLong’s test.
Key Changes:
Implemented
hanley_mcneil_auc_test()with parameters:y_true,y_scores_1,y_scores_2for AUC comparison.Optional
model_names,verbose, andreturn_valuesarguments for flexible use.
Added formatted, human-readable print output (when
verbose=True).Enabled optional programmatic access with
return_values=True.Adopted NumPy-style docstring for clarity and consistency.
Integrated helper into
show_roc_curve()to enable AUC significance testing when thedelongargument is provided.
Notes: This helper can also be used as a standalone function for independent AUC comparison between two models, outside of visualization workflows.
Version 0.0.4a7
DeLong’s test (Hanley & McNeil approximation)
Implemented a new helper function
hanley_mcneil_auc_test()for approximate DeLong’s AUC comparison.Integrated the helper inside
show_roc_curve()to optionally print AUC differences and p-values between two models.Added corresponding pytest coverage under
test_show_roc_curve_with_delong().
Group category support
Added the
group_categoryinput tosummarize_model_performance()to generate subgroup-level performance summaries.Enables stratified metric reporting for fairness or demographic analysis.
Version 0.0.4a6
Reworded the print message inside plot_threshold_metrics() for clarity.
Old:
print(
f"Best threshold for {lookup_metric} = "
f"{round(lookup_value, decimal_places)} is: "
f"{round(best_threshold, decimal_places)}"
)
New:
print(
f"Best threshold for target {lookup_metric} of "
f"{round(lookup_value, decimal_places)} is "
f"{round(best_threshold, decimal_places)}"
)
This removes the equals sign and colon, and adds “target” for a smoother, more descriptive sentence.
Version 0.0.4a8
Added a minimal type check to ensure
y_probis always a list at the start of each affected function:summarize_model_performanceshow_calibration_curveshow_confusion_matrixshow_lift_chartshow_gain_chartshow_roc_curveshow_pr_curve
# Ensure y_prob is always a list of NumPy arrays
if isinstance(y_prob, np.ndarray):
y_prob = [y_prob]
This allows y_prob[0] indexing to work whether the caller provides a single
NumPy array or a list of arrays.
Updated unittests
Version 0.0.4a4
Corrected README to reflect the current version.
Previous release did not update the README properly because the file was not saved before publishing.
No functional changes to the library.
Version 0.0.4a3
Added missing
scipy (>=1.8,<=1.14.0)requirement to the README.
Version 0.0.4a2
This version updates pyproject.toml and requirements.txt to restrict SciPy to >=1.8,<=1.14.0.
Prevents installation of
scipy==1.14.1+, which removes_lazywhereand breaksstatsmodels.Keeps compatibility with
model_tunerand Colab environments.Bumps package version for release.
Updated
scipydependency to>=1.8,<=1.14.0Synced
requirements.txtwith updated constraints
Version 0.0.4a1
Replaced the old
gridparameter withsubplotsacross plotting functions for consistency.Standardized gridline handling by replacing unconditional
plt.grid()calls withplt.grid(visible=gridlines)
Why
Aligns function signatures to use subplots consistently instead of grid.
Makes gridline visibility configurable through a single gridlines flag.
Cleaner charts when gridlines=False, no visual change when gridlines=True.
Version 0.0.4a
Summary
Added the ability to pass predicted probabilities (y_prob) directly into
the functions in model_evaluator.py as an alternative to supplying a fitted
model and feature matrix. This flexibility lets end users evaluate results in two ways:
Using a model object with
X(current behavior)Or passing
y_probdirectly (new option)
Details
Updated all relevant evaluator functions (
summarize_model_performance,plot_threshold_metrics, etc.) to accepty_probas input.Added input validation: functions now check that either
(model and X)ory_probare provided, not both missing.Preserved existing model-based workflows for backward compatibility.
Extended unit tests in
unittests/to cover the new probability-based path, including edge cases and validation errors.
Why
End users sometimes already have predicted probabilities from external pipelines or pre-computed experiments. This change avoids forcing them to re-supply the model, streamlining the evaluation process.
Version 0.0.3a
Added
"plotly>=5.18.0, <=5.24.1"inpyproject.toml,setup.py,README_min.md–> forpartial_dependence.pyfunctions
Version 0.0.2a
Add
show_ks_curvefunction and enhancesummarize_model_performanceby @lshpaner in https://github.com/lshpaner/model_metrics/pull/1Add
plot_threshold_metricsFunction by @lshpaner in https://github.com/lshpaner/model_metrics/pull/2Add
pr_feature_plotand Updateroc_feature_plotfor Enhanced Visualization by @lshpaner in https://github.com/lshpaner/model_metrics/pull/3Reg table enhance by @lshpaner in https://github.com/lshpaner/model_metrics/pull/4
Rmvd (%) from MAPE header by @lshpaner in https://github.com/lshpaner/model_metrics/pull/5
Moved roc legend to lower right default by @lshpaner in https://github.com/lshpaner/model_metrics/pull/6
Allow Flexible Inputs and Save Behavior for
show_roc_curve()by @lshpaner in https://github.com/lshpaner/model_metrics/pull/7Prcurve calc tests by @lshpaner in https://github.com/lshpaner/model_metrics/pull/8
Removed unused imports and functions by @lshpaner in https://github.com/lshpaner/model_metrics/pull/9
changed saving nomenclature in
show_confusion_matrixby @lshpaner in https://github.com/lshpaner/model_metrics/pull/10Fix Calibration Curve Grid Plot Behavior and Update Model Nomenclature by @lshpaner in https://github.com/lshpaner/model_metrics/pull/11
Improved support for multiple models and group categories in calibration curve by @lshpaner in https://github.com/lshpaner/model_metrics/pull/13
Upd.
plot_threshold_metricsw/ new lookup_kwgs and legend logic by @lshpaner in https://github.com/lshpaner/model_metrics/pull/14Rmv. unused arguments by @lshpaner in https://github.com/lshpaner/model_metrics/pull/15
Move PDF-related Functions from
eda_toolkittomodel_metricsby @lshpaner in https://github.com/lshpaner/model_metrics/pull/16
Full Changelog: https://github.com/lshpaner/model_metrics/compare/0.0.1a…0.0.2a
Version 0.0.1a
Updated unit tests and
READMEAdded
statsmodelsto library importsAdded coefficients and p-values to regression summary
Added regression capabilities to
summarize_model_performanceAdded lift and gains charts
Updated versions for earlier Python compatibility