Changelog

Version 0.0.5a12

Three small but useful changes to the crosstab pair this round.

decimal_places parameter on overlap_crosstab

New keyword-only argument controlling how many decimal places the returned cells round to. Defaults to None (no rounding) for backward compatibility. Most useful with normalize=True to control proportion display:

ct = overlap_crosstab(
    y_true=y_test,
    y_prob_a=y_prob_lr,
    y_prob_b=y_prob_rf,
    normalize=True,
    decimal_places=2,
)

Rounding runs after normalize and mask_impossible, so NaN cells from mask_impossible=True are left untouched. Integer counts (default mode without normalize) round to themselves and stay visually unchanged.

Independent font-size control in the summary panel

The right-hand swap-summary panel of plot_overlap_crosstab now exposes two separate font-size knobs that were previously coupled:

  • summary_body_fontsize (new): controls the italic body lines beneath each colored headline. Defaults to None, which falls back to summary_fontsize - 1 (the original derived size, so the visual default is unchanged).

  • label_fontsize (existing, expanded): already controlled the matrix row and column headers; now also drives the bold colored summary headlines via label_fontsize + 5. Default label_fontsize=12 gives 17pt headlines, matching the previous summary_fontsize + 6 = 17 exactly.

The two knobs are independent. Bumping one does not move the other. Driven by the need for larger summary headlines on dense slides without inflating the body sentences.

plot_overlap_crosstab(
    y_true=y_test,
    model_a=lr, model_b=rf, X_a=X_test,
    label_fontsize=18,
    summary_body_fontsize=14,
)

Whitespace removal above the matrix

_draw_crosstab_matrix had two layout issues that conspired to leave 30% or more empty space above the matrix in larger figure sizes:

  1. ylim extended to n + 1.4 when the topmost content (the label_b axis label) only needed about n + 0.75. That left 0.65 units of empty data space inside the axes.

  2. set_aspect("equal") used the default anchor "C", which centered the squared axes box in its subplot cell, sending leftover space upward when matplotlib’s default top margin already biased the layout.

Fixes:

  • Tightened ax.set_ylim from (-0.3, n + 1.4) to (-0.3, n + 0.85).

  • Changed ax.set_aspect("equal") to ax.set_aspect("equal", anchor="N").

Both apply unconditionally to every render mode: table_only, full view, no-summary, no-legend, and combine_plots-nested.

Tests added

17 new tests across test_metrics_utils.py and test_model_evaluator_addtl.py pinning the changes above. Coverage breakdown:

  • overlap_crosstab decimal_places: 5 tests (rounds with normalize, None preserves precision, NaN cells survive, decimal_places=0 rounds to whole numbers, no-op on integer counts).

  • _draw_crosstab_summary font sizes: 6 tests (default 17pt headlines, explicit label_fontsize scales headlines, body defaults to fontsize-1, explicit body_fontsize wins, two knobs independent, backward-compat positional args).

  • plot_overlap_crosstab integration: 3 tests (summary_body_fontsize accepted and applied, label_fontsize flows into summary headlines).

  • _draw_crosstab_matrix layout regression: 3 tests (ylim tight, anchor is N, end-to-end public-function anchor check).

Version 0.0.5a11

A summary of the overlap_crosstab and plot_overlap_crosstab functions added to the overlap family on this iteration, plus the supporting helpers, the font-handling infrastructure, and small touch-ups to existing overlap functions in metrics_utils.py and model_evaluator.py.

What the module adds

Two sibling functions extending the overlap family to a fourth view of the same comparison. Where overlap_summary gives per-category marginal counts and plot_overlap_venns shows per-category region overlap, the crosstab pair shows the full joint distribution of (model_a category, model_b category) in one 4x4 frame.

  • overlap_crosstab returns a 4x4 pandas DataFrame indexed by {TP, FP, FN, TN} on both axes, counting every observation that lands in each (row category, column category) pair. The diagonal is the agreement count; the off-diagonal valid cells are the TP-swap and FP-swap pairs.

  • plot_overlap_crosstab renders the same crosstab as a styled matplotlib figure with three cell colorings (green for agreement, red for swaps, blue-gray for the eight structurally impossible cells), an optional right-hand summary panel surfacing the derived swap story, and an optional legend strip beneath the matrix.

Both share the same input-parameter block as the rest of the overlap family, so the same prediction inputs that drive a Venn or a summary table also drive the crosstab without modification.

Public functions

overlap_crosstab

overlap_crosstab(
    y_true,
    y_pred_a=None,
    y_pred_b=None,
    *,
    y_prob_a=None,
    y_prob_b=None,
    model_a=None,
    model_b=None,
    X_a=None,
    X_b=None,
    threshold_a=None,
    threshold_b=None,
    score_a=None,
    score_b=None,
    label_a="Model A",
    label_b="Model B",
    normalize=False,
    mask_impossible=False,
    verbose=False,
)

Returns a 4x4 DataFrame indexed by {TP, FP, FN, TN} with the same four columns. verbose=True prints the cell-meaning legend plus the derived swap summary before returning. normalize=True divides every cell by the total observation count. mask_impossible=True sets the eight structurally impossible cells to NaN instead of leaving them at 0.

plot_overlap_crosstab

plot_overlap_crosstab(
    y_true,
    y_pred_a=None,
    y_pred_b=None,
    *,
    y_prob_a=None,
    y_prob_b=None,
    model_a=None,
    model_b=None,
    X_a=None,
    X_b=None,
    threshold_a=None,
    threshold_b=None,
    score_a=None,
    score_b=None,
    label_a="Model A",
    label_b="Model B",
    title=None,
    table_only=False,
    show_summary=True,
    show_legend=True,
    colors=None,
    cell_fontsize=14,
    label_fontsize=12,
    title_fontsize=14,
    summary_fontsize=11,
    font=None,
    figsize=None,
    save_plot=False,
    image_path_png=None,
    image_path_svg=None,
    image_filename=None,
    ax=None,
)

Renders the crosstab as a matplotlib figure. Calls overlap_crosstab internally, then dispatches to three private helpers (_draw_crosstab_matrix, _draw_crosstab_summary, _draw_crosstab_legend) for the three regions of the figure.

Structural-impossibility insight

Of the 16 cells in the 4x4 grid, only 8 are reachable. Because y_true is fixed per observation:

  • actual positives can only be TP or FN for each model

  • actual negatives can only be TN or FP

So any cell mixing a positive-subpop category (TP, FN) on one axis with a negative-subpop category (FP, TN) on the other is structurally impossible. A single observation cannot be TP for one model and FP for the other, since y_true is either 1 or 0, not both.

The valid cells partition cleanly:

  • Diagonal (4 cells): agreement, both models put the observation in the same confusion-matrix category. (TP, TP), (FP, FP), (FN, FN), (TN, TN).

  • Off-diagonal, positive-subpop (2 cells): TP swap pair. (TP, FN) and (FN, TP). One model catches an actual positive the other misses.

  • Off-diagonal, negative-subpop (2 cells): FP swap pair. (FP, TN) and (TN, FP). One model false-alarms on an actual negative the other clears.

The structural-impossibility check is encoded in _draw_crosstab_matrix (cells where row category and column category are in different subpops get the impossibility color) and in overlap_crosstab’s mask_impossible flag (NaN those cells in the returned DataFrame instead of leaving them at 0).

Prediction-input modes

Same three-way exclusivity as the rest of the family. For each side (a, b) supply exactly one of:

  1. y_pred_* (binary predictions, used as is)

  2. y_prob_* (positive-class probabilities, thresholded internally via threshold_*, default 0.5)

  3. model_* + X_* (predictions via get_predictions; the model’s stored .threshold attribute is looked up automatically, with threshold_* as an explicit override and score_* selecting a non-default score key from the model’s .threshold dict)

Supplying more than one on the same side raises ValueError. The y_prob_* path defaults to 0.5 when no threshold_* is set, since there is no model object available on that path to source a stored threshold from.

Display controls (plot_overlap_crosstab)

  • title controls the figure title and its agreement-rate subtitle. None defaults to "{label_a} vs {label_b}: Prediction Overlap". Pass "" to suppress.

  • table_only is the one-flag shortcut for a bare matrix: suppresses the title, the side summary panel, and the bottom legend strip. Equivalent to passing title="", show_summary=False, show_legend=False.

  • show_summary toggles the right-hand swap-summary text block (shared TP / FN / TN / FP counts, TP-swap and FP-swap breakdowns).

  • show_legend toggles the color-legend strip beneath the matrix.

  • colors accepts a dict with keys "agree", "disagree", "impossible" to override the default cell colors. Unspecified keys keep their defaults (soft green, soft pink, soft blue-gray).

  • Font sizes are split across four parameters: cell_fontsize for the cell counts, label_fontsize for the row and column headers, title_fontsize for the title block, summary_fontsize for the side panel.

Default figsize behavior

figsize=None triggers a three-way default based on which display regions are active:

  • (5, 5) when table_only=True (bare square matrix)

  • (11, 6.5) when show_summary=True (matrix plus side panel)

  • (7, 6.5) otherwise (matrix plus legend, no side panel)

The matrix uses set_aspect("equal") so cells always render square. Passing an asymmetric figsize (e.g. (10, 4)) produces a figure of that exact size with whitespace on the wider dimension, since cells will not stretch to fill.

When the title is suppressed (table_only=True or title=""), tight_layout is invoked with rect=[0, 0, 1, 1] so the matrix uses the full figure height. When the title is active, rect=[0, 0, 1, 0.88] reserves the top 12% for the title and subtitle.

Font handling

A new module-level _FONT_ALIASES dict in metrics_utils.py maps common cross-platform font names (Arial, Helvetica, Times New Roman, Consolas, Cascadia Code, Source Code Pro, Fira Code, JetBrains Mono, Menlo, Monaco, Georgia, Verdana, Calibri, Cambria, Tahoma, Garamond) to fallback chains ending in a font that is guaranteed installable (DejaVu Sans or DejaVu Sans Mono, both shipped with matplotlib).

_resolve_font_family(font) expands the input to a chain, filters it to the fonts actually findable on the system (via font_manager.findfont(name, fallback_to_default=False)), and returns the surviving list. Behavior summary:

  • font=None returns None (the caller leaves rcParams untouched).

  • font="Arial" (or any aliased name) always succeeds, because the alias chain ends in DejaVu Sans.

  • font="UnknownFontName" (not in the alias map, not installed) raises ValueError with a useful message listing the supported aliases and pointing at how to check what is installed.

  • font=42 raises TypeError.

In plot_overlap_crosstab, the resolved font is applied via matplotlib.rc_context({"font.family": resolved_font}), so the override scopes to the call only and does not modify the global rcParams.

The asymmetry between aliased and unaliased fonts is intentional. Aliased fonts have a curated fallback path that gracefully degrades to something installable on every platform, so library users do not need to install Microsoft fonts on Linux to use font="Arial". Unaliased fonts are taken at the user’s word and validated literally, so typos and missing fonts surface as errors rather than getting silently swapped to DejaVu Sans.

Supporting helpers

  • _print_overlap_crosstab_legend(ct, label_a, label_b) (in metrics_utils.py) prints both the static cell-meaning legend (what each cell type represents, which cells are structurally impossible) and the dynamic swap summary derived from the raw integer counts (agreement rate, shared TP / FN / TN / FP, TP and FP swap breakdowns with net deltas).

  • _draw_crosstab_matrix(ax, ct, label_a, label_b, colors, cell_fontsize, label_fontsize) (in metrics_utils.py) renders the 4x4 grid: per-cell color via Rectangle patches, per-cell text via ax.text with the count formatted by :,, row and column headers, and the axis-level labels. Uses a plain hyphen rather than an em dash for the NaN cell text under mask_impossible=True.

  • _draw_crosstab_summary(ax, label_b, stats, fontsize) (in metrics_utils.py) renders the right-hand text block with four colored headlines (shared FN, shared TN, TP swap, FP swap) and their italic body lines, all relative to label_b since the swap framing reads naturally with the newer or candidate model in the _b slot.

  • _draw_crosstab_legend(ax, colors, fontsize) (in metrics_utils.py) renders the horizontal color-legend strip using matplotlib’s Patch and ax.legend.

  • _resolve_font_family(font) (in metrics_utils.py) is the font alias expander documented above.

  • _FONT_ALIASES (module-level dict in metrics_utils.py) is the alias table.

Touch-ups to existing overlap functions

  • overlap_summary: restored the n_{label_a} and n_{label_b} columns (the per-model totals for the category, equal to both + {label}_only) after a brief detour to a 5-column partition-only form. The 7-column form was restored because the per-model totals are exactly what the Venn diagram shows beneath each circle, making the table-to-figure cross-reference direct.

  • _print_overlap_summary_legend: updated to document both the partition identity (both + {a}_only + {b}_only + outside == subpop) and the per-model decomposition (n_{label} == both + {label}_only) explicitly, since the difference between n_{label} (full venn circle) and {label}_only (exclusive crescent) is the most common reader-confusion point.

  • overlap_table: tightened the index parameter to raise TypeError with a helpful message when a scalar (e.g. a column name string) is passed instead of an array-like, since pd.Index(scalar) produces the cryptic Index(...) must be called with a collection error.

  • _draw_crosstab_matrix axis-label spacing: dropped the axis-level label_b y-position from n + 0.9 to n + 0.55 so the column-axis label sits at the same visible distance from the column headers as the row-axis label sits from the row headers.

  • _draw_crosstab_summary: applied the :, thousands separator to every number in the swap-summary block (the swap headlines and the body lines were originally missing the comma format, leaving numbers like 1249 in a sea of 1,249 and +1,079).

Design decisions baked in

  • Same input-parameter block as the rest of the overlap family. Swapping between plot_overlap_venns, overlap_summary, overlap_table, overlap_crosstab, and plot_overlap_crosstab for the same comparison is a one-character edit.

  • Three exclusive input modes per side, no implicit threshold lookup on the y_prob path. The y_prob path defaults to 0.5 because there is no model object to source a stored threshold from. Users who want tuned-threshold behavior with probabilities in hand pass threshold_* explicitly, or use the model path.

  • Strict ValueError on unfound unaliased fonts; graceful degradation for aliased fonts. Library users do not need to install Microsoft fonts on Linux to call font="Arial", but typos and unrecognized fonts loudly fail rather than silently rendering in DejaVu Sans.

  • Font override scoped via mpl.rc_context. The font parameter applies only to the current call and never leaks into the global rcParams, so other figures in the same notebook are unaffected.

  • table_only as the one-flag shortcut for bare matrices. Sets title="", show_summary=False, show_legend=False together, because those are the three knobs anyone toggles together when preparing a slide-ready figure.

  • aspect=”equal” on the matrix. Cells stay square at every figure size, which is the right tradeoff for the symmetry-reading use case (the TP-swap and FP-swap pairs read off the diagonal best when cells are square).

  • n_{label} columns retained in overlap_summary despite redundancy with the partition columns. The per-model totals match exactly what the venn diagram shows beneath each circle, making the table-to-figure cross-reference direct.

  • Plot function calls data function internally rather than duplicating the counting logic. plot_overlap_crosstab invokes overlap_crosstab to build the DataFrame, then renders. Single source of truth for the counts; test the data function and the figure inherits correctness for free.

  • Swap summary framed relative to label_b. “Tab+Text catches 5 extra, misses 3” reads naturally when the newer or candidate model is in the _b slot. The narrative arrow points from the baseline (_a) to the candidate (_b).

Version 0.0.5a10

Three sibling functions for comparing two binary classifiers head to head:

  • plot_overlap_venns renders equal-area Venn diagrams across any subset of {“TP”, “FP”, “FN”, “TN”} categories. Each panel shows the overlap between the two models’ predictions within the relevant subpopulation (actual positives for TP/FN, actual negatives for FP/TN).

  • overlap_table returns a per-observation DataFrame classifying each row as TP/FP/FN/TN for both models, with an agreement flag. Useful for drilling into specific observations or merging with other patient-level data.

  • overlap_summary returns a four-row DataFrame indexed by {TP, FP, FN, TN} with per-category counts of the venn regions. Useful as numeric companion to the figure.

All three share the same input-parameter block, so swapping between table, summary, and figure for the same comparison is a one-character edit.

Public functions

plot_overlap_venns

plot_overlap_venns(
    y_true,
    y_pred_a=None,
    y_pred_b=None,
    *,
    y_prob_a=None,
    y_prob_b=None,
    model_a=None,
    model_b=None,
    X_a=None,
    X_b=None,
    threshold_a=None,
    threshold_b=None,
    score_a=None,
    score_b=None,
    categories=("FN", "TN"),
    label_a="Model A",
    label_b="Model B",
    titles=None,
    title_pad=None,
    label_kwgs=None,
    inner_fontsize=12,
    outer_fontsize=12,
    title_fontsize=11,
    figsize=None,
    ncols=None,
    pad=1.08,
    h_pad=None,
    w_pad=None,
    colors=None,
    alpha=0.4,
    save_plot=False,
    image_path_png=None,
    image_path_svg=None,
    image_filename=None,
    ax=None,
)

overlap_table

overlap_table(
    y_true,
    y_pred_a=None,
    y_pred_b=None,
    *,
    y_prob_a=None,
    y_prob_b=None,
    model_a=None,
    model_b=None,
    X_a=None,
    X_b=None,
    threshold_a=None,
    threshold_b=None,
    score_a=None,
    score_b=None,
    label_a="Model A",
    label_b="Model B",
    index=None,
)

Returns a DataFrame with columns y_true, {label_a}_pred, {label_b}_pred, {label_a}_category, {label_b}_category, agree.

overlap_summary

overlap_summary(
    y_true,
    y_pred_a=None,
    y_pred_b=None,
    *,
    y_prob_a=None,
    y_prob_b=None,
    model_a=None,
    model_b=None,
    X_a=None,
    X_b=None,
    threshold_a=None,
    threshold_b=None,
    score_a=None,
    score_b=None,
    label_a="Model A",
    label_b="Model B",
    verbose=False,
)

Returns a four-row DataFrame indexed by {TP, FP, FN, TN} with seven columns: n_{label_a}, n_{label_b}, both, {label_a}_only, {label_b}_only, outside, subpop. Two identities hold every row:

partition:    both + {a}_only + {b}_only + outside == subpop
per-model:    n_{label} == both + {label}_only

verbose=True prints a human-readable column legend via _print_overlap_summary_legend.

Prediction-input modes

Each side (a, b) accepts exactly one of three input modes:

  1. y_pred_* (binary predictions, used as is)

  2. y_prob_* (positive-class probabilities, thresholded internally via threshold_*, default 0.5)

  3. model_* + X_* (predictions generated via get_predictions; the model’s stored .threshold attribute is looked up automatically, with threshold_* as an explicit override and score_* selecting a non-default key from the model’s .threshold dict)

Supplying more than one of these on the same side raises ValueError.

When mixing model objects without tuned thresholds, the system falls back to a 0.5 default silently, so plain sklearn classifiers work without extra setup. When mixing this view with the standalone confusion matrix from show_confusion_matrix, pass the same threshold_* value to both to keep the partition counts aligned.

label_kwgs (visibility toggles for plot_overlap_venns)

A single dict parameter controls all text decorations on each Venn panel. Six boolean keys, all default True:

show_title         heading line above the diagram
show_subtitle      auto stats line beneath the heading
show_set_labels    model names beneath each circle
show_set_totals    "FN total: X" line beneath model names
show_inner_count   the number inside each region
show_inner_role    the "both miss" / model-name role text inside each region

Pass only the keys you want to override. Common combinations:

# slide deck: title only, no chrome
label_kwgs={
    "show_subtitle":   False,
    "show_set_totals": False,
    "show_inner_role": False,
}

# bare venn: just circles and numbers
label_kwgs={
    "show_title":      False,
    "show_set_labels": False,
    "show_set_totals": False,
    "show_inner_role": False,
}

Default figsize behavior

When figsize=None (the default) and the function creates its own figure, _venn_default_figsize computes the panel size from title text length:

panel_w = max(5.5, max_chars * (title_fontsize * 0.009) + 0.6)
panel_h = 5.0

Width grows with title length but floors at 5.5 inches to avoid inter-column gaps. Height stays fixed at 5 inches to keep room for the title, the circles, and the bottom set labels without overflowing into the next row. Width and height are decoupled, so a long title widens panels without compressing them vertically.

combine_plots integration

plot_overlap_venns accepts an ax= parameter. When supplied, the function does not create its own figure. It releases the host axes and carves a sub-gridspec from the host axes’ slot, then adds new sub-axes inside that sub-gridspec for each category panel.

Recommended pattern for mixing into a full evaluation suite: add one plot_overlap_venns entry to combine_plots’s plot_calls per category, each with a single-element categories tuple. Nesting all four inside a single panel works but produces visually cramped output when the figure also contains other plots competing for width.

For per-row spacing requirements, combine_plots switches to constrained_layout when hspace is passed. height_ratios=[1, 0.7, 1, 1, 1.4, 1.4] plus hspace=0.15 is a reasonable starting point for the evaluation-suite plus four-venn-rows case.

Supporting helpers in metrics_utils.py

_VENN_CATEGORY_SPEC
    Module-private dict mapping each category key to display metadata:
    title, subpop_val, in_set_val, both_role, outside_label, subpop_name.

_venn_blend(c1, c2)
    RGB midpoint of two matplotlib color specs (via to_rgb).

_venn_resolve_side(side, y_pred, model, X, *, y_true=None, y_prob=None,
                   threshold=None, score=None)
    Three-way exclusivity check on (y_pred, y_prob, model). Returns the
    integer 1-D prediction array for one side. Model path routes through
    get_predictions for threshold-aware prediction.

_venn_category_counts(y_true, y_pred_a, y_pred_b, cat)
    Returns (a_only, b_only, both, outside, n_sub) integer counts for one
    category. Shared backend for plot_overlap_venns and overlap_summary.

_venn_default_figsize(counts_per_cat, categories, titles, label_kwgs,
                      title_fontsize, ncols, nrows)
    Computes per-panel figsize from title text length. Reads show_title
    and show_subtitle from label_kwgs to skip text estimation when titles
    are hidden.

_draw_one_venn(ax, cat, counts, label_a, label_b,
               inner_fontsize, outer_fontsize, title_fontsize,
               colors, alpha, *,
               title_override, title_pad, label_kwgs)
    Renders one category's overlap Venn into the provided axes. Handles
    set-label assembly, inner region text composition, color blending,
    and title formatting per the label_kwgs toggles.

_overlap_table_categorize(yt, yp)
    Per-observation TP/FN/FP/TN classification used by overlap_table.

_print_overlap_summary_legend(label_a, label_b)
    Human-readable column legend for overlap_summary. Documents both the
    partition identity (sum-to-subpop) and the per-model decomposition
    (n_{label} = both + {label}_only) since the difference between
    n_{label} (full venn circle) and {label}_only (exclusive crescent) is
    the most common reader-confusion point.

Design decisions baked in

  • Threshold default uses the model’s stored value, not 0.5. A venn comparison should show each model at its real operating point. Models without a .threshold attribute fall back to 0.5 silently.

  • Single ``threshold_*`` parameter per side, not the library’s ``model_threshold`` plus ``custom_threshold`` pair. The pair only made sense for functions that needed to distinguish “use stored” from “override scalar”. The venn API has one knob.

  • Three exclusive input modes per side (y_pred, y_prob, model). Forces an explicit choice rather than silent ambiguity if multiple are passed.

  • Pure-data companions return DataFrames, not styled objects. Preserves chaining and downstream .loc access. Legend printing is opt-in via verbose=True or a separate helper call.

  • n_{label} columns retained in overlap_summary despite redundancy with the partition columns. The per-model totals match exactly what the venn diagram shows beneath each circle, making the table-to-figure cross-reference direct.

  • Bare-string category accepted (e.g. ``categories=”FN”``). Avoids the footgun where iterating the string yields per-character category lookups.

  • Single function for the venn (not a singular/plural split). combine_plots integration uses plot_overlap_venns with a single-element categories tuple rather than introducing a separate plot_overlap_venn function.

Version 0.0.5a9

Important

Corrected package name in pyproject.toml from model_metrics_dev to model_metrics. The 0.0.5a8 release was inadvertently published under the wrong package name and could not be recalled in time; 0.0.5a9 supersedes it.

New Features

  • Added ax=None parameter to seven single-plot functions: show_roc_curve, show_pr_curve, show_confusion_matrix, show_calibration_curve, show_lift_chart, show_gain_chart, and plot_threshold_metrics. When a pre-created matplotlib.axes.Axes object is supplied, each function draws onto that axes and suppresses its internal plt.show() and save_plot_images calls. All existing call signatures remain fully backward-compatible — the default ax=None preserves prior behavior exactly.

  • Added combine_plots function that assembles multiple single-plot function calls into a shared subplot figure. Accepts a list of (func, kwargs) tuples, pre-allocates one axes per panel, and passes each axes to the corresponding function via the ax parameter. Supports configurable grid layout (n_cols, n_rows), custom figsize, suptitle, tight_layout, and full save_plot_images integration. Unused trailing panels are hidden automatically. Panels that raise exceptions render an inline error message rather than aborting the figure.

  • Added tick_fontsize parameter to combine_plots. Both label_fontsize and tick_fontsize are now automatically injected into each panel function via inspect.signature, giving uniform typography across the grid without per-panel repetition. Per-panel overrides in plot_calls always take precedence.

  • Added hspace and wspace parameters to combine_plots for explicit row and column spacing control. When either is supplied, constrained_layout is activated and spacing is applied via fig.get_layout_engine().set() after tight_layout, correctly handling panels with legend_loc="bottom" that place legends outside the axes bbox.

  • Added height_ratios parameter to combine_plots, passed through gridspec_kw, allowing individual rows to be resized independently. Useful when mixing plot types of different natural heights (e.g., confusion matrices alongside full-height curve panels).

  • Fixed the overlay path in show_roc_curve, show_pr_curve, show_lift_chart, show_gain_chart, show_calibration_curve, and plot_threshold_metrics. Previously, overlay=True called plt.figure() unconditionally, creating a standalone figure instead of drawing onto the supplied ax. All six functions now check ax is not None before creating a figure, correctly routing overlay draws into combine_plots panel grids.

Bug Fixes

  • Fixed combine_plots axes flattening logic. The previous if num_plots == 1 guard incorrectly wrapped a numpy ndarray of axes in a list when plt.subplots(1, N) was called with N > 1 and only one plot call was provided, causing an AttributeError: 'numpy.ndarray' object has no attribute 'plot'. The fix inspects the axes object directly using hasattr rather than relying on num_plots.

  • Fixed show_confusion_matrix axes reference conflict. The loop variable ax used in the subplots branch shadowed the user-supplied ax parameter. The user-supplied value is now stashed as _user_ax before the loop runs to prevent clobbering.

  • Fixed show_confusion_matrix ignoring model_threshold when passed as a list and X is provided. Previously the full list was forwarded to get_predictions which expects a scalar or dict, causing incorrect threshold application. The list is now indexed per model (model_threshold[idx]) in the X-provided branch.

  • Fixed show_confusion_matrix raising TypeError: 'float' object is not subscriptable when model_threshold was passed as a scalar float in the X is None branch. The branch now checks isinstance before indexing, correctly handling scalar, list, and dict threshold inputs.

Version 0.0.5a8

New Features

  • Added ax=None parameter to seven single-plot functions: show_roc_curve, show_pr_curve, show_confusion_matrix, show_calibration_curve, show_lift_chart, show_gain_chart, and plot_threshold_metrics. When a pre-created matplotlib.axes.Axes object is supplied, each function draws onto that axes and suppresses its internal plt.show() and save_plot_images calls. All existing call signatures remain fully backward-compatible — the default ax=None preserves prior behavior exactly.

  • Added combine_plots function that assembles multiple single-plot function calls into a shared subplot figure. Accepts a list of (func, kwargs) tuples, pre-allocates one axes per panel, and passes each axes to the corresponding function via the ax parameter. Supports configurable grid layout (n_cols, n_rows), custom figsize, suptitle, tight_layout, and full save_plot_images integration. Unused trailing panels are hidden automatically. Panels that raise exceptions render an inline error message rather than aborting the figure.

  • Added tick_fontsize parameter to combine_plots. Both label_fontsize and tick_fontsize are now automatically injected into each panel function via inspect.signature, giving uniform typography across the grid without per-panel repetition. Per-panel overrides in plot_calls always take precedence.

  • Added hspace and wspace parameters to combine_plots for explicit row and column spacing control. When either is supplied, constrained_layout is activated and spacing is applied via fig.get_layout_engine().set() after tight_layout, correctly handling panels with legend_loc="bottom" that place legends outside the axes bbox.

  • Added height_ratios parameter to combine_plots, passed through gridspec_kw, allowing individual rows to be resized independently. Useful when mixing plot types of different natural heights (e.g., confusion matrices alongside full-height curve panels).

  • Fixed the overlay path in show_roc_curve, show_pr_curve, show_lift_chart, show_gain_chart, show_calibration_curve, and plot_threshold_metrics. Previously, overlay=True called plt.figure() unconditionally, creating a standalone figure instead of drawing onto the supplied ax. All six functions now check ax is not None before creating a figure, correctly routing overlay draws into combine_plots panel grids.

Bug Fixes

  • Fixed combine_plots axes flattening logic. The previous if num_plots == 1 guard incorrectly wrapped a numpy ndarray of axes in a list when plt.subplots(1, N) was called with N > 1 and only one plot call was provided, causing an AttributeError: 'numpy.ndarray' object has no attribute 'plot'. The fix inspects the axes object directly using hasattr rather than relying on num_plots.

  • Fixed show_confusion_matrix axes reference conflict. The loop variable ax used in the subplots branch shadowed the user-supplied ax parameter. The user-supplied value is now stashed as _user_ax before the loop runs to prevent clobbering.

  • Fixed show_confusion_matrix ignoring model_threshold when passed as a list and X is provided. Previously the full list was forwarded to get_predictions which expects a scalar or dict, causing incorrect threshold application. The list is now indexed per model (model_threshold[idx]) in the X-provided branch.

  • Fixed show_confusion_matrix raising TypeError: 'float' object is not subscriptable when model_threshold was passed as a scalar float in the X is None branch. The branch now checks isinstance before indexing, correctly handling scalar, list, and dict threshold inputs.

Version 0.0.5a7

Summary

This version drops support for Python 3.7.4 and sets the minimum required Python version to 3.8. Python 3.7 reached end-of-life in June 2023 and is no longer supported by the library. Users on Python 3.7.x must upgrade before installing this version.

This version also delivers four major workstreams across the library: a ground-up hardening of ModelCalculator, full multi-model support for plot_threshold_metrics, a new image_filename parameter across all eight plotting functions, and Python 3.8 compatibility restoration.


1. ModelCalculator

_extract_final_model

The original branching logic was ambiguous and failed silently for several real-world model wrapper patterns. The method now resolves wrappers in a strict, documented priority order:

  1. Plain dict with a "model" key (e.g. model_tuner pkl format {"model": <Model>}) unwrapped first before any other check.

  2. sklearn.pipeline.Pipeline via hasattr(model, "steps"), extracting the last step.

  3. Objects with an estimator attribute (e.g. model_tuner Model objects wrapping a CalibratedClassifierCV).

  4. Objects with a model attribute (e.g. custom wrapper classes).

  5. Standalone sklearn-compatible objects with predict, predict_proba, or decision_function.

The dict unwrap path was entirely absent before this change, causing AttributeError: 'dict' object has no attribute 'predict' when loading pkl files saved in model_tuner format.

generate_predictions

The prediction block previously called model.predict() and model.predict_proba() directly on the raw object retrieved from model_dict, bypassing _extract_final_model. This caused the same AttributeError on dict-wrapped models. The block now routes through the already-resolved estimator variable for all prediction calls, while retaining the model.threshold check on the original object since that attribute lives on the model_tuner wrapper, not the inner estimator.

_add_metrics

Replaced type(y_test_m) == pd.DataFrame with isinstance(y_test_m, pd.DataFrame) per Python best practices. Also corrected squeeze(axis=0) to squeeze(axis=1), which is the correct axis for collapsing a single-column DataFrame into a Series.

_get_shap_explainer (new helper)

Replaced the generic shap.Explainer auto-detection call, which fired internal probe warnings on every invocation, with an explicit helper that selects the correct explainer class based on model attributes:

  • shap.TreeExplainer for tree models (tree_ or estimators_).

  • shap.LinearExplainer for linear models (coef_).

  • shap.KernelExplainer via _make_predict_proba_wrapper for everything else.

A guard at the top raises ValueError immediately for models without predict_proba so unsupported models fail with a clear message rather than crashing inside SHAP internals.

_make_predict_proba_wrapper (new helper)

KernelExplainer internally converts DataFrames to numpy before calling the model function, which caused StandardScaler (fitted with named columns) to emit UserWarning: X does not have valid feature names on every SHAP call. The new wrapper re-attaches the original column names before passing the array to predict_proba, eliminating the warning at the source rather than suppressing it.

_calculate_shap_values

Global SHAP previously iterated row-by-row via itertuples, calling the explainer once per sample. This has been replaced with a single batched explainer(X_transformed) call, which is orders of magnitude faster on datasets of any meaningful size.

Fixed the multi-class SHAP averaging from .mean(axis=0).mean(axis=1) to .mean(axis=2).mean(axis=0), which is the correct reduction order for a (n_samples, n_features, n_classes) tensor.

Added pipeline unwrapping at the top of the method so the method handles dict/Pipeline-wrapped models passed directly rather than only pre-unwrapped estimators.

The include_contributions row-wise path now consistently returns top-N {feature: shap_value} dicts sorted by absolute value, matching the coefficient path. Previously it returned a flat dict over all features regardless of top_n.

_calculate_coefficients

Added a secondary Pipeline unwrap after _extract_final_model for cases where the extracted model is itself a Pipeline (e.g. model_tuner objects where .estimator is a CalibratedClassifierCV wrapping a Pipeline). Without this, coef_ lookup failed on the Pipeline object rather than its final step.

The include_contributions=False path now returns top-N feature-name lists rather than dicts, making it symmetric with the SHAP path and ensuring subset_results column content is consistent regardless of which explainability method is used.

Note

This is a behavior change for any code consuming the default output of _calculate_coefficients as dicts. Pass include_contributions=True to restore the previous dict output.


2. plot_threshold_metrics

Multi-model support

The function previously accepted only a single model or y_prob array. It now accepts lists of models, y_prob arrays, and thresholds and supports three display modes:

  • Single: one plot per model, unchanged from previous behavior.

  • Overlay: all models on a single shared axes, with curve labels prefixed by model name for disambiguation.

  • Subplots: one subplot per model in an auto-sized grid.

New parameters: model_title, overlay, subplots, n_cols, n_rows, suptitle, suptitle_y, and model_threshold (now accepts a list).

Model title defaulting

When model_title=None and model objects are provided, titles now default to the model class name via extract_model_name() rather than generic “Model 1”, “Model 2” index labels. When only y_prob arrays are provided the index fallback is retained since there is nothing to extract a name from.

n_rows / n_cols auto-derivation

When n_rows is explicitly provided but n_cols is left at its default of 2, n_cols is now automatically derived as ceil(num_models / n_rows). Previously specifying n_rows=1 with 3 models still produced a 2-column grid because n_cols was never recalculated.

y_test flattening

precision_recall_curve and roc_curve inside _plot_single now receive np.asarray(y_test).ravel() to prevent ValueError: Found input variables with inconsistent numbers of samples when y_test is a single-column DataFrame loaded from parquet.

suptitle / title separation

suptitle controls the overall figure heading above all subplots. title controls per-subplot headings. The two can be set independently: passing title="" suppresses per-subplot titles while still showing the suptitle, and vice versa. Previously there was no way to have both levels of titling independently.


3. image_filename Save Integration

Updated save_plot_images

Added three new parameters: image_filename, fig, and dpi.

Saving is now triggered when either save_plot=True or image_filename is provided, so callers no longer need to set save_plot=True just to use a custom filename. When image_filename is provided it takes precedence over the auto-generated filename. The function now calls fig.savefig() targeting the correct figure object rather than plt.savefig(), which targeted whatever the current active figure happened to be at call time.

All eight plotting functions updated

image_filename=None added to the signature immediately after image_path_svg and threaded through every save_plot_images call site across show_confusion_matrix, show_roc_curve, show_pr_curve, show_lift_chart, show_gain_chart, show_calibration_curve, plot_threshold_metrics, and show_residual_diagnostics (22 call sites in total).

The if save_plot: guards that previously wrapped save_plot_images calls in show_calibration_curve (group path) and show_residual_diagnostics (both save sites) have been removed since save_plot_images now handles the trigger logic internally.

All eight function docstrings updated to document image_filename immediately after image_path_svg.


4. Python 3.8 Compatibility and Version Floor Change

The minimum supported Python version has been raised from 3.7.4 to 3.8. This aligns with the broader Python ecosystem where 3.7 has been end-of-life since June 2023 and several upstream dependencies no longer ship 3.7-compatible wheels.

fastparquet replaced with pyarrow

fastparquet fails to build on Python 3.8 due to a Cython ndarray type identifier incompatibility that affects all released versions. The dependency has been replaced with pyarrow>=11.0.0,<=14.0.2, which is the last release with official Python 3.8 wheels. pd.read_parquet() uses pyarrow automatically so no code changes were required.

pip upgrade required for Python 3.8 environment

The venv_3_8 environment shipped with pip 19.2.3 (2019), which cannot parse modern pyproject.toml files used by packages like ninja and scipy. This caused cascading build failures for scikit-learn and pyarrow. Running pip install --upgrade pip before pip install -r requirements.txt unblocks the full install.

typing imports added to plot_utils.py

Optional, List, Dict, Union, and Tuple were used in type annotations but not imported. Python 3.10+ allows X | None union syntax without importing from typing but Python 3.8 requires the explicit import. Added from typing import Optional, List, Dict, Union, Tuple to resolve NameError: name 'Optional' is not defined on import.


5. Test Suite

pytest collection conflict resolved

Both py_scripts/test_model_calculator.py and unittests/test_model_calculator.py share the same basename, causing pytest to fail at collection with an import file mismatch error. Fixed by adding __init__.py to both py_scripts/ and unittests/ and clearing stale __pycache__ artifacts.

test_model_calculator fixes

test_calculate_coefficients: updated assertion from isinstance(row, dict) to isinstance(row, list) since the default path now returns feature-name lists, not dicts.

test_extract_final_model_wrapped_model: resolved by adding the hasattr(model, "model") branch to _extract_final_model. No test change needed.

test_calculate_shap_unsupported_model: resolved by adding the predict_proba guard at the top of _get_shap_explainer so unsupported models raise ValueError before reaching KernelExplainer.

test_calculate_shap_unexpected_shape and test_rowwise_shap_output_unexpected_type: both monkeypatched shap.Explainer which is no longer called. Updated to monkeypatch ModelCalculator._get_shap_explainer instead.

test_model_evaluator fixes

test_plot_threshold_metrics_with_lookup and test_plot_threshold_metrics_all_lookup_metrics: both asserted "Best threshold" in captured.out. The print format now prefixes the model name (e.g. "LogisticRegression -- best threshold for..."). Updated assertions to "best threshold" in captured.out which matches both the old and new format.

Final result: 204 collected, 204 passed. Coverage: metrics_utils 79%, model_calculator 87%, model_evaluator 83%, partial_dependence 89%, plot_utils 70%.


6. Test and Usage Scripts

test_model_calculator.py (py_scripts)

Standalone .py equivalent of the test notebook with section headers, coloured PASS/FAIL labels, and .to_string() DataFrame output for clean terminal readability. Paths are anchored to __file__ via SCRIPT_DIR and PROJECT_ROOT so the script runs correctly regardless of which directory python is invoked from.

test_model_calculator.ipynb

Updated to use load_breast_cancer() (30 real named features) instead of make_classification() (generic feature_0..9 labels) so SHAP and coefficient outputs show meaningful feature names. Added max_iter=10000 to LogisticRegression to suppress the convergence warning that cluttered notebook output.

Version 0.0.5a6

Bug Fixes

  • Fixed interactive Plotly plot not rendering in Jupyter notebooks unless save_plots was set. Display and saving are now fully decoupled; the plot always renders regardless of save_plots.

  • Removed duplicate HTML save block that existed inside the static plot section.

New Features

  • Added x_label_map and y_label_map parameters for mapping raw axis values to human-readable tick labels; useful for encoded or numeric categorical features.

  • Added modebar_image_format parameter ("png", "svg", "jpeg", "webp") to control the download format of the Plotly modebar camera button. Defaults to "png".

Improvements

  • Docstring updated to document x_label_map, y_label_map, and modebar_image_format.

  • Raises section expanded to cover all ValueError conditions, including invalid save_plots, missing image paths, missing HTML paths, invalid plot_type, and invalid modebar_image_format.

  • Update to plot_3d_pdp docstring

  • Adds full categorical feature support to plot_3d_pdp while preserving backward compatibility with numeric grids. The function now renders categorical axes correctly in both Matplotlib and Plotly by mapping categories to numeric surface positions and overlaying labels. Custom label mapping is supported for cleaner presentation. Interactive hover and axis ticks now display true category names. HTML export logic was refactored to prevent duplicate writes and ensure reliable saving across plot modes.

Version 0.0.5a5

  • Update to plot_3d_pdp docstring

  • Adds full categorical feature support to plot_3d_pdp while preserving backward compatibility with numeric grids. The function now renders categorical axes correctly in both Matplotlib and Plotly by mapping categories to numeric surface positions and overlaying labels. Custom label mapping is supported for cleaner presentation. Interactive hover and axis ticks now display true category names. HTML export logic was refactored to prevent duplicate writes and ensure reliable saving across plot modes.

Version 0.0.5a4

  • Adds full categorical feature support to plot_3d_pdp while preserving backward compatibility with numeric grids. The function now renders categorical axes correctly in both Matplotlib and Plotly by mapping categories to numeric surface positions and overlaying labels. Custom label mapping is supported for cleaner presentation. Interactive hover and axis ticks now display true category names. HTML export logic was refactored to prevent duplicate writes and ensure reliable saving across plot modes.

Version 0.0.5a3

Major Features Added

  • Axis Limits: Added xlim and ylim parameters for standardizing axis ranges across multiple model comparisons

  • Bottom Legend Support: Automatic figure height adjustment when legend_loc="bottom" to prevent legend overlap with x-axis labels

  • Multi-Model Layout: Improved default layout for multiple models with plot_type="all" - now arranges as one row per model (6 columns × N rows) instead of mixed layout

Bug Fixes

  • Heteroskedasticity Tests: Fixed categorical variable handling in all tests (Breusch-Pagan, White, Goldfeld-Quandt) by encoding categorical columns before running tests

  • Legend Formatting:

    • Fixed legend duplication bug in Scale-Location plot (removed test name prepending since interpretations already contain test names)

    • Fixed histogram legend to use apply_legend() for consistent formatting

    • Fixed legend kwargs not being properly passed to Scale-Location plot

  • Group Category Handling: Fixed KeyError when group_category column not in X DataFrame by properly checking column existence before filtering predictor columns

  • Index Alignment: Fixed AssertionError when using external group_category array by ensuring Series index matches X.index

  • Python 3.8 Compatibility: Fixed LaTeX rendering error in Scale-Location y-axis label by replacing \text{} with \mathrm{} for matplotlib <3.3 compatibility

Enhancements

  • Scale-Location Y-Axis: Changed to LaTeX notation r"$\sqrt{|\text{Std. Residuals}|}$" for better readability

  • Histogram Overlay: Removed normal distribution overlay from histogram_type="frequency" for cleaner, simpler visualization (overlay still present for histogram_type="density")

  • Text Wrapping: Added text_wrap parameter support to all subplot titles (previously only worked for suptitles)

  • Helper Functions:

    • Created apply_axis_limits() helper in plot_utils.py

    • Enhanced apply_legend() to handle bottom legend resizing with flag-based prevention of multiple resizes

  • Refactoring: Refactored plot_threshold_metrics() to use apply_plot_title() and apply_legend() helpers for consistency

Version 0.0.5a2

Testing Improvements

  • Added 150+ comprehensive unit tests covering:

    • Edge cases and error handling

    • Parameter validation

    • Different input pathways (model vs y_prob vs y_pred)

    • Group category functionality

    • Styling and customization options

    • Integration scenarios

Code Quality

  • Refactored monolithic 3866-line model_metrics.py into modular components:

    • model_evaluator.py - Main plotting and evaluation functions

    • metrics_utils.py - Utility functions and calculations

    • plot_utils.py - Plotting helper functions

  • Improved code maintainability and organization

  • Enhanced error messages and validation

  • Operating point visualization with two methods:

    • operating_point_method='youden' - Youden’s J statistic

    • operating_point_method='closest_topleft' - Closest to top-left corner

  • DeLong test support for AUC comparison between models via delong parameter

  • Legend ordering - Proper organization: AUC curves → Random Guess → Operating Points

  • Custom operating point styling via operating_point_kwgs

Residual Diagnostics Expansion

  • New plot types:

    • 'influence' - Influence plot with Cook’s distance bubbles

    • 'predictors' - Individual residual plots for each predictor

  • Heteroskedasticity testing with multiple methods:

    • 'breusch_pagan' - Breusch-Pagan test

    • 'white' - White’s test

    • 'goldfeld_quandt' - Goldfeld-Quandt test

    • 'spearman' - Spearman rank correlation

    • 'all' - Run all tests

  • LOWESS smoothing via show_lowess parameter

  • Centroid visualization with two modes:

    • User-defined groups via group_category

    • Automatic K-means clustering via n_clusters

  • Histogram types:

    • histogram_type='frequency' - Raw counts (default)

    • histogram_type='density' - Probability density with normal overlay

  • Diagnostics table - Comprehensive model diagnostics via show_diagnostics_table

  • Return diagnostics - Programmatic access via return_diagnostics=True

Group Category Support

  • All classification plots now support group_category parameter:

    • ROC curves with per-group AUC and counts

    • PR curves with per-group metrics

    • Calibration curves with per-group calibration

  • Residual diagnostics support group visualization with centroids

  • Summary performance supports grouped classification metrics

Gain Chart Enhancement

  • Gini coefficient calculation and display via show_gini parameter

  • Custom decimal places for Gini via decimal_places parameter

Legend Customization

  • Legend location now supports:

    • Standard matplotlib locations (‘best’, ‘upper right’, etc.)

    • 'bottom' - Places legend below plot (perfect for group categories)

  • Automatic legend ordering for better readability

summarize_model_performance

  • Added include_adjusted_r2 for regression models

  • Added group_category for grouped classification metrics

  • Added overall_only for regression to show only aggregate metrics

  • Improved coefficient ordering (intercept first)

  • Better handling of feature importances for tree-based models

show_confusion_matrix

  • Added show_colorbar parameter (default: False)

  • Added labels parameter to toggle TN/FP/FN/TP labels

  • Improved font size controls (inner_fontsize, label_fontsize, tick_fontsize)

show_roc_curve

  • Added show_operating_point and operating_point_method

  • Added operating_point_kwgs for custom styling

  • Added delong parameter for AUC comparison

  • Added group_category for stratified analysis

  • Added legend_loc parameter

show_pr_curve

  • Added legend_metric parameter (‘ap’ or ‘aucpr’)

  • Added group_category for stratified analysis

  • Added legend_loc parameter

show_calibration_curve

  • Added show_brier_score parameter (default: True)

  • Added brier_decimals for formatting

  • Added group_category for stratified analysis

  • Added legend_loc parameter

show_gain_chart

  • Added show_gini parameter (default: False)

  • Added decimal_places for Gini formatting

plot_threshold_metrics

  • Added lookup_metric and lookup_value for threshold optimization

  • Added model_threshold to highlight specific thresholds

  • Added baseline_thresh to toggle baseline line

  • Added custom styling: curve_kwgs, baseline_kwgs, threshold_kwgs, lookup_kwgs

show_residual_diagnostics

  • Added plot_type options: ‘all’, ‘fitted’, ‘qq’, ‘scale_location’, ‘leverage’, ‘influence’, ‘histogram’, ‘predictors’

  • Added heteroskedasticity_test with multiple test options

  • Added show_lowess for trend lines

  • Added lowess_kwgs for LOWESS styling

  • Added group_category for stratified analysis

  • Added group_kwgs for custom group styling

  • Added show_centroids and centroid_kwgs

  • Added centroid_type (‘clusters’ or ‘groups’)

  • Added n_clusters for automatic clustering

  • Added histogram_type (‘frequency’ or ‘density’)

  • Added show_diagnostics_table and return_diagnostics

  • Added show_plots to disable plotting

  • Added show_outliers and n_outliers for labeling

  • Added legend_loc parameter

  • Added legend_kwgs to control legend display for groups, centroids, clusters, and het_tests

  • Added kmeans_rstate for reproducible clustering

  • Added n_cols and n_rows for custom subplot layouts

  • Added point_kwgs for scatter point styling (supports edgecolor, linewidth, etc.)

Bug Fixes

  • Fixed confusion matrix colorbar removal when show_colorbar=False

  • Fixed duplicate text handling in confusion matrix displays

  • Fixed legend placement for grouped visualizations

  • Fixed text wrapping for long titles

  • Fixed LOWESS exception handling (now fails gracefully)

  • Fixed feature importance display for tree-based models

  • Fixed coefficient ordering in regression output

  • Fixed empty metric columns in regression feature importance rows

Documentation Improvements

  • Comprehensive docstrings for all major functions

  • Parameter descriptions with examples

  • Error message improvements for better debugging

  • Type hints and validation error messages

  • Usage examples in docstrings

Testing

  • Test suite expanded from ~50 tests to 152 tests

  • Coverage increased from 50% to 86% on core modules

  • All edge cases and error conditions tested

  • Integration tests for real-world workflows

  • Parametrized tests for systematic coverage

Performance

  • No performance regressions

  • Modular code structure improves maintainability

  • Efficient calculation caching where applicable

Migration Guide

From 0.0.5a1 to 0.0.5a2:

No changes required - all existing code will work as before. New features are opt-in:

Version 0.0.5a1

Added

  • Operating Point Visualization for ROC Curves: Added show_operating_point parameter to display optimal classification thresholds on ROC curves with two methods:

    • youden: Youden’s J statistic (maximizes TPR - FPR)

    • closest_topleft: Point closest to top-left corner (minimizes distance to perfect classifier)

    • Configurable via operating_point_method and operating_point_kwgs parameters

    • Operating points display threshold values in legends and appear as markers on curves

  • Gini Coefficient for Gain Charts: Added automatic calculation and display of Gini coefficient in show_gain_chart()

    • Prints Gini coefficient for each model (default: 3 decimal places)

    • Displays in legend labels across all plot modes (overlay, subplots, single)

    • Configurable via show_gini and decimal_places parameters

  • Legend Location Control: Added legend_loc parameter to all plotting functions for flexible legend positioning

    • Supports standard matplotlib locations ('lower right', 'upper left', 'best', etc.)

    • Special ‘bottom’ option places legend below plot with proper spacing

    • Available in: show_roc_curve(), show_pr_curve(), show_calibration_curve(), show_lift_chart(), show_gain_chart()

Improved

  • Legend Ordering for ROC Curves: Standardized legend entry order across all plot modes

    • Order: Model curves with AUC → Random Guess baseline → Operating points

    • Ensures consistent, intuitive legend presentation

  • Overlay Mode for ROC Curves: Enhanced operating point display in overlay plots

    • Combined AUC and operating point threshold in single legend entry

    • Format: “Model Name (AUC = 0.XX, Op = 0.XX)”

    • Operating point markers appear on curves without duplicate legend entries

Technical Details

  • Operating points calculated post-ROC curve generation using optimal threshold selection

  • Gini coefficient derived from area under gain curve: Gini = 2 × AUGC - 1

  • Legend positioning uses bbox_to_anchor for 'bottom' placement with dynamic spacing

  • All changes maintain backward compatibility with existing code

Version 0.0.4a10

Refactored and stabilized the summarize_model_performance function to improve consistency across classification and regression workflows while preserving the exact formatting logic for printed outputs and regression coefficient display.

Changes

  • Consolidated redundant metric computation into dedicated helper functions for classification and regression metrics.

  • Ensured regression coefficients, intercepts, and feature importances are retained and ordered correctly in the final DataFrame output.

  • Fixed grouped classification output so Model Threshold always appears last, and group headers correctly reflect category names.

  • Added conditional handling for grouped classification to prevent KeyError when the "Model" column is absent.

  • Preserved the original manual formatting block to maintain Leon’s custom printing logic for both classification and regression:

    • Right-aligned all table columns for readability.

    • Retained separator-based visual formatting and model-wise breaks.

    • Preserved coefficient and intercept reporting behavior exactly as before, ensuring regression results remain interpretable and consistent.

Impact

  • Classification and regression now produce stable, well-ordered, and readable summaries.

  • Grouped and non-grouped runs behave consistently without disrupting regression coefficient output.

  • Backward compatibility with previous console and DataFrame output formats maintained.

Version 0.0.4a9

This release introduces a new parameter, brier_decimals, to the show_calibration_curve() function, allowing users to control the number of decimal places displayed for the Brier score.

Changes Made

  • Added brier_decimals parameter (default: 3) next to show_brier_score.

  • Updated Brier score display logic to format using round(brier_score, brier_decimals).

  • Improved readability and precision consistency across calibration plots.

Impact

  • No breaking changes.

  • Users now have finer control over Brier score precision in calibration curve visualizations.

Quick Example

from model_metrics import show_calibration_curve
show_calibration_curve(model, X, y, show_brier_score=True, brier_decimals=4)

Version 0.0.4a8

Summary:

Updated hanley_mcneil_auc_test() function to perform a large-sample z-test for comparing correlated AUCs, based on Hanley & McNeil (1982), an analytical approximation of DeLong’s test.

Key Changes:

  • Implemented hanley_mcneil_auc_test() with parameters:

    • y_true, y_scores_1, y_scores_2 for AUC comparison.

    • Optional model_names, verbose, and return_values arguments for flexible use.

  • Added formatted, human-readable print output (when verbose=True).

  • Enabled optional programmatic access with return_values=True.

  • Adopted NumPy-style docstring for clarity and consistency.

  • Integrated helper into show_roc_curve() to enable AUC significance testing when the delong argument is provided.

Notes: This helper can also be used as a standalone function for independent AUC comparison between two models, outside of visualization workflows.

Version 0.0.4a7

  • DeLong’s test (Hanley & McNeil approximation)

    • Implemented a new helper function hanley_mcneil_auc_test() for approximate DeLong’s AUC comparison.

    • Integrated the helper inside show_roc_curve() to optionally print AUC differences and p-values between two models.

    • Added corresponding pytest coverage under test_show_roc_curve_with_delong().

  • Group category support

    • Added the group_category input to summarize_model_performance() to generate subgroup-level performance summaries.

    • Enables stratified metric reporting for fairness or demographic analysis.

Version 0.0.4a6

Reworded the print message inside plot_threshold_metrics() for clarity.

Old:

print(
      f"Best threshold for {lookup_metric} = "
      f"{round(lookup_value, decimal_places)} is: "
      f"{round(best_threshold, decimal_places)}"
)

New:

print(
    f"Best threshold for target {lookup_metric} of "
    f"{round(lookup_value, decimal_places)} is "
    f"{round(best_threshold, decimal_places)}"
)

This removes the equals sign and colon, and adds “target” for a smoother, more descriptive sentence.

Version 0.0.4a8

  • Added a minimal type check to ensure y_prob is always a list at the start of each affected function:

  • summarize_model_performance

  • show_calibration_curve

  • show_confusion_matrix

  • show_lift_chart

  • show_gain_chart

  • show_roc_curve

  • show_pr_curve

# Ensure y_prob is always a list of NumPy arrays
if isinstance(y_prob, np.ndarray):
   y_prob = [y_prob]

This allows y_prob[0] indexing to work whether the caller provides a single NumPy array or a list of arrays.

  • Updated unittests

Version 0.0.4a4

  • Corrected README to reflect the current version.

  • Previous release did not update the README properly because the file was not saved before publishing.

  • No functional changes to the library.

Version 0.0.4a3

  • Added missing scipy (>=1.8,<=1.14.0) requirement to the README.

Version 0.0.4a2

This version updates pyproject.toml and requirements.txt to restrict SciPy to >=1.8,<=1.14.0.

  • Prevents installation of scipy==1.14.1+, which removes _lazywhere and breaks statsmodels.

  • Keeps compatibility with model_tuner and Colab environments.

  • Bumps package version for release.

  • Updated scipy dependency to >=1.8,<=1.14.0

  • Synced requirements.txt with updated constraints

Version 0.0.4a1

  • Replaced the old grid parameter with subplots across plotting functions for consistency.

  • Standardized gridline handling by replacing unconditional plt.grid() calls with plt.grid(visible=gridlines)

Why

  • Aligns function signatures to use subplots consistently instead of grid.

  • Makes gridline visibility configurable through a single gridlines flag.

  • Cleaner charts when gridlines=False, no visual change when gridlines=True.

Version 0.0.4a

Summary

Added the ability to pass predicted probabilities (y_prob) directly into the functions in model_evaluator.py as an alternative to supplying a fitted model and feature matrix. This flexibility lets end users evaluate results in two ways:

  • Using a model object with X (current behavior)

  • Or passing y_prob directly (new option)

Details

  • Updated all relevant evaluator functions (summarize_model_performance, plot_threshold_metrics, etc.) to accept y_prob as input.

  • Added input validation: functions now check that either (model and X) or y_prob are provided, not both missing.

  • Preserved existing model-based workflows for backward compatibility.

  • Extended unit tests in unittests/ to cover the new probability-based path, including edge cases and validation errors.

Why

End users sometimes already have predicted probabilities from external pipelines or pre-computed experiments. This change avoids forcing them to re-supply the model, streamlining the evaluation process.

Version 0.0.3a

  • Added "plotly>=5.18.0, <=5.24.1" in pyproject.toml, setup.py, README_min.md –> for partial_dependence.py functions

Version 0.0.2a

Full Changelog: https://github.com/lshpaner/model_metrics/compare/0.0.1a…0.0.2a

Version 0.0.1a

  • Updated unit tests and README

  • Added statsmodels to library imports

  • Added coefficients and p-values to regression summary

  • Added regression capabilities to summarize_model_performance

  • Added lift and gains charts

  • Updated versions for earlier Python compatibility