Data Management Overview

In any data-driven project, effective management of data is crucial. This section provides essential techniques for handling and preparing data to ensure consistency, accuracy, and ease of analysis. From directory setup and data cleaning to advanced data processing, these methods form the backbone of reliable data management. Dive into the following topics to enhance your data handling capabilities and streamline your workflow.

Data Management Techniques

Path directories

Ensure that the directory exists. If not, create it.

ensure_directory(path)
Parameters:

path (str) – The path to the directory that needs to be ensured.

Returns:

None

The ensure_directory function is a utility designed to facilitate the management of directory paths within your project. When working with data science projects, it is common to save and load data, images, and other artifacts from specific directories. This function helps in making sure that these directories exist before any read/write operations are performed. If the specified directory does not exist, the function creates it. If it already exists, it does nothing, thus preventing any errors related to missing directories.

Implementation Example

In the example below, we demonstrate how to use the ensure_directory function to verify and create directories as needed. This example sets up paths for data and image directories, ensuring they exist before performing any operations that depend on them.

First, we define the base path as the parent directory of the current directory. The os.pardir constant, equivalent to ".."", is used to navigate up one directory level. Then, we define paths for the data directory and data output directory, both located one level up from the current directory.

Next, we set paths for the PNG and SVG image directories, located within an images folder in the parent directory. Using the ensure_directory function, we then verify that these directories exist. If any of the specified directories do not exist, the function creates them.

from eda_toolkit import ensure_directory
import os # import operating system for dir


base_path = os.path.join(os.pardir)

# Go up one level from 'notebooks' to parent directory,
# then into the 'data' folder
data_path = os.path.join(os.pardir, "data")
data_output = os.path.join(os.pardir, "data_output")

# create image paths
image_path_png = os.path.join(base_path, "images", "png_images")
image_path_svg = os.path.join(base_path, "images", "svg_images")

# Use the function to ensure'data' directory exists
ensure_directory(data_path)
ensure_directory(data_output)
ensure_directory(image_path_png)
ensure_directory(image_path_svg)

Output

Created directory: ../data
Created directory: ../data_output
Created directory: ../images/png_images
Created directory: ../images/svg_images

Adding Unique Identifiers

Add a column of unique IDs with a specified number of digits to the DataFrame.

add_ids(df, id_colname='ID', num_digits=9, seed=None, set_as_index=False)
Parameters:
  • df (pd.DataFrame) – The DataFrame to which IDs will be added.

  • id_colname (str, optional) – The name of the new column for the unique IDs. Defaults to "ID".

  • num_digits (int, optional) – The number of digits for the unique IDs. Defaults to 9. The first digit will always be non-zero to ensure proper formatting.

  • seed (int, optional) – The seed for the random number generator to ensure reproducibility. Defaults to None.

  • set_as_index (bool, optional) – If True, the generated ID column will replace the existing index of the DataFrame. Defaults to False.

Returns:

The updated DataFrame with a new column of unique IDs. If set_as_index is True, the new ID column replaces the existing index.

Return type:

pd.DataFrame

Raises:

ValueError – If the number of rows in the DataFrame exceeds the pool of possible unique IDs for the specified num_digits.

Notes

  • The function ensures all IDs are unique by resolving potential collisions during generation, even for large datasets.

  • The total pool size of unique IDs is determined by \(9 \times 10^{(\text{num_digits} - 1)}\), since the first digit must be non-zero.

  • Warnings are printed if the number of rows in the DataFrame approaches the pool size of possible unique IDs, recommending increasing num_digits.

  • If set_as_index is False, the ID column will be added as the first column in the DataFrame.

  • Setting a random seed ensures reproducibility of the generated IDs.

The add_ids function is used to append a column of unique identifiers with a specified number of digits to a given dataframe. This is particularly useful for creating unique patient or record IDs in datasets. The function allows you to specify a custom column name for the IDs, the number of digits for each ID, and optionally set a seed for the random number generator to ensure reproducibility. Additionally, you can choose whether to set the new ID column as the index of the dataframe.

Implementation Example

In the example below, we demonstrate how to use the add_ids function to add a column of unique IDs to a dataframe. We start by importing the necessary libraries and creating a sample dataframe. We then use the add_ids function to generate and append a column of unique IDs with a specified number of digits to the dataframe.

First, we import the pandas library and the add_ids function from the eda_toolkit. Then, we create a sample dataframe with some data. We call the add_ids function, specifying the dataframe, the column name for the IDs, the number of digits for each ID, a seed for reproducibility, and whether to set the new ID column as the index. The function generates unique IDs for each row and adds them as the first column in the dataframe.

from eda_toolkit import add_ids

# Add a column of unique IDs with 9 digits and call it "census_id"
df = add_ids(
    df=df,
    id_colname="census_id",
    num_digits=9,
    seed=111,
    set_as_index=True,
)

Output

First 5 Rows of Census Income Data (Adapted from Kohavi, 1996, UCI Machine Learning Repository) [1]

DataFrame index is unique.
age workclass fnlwgt education education-num marital-status occupation relationship
census_id
582248222 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family
561810758 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband
598098459 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family
776705221 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband
479262902 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife

Trailing Period Removal

Strip the trailing period from values in a specified column of a DataFrame, if present.

strip_trailing_period(df, column_name)
Parameters:
  • df (pd.DataFrame) – The DataFrame containing the column to be processed.

  • column_name (str) – The name of the column containing values with potential trailing periods.

Returns:

The updated DataFrame with trailing periods removed from the specified column.

Return type:

pd.DataFrame

Raises:

ValueError – If the specified column_name does not exist in the DataFrame, pandas will raise a ValueError.

Notes:

  • For string values, trailing periods are stripped directly.

  • For numeric values represented as strings (e.g., "1234."), the trailing period is removed, and the value is converted back to a numeric type if possible.

  • NaN values are preserved and remain unprocessed.

  • Non-string and non-numeric types are returned as-is.

The strip_trailing_period function is designed to remove trailing periods from float values in a specified column of a DataFrame. This can be particularly useful when dealing with data that has been inconsistently formatted, ensuring that all float values are correctly represented.

Implementation Example

In the example below, we demonstrate how to use the strip_trailing_period function to clean a column in a DataFrame. We start by importing the necessary libraries and creating a sample DataFrame. We then use the strip_trailing_period function to remove any trailing periods from the specified column.

from eda_toolkit import strip_trailing_period

# Create a sample dataframe with trailing periods in some values
data = {
    "values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.],
}
df = pd.DataFrame(data)

# Remove trailing periods from the 'values' column
df = strip_trailing_period(df=df, column_name="values")

Output

First 6 Rows of Data Before and After Removing Trailing Periods (Adapted from Example)

Before:
Index Value
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.
After:
Index Value
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0

Note

The last row shows 6 as an int with a trailing period with its conversion to float.

Standardized Dates

Parse and standardize date strings based on the provided rule.

parse_date_with_rule(date_str)

This function takes a date string and standardizes it to the ISO 8601 format (YYYY-MM-DD). It assumes dates are provided in either day/month/year or month/day/year format. The function first checks if the first part of the date string (day or month) is greater than 12, which unambiguously indicates a day/month/year format. If the first part is 12 or less, the function attempts to parse the date as month/day/year, falling back to day/month/year if the former raises a ValueError due to an impossible date (e.g., month being greater than 12).

Parameters:

date_str (str) – A date string to be standardized.

Returns:

A standardized date string in the format YYYY-MM-DD.

Return type:

str

Raises:

ValueError – If date_str is in an unrecognized format or if the function cannot parse the date.

Implementation Example

In the example below, we demonstrate how to use the parse_date_with_rule function to standardize date strings. We start by importing the necessary library and creating a sample list of date strings. We then use the parse_date_with_rule function to parse and standardize each date string to the ISO 8601 format.

from eda_toolkit import parse_date_with_rule

date_strings = ["15/04/2021", "04/15/2021", "01/12/2020", "12/01/2020"]

standardized_dates = [parse_date_with_rule(date) for date in date_strings]

print(standardized_dates)

Output

['2021-04-15', '2021-04-15', '2020-12-01', '2020-01-12']

Important

In the next example, we demonstrate how to apply the parse_date_with_rule function to a DataFrame column containing date strings using the .apply() method. This is particularly useful when you need to standardize date formats across an entire column in a DataFrame.

# Creating the DataFrame
data = {
    "date_column": [
        "31/12/2021",
        "01/01/2022",
        "12/31/2021",
        "13/02/2022",
        "07/04/2022",
    ],
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "amount": [100.0, 150.5, 200.75, 250.25, 300.0],
}

df = pd.DataFrame(data)

# Apply the function to the DataFrame column
df["standardized_date"] = df["date_column"].apply(parse_date_with_rule)

print(df)

Output

   date_column     name  amount standardized_date
0   31/12/2021    Alice  100.00        2021-12-31
1   01/01/2022      Bob  150.50        2022-01-01
2   12/31/2021  Charlie  200.75        2021-12-31
3   13/02/2022    David  250.25        2022-02-13
4   07/04/2022      Eve  300.00        2022-04-07

DataFrame Analysis

Analyze DataFrame columns, including data type, null values, and unique value counts.

The dataframe_profiler function provides a comprehensive summary of a DataFrame’s columns, including information on data types, null values, unique values, and the most frequent values. It can output either a plain DataFrame for further processing or a styled DataFrame for visual presentation in Jupyter environments.

dataframe_profiler(df, background_color=None, return_df=False, sort_cols_alpha=False)
Parameters:
  • df (pandas.DataFrame) – The DataFrame to analyze.

  • background_color (str, optional) – Hex color code or color name for background styling in the output DataFrame. Applies to specific columns, such as unique value totals and percentages. Defaults to None.

  • return_df (bool, optional) – If True, returns the plain DataFrame with summary statistics. If False, returns a styled DataFrame for visual presentation. Defaults to False.

  • sort_cols_alpha (bool, optional) – If True, sorts the DataFrame columns alphabetically in the output. Defaults to False.

Returns:

  • If return_df is True or running in a terminal environment, returns the plain DataFrame containing column summary statistics.

  • If return_df is False and running in a Jupyter Notebook, returns a styled DataFrame with optional background styling.

Return type:

pandas.DataFrame

Raises:

ValueError – If the DataFrame is empty or contains no columns.

Notes

  • Automatically detects whether the function is running in a Jupyter Notebook or terminal and adjusts the output accordingly.

  • In Jupyter environments, uses Pandas’ Styler for visual presentation. If the installed Pandas version does not support hide, falls back to hide_index.

  • Utilizes a tqdm progress bar to show the status of column processing.

  • Preprocesses columns to handle NaN values and replaces empty strings with Pandas’ pd.NA for consistency.

Census Income Example

In the example below, we demonstrate how to use the dataframe_profiler function to analyze a DataFrame’s columns.

from eda_toolkit import dataframe_profiler

dataframe_profiler(df=df)

Output

Result on Census Income Data (Adapted from Kohavi, 1996, UCI Machine Learning Repository) [1]

Shape:  (48842, 15)

Processing columns: 100%|██████████| 15/15 [00:00<00:00, 74.38it/s]

Total seconds of processing time: 0.351102
column dtype null_total null_pct unique_values_total max_unique_value max_unique_value_total max_unique_value_pct
0 age int64 0 0.00 74 36 1348 2.76
1 workclass object 963 1.97 9 Private 33906 69.42
2 fnlwgt int64 0 0.00 28523 203488 21 0.04
3 education object 0 0.00 16 HS-grad 15784 32.32
4 education-num int64 0 0.00 16 9 15784 32.32
5 marital-status object 0 0.00 7 Married-civ-spouse 22379 45.82
6 occupation object 966 1.98 15 Prof-specialty 6172 12.64
7 relationship object 0 0.00 6 Husband 19716 40.37
8 race object 0 0.00 5 White 41762 85.50
9 sex object 0 0.00 2 Male 32650 66.85
10 capital-gain int64 0 0.00 123 0 44807 91.74
11 capital-loss int64 0 0.00 99 0 46560 95.33
12 hours-per-week int64 0 0.00 96 40 22803 46.69
13 native-country object 274 0.56 42 United-States 43832 89.74
14 income object 0 0.00 4 <=50K 24720 50.61

DataFrame Column Names

unique_values_total

This column indicates the total number of unique values present in each column of the DataFrame. It measures the distinct values that a column holds. For example, in the age column, there are 74 unique values, meaning the ages vary across 74 distinct entries.

max_unique_value

This column shows the most frequently occurring value in each column. For example, in the workclass column, the most common value is Private, indicating that this employment type is the most represented in the dataset. For numeric columns like capital-gain and capital-loss, the most common value is 0, which suggests that the majority of individuals have no capital gain or loss.

max_unique_value_total

This represents the count of the most frequently occurring value in each column. For instance, in the native-country column, the value United-States appears 43,832 times, indicating that the majority of individuals in the dataset are from the United States.

max_unique_value_pct

This column shows the percentage that the most frequent value constitutes of the total number of rows. For example, in the race column, the value White makes up 85.5% of the data, suggesting a significant majority of the dataset belongs to this racial group.

Calculation Details

  • unique_values_total is calculated using the nunique() function, which counts the number of unique values in a column.

  • max_unique_value is determined by finding the value with the highest frequency using value_counts(). For string columns, any missing values (if present) are replaced with the string "null" before computing the frequency.

  • max_unique_value_total is the frequency count of the max_unique_value.

  • max_unique_value_pct is the percentage of max_unique_value_total divided by the total number of rows in the DataFrame, providing an idea of how dominant the most frequent value is.

This analysis helps in identifying columns with a high proportion of dominant values, like <=50K in the income column, which appears 24,720 times, making up 50.61% of the entries. This insight can be useful for understanding data distributions, identifying potential data imbalances, or even spotting opportunities for feature engineering in further data processing steps.

Generating Summary Tables for Variable Combinations

Generate and save summary tables for all possible combinations of specified variables in a DataFrame to an Excel file, complete with formatting and progress tracking.

summarize_all_combinations(df, variables, data_path, data_name, min_length=2)

Generate summary tables for all possible combinations of the specified variables in the DataFrame and save them to an Excel file with a Table of Contents.

Parameters:
  • df (pandas.DataFrame) – The pandas DataFrame containing the data.

  • variables (list of str) – List of column names from the DataFrame to generate combinations.

  • data_path (str) – Directory path where the output Excel file will be saved.

  • data_name (str) – Name of the output Excel file.

  • min_length (int, optional) – Minimum size of the combinations to generate. Defaults to 2.

Returns:

A tuple containing: - A dictionary where keys are tuples of column names (combinations) and values are the corresponding summary DataFrames. - A list of all generated combinations, where each combination is represented as a tuple of column names.

Return type:

tuple(dict, list)

Notes

  • Excel Output:
    • Each combination is saved as a separate sheet in an Excel file.

    • A “Table of Contents” sheet is created with hyperlinks to each combination’s summary table.

    • Sheet names are truncated to 31 characters to meet Excel’s constraints.

  • Formatting:
    • Headers in all sheets are bold, left-aligned, and borderless.

    • Columns are auto-fitted to content length for improved readability.

    • A left-aligned format is applied to all columns in the output.

  • The function uses tqdm progress bars for tracking:
    • Combination generation.

    • Writing the Table of Contents.

    • Writing individual summary tables to Excel.

The function returns two outputs:

1. summary_tables: A dictionary where each key is a tuple representing a combination of variables, and each value is a DataFrame containing the summary table for that combination. Each summary table includes the count and proportion of occurrences for each unique combination of values.

2. all_combinations: A list of all generated combinations of the specified variables. This is useful for understanding which combinations were analyzed and included in the summary tables.

Implementation Example

Below, we use the summarize_all_combinations function to generate summary tables for the specified variables from a DataFrame containing the census data [1].

Note

Before proceeding with any further examples in this documentation, ensure that the age variable is binned into a new variable, age_group. Detailed instructions for this process can be found under Binning Numerical Columns.

from eda_toolkit import summarize_all_combinations

# Define unique variables for the analysis
unique_vars = [
    "age_group",
    "workclass",
    "education",
    "occupation",
    "race",
    "sex",
    "income",
]

# Generate summary tables for all combinations of the specified variables
summary_tables, all_combinations = summarize_all_combinations(
    df=df,
    data_path=data_output,
    variables=unique_vars,
    data_name="census_summary_tables.xlsx",
)

# Print all combinations of variables
print(all_combinations)

Output

Generating combinations: 100%|██████████| 120/120 [00:01<00:00, 76.56it/s]
Writing summary tables: 100%|██████████| 120/120 [00:41<00:00,  2.87it/s]
Finalizing Excel file: 100%|██████████| 1/1 [00:00<00:00, 13706.88it/s]
Data saved to ../data_output/census_summary_tables.xlsx
[('age_group', 'workclass'),
('age_group', 'education'),
('age_group', 'occupation'),
('age_group', 'race'),
('age_group', 'sex'),
('age_group', 'income'),
('workclass', 'education'),
('workclass', 'occupation'),
('workclass', 'race'),
('workclass', 'sex'),
('workclass', 'income'),
('education', 'occupation'),
('education', 'race'),
('education', 'sex'),
('education', 'income'),
('occupation', 'race'),
('occupation', 'sex'),
('occupation', 'income'),
('race', 'sex'),
('race', 'income'),
('sex', 'income'),
('age_group', 'workclass', 'education'),
('age_group', 'workclass', 'occupation'),
('age_group', 'workclass', 'race'),
('age_group', 'workclass', 'sex'),
('age_group', 'workclass', 'income'),
('age_group', 'education', 'occupation'),
('age_group', 'education', 'race'),
('age_group', 'education', 'sex'),
('age_group', 'education', 'income'),
('age_group', 'occupation', 'race'),
('age_group', 'occupation', 'sex'),
('age_group', 'occupation', 'income'),
('age_group', 'race', 'sex'),
('age_group', 'race', 'income'),
('age_group', 'sex', 'income'),
('workclass', 'education', 'occupation'),
('workclass', 'education', 'race'),
('workclass', 'education', 'sex'),
('workclass', 'education', 'income'),
('workclass', 'occupation', 'race'),
('workclass', 'occupation', 'sex'),
('workclass', 'occupation', 'income'),
('workclass', 'race', 'sex'),
('workclass', 'race', 'income'),
('workclass', 'sex', 'income'),
('education', 'occupation', 'race'),
('education', 'occupation', 'sex'),
('education', 'occupation', 'income'),
('education', 'race', 'sex'),
('education', 'race', 'income'),
('education', 'sex', 'income'),
('occupation', 'race', 'sex'),
('occupation', 'race', 'income'),
('occupation', 'sex', 'income'),
('race', 'sex', 'income'),
('age_group', 'workclass', 'education', 'occupation'),
('age_group', 'workclass', 'education', 'race'),
('age_group', 'workclass', 'education', 'sex'),
('age_group', 'workclass', 'education', 'income'),
('age_group', 'workclass', 'occupation', 'race'),
('age_group', 'workclass', 'occupation', 'sex'),
('age_group', 'workclass', 'occupation', 'income'),
('age_group', 'workclass', 'race', 'sex'),
('age_group', 'workclass', 'race', 'income'),
('age_group', 'workclass', 'sex', 'income'),
('age_group', 'education', 'occupation', 'race'),
('age_group', 'education', 'occupation', 'sex'),
('age_group', 'education', 'occupation', 'income'),
('age_group', 'education', 'race', 'sex'),
('age_group', 'education', 'race', 'income'),
('age_group', 'education', 'sex', 'income'),
('age_group', 'occupation', 'race', 'sex'),
('age_group', 'occupation', 'race', 'income'),
('age_group', 'occupation', 'sex', 'income'),
('age_group', 'race', 'sex', 'income'),
('workclass', 'education', 'occupation', 'race'),
('workclass', 'education', 'occupation', 'sex'),
('workclass', 'education', 'occupation', 'income'),
('workclass', 'education', 'race', 'sex'),
('workclass', 'education', 'race', 'income'),
('workclass', 'education', 'sex', 'income'),
('workclass', 'occupation', 'race', 'sex'),
('workclass', 'occupation', 'race', 'income'),
('workclass', 'occupation', 'sex', 'income'),
('workclass', 'race', 'sex', 'income'),
('education', 'occupation', 'race', 'sex'),
('education', 'occupation', 'race', 'income'),
('education', 'occupation', 'sex', 'income'),
('education', 'race', 'sex', 'income'),
('occupation', 'race', 'sex', 'income'),
('age_group', 'workclass', 'education', 'occupation', 'race'),
('age_group', 'workclass', 'education', 'occupation', 'sex'),
('age_group', 'workclass', 'education', 'occupation', 'income'),
('age_group', 'workclass', 'education', 'race', 'sex'),
('age_group', 'workclass', 'education', 'race', 'income'),
('age_group', 'workclass', 'education', 'sex', 'income'),
('age_group', 'workclass', 'occupation', 'race', 'sex'),
('age_group', 'workclass', 'occupation', 'race', 'income'),
('age_group', 'workclass', 'occupation', 'sex', 'income'),
('age_group', 'workclass', 'race', 'sex', 'income'),
('age_group', 'education', 'occupation', 'race', 'sex'),
('age_group', 'education', 'occupation', 'race', 'income'),
('age_group', 'education', 'occupation', 'sex', 'income'),
('age_group', 'education', 'race', 'sex', 'income'),
('age_group', 'occupation', 'race', 'sex', 'income'),
('workclass', 'education', 'occupation', 'race', 'sex'),
('workclass', 'education', 'occupation', 'race', 'income'),
('workclass', 'education', 'occupation', 'sex', 'income'),
('workclass', 'education', 'race', 'sex', 'income'),
('workclass', 'occupation', 'race', 'sex', 'income'),
('education', 'occupation', 'race', 'sex', 'income'),
('age_group', 'workclass', 'education', 'occupation', 'race', 'sex'),
('age_group', 'workclass', 'education', 'occupation', 'race', 'income'),
('age_group', 'workclass', 'education', 'occupation', 'sex', 'income'),
('age_group', 'workclass', 'education', 'race', 'sex', 'income'),
('age_group', 'workclass', 'occupation', 'race', 'sex', 'income'),
('age_group', 'education', 'occupation', 'race', 'sex', 'income'),
('workclass', 'education', 'occupation', 'race', 'sex', 'income'),
('age_group',
'workclass',
'education',
'occupation',
'race',
'sex',
'income')]

When applied to the US Census data, the output Excel file will contain summary tables for all possible combinations of the specified variables. The first sheet will be a Table of Contents with hyperlinks to each summary table.

Saving DataFrames to Excel with Customized Formatting

Save multiple DataFrames to separate sheets in an Excel file with progress tracking and formatting.

This section explains how to save multiple DataFrames to separate sheets in an Excel file with customized formatting using the save_dataframes_to_excel function.

save_dataframes_to_excel(file_path, df_dict, decimal_places=0)

Save DataFrames to an Excel file, applying formatting and autofitting columns.

Parameters:
  • file_path (str) – Full path to the output Excel file.

  • df_dict (dict) – Dictionary where keys are sheet names and values are DataFrames to save.

  • decimal_places (int) – Number of decimal places to round numeric columns. Default is 0. When set to 0, numeric columns are saved as integers.

Notes

  • Progress Tracking: The function uses tqdm to display a progress bar while saving DataFrames to Excel sheets.

  • Column Formatting:
    • Columns are autofitted to content length and left-aligned by default.

    • Numeric columns are rounded to the specified decimal places and formatted accordingly.

    • Non-numeric columns are left-aligned.

  • Header Styling:
    • Headers are bold, left-aligned, and borderless.

  • Dependencies: This function requires the xlsxwriter library.

The function performs the following tasks:

  • Writes each DataFrame to its respective sheet in the Excel file.

  • Rounds numeric columns to the specified number of decimal places.

  • Applies customized formatting to headers and cells.

  • Autofits columns based on content length for improved readability.

  • Tracks the saving process with a progress bar, making it user-friendly for large datasets.

Implementation Example

Below, we use the save_dataframes_to_excel function to save two DataFrames: the original DataFrame and a filtered DataFrame with ages between 18 and 40.

from eda_toolkit import save_dataframes_to_excel

file_name = "df_census.xlsx"  # Name of the output Excel file
file_path = os.path.join(data_path, file_name)

# filter DataFrame to Ages 18-40
filtered_df = df[(df["age"] > 18) & (df["age"] < 40)]

df_dict = {
    "original_df": df,
    "ages_18_to_40": filtered_df,
}

save_dataframes_to_excel(
    file_path=file_path,
    df_dict=df_dict,
    decimal_places=0,
)

Output

Saving DataFrames: 100%|██████████| 2/2 [00:08<00:00,  4.34s/it]
DataFrames saved to ../data/df_census.xlsx

The output Excel file will contain the original DataFrame and a filtered DataFrame as a separate tab with ages between 18 and 40, each on separate sheets with customized formatting.

Creating Contingency Tables

Create a contingency table from one or more columns in a DataFrame, with sorting options.

This section explains how to create contingency tables from one or more columns in a DataFrame, with options to sort the results using the contingency_table function.

contingency_table(df, cols=None, sort_by=0)
Parameters:
  • df (pandas.DataFrame) – The DataFrame to analyze.

  • cols (str or list of str, optional) – Name of the column (as a string) for a single column or list of column names for multiple columns. Must provide at least one column.

  • sort_by (int, optional) – Enter 0 to sort results by column groups; enter 1 to sort results by totals in descending order. Defaults to 0.

Raises:

ValueError – If no columns are specified or if sort_by is not 0 or 1.

Returns:

A DataFrame containing the contingency table with the specified columns, a 'Total' column representing the count of occurrences, and a 'Percentage' column representing the percentage of the total count.

Return type:

pandas.DataFrame

Implementation Example

Below, we use the contingency_table function to create a contingency table from the specified columns in a DataFrame containing census data [1]

from eda_toolkit import contingency_table

contingency_table(
    df=df,
    cols=[
        "age_group",
        "workclass",
        "race",
        "sex",
    ],
    sort_by=1,
)

Note

You may notice a new variable, age_group, is introduced. The logic for generating this variable is provided here.

Output

The output will be a contingency table with the specified columns, showing the total counts and percentages of occurrences for each combination of values. The table will be sorted by the 'Total' column in descending order because sort_by is set to 1.

    age_group     workclass                race     sex  Total  Percentage
0       30-39       Private               White    Male   5856       11.99
1       18-29       Private               White    Male   5623       11.51
2       40-49       Private               White    Male   4267        8.74
3       18-29       Private               White  Female   3680        7.53
4       50-59       Private               White    Male   2565        5.25
..        ...           ...                 ...     ...    ...         ...
467     50-59   Federal-gov               Other    Male      1        0.00
468     50-59     Local-gov  Asian-Pac-Islander  Female      1        0.00
469     70-79  Self-emp-inc               Black    Male      1        0.00
470     80-89     Local-gov  Asian-Pac-Islander    Male      1        0.00
471                                                      48842      100.00

[472 rows x 6 columns]

Generating Summaries (Table 1)

Create a summary statistics table for both categorical and continuous variables in a DataFrame. This section describes how to generate a summary “Table 1” from a given dataset using the generate_table1 function. It supports automatic detection of variable types, pretty-printing, optional export to Markdown, and p-value adjustments.

generate_table1(df, apply_bonferroni=False, apply_bh_fdr=False, categorical_cols=None, continuous_cols=None, decimal_places=2, export_markdown=False, drop_columns=None, drop_variables=None, markdown_path=None, max_categories=None, detect_binary_numeric=True, return_markdown_only=False, value_counts=False, include_types='both', combine=True, groupby_col=None, use_fisher_exact=False, use_welch=True)
Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing the data to summarize.

  • apply_bonferroni (bool, optional) – Whether to apply Bonferroni correction to p-values.

  • apply_bh_fdr (bool, optional) – Whether to apply Benjamini-Hochberg FDR correction to p-values.

  • categorical_cols (list of str, optional) – List of categorical column names. If None, inferred automatically.

  • continuous_cols (list of str, optional) – List of continuous column names. If None, inferred automatically.

  • decimal_places (int, optional) – Number of decimal places for rounding statistics.

  • export_markdown (bool, optional) – Whether to export the table(s) to Markdown files.

  • drop_columns (str or list of str, optional) – Column name or list of column names to drop from the final output.

  • drop_variables (str or list of str, optional) – Variable name or list of variables to exclude from output.

  • markdown_path (str or Path, optional) – Output path or filename prefix for Markdown export.

  • max_categories (int, optional) – Maximum number of category levels to display in value counts.

  • detect_binary_numeric (bool, optional) – Whether to reclassify numeric variables with ≤2 unique values as categorical.

  • return_markdown_only (bool, optional) – If True, returns Markdown string(s) instead of DataFrame(s).

  • value_counts (bool, optional) – If True, include value counts and proportions for categorical variables.

  • include_types (str) – Which types of variables to include: "categorical", "continuous", or "both".

  • combine (bool, optional) – If True, return a single DataFrame when include_types is "both"; otherwise, return tuple.

  • groupby_col (str, optional) – Optional column name for binary group comparisons (e.g., treatment vs. control).

  • use_fisher_exact (bool, optional) – Whether to use Fisher’s Exact Test for 2x2 categorical comparisons.

  • use_welch (bool, optional) – Whether to use Welch’s t-test (default). If False, uses Student’s t-test.

Returns:

  • If include_types="both" and combine=False: tuple of two DataFrames (continuous, categorical)

  • If include_types="both" and combine=True: single combined DataFrame

  • If include_types="continuous" or "categorical": single DataFrame

  • If export_markdown=True and return_markdown_only=True: returns Markdown string or dict of strings

Return type:

pandas.DataFrame or tuple or str or dict

Important

By default, combine=True, so the function returns a single DataFrame containing both continuous and categorical summaries. This may introduce visual clutter, as categorical rows do not use columns like mean, standard deviation, median, min, or max. These columns are automatically removed when generating a categorical-only summary.

To separate the summaries for clarity or further processing, set combine=False and unpack the result into two distinct objects:

df_cont, df_cat = generate_table1(df, combine=False)

This provides greater flexibility for formatting, exporting, or downstream analysis.

Important

When using combine=False, the function returns a tuple of two TableWrapper objects — one for continuous variables and one for categorical variables.

The TableWrapper class is a lightweight wrapper designed to override the string representation of a DataFrame, enabling pretty-print formatting (e.g., tables) when the result is printed in a Jupyter cell. The wrapper preserves full access to the underlying DataFrame through attribute forwarding.

class TableWrapper:
    """
    Wraps a DataFrame to override its string output without affecting
    Jupyter display.
    """
    def __init__(self, df, string):
        self._df = df
        self._string = string

    def __str__(self):
        return self._string

    def __getattr__(self, attr):
        return getattr(self._df, attr)

    def __getitem__(self, key):
        return self._df[key]

    def __len__(self):
        return len(self._df)

    def __iter__(self):
        return iter(self._df)

Because TableWrapper overrides __str__ but not __repr__, assigning the result of generate_table1() to a single variable like this:

table1 = generate_table1(df, combine=False)

will show a tuple of object addresses instead of nicely formatted tables:

(<eda_toolkit.data_manager.TableWrapper at 0x...>, <eda_toolkit.data_manager.TableWrapper at 0x...>)

To avoid this, you have two preferred options:

  1. Assign the output to two separate variables:

    table1_cont, table1_cat = generate_table1(df, combine=False)
    

    Then display them individually:

    table1_cont
    table1_cat
    
  2. Don’t assign the result at all when calling from a notebook or REPL cell — this will trigger automatic pretty-printing of both tables, with a blank line in between:

    generate_table1(df, combine=False)
    

This design keeps the output clean for interactive use while still supporting unpacking in scripts and pipelines.

Example 1: Mixed Summary Table (value_counts=False)

In the example below, we generate a summary table from a dataset containing both categorical and continuous variables. We explicitly define which columns fall into each category—although the generate_table1 function also supports automatic detection of variable types if desired.

The summary output is automatically pretty-printed in the console using the table1_to_str utility. This formatting is applied behind the scenes whenever a summary table is printed, making it especially helpful for reading outputs within notebooks or logging environments.

In this case, we specify export_markdown=True and provide a filename via markdown_path. This allows the summary to be exported in Markdown format for use in reports, documentation, or publishing platforms like Jupyter Book or Quarto. When include_types="both" and combine=True (the default), both continuous and categorical summaries are merged into a single DataFrame and written to two separate Markdown files with _continuous.md and _categorical.md suffixes.

We also set value_counts=False to limit each categorical variable to a single summary row, rather than expanding into one row per category-level value.

from eda_toolkit import generate_table1

table1 = generate_table1(
    df=df,
    categorical_cols=["sex", "race", "workclass"],
    continuous_cols=["hours-per-week", "age", "education-num"],
    value_counts=False,
    max_categories=3,
    export_markdown=True,
    decimal_places=0,
    markdown_path="table1_summary.md",
)

print(table1)

Output

 Variable       | Type        | Mean | SD | Median | Min | Max | Mode    | Missing (n) | Missing (%) | Count  | Proportion (%)
----------------|-------------|------|----|--------|-----|-----|---------|-------------|-------------|--------|----------------
 hours-per-week | Continuous  | 40   | 12 | 40     | 1   | 99  | 40      | 0           | 0           | 48,842 | 100
 age            | Continuous  | 39   | 14 | 37     | 17  | 90  | 36      | 0           | 0           | 48,842 | 100
 education-num  | Continuous  | 10   | 3  | 10     | 1   | 16  | 9       | 0           | 0           | 48,842 | 100
 sex            | Categorical |      |    |        |     |     | Male    | 0           | 0           | 48,842 | 100
 race           | Categorical |      |    |        |     |     | White   | 0           | 0           | 48,842 | 100
 workclass      | Categorical |      |    |        |     |     | Private | 963         | 2           | 47,879 | 98

Note

If categorical_cols or continuous_cols are not specified, generate_table1 will automatically detect them based on the column data types. Additionally, numeric columns with two or fewer unique values can be reclassified as categorical using the detect_binary_numeric=True setting (enabled by default).

When value_counts=True, one row will be generated for each category-value pair rather than for the variable as a whole.

Example 2: Mixed Summary Table (value_counts=True)

In this example, we call generate_table1 without manually specifying which columns are categorical or continuous. Instead, the function automatically detects variable types based on data types. Numeric columns with two or fewer unique values are also reclassified as categorical by default (controlled via detect_binary_numeric=True).

We set value_counts=True to generate a separate summary row for each unique value within a categorical variable, rather than a single row per variable. To keep the output concise, we limit each breakdown to the top 3 most frequent values using max_categories=3.

We also enable export_markdown=True to export the summaries in Markdown format. While you can specify a custom markdown_path, if none is provided, the output files are saved to the current working directory.

Since include_types="both" is the default and combine=True by default as well, the underlying summaries are merged into a single DataFrame for display—but two separate Markdown files are still generated with suffixes that reflect the type of summary:

  • table1_summary_continuous.md

  • table1_summary_categorical.md

This setup is ideal for detailed reporting, especially when working with downstream tools like Jupyter Book, Quarto, or static site generators.

from eda_toolkit import generate_table1

table1_cont, table1_cat = generate_table1(
    df=df,
    value_counts=True,
    max_categories=3,
    combine=False,
    export_markdown=True,
    markdown_path="table1_summary.md",
)

table1_cont = table1_cont.drop(columns=["Type", "Mode"])

Output

table1_cont

Variable Mean SD Median Min Max Missing (n) Missing (%) Count Proportion (%)
age 38.64 13.71 37 17 90 0 0 48,842 100
capital-gain 1,079.07 7,452.02 0 0.00 99,999.00 0 0 48,842 100
capital-loss 87.5 403 0 0.00 4,356.00 0 0 48,842 100
education-num 10.08 2.57 10 1 16 0 0 48,842 100
fnlwgt 189,664.13 105,604.03 178,144.50 12,285.00 1,490,400.00 0 0 48,842 100
hours-per-week 40.42 12.39 40 1 99 0 0 48,842 100

table1_cat

Variable Type Mode Missing (n) Missing (%) Count Proportion (%)
workclass = Private Categorical Private 963 1.97 33,906 69.42
workclass = Self-emp-not-inc Categorical Private 963 1.97 3,862 7.91
workclass = Local-gov Categorical Private 963 1.97 3,136 6.42
education = HS-grad Categorical HS-grad 0 0 15,784 32.32
education = Some-college Categorical HS-grad 0 0 10,878 22.27
education = Bachelors Categorical HS-grad 0 0 8,025 16.43
marital-status =   Married-civ-spouse Categorical Married-civ-spouse 0 0 22,379 45.82
marital-status = Never-married Categorical Married-civ-spouse 0 0 16,117 33
marital-status = Divorced Categorical Married-civ-spouse 0 0 6,633 13.58
occupation = Prof-specialty Categorical Prof-specialty 966 1.98 6,172 12.64
occupation = Craft-repair Categorical Prof-specialty 966 1.98 6,112 12.51
occupation = Exec-managerial Categorical Prof-specialty 966 1.98 6,086 12.46
relationship = Husband Categorical Husband 0 0 19,716 40.37
relationship = Not-in-family Categorical Husband 0 0 12,583 25.76
relationship = Own-child Categorical Husband 0 0 7,581 15.52
race = White Categorical White 0 0 41,762 85.5
race = Black Categorical White 0 0 4,685 9.59
race = Asian-Pac-Islander Categorical White 0 0 1,519 3.11
sex = Male Categorical Male 0 0 32,650 66.85
sex = Female Categorical Male 0 0 16,192 33.15
native-country = United-States Categorical United-States 274 0.56 43,832 89.74
native-country = Mexico Categorical United-States 274 0.56 951 1.95
native-country = ? Categorical United-States 274 0.56 583 1.19
income = <=50K Categorical <=50K 0 0 24,720 50.61
income = <=50K. Categorical <=50K 0 0 12,435 25.46
income = >50K Categorical <=50K 0 0 7,841 16.05

Example 3: Group Comparisons (P-Values)

In this example we first turn the continuous fnlwgt variable into two new categorical factors:

  1. Equal-width bins (fnlwgt_bin), so each interval spans the same numeric range.

  2. Quantile bins (fnlwgt_bin_quantile), so each bin holds roughly the same number of observations.

After creating those, we build a mixed summary table that computes p-values for the two income groups (<=50K vs. >50K), alongside other categorical variables.

import pandas as pd

# Equal-width bins
df["fnlwgt_bin"] = pd.cut(
    df["fnlwgt"],
    bins=5,
    labels=[f"Bin {i}" for i in range(1, 6)],
    include_lowest=True,
)
# Select our new bins plus other categorical fields
df_table1_cat = df[["fnlwgt_bin", "age_group", "income"]]
from eda_toolkit import generate_table1

# Generate Table 1 comparing income groups
p_value_table_1_cat = generate_table1(
    df=df_table1_cat,
    value_counts=True,
    include_types="categorical",
    export_markdown=True,
    groupby_col="income",
    drop_columns=[
        "Missing (n)",
        "Missing (%)",
        "income",   # drop the raw income column itself
        "Type",     # drop internal type metadata
        "Mode",     # drop mode metadata
    ],
    drop_variables="income",
)

print(p_value_table_1_cat)
Using Chi-squared test for variable: fnlwgt_bin_width
Using Chi-squared test for variable: age_group
Using Chi-squared test for variable: sex
Using Chi-squared test for variable: marital-status
Using Chi-squared test for variable: income

Output

Variable Count Proportion (%) ≤50K (n = 37,155) >50K (n = 11,687) P-value
fnlwgt_bin_width 48,842 100.00 37,155 11,687 0.66
fnlwgt_bin_width = Bin 1 42,729 87.48 32,517 (87.52%) 10,212 (87.38%)
fnlwgt_bin_width = Bin 2 5,898 12.08 4,466 (12.02%) 1,432 (12.25%)
fnlwgt_bin_width = Bin 3 186 0.38 148 (0.40%) 38 (0.33%)
fnlwgt_bin_width = Bin 4 22 0.05 18 (0.05%) 4 (0.03%)
fnlwgt_bin_width = Bin 5 7 0.01 6 (0.02%) 1 (0.01%)
age_group 48,842 100.00 37,155 11,687 0.00
age_group = 18–29 13,920 28.50 13,174 (35.46%) 746 (6.38%)
age_group = 30–39 12,929 26.47 9,468 (25.48%) 3,461 (29.61%)
age_group = 40–49 10,724 21.96 6,738 (18.13%) 3,986 (34.11%)
age_group = 50–59 6,619 13.55 4,110 (11.06%) 2,509 (21.47%)
age_group = 60–69 3,054 6.25 2,245 (6.04%) 809 (6.92%)
age_group = 70–79 815 1.67 668 (1.80%) 147 (1.26%)
age_group = < 18 595 1.22 595 (1.60%) 0 (0.00%)
age_group = 80–89 131 0.27 115 (0.31%) 16 (0.14%)
age_group = 90–99 55 0.11 42 (0.11%) 13 (0.11%)
age_group = 100 + 0 0.00 0 (0.00%) 0 (0.00%)

Note

  • groupby_col="income" enables group comparison across the two income categories.

  • age_group and fnlwgt_bin_width are summarized as categorical variables with frequency counts and group-wise breakdowns.

  • value_counts=True includes one row per unique value in each categorical variable.

  • drop_columns and drop_variables help simplify the final output for presentation by removing metadata and redundant columns.

  • export_markdown=True writes the summary table to a Markdown file for convenient reporting and sharing.

This example illustrates how the same function can be used to perform basic univariate statistical testing across groups while producing a ready-to-export Markdown summary.

Highlighting Specific Columns in a DataFrame

This section explains how to highlight specific columns in a DataFrame using the highlight_columns function.

Highlight specific columns in a DataFrame with a specified background color.

highlight_columns(df, columns, color='yellow')
Parameters:
  • df (pandas.DataFrame) – The DataFrame to be styled.

  • columns (list of str) – List of column names to be highlighted.

  • color (str, optional) – The background color to be applied for highlighting (default is "yellow").

Returns:

A Styler object with the specified columns highlighted.

Return type:

pandas.io.formats.style.Styler

Implementation Example

Below, we use the highlight_columns function to highlight the age and education columns in the first 5 rows of the census [1] DataFrame with a pink background color.

from eda_toolkit import highlight_columns

highlighted_df = highlight_columns(
    df=df.head(),
    columns=["age", "education"],
    color="#F8C5C8",
)

highlighted_df

Output

The output will be a DataFrame with the specified columns highlighted in the given background color. The age and education columns will be highlighted in pink.

The resulting styled DataFrame can be displayed in a Jupyter Notebook or saved to an HTML file using the .render() method of the Styler object.

age workclass fnlwgt education education-num marital-status occupation relationship
census_id
582248222 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family
561810758 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband
598098459 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family
776705221 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband
479262902 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife

Binning Numerical Columns

Binning numerical columns is a technique used to convert continuous numerical data into discrete categories or “bins.” This is especially useful for simplifying analysis, creating categorical features from numerical data, or visualizing the distribution of data within specific ranges. The process of binning involves dividing a continuous range of values into a series of intervals, or “bins,” and then assigning each value to one of these intervals.

Note

The code snippets below create age bins and assign a corresponding age group label to each age in the DataFrame. The pd.cut function from pandas is used to categorize the ages and assign them to a new column, age_group. Adjust the bins and labels as needed for your specific data.

Below, we use the age column of the census data [1] from the UCI Machine Learning Repository as an example:

  1. Bins Definition: The bins are defined by specifying the boundaries of each interval. For example, in the code snippet below, the bin_ages list specifies the boundaries for age groups:

    bin_ages = [
        0,
        18,
        30,
        40,
        50,
        60,
        70,
        80,
        90,
        100,
        float("inf"),
    ]
    

    Each pair of consecutive elements in bin_ages defines a bin. For example:

    • The first bin is [0, 18),

    • The second bin is [18, 30),

    • and so on.

  1. Labels for Bins: The label_ages list provides labels corresponding to each bin:

    label_ages = [
        "< 18",
        "18-29",
        "30-39",
        "40-49",
        "50-59",
        "60-69",
        "70-79",
        "80-89",
        "90-99",
        "100 +",
    ]
    

    These labels are used to categorize the numerical values into meaningful groups.

  2. Applying the Binning: The pd.cut function from Pandas is used to apply the binning process. For each value in the age column of the DataFrame, it assigns a corresponding label based on which bin the value falls into. Here, right=False indicates that each bin includes the left endpoint but excludes the right endpoint. For example, if bin_ages = [0, 10, 20, 30], then a value of 10 will fall into the bin [10, 20) and be labeled accordingly.

    df["age_group"] = pd.cut(
        df["age"],
        bins=bin_ages,
        labels=label_ages,
        right=False,
    )
    

    Mathematically, for a given value x in the age column:

    \[\begin{split}\text{age_group} = \begin{cases} < 18 & \text{if } 0 \leq x < 18 \\ 18-29 & \text{if } 18 \leq x < 30 \\ \vdots \\ 100 + & \text{if } x \geq 100 \end{cases}\end{split}\]

    The parameter right=False in pd.cut means that the bins are left-inclusive and right-exclusive, except for the last bin, which is always right-inclusive when the upper bound is infinity (float("inf")).

Group-by Imputer

Impute missing values using group-level statistics with optional fallback logic.

The groupby_imputer function provides a simple, transparent, and statistically grounded approach for imputing missing values in a DataFrame using group-specific summary statistics. Instead of applying a single global imputation value, this function computes statistics within meaningful subgroups, preserving structure and heterogeneity in the data.

This approach is especially useful when missingness depends on categorical attributes such as demographic groups, job categories, or experimental conditions.

Key Features

  • Group-aware imputation using one or more categorical variables

  • Supports common summary statistics: mean, median, min, max

  • Fallback logic for sparse or empty groups

  • Option to preserve the original column or overwrite in place

  • Fully vectorized and interpretable, no black-box modeling

groupby_imputer(df, impute_col, by, stat='mean', fallback='global', as_new_col=True, new_col_name=None)
Parameters:
  • df (pandas.DataFrame) – Input DataFrame containing missing values to impute.

  • impute_col (str) – Name of the column containing missing values.

  • by (str or list of str) – One or more categorical columns used to define groups. Group-level statistics are computed within each unique combination of these columns.

  • stat (str, optional) – Summary statistic used for imputation. Supported values are "mean", "median", "min", and "max".

  • fallback (str or int or float, optional) – Value used when a group contains no non-null observations for impute_col. - "global" uses the overall statistic computed across the entire column - A numeric value (int or float) uses a fixed fallback

  • as_new_col (bool, optional) – If True, a new imputed column is created and the original column is left unchanged. If False, imputation is performed in place.

  • new_col_name (str, optional) – Optional custom name for the imputed column. Ignored if as_new_col=False.

Returns:

A DataFrame with missing values imputed according to group-level statistics and fallback logic.

Return type:

pandas.DataFrame

Raises:

ValueError

  • If stat is not one of "mean", "median", "min", or "max".

Notes

  • Group statistics are computed using non-null values only.

  • If a group contains only missing values, the fallback value is applied.

  • Using as_new_col=True is recommended for exploratory workflows to preserve the original data for comparison.

  • This method is deterministic and fully interpretable, making it suitable for regulated or audited modeling environments.

Group-wise Imputation Examples

In the example below, missing values are introduced artificially into the "age" column of the Adult Income dataset. The goal is to impute these missing values using group-level means, defined by combinations of "workclass" and "education".

Two fallback strategies are demonstrated:

  1. Global fallback, where groups with insufficient data fall back to the overall mean

  2. Fixed-value fallback, where a constant value is used instead

from eda_toolkit import groupby_imputer
import numpy as np

# Introduce missingness for demonstration
df.loc[
    df.sample(frac=0.3, random_state=42).index, "age"
] = np.nan

# 1. Group-wise mean imputation with global fallback
X_global = groupby_imputer(
    df=df,
    impute_col="age",
    by=["workclass", "education"],
    stat="mean",
    fallback="global",
    as_new_col=True,
)

print("\n### Head with global fallback ###")
print(X_global[["age", "age_mean_imputed"]].head())

print("Means:")
print(f"Original df['age'].mean(): {df['age'].mean()}")
print(f"Imputed (global fallback) mean: {X_global['age_mean_imputed'].mean()}")

Output

### Head with global fallback ###
            age  age_mean_imputed
census_id
582248222   NaN         38.495238
561810758   NaN         45.110837
598098459  38.0         38.000000
776705221  53.0         53.000000
479262902   NaN         37.249330
Means:
Original df['age'].mean(): 38.597911608997045
Imputed (global fallback) mean: 38.621605730469135
from eda_toolkit import groupby_imputer
import numpy as np

# Introduce missingness for demonstration
df.loc[
    df.sample(frac=0.3, random_state=42).index, "age"
] = np.nan

# 2. Group-wise mean imputation with fixed fallback
X_fixed = groupby_imputer(
    df=df,
    impute_col="age",
    by=["workclass", "education"],
    stat="mean",
    fallback=50,
    as_new_col=True,
)

print("\n### Head with fixed fallback=50 ###")
print(X_fixed[["age", "age_mean_imputed"]].head())
print("Means:")
print(f"Original df['age'].mean(): {df['age'].mean()}")
print(f"Imputed (fixed fallback=50) mean: {X_fixed['age_mean_imputed'].mean()}")

print("\n### Comparison summary ###")
print(f"Missingness in original age: {df['age'].isna().mean()}")
print(f"Global fallback mean: {X_global['age_mean_imputed'].mean()}")
print(f"Fixed fallback=50 mean: {X_fixed['age_mean_imputed'].mean()}")

Output

### Head with fixed fallback=50 ###
    age  age_mean_imputed
0   NaN         38.495238
1   NaN         45.110837
2  38.0         38.000000
3  53.0         53.000000
4   NaN         37.249330

Means:
Original df['age'].mean(): 38.597911608997045
Imputed (fixed fallback=50) mean: 38.6904730183637

### Comparison summary ###
Missingness in original age: 0.30000818967282256
Global fallback mean: 38.621605730469135
Fixed fallback=50 mean: 38.6904730183637

Interpretation

  • Group-level imputation preserves differences across workclass and education categories that would be lost under global imputation.

  • The global fallback maintains consistency with the original distribution.

  • Fixed fallback values may be useful for domain-driven constraints but can introduce bias if not chosen carefully.

Group-aware imputation provides a strong baseline that often outperforms naive global strategies while remaining transparent and easy to audit.

Delete Inactive DataFrames

Safely clean up unused pandas DataFrames from a namespace to reduce memory usage

The del_inactive_dataframes utility is designed to help manage memory in interactive environments such as Jupyter notebooks, scripts, and cloud runtimes. It inspects a namespace (for example, globals() or a custom dictionary), identifies pandas DataFrames, and optionally deletes those that are no longer needed.

Unlike manual del calls, this function provides visibility, safety, and auditability:

  • Preview what would be deleted before committing

  • Preserve explicitly named DataFrames

  • Handle Jupyter output-cache variables that can silently retain memory

  • Optionally track memory usage before and after cleanup

  • Produce readable summaries using Rich tables when available

This makes it particularly useful during exploratory analysis, large ETL pipelines, and long-running notebooks.

Key Capabilities

  • Selective deletion based on an explicit keep list

  • Dry-run mode to preview deletions without modifying state

  • Optional inclusion of IPython cache variables (e.g. _14, _27)

  • Optional garbage collection after deletion

  • Optional memory tracking at the DataFrame and process level

  • Works with globals(), locals(), or any user-managed namespace dictionary

del_inactive_dataframes(dfs_to_keep, del_dataframes=False, namespace=None, include_ipython_cache=False, dry_run=False, run_gc=True, track_memory=False, memory_mode='dataframes', verbose=True)
Parameters:
  • dfs_to_keep (str or iterable of str) – Name or names of DataFrame variables to preserve. All other DataFrames may be deleted when del_dataframes=True.

  • del_dataframes (bool, optional) – Whether to actually delete inactive DataFrames. If False, the function only reports what would be deleted.

  • namespace (dict, optional) – Namespace dictionary to inspect and optionally modify. If None, the function uses globals() of the calling scope.

  • include_ipython_cache (bool, optional) – Whether to include IPython output-cache variables (e.g. _14) when searching for DataFrames.

  • dry_run (bool, optional) – If True, show planned deletions without deleting anything.

  • run_gc (bool, optional) – Whether to call gc.collect() after deletions occur.

  • track_memory (bool, optional) – Whether to record memory usage before and after cleanup.

  • memory_mode (str, optional) – Controls which memory metrics are reported: - "dataframes" reports total pandas DataFrame memory - "all" also reports process RSS (requires psutil)

  • verbose (bool, optional) – Whether to print formatted output. If False, the function returns results silently.

Returns:

Summary dictionary describing detected, deleted, and remaining DataFrames, along with optional memory metrics.

Return type:

dict

Raises:

ValueError

  • If memory_mode is not one of "dataframes" or "all".

Notes

  • A namespace is simply a dictionary mapping variable names to objects.

  • Deletion is performed by removing keys from the namespace dictionary.

  • Modifying locals() inside a function is not guaranteed to free memory. For reliable cleanup inside functions, store DataFrames in a dict and pass that dict as namespace.

  • Process RSS memory may not decrease immediately even after garbage collection.

Usage Examples

The following examples demonstrate common and advanced usage patterns for del_inactive_dataframes. Each example recreates its own DataFrames so behavior is isolated and easy to follow.

Because del_inactive_dataframes inspects a namespace dictionary, these examples explicitly pass namespace=globals() so the function can see variables defined in this script or notebook.

Example 1: List Active DataFrames (No Deletion)

Inspect which DataFrames are currently active without deleting anything.

from eda_toolkit import del_inactive_dataframes

df_main = pd.DataFrame({"a": range(10)})
df_tmp = pd.DataFrame({"b": range(100)})

del_inactive_dataframes(
    dfs_to_keep = ["df"],
    del_dataframes=False,
    namespace=globals(),
)

Output

──────────────────────────────────────── Active DataFrames ────────────────────────────────────────
Current Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ _              │
│ __             │
│ ___            │
│ df             │
│ df_copy        │
│ df_main        │
│ df_tmp         │
│ sampled_df     │
└────────────────┘
{'active': ['_',
'__',
'___',
'df',
'df_copy',
'df_main',
'df_tmp',
'sampled_df'],
'to_delete': [],
'deleted': [],
'remaining': [],
'used_rich': True,
'memory': None}

Example 2: Delete Everything Except a Single DataFrame

Remove all DataFrames except the one explicitly listed.

from eda_toolkit import del_inactive_dataframes

df_main = pd.DataFrame({"a": range(10)})
df_tmp = pd.DataFrame({"b": range(100)})

del_inactive_dataframes(
    dfs_to_keep = ["df"],
    del_dataframes=True,
    namespace=globals(),
)

Output

──────────────────────────────────────── Active DataFrames ────────────────────────────────────────
Current Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ X              │
│ X_fixed        │
│ X_global       │
│ data           │
│ df             │
│ df_impute      │
│ df_main        │
│ df_strip       │
│ df_table1_cat  │
│ df_tmp         │
│ filtered_df    │
│ y              │
└────────────────┘
──────────────────────────────────────── Planned Deletions ────────────────────────────────────────
DataFrames Marked
for Deletion
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ X              │
│ X_fixed        │
│ X_global       │
│ data           │
│ df_impute      │
│ df_main        │
│ df_strip       │
│ df_table1_cat  │
│ df_tmp         │
│ filtered_df    │
│ y              │
└────────────────┘
──────────────────────────────────────── Deleted DataFrames ────────────────────────────────────────
Deleted DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ X              │
│ X_fixed        │
│ X_global       │
│ data           │
│ df_impute      │
│ df_main        │
│ df_strip       │
│ df_table1_cat  │
│ df_tmp         │
│ filtered_df    │
│ y              │
└────────────────┘
─────────────────────────────────────── Remaining DataFrames ───────────────────────────────────────
Remaining Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df             │
└────────────────┘
{'active': ['X',
'X_fixed',
'X_global',
'data',
'df',
'df_impute',
'df_main',
'df_strip',
'df_table1_cat',
'df_tmp',
'filtered_df',
'y'],
'to_delete': ['X',
'X_fixed',
'X_global',
'data',
'df_impute',
'df_main',
'df_strip',
'df_table1_cat',
'df_tmp',
'filtered_df',
'y'],
'deleted': ['X',
'X_fixed',
'X_global',
'data',
'df_impute',
'df_main',
'df_strip',
'df_table1_cat',
'df_tmp',
'filtered_df',
'y'],
'remaining': ['df'],
'used_rich': True,
'memory': None}

Example 3: Dry Run (Preview Deletions)

Preview which DataFrames would be deleted without actually removing them.

from eda_toolkit import del_inactive_dataframes

df_main = pd.DataFrame({"a": range(10)})
df_tmp = pd.DataFrame({"b": range(100)})

del_inactive_dataframes(
    dfs_to_keep = ["df"],
    del_dataframes=True,
    dry_run=True,
    namespace=globals(),
)

Output

──────────────────────────────────────── Active DataFrames ────────────────────────────────────────
Current Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ __             │
│ ___            │
│ df             │
│ df_copy        │
│ df_main        │
│ df_tmp         │
│ sampled_df     │
└────────────────┘
──────────────────────────────────────── Planned Deletions ────────────────────────────────────────
DataFrames Marked
for Deletion
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ __             │
│ ___            │
│ df_copy        │
│ df_main        │
│ df_tmp         │
│ sampled_df     │
└────────────────┘
Dry run enabled. No DataFrames were deleted.
─────────────────────────────────────── Remaining DataFrames ───────────────────────────────────────
Remaining Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ __             │
│ ___            │
│ df             │
│ df_copy        │
│ df_main        │
│ df_tmp         │
│ sampled_df     │
└────────────────┘
{'active': ['__', '___', 'df', 'df_copy', 'df_main', 'df_tmp', 'sampled_df'],
'to_delete': ['__', '___', 'df_copy', 'df_main', 'df_tmp', 'sampled_df'],
'deleted': [],
'remaining': ['__',
'___',
'df',
'df_copy',
'df_main',
'df_tmp',
'sampled_df'],
'used_rich': True,
'memory': None}

This mode is recommended before running destructive cleanup steps.

Example 4: Include IPython Output Cache Variables

In Jupyter notebooks, output cache variables such as _14 may reference large DataFrames. These are ignored by default but can be included explicitly.

from eda_toolkit import del_inactive_dataframes

df_main = pd.DataFrame({"a": range(10)})
df_tmp = pd.DataFrame({"b": range(100)})

_ = df_tmp  # simulates an IPython output cache reference

del_inactive_dataframes(
    dfs_to_keep = ["df"],
    del_dataframes=True,
    include_ipython_cache=True,
    namespace=globals(),
)

Output

──────────────────────────────────────── Active DataFrames ────────────────────────────────────────
Current Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ _              │
│ _10            │
│ _19            │
│ _8             │
│ df             │
│ df_main        │
│ df_tmp         │
└────────────────┘
──────────────────────────────────────── Planned Deletions ────────────────────────────────────────
DataFrames Marked
for Deletion
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ _              │
│ _10            │
│ _19            │
│ _8             │
│ df_main        │
│ df_tmp         │
└────────────────┘
──────────────────────────────────────── Deleted DataFrames ────────────────────────────────────────
Deleted DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ _              │
│ _10            │
│ _19            │
│ _8             │
│ df_main        │
│ df_tmp         │
└────────────────┘
─────────────────────────────────────── Remaining DataFrames ───────────────────────────────────────
Remaining Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df             │
└────────────────┘
{'active': ['_', '_10', '_19', '_8', 'df', 'df_main', 'df_tmp'],
'to_delete': ['_', '_10', '_19', '_8', 'df_main', 'df_tmp'],
'deleted': ['_', '_10', '_19', '_8', 'df_main', 'df_tmp'],
'remaining': ['df'],
'used_rich': True,
'memory': None}

Example 5: Track DataFrame Memory Usage

Report total DataFrame memory usage before and after cleanup.

from eda_toolkit import del_inactive_dataframes

df_main = pd.DataFrame({"a": range(10)})
df_tmp = pd.DataFrame({"b": range(100)})

del_inactive_dataframes(
    dfs_to_keep = ["df"],
    del_dataframes=True,
    track_memory=True,
    namespace=globals(),
)

Output

──────────────────────────────────────── Active DataFrames ────────────────────────────────────────
Current Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df             │
│ df_main        │
│ df_tmp         │
└────────────────┘
──────────────────────────────────────── Planned Deletions ────────────────────────────────────────
DataFrames Marked
for Deletion
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df_main        │
│ df_tmp         │
└────────────────┘
──────────────────────────────────────── Deleted DataFrames ────────────────────────────────────────
Deleted DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df_main        │
│ df_tmp         │
└────────────────┘
─────────────────────────────────────── Remaining DataFrames ───────────────────────────────────────
Remaining Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df             │
└────────────────┘
────────────────────────────────────────────── Memory ──────────────────────────────────────────────
            Memory Usage (MB)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Metric           ┃ Before ┃ After ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ DataFrames total │   30.2 │  30.2 │  -0.0 │
└──────────────────┴────────┴───────┴───────┘
{'active': ['df', 'df_main', 'df_tmp'],
'to_delete': ['df_main', 'df_tmp'],
'deleted': ['df_main', 'df_tmp'],
'remaining': ['df'],
'used_rich': True,
'memory': {'dataframes_mb': {'before': 30.237445831298828,
'after': 30.23635482788086,
'delta': -0.00109100341796875},
'mode': 'dataframes'}}

This mode uses ``pandas.DataFrame.memory_usage(deep=True)`` for reporting.

Example 6: Track DataFrame Memory and Process RSS

Enable power-user memory tracking, including process-level RSS when available.

df_main = pd.DataFrame({"a": range(10)})
df_tmp = pd.DataFrame({"b": range(100)})

del_inactive_dataframes(
    dfs_to_keep = ["df"],
    del_dataframes=True,
    track_memory=True,
    memory_mode="all",
    namespace=globals(),
)

Output

──────────────────────────────────────── Active DataFrames ────────────────────────────────────────
Current Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df             │
│ df_main        │
│ df_tmp         │
└────────────────┘
──────────────────────────────────────── Planned Deletions ────────────────────────────────────────
DataFrames Marked
for Deletion
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df_main        │
│ df_tmp         │
└────────────────┘
──────────────────────────────────────── Deleted DataFrames ────────────────────────────────────────
Deleted DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df_main        │
│ df_tmp         │
└────────────────┘
─────────────────────────────────────── Remaining DataFrames ───────────────────────────────────────
Remaining Active
    DataFrames
┏━━━━━━━━━━━━━━━━┓
┃ DataFrame Name ┃
┡━━━━━━━━━━━━━━━━┩
│ df             │
└────────────────┘
────────────────────────────────────────────── Memory ──────────────────────────────────────────────
            Memory Usage (MB)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Metric           ┃ Before ┃ After ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ Process RSS      │  403.6 │ 403.6 │  +0.0 │
│ DataFrames total │   30.2 │  30.2 │  -0.0 │
└──────────────────┴────────┴───────┴───────┘
{'active': ['df', 'df_main', 'df_tmp'],
'to_delete': ['df_main', 'df_tmp'],
'deleted': ['df_main', 'df_tmp'],
'remaining': ['df'],
'used_rich': True,
'memory': {'dataframes_mb': {'before': 30.237445831298828,
'after': 30.23635482788086,
'delta': -0.00109100341796875},
'mode': 'all',
'process_mb': {'before': 403.5546875, 'after': 403.5546875, 'delta': 0.0}}}

Process RSS reporting requires psutil and may not immediately decrease after garbage collection.

Example 7: Programmatic Usage (No Console Output)

Suppress console output and consume the returned summary dictionary directly.

from eda_toolkit import del_inactive_dataframes

df_main = pd.DataFrame({"a": range(10)})
df_tmp = pd.DataFrame({"b": range(100)})

summary = del_inactive_dataframes(
    ["df_main"],
    del_dataframes=True,
    verbose=False,
    track_memory=True,
    namespace=globals(),
)

print(summary)

Output

Returned summary dict:
{'active': ['df_main', 'df_tmp'], 'to_delete': ['df_tmp'], 'deleted': ['df_tmp'],
'remaining': ['df_main'], 'used_rich': False, 'memory': {'dataframes_mb':
{'before': 0.00109100341796875, 'after': 0.000202178955078125,
'delta': -0.000888824462890625}, 'mode': 'dataframes'}}

This pattern is useful in pipelines, automated scripts, and testing workflows.

When to Use This Utility

  • Long-lived Jupyter notebooks with many intermediate DataFrames

  • Cloud notebooks with limited RAM

  • ETL pipelines with staged intermediate results

  • Teaching environments where reproducibility and clarity matter

  • Any workflow where memory pressure needs to be controlled explicitly

This function is intentionally conservative, explicit, and observable, favoring safe cleanup over implicit or destructive behavior.