Adult Income Dataset

This example shows an example of what a train.py script could look like to train and concisely evaluate classification models on the Adult Census Income dataset [1] using the model_tuner library [2]. The dataset presents a classic binary classification problem—predicting whether an individual’s income exceeds $50K per year—based on a variety of demographic and employment-related features. The model_tuner framework streamlines model development by handling preprocessing, cross-validation, hyperparameter tuning, and performance evaluation, enabling rapid experimentation with minimal boilerplate code.

Model Configuration & Hyperparameters (`model_params.py`)

This script defines the model configurations and hyperparameter grids for three classifiers—Logistic Regression, Decision Tree, and Random Forest—used in our evaluation of the Adult Income dataset [1].

Each model is structured into a standardized dictionary format compatible with the model_tuner pipeline, containing the estimator, its name, a hyperparameter grid, and metadata flags for tuning strategies. To ensure consistent and reproducible results, a fixed random state (rstate = 222) is used across models.

The Logistic Regression model is tuned over a range of regularization strengths (C) using an L2 penalty. It is configured with class balancing and parallel processing to improve performance and scalability. The Decision Tree model explores various tree depths along with different minimum sample thresholds for both splits and leaves, offering a balance between flexibility and regularization, with class balancing also enabled. Lastly, the Random Forest model is optimized for efficiency by using a reduced number of estimators and constraining tree depth to help prevent overfitting. Like Logistic Regression, it leverages parallel processing for faster computation.

All three model definitions are collected in a dictionary (model_definitions) that can be easily passed into the model_tuner workflow for training, evaluation, and comparison.

Let me know if you’d like to add default scoring metrics or cross-validation strategy info to this section as well.

import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

################################################################################
############################ Global Constants ##################################
################################################################################

rstate = 222  # random state for reproducibility

################################################################################
########################## Logistic Regression #################################
################################################################################

# Define the hyperparameters for Logistic Regression
lr_name = "lr"

lr_penalties = ["l2"]
lr_Cs = np.logspace(-4, 0, 5)
# lr_max_iter = [100, 500]

# Structure the parameters similarly to the RF template
tuned_parameters_lr = [
    {
        "lr__penalty": lr_penalties,
        "lr__C": lr_Cs,
    }
]

lr = LogisticRegression(
    class_weight="balanced",
    random_state=rstate,
    n_jobs=2,
)

lr_definition = {
    "clc": lr,
    "estimator_name": lr_name,
    "tuned_parameters": tuned_parameters_lr,
    "randomized_grid": False,
    "early": False,
}

################################################################################
############################### Decision Trees #################################
################################################################################

# Define Decision Tree parameters
dt_name = "dt"

# Simplified hyperparameters
dt_max_depth = [None, 10, 20]  # Unbounded, shallow, and medium depths
dt_min_samples_split = [2, 10]  # Default and a stricter option
dt_min_samples_leaf = [1, 5]  # Default and larger leaf nodes for regularization

# Define the parameter grid for Decision Trees
tuned_parameters_dt = [
    {
        "dt__max_depth": dt_max_depth,
        "dt__min_samples_split": dt_min_samples_split,
        "dt__min_samples_leaf": dt_min_samples_leaf,
    }
]

# Define the Decision Tree model
dt = DecisionTreeClassifier(
    class_weight="balanced",
    random_state=rstate,
)

# Define the Decision Tree model configuration
dt_definition = {
    "clc": dt,
    "estimator_name": dt_name,
    "tuned_parameters": tuned_parameters_dt,
    "randomized_grid": False,
    "early": False,
}

################################################################################
##############################  Random Forest  #################################
################################################################################


# Define the hyperparameters for Random Forest (trimmed for efficiency)
rf_name = "rf"

# Reduced hyperparameters for tuning
rf_parameters = [
    {
        "rf__n_estimators": [10, 50],  # Reduce number of trees for speed
        "rf__max_depth": [None, 10],  # Limit depth to prevent overfitting
        "rf__min_samples_split": [2, 5],  # Fewer options for splitting
    }
]

# Initialize the Random Forest Classifier with a smaller number of trees
rf = RandomForestClassifier(
    n_estimators=10,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
)

# Define the Random Forest model setup
rf_definition = {
    "clc": rf,
    "estimator_name": rf_name,
    "tuned_parameters": rf_parameters,
    "randomized_grid": False,
    "early": False,
}

################################################################################

model_definitions = {
    lr_name: lr_definition,
    dt_name: dt_definition,
    rf_name: rf_definition,
}

Model Training (`train.py`)

This script defines the end-to-end workflow for training and evaluating classification models on the Adult Income dataset using the model_tuner library [2]. The process is structured as a command-line interface (CLI) application using typer, with steps that include dataset fetching, preprocessing, model tuning, training, calibration, evaluation, and serialization.

After importing the necessary libraries and establishing paths, the script fetches the Adult dataset [1] directly from the UCI Mahcine Learning Repository [3] via ucimlrepo. The target column is cleaned and encoded into binary format, while only numeric features are retained for modeling. Additionally, subgroup columns (race and sex) are extracted for stratified sampling.

The script retrieves the model configuration based on the model_type argument (e.g., "lr", "dt", or "rf"), then builds a preprocessing pipeline consisting of a standard scaler and simple imputer. This pipeline is passed into the Model class from model_tuner, along with tuning parameters, model metadata, and training settings—including stratification, calibration, and scoring criteria.

A grid search is performed with optional F1-beta optimization, and the dataset is split into training, validation, and test sets. The selected model is trained, optionally calibrated, and evaluated using ROC AUC as the primary metric. Final metrics are printed and returned as a DataFrame, and the trained model object is saved to disk for future use.

################################################################################
## Step 1. Import Libraries
################################################################################

from pathlib import Path
import typer
from loguru import logger
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo
import os
import model_tuner
from model_tuner import Model, dumpObjects
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from py_scripts.model_params import model_definitions, rstate

################################################################################
## Step 2. Initialize CLI App and Paths
################################################################################

app = typer.Typer()

PROCESSED_DATA_DIR = Path("model_files")
MODELS_DIR = Path("model_files")
RESULTS_DIR = Path("model_files/results")

################################################################################
## Step 3. Define Main Function with CLI Command
################################################################################


@app.command()
def main(
    model_type: str = "lr",
):

    ############################################################################
    ## Display Model Tuner Version Info
    ############################################################################

    print(f"\nModel Tuner version: {model_tuner.__version__}\n")
    print(f"Model Tuner authors: {model_tuner.__author__}\n")

    ############################################################################
    ## Step 4. Fetch and Prepare Dataset
    ############################################################################

    # Fetch dataset
    adult = fetch_ucirepo(id=2)  # UCI Adult dataset
    X = adult.data.features
    y = adult.data.targets

    # Copy X to retrieve original features, used for stratification
    stratify_df = X.copy()

    # Log first five rows of features and targets
    logger.info(f"\n{'=' * 80}\nX\n{'=' * 80}\n{X.head()}")
    logger.info(f"\n{'=' * 80}\ny\n{'=' * 80}\n{y.head()}")

    # Retain numeric columns only
    X = X.select_dtypes(include=np.number)

    # Subset stratify_df to those features uses for stratification
    stratify_df = stratify_df[["race", "sex"]]

    # Clean target column by removing trailing period
    y.loc[:, "income"] = y["income"].str.rstrip(".")

    # Display class balance
    print(f"\nBreakdown of y:\n{y['income'].value_counts()}\n")

    # Encode target to binary
    y = y["income"].map({"<=50K": 0, ">50K": 1})

    ############################################################################
    ## Step 5. Extract Model Settings
    ############################################################################

    clc = model_definitions[model_type]["clc"]
    estimator_name = model_definitions[model_type]["estimator_name"]

    # Set the parameters
    tuned_parameters = model_definitions[model_type]["tuned_parameters"]
    early_stop = model_definitions[model_type]["early"]

    metrics = {}

    logger.info(f"\nTraining {estimator_name}...")

    ############################################################################
    ## Step 6. Create Preprocessing Pipeline
    ###########################################################################

    pipeline = [
        ("StandardScalar", StandardScaler()),
        ("Preprocessor", SimpleImputer()),
    ]

    print("\n" + "=" * 60)

    ############################################################################
    ## Step 7. Instantiate Model Tuner
    ############################################################################

    model = Model(
        pipeline_steps=pipeline,
        name=estimator_name,
        model_type="classification",
        estimator_name=estimator_name,
        calibrate=True,
        estimator=clc,
        kfold=False,
        grid=tuned_parameters,
        n_jobs=2,
        randomized_grid=False,
        scoring=["roc_auc"],
        random_state=rstate,
        stratify_cols=stratify_df,
        stratify_y=True,
        boost_early=early_stop,
    )

    ############################################################################
    ## Step 8. Grid Search & Data Splitting
    ############################################################################

    model.grid_search_param_tuning(X, y, f1_beta_tune=True)
    X_test, y_test = model.get_test_data(X, y)
    X_valid, y_valid = model.get_valid_data(X, y)

    ############################################################################
    ## Step 9. Train and Calibrate Model
    ############################################################################

    model.fit(X, y, score="roc_auc")

    if model.calibrate:
        model.calibrateModel(X, y, score="roc_auc")

    ############################################################################
    ## Step 10. Evaluate Model
    ############################################################################

    return_metrics_dict = model.return_metrics(
        X,
        y,
        optimal_threshold=True,
        print_threshold=True,
        model_metrics=True,
        return_dict=True,
    )

    metrics = pd.Series(return_metrics_dict).to_frame(estimator_name)
    metrics = round(metrics, 3)
    print("=" * 80)

    ############################################################################
    ## Step 11. Save Trained Model
    ############################################################################

    print("=" * 80)
    dumpObjects(
        {
            "model": model,
        },
        RESULTS_DIR / f"{str(clc).split('(')[0]}.pkl",
    )


if __name__ == "__main__":
    app()

Logistic Regression Output

Model Tuner version: 0.0.28b
Model Tuner authors: Arthur Funnell, Leonid Shpaner, Panayiotis Petousis

--------------------------------------------------------------------------------
X
--------------------------------------------------------------------------------
age         workclass  fnlwgt  ... capital-loss  hours-per-week native-country
0   39         State-gov   77516  ...            0              40  United-States
1   50  Self-emp-not-inc   83311  ...            0              13  United-States
2   38           Private  215646  ...            0              40  United-States
3   53           Private  234721  ...            0              40  United-States
4   28           Private  338409  ...            0              40           Cuba

[5 rows x 14 columns]
--------------------------------------------------------------------------------
y
--------------------------------------------------------------------------------
income
0  <=50K
1  <=50K
2  <=50K
3  <=50K
4  <=50K
income
<=50K    37155
>50K     11687
Name: count, dtype: int64


============================================================

Pipeline Steps:

┌────────────────────────────────────────────┐
│ Step 1: preprocess_scaler_StandardScalar   │
│ StandardScaler                             │
└────────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────┐
│ Step 2: preprocess_imputer_Preprocessor    │
│ SimpleImputer                              │
└────────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────┐
│ Step 3: lr                                 │
│ LogisticRegression                         │
└────────────────────────────────────────────┘


0%|          | 0/5 [00:00<?, ?it/s]
20%|██        | 1/5 [00:01<00:04,  1.09s/it]
40%|████      | 2/5 [00:02<00:03,  1.11s/it]
60%|██████    | 3/5 [00:02<00:01,  1.41it/s]
80%|████████  | 4/5 [00:02<00:00,  1.93it/s]
100%|██████████| 5/5 [00:02<00:00,  2.46it/s]
100%|██████████| 5/5 [00:02<00:00,  1.74it/s]
Fitting model with best params and tuning for best threshold ...

0%|          | 0/2 [00:00<?, ?it/s]
50%|█████     | 1/2 [00:00<00:00,  2.54it/s]
100%|██████████| 2/2 [00:00<00:00,  2.96it/s]
100%|██████████| 2/2 [00:00<00:00,  2.89it/s]
Best score/param set found on validation set:
{'params': {'lr__C': 0.1, 'lr__penalty': 'l2'}, 'score': 0.827771193165685}
Best roc_auc: 0.828

roc_auc after calibration: 0.827771193165685
Confusion matrix on set provided:
--------------------------------------------------------------------------------
        Predicted:
            Pos     Neg
--------------------------------------------------------------------------------
Actual: Pos  7293 (tp)   4394 (fn)
        Neg  5880 (fp)  31275 (tn)
--------------------------------------------------------------------------------
Optimal threshold used: 0.3
********************************************************************************
Report Model Metrics: lr

            Metric     Value
0      Precision/PPV  0.553632
1  Average Precision  0.643984
2        Sensitivity  0.624027
3        Specificity  0.841744
4            AUC ROC  0.830118
5        Brier Score  0.130627
********************************************************************************
================================================================================
================================================================================
Object saved!

Decision Tree Output

Model Tuner version: 0.0.28b
Model Tuner authors: Arthur Funnell, Leonid Shpaner, Panayiotis Petousis

--------------------------------------------------------------------------------
X
--------------------------------------------------------------------------------
age         workclass  fnlwgt  ... capital-loss  hours-per-week native-country
0   39         State-gov   77516  ...            0              40  United-States
1   50  Self-emp-not-inc   83311  ...            0              13  United-States
2   38           Private  215646  ...            0              40  United-States
3   53           Private  234721  ...            0              40  United-States
4   28           Private  338409  ...            0              40           Cuba

[5 rows x 14 columns]
--------------------------------------------------------------------------------
y
--------------------------------------------------------------------------------
income
0  <=50K
1  <=50K
2  <=50K
3  <=50K
4  <=50K
income
<=50K    37155
>50K     11687
Name: count, dtype: int64


============================================================

Pipeline Steps:

┌────────────────────────────────────────────┐
│ Step 1: preprocess_scaler_StandardScalar   │
│ StandardScaler                             │
└────────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────┐
│ Step 2: preprocess_imputer_Preprocessor    │
│ SimpleImputer                              │
└────────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────┐
│ Step 3: dt                                 │
│ DecisionTreeClassifier                     │
└────────────────────────────────────────────┘


0%|          | 0/12 [00:00<?, ?it/s]
8%|▊         | 1/12 [00:00<00:01,  7.00it/s]
17%|█▋        | 2/12 [00:00<00:01,  7.19it/s]
25%|██▌       | 3/12 [00:00<00:01,  7.72it/s]
42%|████▏     | 5/12 [00:00<00:00,  9.87it/s]
58%|█████▊    | 7/12 [00:00<00:00, 11.15it/s]
75%|███████▌  | 9/12 [00:00<00:00, 11.46it/s]
92%|█████████▏| 11/12 [00:01<00:00, 10.67it/s]
100%|██████████| 12/12 [00:01<00:00, 10.09it/s]
Fitting model with best params and tuning for best threshold ...

0%|          | 0/2 [00:00<?, ?it/s]
50%|█████     | 1/2 [00:00<00:00,  1.95it/s]
100%|██████████| 2/2 [00:01<00:00,  1.90it/s]
100%|██████████| 2/2 [00:01<00:00,  1.90it/s]
Best score/param set found on validation set:
{'params': {'dt__max_depth': 10,
            'dt__min_samples_leaf': 5,
            'dt__min_samples_split': 2},
'score': 0.8445424045851704}
Best roc_auc: 0.845

roc_auc after calibration: 0.8445422030447913
Confusion matrix on set provided:
--------------------------------------------------------------------------------
        Predicted:
            Pos     Neg
--------------------------------------------------------------------------------
Actual: Pos  9265 (tp)   2422 (fn)
        Neg  9688 (fp)  27467 (tn)
--------------------------------------------------------------------------------
Optimal threshold used: 0.21
********************************************************************************
Report Model Metrics: dt

            Metric     Value
0      Precision/PPV  0.488841
1  Average Precision  0.709198
2        Sensitivity  0.792761
3        Specificity  0.739254
4            AUC ROC  0.862208
5        Brier Score  0.117609
********************************************************************************
================================================================================
================================================================================
Object saved!

Random Forest Output

Model Tuner version: 0.0.28b
Model Tuner authors: Arthur Funnell, Leonid Shpaner, Panayiotis Petousis

--------------------------------------------------------------------------------
X
--------------------------------------------------------------------------------
age         workclass  fnlwgt  ... capital-loss  hours-per-week native-country
0   39         State-gov   77516  ...            0              40  United-States
1   50  Self-emp-not-inc   83311  ...            0              13  United-States
2   38           Private  215646  ...            0              40  United-States
3   53           Private  234721  ...            0              40  United-States
4   28           Private  338409  ...            0              40           Cuba

[5 rows x 14 columns]
--------------------------------------------------------------------------------
y
--------------------------------------------------------------------------------
income
0  <=50K
1  <=50K
2  <=50K
3  <=50K
4  <=50K
income
<=50K    37155
>50K     11687
Name: count, dtype: int64


============================================================

Pipeline Steps:

┌────────────────────────────────────────────┐
│ Step 1: preprocess_scaler_StandardScalar   │
│ StandardScaler                             │
└────────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────┐
│ Step 2: preprocess_imputer_Preprocessor    │
│ SimpleImputer                              │
└────────────────────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────┐
│ Step 3: rf                                 │
│ RandomForestClassifier                     │
└────────────────────────────────────────────┘


0%|          | 0/8 [00:00<?, ?it/s]
12%|█▎        | 1/8 [00:00<00:00,  7.50it/s]
25%|██▌       | 2/8 [00:00<00:01,  3.66it/s]
38%|███▊      | 3/8 [00:00<00:01,  4.96it/s]
50%|█████     | 4/8 [00:01<00:01,  3.30it/s]
75%|███████▌  | 6/8 [00:01<00:00,  4.43it/s]
100%|██████████| 8/8 [00:01<00:00,  4.60it/s]
100%|██████████| 8/8 [00:01<00:00,  4.43it/s]
Fitting model with best params and tuning for best threshold ...

0%|          | 0/2 [00:00<?, ?it/s]
50%|█████     | 1/2 [00:00<00:00,  2.59it/s]
100%|██████████| 2/2 [00:00<00:00,  2.67it/s]
100%|██████████| 2/2 [00:00<00:00,  2.65it/s]
Best score/param set found on validation set:
{'params': {'rf__max_depth': 10,
            'rf__min_samples_split': 5,
            'rf__n_estimators': 50},
'score': 0.859756111956717}
Best roc_auc: 0.860

roc_auc after calibration: 0.859756111956717
Confusion matrix on set provided:
--------------------------------------------------------------------------------
        Predicted:
            Pos     Neg
--------------------------------------------------------------------------------
Actual: Pos  9835 (tp)   1852 (fn)
        Neg 10861 (fp)  26294 (tn)
--------------------------------------------------------------------------------
Optimal threshold used: 0.15
********************************************************************************
Report Model Metrics: rf

            Metric     Value
0      Precision/PPV  0.475213
1  Average Precision  0.744653
2        Sensitivity  0.841533
3        Specificity  0.707684
4            AUC ROC  0.874146
5        Brier Score  0.111757
********************************************************************************
================================================================================
================================================================================
Object saved!

Loading (Retrieving) The Model Objects and Data Splits

model_path = RESULTS_DIR

model_lr = loadObjects(os.path.join(model_path, "LogisticRegression.pkl"))
model_dt = loadObjects(os.path.join(model_path, "DecisionTreeClassifier.pkl"))
model_rf = loadObjects(os.path.join(model_path, "RandomForestClassifier.pkl"))


X_test = pd.read_parquet(os.path.join(data_path, "X_test.parquet"))
y_test = pd.read_parquet(os.path.join(data_path, "y_test.parquet"))