Welcome to the EDA Toolkit Python Library Documentation!
Note
This documentation is for eda_toolkit version 0.0.22.
The eda_toolkit is a comprehensive library designed to streamline and
enhance the process of Exploratory Data Analysis (EDA) for data scientists,
analysts, and researchers. This toolkit provides a suite of functions and
utilities that facilitate the initial investigation of datasets, enabling users
to quickly gain insights, identify patterns, and uncover underlying structures
in their data.
Project Links
Tip
A comprehensive Google Colab notebook is linked below that walks through
nearly everything discussed in this documentation, including data loading,
cleaning, visualization, and reporting workflows using the eda_toolkit.
This notebook is designed as a practical, end-to-end companion to the documentation
and can be run entirely in the browser without any local setup.
What is EDA?
Exploratory Data Analysis (EDA) is a crucial step in the data science workflow. It involves various techniques to summarize the main characteristics of the data, often with visual methods. EDA helps in understanding the data better, identifying anomalies, discovering patterns, and forming hypotheses. This process is essential before applying any machine learning models, as it ensures the quality and relevance of the data.
Purpose of EDA Toolkit
The eda_toolkit library is a comprehensive suite of tools designed to
streamline and automate many of the tasks associated with Exploratory Data
Analysis (EDA). It offers a broad range of functionalities, including:
Data Management: Tools for managing directories, generating unique IDs, standardizing dates, and handling common DataFrame manipulations.
Data Cleaning: Functions to address missing values, remove outliers, and correct formatting issues, ensuring data is ready for analysis.
Data Visualization: A variety of plotting functions, including KDE distribution plots, stacked bar plots, scatter plots with optional best fit lines, and box/violin plots, to visually explore data distributions, relationships, and trends.
Descriptive and Summary Statistics: Methods to generate comprehensive reports on data types, summary statistics (mean, median, standard deviation, etc.), and to summarize all possible combinations of specified variables.
Reporting and Export: Features to save DataFrames to Excel with customizable formatting, create contingency tables, and export generated plots in multiple formats.
Key Features
Ease of Use: The toolkit is designed with simplicity in mind, offering intuitive and easy-to-use functions.
Customizable: Users can customize various aspects of the toolkit to fit their specific needs.
Integration: Seamlessly integrates with popular data science libraries such as
Pandas,NumPy,Matplotlib, andSeaborn.Documentation and Examples: Comprehensive documentation and examples to help users get started quickly and effectively.
Prerequisites
Before you install eda_toolkit, ensure your system meets the following requirements:
Python: version
3.7.4or higher is required to runeda_toolkit.
Additionally, eda_toolkit depends on the following packages, which will be automatically installed when you install eda_toolkit:
jinja2: version3.1.4(exact version required)matplotlib: version3.5.3or higher, capped at3.9.2nbformat: version4.2.0or higher, capped at5.10.4numpy: version1.21.6or higher, capped at2.1.0pandas: version1.3.5or higher, capped at2.2.3plotly: version5.18.0or higher, capped at5.24.0scikit-learn: version1.0.2or higher, capped at1.5.2scipy: version1.5.4or higher, capped at1.7.3seaborn: version0.12.2or higher, capped below0.13.2tqdm: version4.66.4or higher, capped below4.67.1xlsxwriter: version3.2.0(exact version required)
Installation
You can install eda_toolkit directly from PyPI:
pip install eda_toolkit
Description
This guide provides detailed instructions and examples for using the functions
provided in the eda_toolkit library and how to use them effectively in your projects.
For most of the ensuing examples, we will leverage the Census Income Data (1994) from
the UCI Machine Learning Repository [1]. This dataset provides a rich source of
information for demonstrating the functionalities of the eda_toolkit.
Table of Contents
Getting Started
Data Management
- Data Management Overview
- Data Management Techniques
- Path directories
- Adding Unique Identifiers
- Trailing Period Removal
- Standardized Dates
- DataFrame Analysis
- Generating Summary Tables for Variable Combinations
- Saving DataFrames to Excel with Customized Formatting
- Creating Contingency Tables
- Generating Summaries (Table 1)
- Highlighting Specific Columns in a DataFrame
- Binning Numerical Columns
- Group-by Imputer
- Delete Inactive DataFrames
del_inactive_dataframes()- Example 1: List Active DataFrames (No Deletion)
- Example 2: Delete Everything Except a Single DataFrame
- Example 3: Dry Run (Preview Deletions)
- Example 4: Include IPython Output Cache Variables
- Example 5: Track DataFrame Memory Usage
- Example 6: Track DataFrame Memory and Process RSS
- Example 7: Programmatic Usage (No Console Output)
Plotting Functions
- Creating Effective Visualizations
- Histogram Distribution Plots
- Grouped Distributions
- Distribution Goodness-of-Fit Plots
- Feature Scaling and Outliers
- Stacked Crosstab Plots
- Outcome Crosstab Plots
- Box and Violin Plots
- Scatter Plots and Best Fit Lines
- Correlation Matrices
Theoretical Overview
About EDA Toolkit
- ASCII Art
- Acknowledgements
- Contributors/Maintainers
- Citing EDA Toolkit
- Changelog
- Version 0.0.22
- Version 0.0.21
- Version 0.0.20
- Version 0.0.19
- Version 0.0.18
- Version 0.0.17
- Version 0.0.16
- Version 0.0.15
- Version 0.0.14
- Version 0.0.13
- Add
ValueErrorfor Insufficient Pool Size inadd_idsand Enhance ID Deduplication - Enhance
strip_trailing_periodto Support Strings and Mixed Data Types - Changes in
stacked_crosstab_plot - Add Environment Detection to
dataframe_columnsFunction - Add
tqdmProgress Bar todataframe_columnsFunction - Other Enhancements and Fixes
- Add
- Version 0.0.12
- Version 0.0.11
- Version 0.0.10
- Version 0.0.9
- Version 0.0.8
- Version 0.0.8c
- Version 0.0.8b
- Version 0.0.8a
- Version 0.0.7
- Version 0.0.6
- Version 0.0.5
- Version 0.0.4
- Version 0.0.3
- Version 0.0.2
- Version 0.0.1rc0
- Version 0.0.1b0
- Version 0.0.1b0
- Version 0.0.1b0
- Version 0.0.1b0
- References