5 Functions and Modularity

We will begin by importing the necessary libraries.

# third-party imports
import pandas as pd # used for most data frame operations
import numpy as np # for linear algebra functionality
import matplotlib.pyplot as plt # import plotting library
import re # for regex functionality
from tabulate import tabulate # for tables

# import sklearn's 'metrics' module, providing functions for 
# evaluating the model prediction quality, like precision, recall, 
# confusion matrix, ROC curve etc.
from sklearn import metrics
from sklearn.metrics import (
    average_precision_score, 
    precision_recall_curve, 
    auc, 
    roc_curve, 
    classification_report, 
    recall_score,
    precision_score,
    roc_auc_score,
    brier_score_loss,
    confusion_matrix
)

5.0.1 Basic Functions

“Data Science Project Essentials” underscores the importance of functions and modularity in the arena of data science. Throughout this segment, you will see how organizing code into functions promotes readability, reusability, and debugging. By creating modularized code, data scientists can effectively break down complex processes into manageable units, improving code efficiency and scalability. It aids in maintaining codebases, streamlining updates, and sharing useful tools across multiple projects.

This section of the book takes you on a deep dive into the practical applications of functions and modularity, from encapsulating pre-processing steps to implementing machine learning algorithms. Using the Chronic Kidney Disease (CKD) dataset, you will understand how you can construct robust, flexible data pipelines that are ready to accommodate changing project requirements.

Whether you’re a novice getting your feet wet in the data science pool or a seasoned professional, the emphasis on functions and modularity will provide a unique lens to appreciate and advance your data science projects. Dive into this essential resource and arm yourself with the knowledge to design elegant and efficient code, the cornerstone of any successful data science project.

def greet(name):
    return f"Hello, {name}!"

In this case, greet is the function’s name, and name is the function’s argument. The return statement specifies the result that the function should produce.

You can call (or use) this function like so:

print(greet("Chloe"))

Hello, Chloe!

This will output: Hello, Chloe!

This function is quite simple but demonstrates the basic idea. You can create more complex functions that perform calculations, manipulate data, or even train machine learning models. Functions can take multiple arguments, and these arguments can be of any type (e.g., numbers, strings, lists, dataframes, or even other functions).

Here’s a more complex function, which calculates the mean of a list of numbers:

def calculate_mean(numbers):
    return sum(numbers) / len(numbers)

# Usage:
print(calculate_mean([1, 2, 3, 4, 5]))

3.0

This will output: 3.0

These examples illustrate the fundamental structure and usage of functions in Python. By encapsulating code into functions, we can keep our code organized, reusable, and easy to understand.

A function is like a black box that performs a specific task. It takes some inputs, processes them, and produces some outputs. This allows us to abstract away the details of the processing, making our code cleaner and easier to understand. Let’s dive deeper into the concept of inputs (parameters or arguments) and outputs (return values) in the context of functions.

5.0.2 Inputs (Parameters/Arguments)

The inputs to a function are also known as parameters or arguments. These are the values you pass into the function when you call it. In Python, these are defined in the parentheses after the function name. For instance:

def multiply(a, b):
    return a * b

Here, a and b are the input parameters. We can provide values for these parameters when we call the function:

result = multiply(4, 5)
print('result =', result)

result = 20

Here, we are calling the multiply function with 4 and 5 as arguments. This will output 20.

Parameters can be of any data type, and a function can have any number of parameters. If a function takes no parameters, we still include empty parentheses:

def say_hello():
    return "Hello!"

5.0.3 Outputs (Return Values)

The output of a function is the result that it produces. This is defined by the return statement in Python. The return statement ends the function execution and “sends back” the result of the function. For instance:

def add(a, b):
    return a + b

Here, the add function returns the sum of a and b. We can capture this return value when we call the function:

sum_value = add(3, 2)  # sum_value is now 5

Note that a function in Python doesn’t have to return a value. If no return statement is provided, the function will return None by default. To clarify, after the print_hello function is executed, “Hello!” will be printed to the console, but the result variable will contain None, which is the default return value of a Python function when no return statement is provided.

For instance:

def print_hello():
    print("Hello!")

result = print_hello()  # result is None

Hello!

print(result)

None

5.0.4 Data Types Report

In data science, functions often take data and some parameters as input, perform some operation (like a calculation, data transformation, or a machine learning model training), and return a result (like a number, a new dataset, or a trained model).

Now, let’s bring it all together and write a slightly more advanced function and call it data_types. This function is designed to perform an analysis on a given DataFrame df, producing a report on the data types, null values count, and the percentage of null values for each column in the DataFrame. Below is a bit of pseudocode to get us started in the process.

START

Step 1:

INPUT a DataFrame 'df'

Step 2:

CREATE a new DataFrame 'dat_type' containing the data types of all columns in 'df'

Step 3:

FOR each column in 'df':

CALCULATE the number of null values

ADD this data as a new column in 'dat_type' named 'Null_Values'

END FOR

Step 4:

RESET the index of the 'dat_type' DataFrame

Step 5:

FOR each column in 'df':

CALCULATE the percentage of null values

ROUND the result to the nearest whole number

ADD this as a new column in 'dat_type' named 'perc_null'

END FOR

Step 6:

RENAME the columns of 'dat_type' as follows:

'index' –> 'Column/Variable'

'0' –> 'Data Type'

'Null_Values' –> '# of Nulls'

'perc_null' –> 'Percent Null'

Step 7:

OUTPUT 'dat_type'

END

# Data Types Report
def data_types(df):
    '''
    This function provides a data types report on every column in the dataframe,
    showing column names, column data types, number of nulls, and percentage 
    of nulls, respectively.
    Inputs:
        df: dataframe to run the datatypes report on
    Outputs:
        dat_type: report saved out to a dataframe showing column name, 
                  data type, count of null values in the dataframe, and 
                  percentage of null values in the dataframe
    '''
    # Features' Data Types and Their Respective Null Counts
    # This line creates a pandas Series object, dat_type, where the index is the
    # column names and the values are the data types of those columns.
    dat_type = df.dtypes

    # create a new dataframe to inspect data types
    # This line converts the dat_type Series into a DataFrame. This is done 
    # because a DataFrame is more flexible and powerful than a Series, and we 
    # will need to add more columns later on.
    dat_type = pd.DataFrame(dat_type)

    # sum the number of nulls per column in df
    # This line adds a new column to the dat_type DataFrame named 'Null_Values', 
    # which contains the count of null (or missing) values in each column of the 
    # original DataFrame df.
    dat_type['Null_Values'] = df.isnull().sum()

    # reset index w/ inplace = True for more efficient memory usage
    # Here, the function resets the index of the DataFrame dat_type to default 
    # integer values and drops the existing index into a new column.
    dat_type.reset_index(inplace=True)

    # percentage of null values is produced and cast to new variable
    # This line creates a new column named 'perc_null' in dat_type DataFrame. 
    # This column holds the percentage of null values for each column in the 
    # original DataFrame df. It's calculated as the number of null values 
    # divided by the total number of rows, all multiplied by 100 to convert to 
    # a percentage. The round function rounds this value to the nearest whole 
    # number.
    dat_type['perc_null'] = round(dat_type['Null_Values'] / len(df)*100,0)

    # columns are renamed for a cleaner appearance
    # Finally, this line renames the columns of dat_type DataFrame for a cleaner 
    # appearance.
    dat_type = dat_type.rename(columns={0:'Data Type',
                                          'index': 'Column/Variable',
                                          'Null_Values': '# of Nulls',
                                          'perc_null': 'Percent Null'})
    # The function then returns the `dat_type` DataFrame, which now provides a 
    # comprehensive overview of each column in the original DataFrame - their 
    # data types, the count of null values, and the percentage of null values.
    return dat_type

5.0.5 Enhancing Efficiency and Adaptability

This function is a great utility for understanding the structure and cleanliness of a DataFrame at a glance, which is especially helpful in the initial stages of data exploration and preprocessing.

In the grand tapestry of programming, the concept of modularity is akin to the individual stitches that make up the larger design. Modularity involves subdividing a program into discrete, self-contained components, or “modules,” each tasked with executing a unique function within the program’s overall operation.

Envision this approach in the context of creating a plot. Perhaps this plot is a critical element of your program, something to be reproduced with variations across multiple instances. The modularity paradigm steps in to streamline this process in the following ways:

Clarity and Ease of Maintenance
Modularity significantly simplifies the understanding and maintenance of the codebase. If you liken a program to a book, the modules are its chapters. Each chapter tells its own story, and its focus is narrow and precise. If a plot isn’t correctly rendered, one would need only to refer to the relevant chapter—the plot module—for debugging, rather than getting lost in the entire tome of the program.

Encouraging Code Reusability
Just as a well-written book chapter can stand alone or be appreciated within the broader narrative, an effectively developed module can operate independently or within different programs. If a particular plot is to be recreated repeatedly, one doesn’t need to rewrite the entire code each time; the relevant module can be invoked with different parameters, making it a practical embodiment of the ‘write once, use many times’ principle.

Efficiency in Development Time
Reusing modules significantly compresses the development timeframe. A module—once written and tested—can be invoked whenever needed, eliminating the time spent on redundant testing.

Enhancing Reliability
Isolated development and testing of modules enhance their reliability. Since a module is reused in various scenarios, it has to pass multiple litmus tests, increasing its robustness and credibility.

Parallel Development
Modularity fosters collaborative development. When a program is broken down into modules, different teams or individuals can concurrently work on separate modules, expediting the overall development process.

Change Isolation
Just as how altering a single chapter doesn’t drastically change the entire book’s narrative, updates or modifications within a module do not disrupt the entire program. This allows for seamless integration of changes and fosters the system’s overall resilience.

By breaking complex problems down into manageable parts, modularity proves itself to be a significant asset in programming. It contributes to the creation of code that is simpler to understand, easier to maintain, and adaptable to changes—an instrumental tool in the programmer’s arsenal.

The subsequent functions, which will be utilized frequently in this book, are now at your disposal. A few of them are dedicated to exploratory data analysis, while others focus on preprocessing and wrangling the data. These are presented here for your convenience, allowing you to easily refer back to them as they are summoned in the later chapters. This way, you will be able to understand their usage in context.

5.0.6 Plotting

import seaborn as sns # import the seaborn library for plotting

def sns_boxplot(df, title, xlabel, ylabel, column):
    '''
    This function plots boxplots of any column of interest
    Inputs: 
        df: dataframe to pass into the function
        title: title of the boxplot
        ylabel: y-axis label of the boxplot
        column: column of interest to run the function on
    '''
    fig = plt.figure(figsize = (15,1.5)) # set figure size
    plt.title(title, fontsize=12) # set plot title
    plt.xlabel(xlabel, fontsize=12) # set plot x-axis label
    plt.ylabel(ylabel, fontsize=12) # set plot y-axis label
    # seaborn boxplot function w/ horizontal orientation
    boxplot = sns.boxplot(df[column], palette="coolwarm", 
                          orient='h', linewidth=2.5)
    print()
    print('Summarizing', column)
    # Computing IQR
    Q1 = df[column].quantile(0.25) # first quartile
    Q3 = df[column].quantile(0.75) # third quartile
    IQR = Q3-Q1 # interquartile range

    # Computing Summary Stats of average_monthly_hours
    mean = round(df[column].mean(),2) # calculate mean
    std = round(df[column].std(),2) # calculate standard dev.
    median = round(df[column].median(),2) # calculate median

    # print statements for summary statistics
    Q1_print = print('The first quartile is %s. '%Q1)
    Q3_print = print('The third quartile is %s. '%Q3)
    IQR_print = print('The IQR is %s.'%round(IQR,2))
    mean_print = print('The mean is %s.'%mean)
    std_print = print('The standard deviation is %s.'%std)
    median_print = print('The median is %s.'%median)
    # if mean is greater than median, (+) skewed; 
    # otherwise (-) skewed.
    if mean > median:
        print('The distribution is positively skewed.')
    else:
        print('The distribution is negatively skewed.')
    print()