Introduction to Functional Programming

Functional programming

  • a style of programming where your output is determined solely by your input
  • The function does not change anything outside of it or depend on external data to produce the output

Why use Functional programming?:

  • makes your code easier to understand, test, debug and build upon
  • It's widely used in data analysis and other fields where computation is important

4 Key concepts in functional programming:

  1. Pure Functions: These are functions where the output is determined only by their input

    • Given the same input, they will always produce the same output
    • Have no side-effects, meaning they don't change anything -beside the function input- in the program
  2. Immutability: In functional programming, data is not changed

    • Instead, new data is created from the existing data, which is easier to follow and debug
  3. First-Class Functions: In functional programming, functions are treated like any other variable

    • They can be assigned to variables, stored in data structures, passed as arguments to other functions, and returned as values from other functions
  4. Higher-Order Functions: These are functions that take one or more functions as arguments, return a function as a result, or both

    • This is a key part of functional programming and you'll see this concept used often

Goal: To use functional programming concepts in our Python code

The Essence of Functional Programming in Data Analysis

In data analysis tasks, particularly with tools like pandas, the essence of functional programming—writing clear, concise, and effective code—is achieved by:

  • Minimizing Use of Mutable Data structures: By avoiding in-place modifications (changes made directly to the data structure)
  • Encapsulating Operations in Functions: Encapsulation in a function improves readability, reusability, and allows for better testing and debugging
  • Utilizing Functional Constructs: Leveraging built-in methods that abstract complex operations into simpler, concise statements
In [57]:
import pandas as pd # should already be pre-installed with conda

from itables import init_notebook_mode, show #to fully take advantage of the ability to display our data as a table format, let's use the itables library 

init_notebook_mode(all_interactive=True) #After this, any Pandas DataFrame, or Series, is displayed as interactive table, which lets you explore, filter or sort your data

1. Pure functions:

Definition: A function is pure if its output is only determined by its input

  • no changes are made beyond the input
In [58]:
# Pure function: 
def number_power_two(x):
    return x ** 2
print(number_power_two(2)) # prints 4

# key point: will always produce same output for same input
# Why? Because it has a local scope: uses a local variable x , which is only defined by the input value when the function is called
4

Non-Pure functions:

Definition: this function depends on a global variable, which means that its output can change even with the same input

  • Global Variables: These are variables that are defined in the main body of the script, outside any function, and can be accessed from any function in the code
    • They have a global scope, meaning they can be accessed anywhere in the program
    • However, to modify a global variable within a function, you must use the global keyword before the variable name
  • Local Variables: These are variables that are defined only within a function

    • keypoint: local variables can only be accessed and modified within that function, and they cease to exist once the function has finished running
In [59]:
# non-Pure function:
y=3 # here is a global variable called y, which  is a variable that operates across the entire scope of your code

def number_power_two(x):
    # The function is reading the value of the global variable y,
    # but it's only accessing and not modifying y, so we don't use global keyword
    return x**y
print(number_power_two(2)) # prints 8

# This is NOT a pure function
    # If we change y elsewhere in our code, the same input e.g., x=2
    #  will also change the output
8

Function with Side Effects:

Definition: a function that depends on a global variable and modifies it

  • The global keyword is used before a variable to indicate that the variable is a global variable
  • Keyword global is necessary if you want to modify a global variable from within a function
  • Without the global keyword, Python would treat a variable assigned within a function as a new local variable, leaving the global variable unchanged

In functional programming, it's recommended to avoid using global variables and modifying them within functions, as this can lead to unpredictable side effects (unpredictable things occuring in your program)

  • it's better to use and return local variables, and to pass any necessary data into a function as arguments
In [60]:
# example of what NOT to do in functional programming...
y=3 

def number_power_two_change_y(x):
    global y # we need to use keyword global to access AND modify the global variable y inside the function
    y= y*2 # global variable y has been changed! 
    return x**y

print(number_power_two_change_y(2)) # prints 12
print(y) # prints 6, y has been changed!!

# In a larger program, it would become quite difficult to track the value of y if we keep changing its value ...
64
6
In [61]:
# Do this instead...
# This function uses only local variables and arguments
def add_numbers(x, y):

    # x and y are local variables. They only exist within this function
    result = x + y  # result is also a local variable
    return result

# Rather than using global variables, call the function with those data as arguments.  e.g., 3 and 4
print(add_numbers(3, 4))  # prints: 7
7
In [62]:
#This function doesn't use or modify any global variables, it operates solely on its arguments and local variables
# Makes the function's behavior easy to understand and predict 
# given the same arguments, it will always produce the same result

2. Immutable functions in functional programming

  • Immutatbility is a key concept in functional programming

  • Immutability: In functional programming is data that cannot be changed

    • Instead, new data is created from the existing data, which leads to less bugs and easier troubleshooting if you do have them

    • Certain data types are inherently immutable: tuples, integers, strings

    • Others data types are mutable: lists, dictionaries, and most relevant here DataFrames!
  • Key point: Understanding immutability ensures that data not get changed unexpectedly

In [63]:
# Strings in python are immutable meaning any operation on them will create a new string

# define our function to modify an input string
def capitalize_string(input_string):
    # this function returns a new string
    return input_string.capitalize() # this is a string method used to Capitalize a string

original_string = 'chicago' 




# Call the function and print the result
print(f"New string is : {capitalize_string(original_string)}") # capitalizes and returns a new string 

# print the original, will it be changed?...
print("Original string is:", original_string) # still the same
New string is : Chicago
Original string is: chicago
In [64]:
# We see that the function did not alter the original string
# # The function actually returned a new string with the first letter capilized
In [65]:
# While strings are immutable, lists are mutable
    # So, you can directly change the list contents like adding, removing, or modifying its elements 
def upper_string(input_string):
    return input_string.upper()

# Original list of city names
original_strings = ['Chicago', 'Philadelphia', 'Houston']

# Modifying the list by replacing elements
for i in range(len(original_strings)):
    # Although strings are immutable, the list that contains them is mutable
    original_strings[i] = upper_string(original_strings[i]) # we are directly modifying the list

print("Modified strings in mutable list: ", original_strings)
Modified strings in mutable list:  ['CHICAGO', 'PHILADELPHIA', 'HOUSTON']
In [66]:
# We see that the function altered the original list of strings
# This is a something we need to be aware of if the program refers to the original list elsewhere
# Why?: Because if we change the original data, then how do we... 
    # refer back to the original data?
    # compare the new result with the original?
    # apply a different function to the original data?
In [67]:
# imagine you had a function that processes an input list of cities and appends a 'new' city to the list
# what would happen if this function was called in different parts of the program
# How should it go about adding data to the original list?

def add_new_cities(data_list, city):
    '''Simulates process of adding by appending strings to input list'''
    new_city= city
    data_list.append(new_city) # directly modifies the list passed to it
    # Keep in mind whether your intent is to modify the original data?



# Original list of city names
original_list = ['Chicago', 'Philadelphia', 'Houston']

# call the function for the first time
# Function call 1
add_new_cities(original_list, "New Orleans")
print("After first processing:", original_list)

# Function call 2
add_new_cities(original_list,"New Orleans")
print("After second processing:", original_list)
After first processing: ['Chicago', 'Philadelphia', 'Houston', 'New Orleans']
After second processing: ['Chicago', 'Philadelphia', 'Houston', 'New Orleans', 'New Orleans']
In [68]:
# What happened? 
    # We modified the original data 
    # the first time the function was called on the original data, it correctly added the New city to the list
# Later, the function was called again and adds to the same list, creating likely unintended additions to the original Data
    # Can lead to bugs if we do not understand that the original data is being modified when we append to a list in this way
In [69]:
# Instead, copy the original data
#  then return the the new local variable created that holds the modified list

def add_new_cities(data_list, city):
    '''Simulates process of adding by appending strings to input list'''
    new_city= city
    data_list= data_list.copy()
    data_list.append(new_city) # directly modifies the list passed to it
    return data_list
    # is the intent to modify the original data? Did you already have a .copy()? If so, making a copy is NOT necessary



# Original list of city names
original_list = ['Chicago', 'Philadelphia', 'Houston']

# call the function for the first time
# Function call 1
processed_data1 = add_new_cities(original_list, "New Orleans")
print("Original list, after first processing:", original_list)
print("New list, after first processing:", processed_data1)

# Function call 2
processed_data2 = add_new_cities(original_list,"New Orleans")
print("Original list, after second processing:", original_list)
print("New list, after second processing:", processed_data2)
Original list, after first processing: ['Chicago', 'Philadelphia', 'Houston']
New list, after first processing: ['Chicago', 'Philadelphia', 'Houston', 'New Orleans']
Original list, after second processing: ['Chicago', 'Philadelphia', 'Houston']
New list, after second processing: ['Chicago', 'Philadelphia', 'Houston', 'New Orleans']
In [70]:
# Here, the function deliberately avoids altering the original list
# It works on a copy of the list and returns this modified copy
# The purpose is to leave the original data unchanged
In [71]:
# A more practical case, using a dataframe
# Clean up zip code data while preserving the original data

# Create a DataFrame with various zip formats
df = pd.DataFrame({
    'original_zip': ['55255-10202', '12343-29292', '84848-12020', '3033'] # Although numeric, hyphenated zips in the raw data will be interpreted by pandas as strings as represented here in quotes
})


# function to clean up zipcodes and keep only the first 5 digits
def clean_zipcodes(input_zip):
    # Split the input string at the hyphen and select the first part
    first_part= input_zip.split("-")[0] # string method called .split(), [0] selects the first part of the string
    # Pad zeros to ensure zip is always 5 characters
    return first_part.zfill(5)  

# Apply the function to the original_zip column and create a new column for the cleaned zip codes
df['new_zip'] = df['original_zip'].apply(clean_zipcodes) # will splitting and padding zips modify the original data?...


# Display the original and new data 
print("\nModified DataFrame with Cleaned Zip Codes:\n", df[['original_zip', 'new_zip']])

print("\n")

# Display the original data 
print("Original DataFrame:\n", df[['original_zip']]) 
Modified DataFrame with Cleaned Zip Codes:
   original_zip new_zip
0  55255-10202   55255
1  12343-29292   12343
2  84848-12020   84848
3         3033   03033


Original DataFrame:
   original_zip
0  55255-10202
1  12343-29292
2  84848-12020
3         3033
In [72]:
# we performed several manipulations on the original zips without altering the original data
# value of keeping the original data in the dataframe is that
# it is easier to backtrack what was done later on if we need to verify and compare
In [73]:
# Say we find that we often need to change the data in our dataframe, maybe dates and mags have typos..
# We need a function to simplify this re-occuring process 
# but our dataframe has a tuple multi index
# So, our function should take a dataframe, and a tuple, then return a NEW dataframe with the updated index


# Create a DataFrame from a dictionary of tuples
data = {
    'mag': (1,1,3,3),
    'date': ('April 10 2024','April 11 2024','April 12 2024','April 13 2024'),
    'inj' : (1,2,100,200)
}
tdata = pd.DataFrame(data)

def set_new_index(df, row, new_index_values):

    print(df) #input data we want to change

    # Create a new DataFrame with the desired index
    df_copy = df.reset_index() # reset the index and create a NEW dataframe

    print(df_copy) # check the results of the NEW dataframe

    df_copy.loc[row, ['mag', 'date']] = new_index_values # change the mag and date for the first row of the NEW dataframe

    print(df_copy)

    df_copy.set_index(['mag', 'date'], inplace=True) # set the index to the tuple of mag, date for the NEW dataframe

    print(df_copy)

    return df_copy # return the NEW dataframe with the modified index

# This pure function , it doesn't modify the original DataFrame (it has no side effects)
# will always product the same output Dataframe for the same input Dataframe,index (we have no external variables in the function modifying our output)

new_tdata = set_new_index(tdata, 0, (2, 'April 12 2024')) # creates a new DataFrame with the multi index (tuple) updated

print(new_tdata)

# This function doesn't modify the original DataFrame
   mag           date  inj
0    1  April 10 2024    1
1    1  April 11 2024    2
2    3  April 12 2024  100
3    3  April 13 2024  200
   index  mag           date  inj
0      0    1  April 10 2024    1
1      1    1  April 11 2024    2
2      2    3  April 12 2024  100
3      3    3  April 13 2024  200
   index  mag           date  inj
0      0    2  April 12 2024    1
1      1    1  April 11 2024    2
2      2    3  April 12 2024  100
3      3    3  April 13 2024  200
                   index  inj
mag date                     
2   April 12 2024      0    1
1   April 11 2024      1    2
3   April 12 2024      2  100
    April 13 2024      3  200
                   index  inj
mag date                     
2   April 12 2024      0    1
1   April 11 2024      1    2
3   April 12 2024      2  100
    April 13 2024      3  200
In [74]:
#set_new_index is a pure function
# This function doesn't modify the original DataFrame
# will always produce the same output for the same input 

3. First-Class Functions

  • Central concept to functional programming
    • Functions are treated like any other value or variable
    • This concept allows for flexible design patterns and can simplify many complex programming tasks

Key Characteristics:

  • Variable Assignment: Functions can be assigned to variables.
  • Function as Argument: Functions can be passed as arguments to other functions.
  • Function as Return Value: Functions can return other functions.
  • Also, can be stored in data structures

Variable Assignment

In [75]:
# Assigning a function to a variable


 # Let's create a function that greets the name entered in the arguement 
 #  # and modify it with some Seinfeld references...
def Hello(name):  
    if name== "Newman":
        return f" Hello,...Newman!"
    if name == "Jerry":
        return f" Hello, ... Jerry!"
    if name == "Mulva":
        return f" You don't know my name do you?"
    else: 
            return f"Hello, {name}" #  this line alone, without the 'else' , is sufficient for the example to function, but that would be no fun!!

print(Hello("Mulva"))

# now for the key point here:  let's save that function in a variable 
greet = Hello

# let's check on this data....
print(greet)  # yep, we have a stored function


print(greet("Newman"))
 You don't know my name do you?
<function Hello at 0x0000015DA63BA7A0>
 Hello,...Newman!

Store the function in a data structure

In [76]:
# Stored in Data Structures

def square(x):
    return x * x

def cube(x):
    return x * x * x

functions = [square, cube]

print(type(functions)) # a list
print(functions) # holding two function objects

results =[] # create an empty list to hold the results

for func in functions:
    results.append(func(3)) # append the result to the list
    
print(results)  # Outputs 9 (3^2) and 27 (3^3) respectively
<class 'list'>
[<function square at 0x0000015DA63BB240>, <function cube at 0x0000015DA63BA5C0>]
[9, 27]
In [77]:
# Assign functions to variables and store them in Data structures

import numpy as np

data = {
    'mag': [1,1,3,3], # column 1
    'date': ['April 10 2024','April 11 2024','April 12 2024','April 13 2024'], # column 2
    'inj' : [1,2,100,200] # column 3
}
tdata = pd.DataFrame(data)

# Define some transformation functions

def log(x):
    return np.log(x)


def square(x):
    return x**2

# store these for later use on our data

# function dictionary, holding two functions
transformation = {
    'log' :log,
    'square': square
}

# suppose we say that a user decides what transformation 

user_transforms = 'log' # can make into an input() but for simplicity create it as a typical variable

# apply the function from the dictionary
tdata['log_inj'] = tdata['inj'].apply(transformation[user_transforms])

print(tdata) # apply function has applied our function def log from the dictionary called transformation
   mag           date  inj   log_inj
0    1  April 10 2024    1  0.000000
1    1  April 11 2024    2  0.693147
2    3  April 12 2024  100  4.605170
3    3  April 13 2024  200  5.298317

Pass a function as an arguement to function

In [78]:
# pass function as an arguement to other functions

# say you have a function you need to apply often to different inputs
def cube(x):
    return x * x * x


# you can create another function that takes your cube function along with a specific arguement to evaluate
def apply_function(func, value):

    return func(value) # this will pass a value to the cube function to be evaluated and returned

result = apply_function(cube, 5)  # you can pass that cube function into another function along with a specific arguement to evaluate


print(result)  # Output will be 125
125
In [79]:
# Pass function as an arguement to another function with sample tornado data

data={
    'mag': [1,1,3,3], # column 1
    'date': ['April 10 2024','April 11 2024','April 12 2024','April 13 2024'], # column 2
    'inj' : [1,2,100,200] # column 3
}

tdata = pd.DataFrame(data)

# create a function that takes a dataframe and function as arguements
def apply_custom_function(df, func):
    return df.apply(func) # uses the built-in pandas apply function to execute the function on the dataframe

# Function to increment by 10
def add_ten(x):
    return x + 10

# add_ten function is passed as an arguement to apply_custom_function
tdata["added_inj"] = apply_custom_function(tdata['inj'], add_ten)
print(tdata)
   mag           date  inj  added_inj
0    1  April 10 2024    1         11
1    1  April 11 2024    2         12
2    3  April 12 2024  100        110
3    3  April 13 2024  200        210

Return a function as value from another function

In [80]:
# Return a function as a value from another function

def make_multiplier(x):
    def multiplier(y):# when we call make_multiplier, the multiplier function is created
        return x * y #created with a value x that is determined when make_multiplier(x) is called
    
    return multiplier # we return the function and now it holds a fixed value for x

double = make_multiplier(2) # multiplier function is stored in variable double with x=2 as fixed parameter


print(double(5)) # when we call double it means we execute the multiplier function, here with y set to 5 

# Output will be 10

# so we created a function that generates a another function called multiplier with a pre-set value of 2
10
In [81]:
# Return a Function as a value from another function

# lets use this to normalize injury counts from a sample of tornado data
data={
    'mag': [1,1,3,3], # column 1
    'date': ['April 10 2024','April 11 2024','April 12 2024','April 13 2024'], # column 2
    'inj' : [1,2,100,200] # column 3
}

tdata = pd.DataFrame(data)


def get_normalizer(min_val, max_val):
    def normalize(x):
        return (x - min_val)/ (max_val - min_val) # fyi, min-max scaling formula scales min to 0 and max to 1
    
    return normalize # notice we returned a function, and it hold fixed values for min and max injuries

# compute min and max
min_A = tdata["inj"].min()
max_A= tdata["inj"].max()

# Return our normalize function and store it in a new variable called normalizer
normalizer = get_normalizer(min_A, max_A)

tdata["inj_normalized"] = tdata['inj'].apply(normalizer)

print(tdata)
   mag           date  inj  inj_normalized
0    1  April 10 2024    1        0.000000
1    1  April 11 2024    2        0.005025
2    3  April 12 2024  100        0.497487
3    3  April 13 2024  200        1.000000
In [82]:
# one more example where we pull all 3 characteristics of first-class functions together

def greet(name):
    return f"Hello {name}!"

def greet_loudly(name):
    return f"HELLO {name}!!!"

def create_greeting(name, func):
    """Apply any greeting function to a name."""
    return func(name)

# Variable assignment
say_hello = greet

print(greet("Bob"))

# Passing a function as an argument
print(create_greeting("Alice", greet))

print(create_greeting("Bob", greet_loudly))

# Returning a function from a function, with a condition
def get_greeter(mood):
    if mood == 'loud':
        return greet_loudly
    else:
        return greet

# Use returned function
current_mood_greeter = get_greeter('loud')
print(current_mood_greeter("Chris"))
Hello Bob!
Hello Alice!
HELLO Bob!!!
HELLO Chris!!!

4. Higher-Order Functions

  • Definition: functions that can take other functions as arguments, return functions as results, or both
  • useful for creating flexible code that can be customized with behavior defined outside the function itself

    Methods:

  • ### map : is primarily used with pandas Series to apply a function element-wise. Good example of a higher-order function because it takes a function as an argument
    • Recall that a series is 1-dimension array, like a column of data that is paired with an index of row labels for each of the data points
    • In contrast, a DataFrame is 2-dimension arrary, like a table or spreadsheet. Here data is acccessed by an index of labels for each row and its column names
  • ### apply: in pandas is also used for applying functions along an axis of a DataFrame or on a Series
    • The result is either aggregated or transformed data
    • This is another instance of higher-order functions where the function passed can produce outputs based on entire columns or rows
    • filtering: there is not method called filter but we can use built-in python syntax along with apply to create a filter based on a condition
In [83]:
#Map function

# first let's generate a Series or column of data . Say it is temperature data in Celcius
cdata= pd.Series([22,21,32,30,40])

# Let's create a function to convert celcius to fahrenheit
def cto_Fahrenheit(x):
    return (9/5)*x + 32

# now we use the map function
# the map function takes our custom function as its arguement
# applies it to each element in cdata
 
f_temps = cdata.map(cto_Fahrenheit) # key point: we are passing a function as an arguement

print("Temps Series data\n", f_temps)


# create a new dataset 
# using a dictionary {} of lists []
#  generates column names and their values as key:value pairs'

new_cdata= {
    'C_Temp':[30,33,31,37],
    'Date':['2024-06-04', '2024-06-05','2024-06-06', '2024-06-07']
}

# convert to a DataFrame
temp_df=pd.DataFrame(new_cdata)
print(temp_df)

# Use map to apply the custom function to the c_temp column
 # Convert values in C_Temp and define a new column to store the converted temp values
temp_df['F_Temp'] = temp_df['C_Temp'].map(cto_Fahrenheit) # key point: we are passing a function as an arguement

print(temp_df)


# we used map with our custom function to convert entire columns of data from Celcius to Fahrenheit
Temps Series data
 0     71.6
1     69.8
2     89.6
3     86.0
4    104.0
dtype: float64
   C_Temp        Date
0      30  2024-06-04
1      33  2024-06-05
2      31  2024-06-06
3      37  2024-06-07
   C_Temp        Date  F_Temp
0      30  2024-06-04    86.0
1      33  2024-06-05    91.4
2      31  2024-06-06    87.8
3      37  2024-06-07    98.6
In [84]:
# Apply method

precip_data = {'Precip' : [ 1, 1, 0, 0]} # a simple example, more realistic would need to know day/time values to link the datasets
precip_df = pd.DataFrame(precip_data)
temp_df['Precip'] = precip_df

print(temp_df) # now we added some precip data


# Say the Precip measurements are all off by 1
# We can create a new column that will hold the corrected values 

# Using the lambda function along with apply to modify a DataFrame column
temp_df['Precip_plus_1'] = temp_df['Precip'].apply(lambda x: x + 1) 

print("DataFrame with Precip + 1:\n", temp_df)
   C_Temp        Date  F_Temp  Precip
0      30  2024-06-04    86.0       1
1      33  2024-06-05    91.4       1
2      31  2024-06-06    87.8       0
3      37  2024-06-07    98.6       0
DataFrame with Precip + 1:
    C_Temp        Date  F_Temp  Precip  Precip_plus_1
0      30  2024-06-04    86.0       1              2
1      33  2024-06-05    91.4       1              2
2      31  2024-06-06    87.8       0              1
3      37  2024-06-07    98.6       0              1
In [85]:
# key point: 
# In these precip examples
# we are passing a function as an arguement to another function, this time a lambda function was passed to apply
In [86]:
# Filter data based on a condition

# How to create a function that filters data based on a certain condition?
    # Here we use the apply method to create a mask, series of True/False value
    # Then we use that mask to filter the data

print(temp_df) # we start with the full set of data

# Define a function to select only those temps meeting a certain condition
def hightemps(row, temp_value):
    return row >temp_value # returns temps above temp value


# Using apply method to call the function hightemps that filters for values where F_Temp is >90 
    # Here apply works on each row of the column, returning True when value is above the temp_value
hi_temps= temp_df[temp_df['F_Temp'].apply(hightemps, temp_value=90)]



print('\n') # adds space between prints



print('DataFrame with high temperatures:\n', hi_temps)
   C_Temp        Date  F_Temp  Precip  Precip_plus_1
0      30  2024-06-04    86.0       1              2
1      33  2024-06-05    91.4       1              2
2      31  2024-06-06    87.8       0              1
3      37  2024-06-07    98.6       0              1


DataFrame with high temperatures:
    C_Temp        Date  F_Temp  Precip  Precip_plus_1
1      33  2024-06-05    91.4       1              2
3      37  2024-06-07    98.6       0              1
In [87]:
# A simpler approach to filtering based on a condition with using apply function

print(temp_df)  # we start with the full set of data

temp_value = 90
# Directly use boolean indexing for filtering
hi_temps = temp_df[temp_df['F_Temp'] > temp_value]

print('\n')  # adds space between prints

print('DataFrame with high temperatures:\n', hi_temps)
   C_Temp        Date  F_Temp  Precip  Precip_plus_1
0      30  2024-06-04    86.0       1              2
1      33  2024-06-05    91.4       1              2
2      31  2024-06-06    87.8       0              1
3      37  2024-06-07    98.6       0              1


DataFrame with high temperatures:
    C_Temp        Date  F_Temp  Precip  Precip_plus_1
1      33  2024-06-05    91.4       1              2
3      37  2024-06-07    98.6       0              1
In [88]:
# the advantage of using a function is that we do not have to 
    # specify the temp value to filter on twice each time we want to fitler
    #  rewrite the filter line each time 
In [89]:
# Another example, where simple boolean filter would not work
data = {
    'Day': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
    'City': ['CityA', 'CityB', 'CityA', 'CityB', 'CityA'],
    'F_Temp': [89, 92, 88, 93, 95]
}
temp_df = pd.DataFrame(data)

print("Original DataFrame:")
print(temp_df)

# Define a function to select only those temps meeting a certain condition
# For example, filter for temperatures above a user-specified value on specific days or cities
def hightemps(row, temp_value):
    # Complex condition: Temp > user-specified value and either on Monday or in CityA
    return (row['F_Temp'] > temp_value) and (row['Day'] == 'Monday' or row['City'] == 'CityA')



# Using apply method to call the function hightemps that filters rows based on the complex condition
hi_temps = temp_df[temp_df.apply(hightemps, axis=1, temp_value=90)] # can make this a user input like this: float(input("Enter the temperature value: "))

print("\nDataFrame with high temperatures based on complex conditions:")
print(hi_temps)
Original DataFrame:
         Day   City  F_Temp
0     Monday  CityA      89
1    Tuesday  CityB      92
2  Wednesday  CityA      88
3   Thursday  CityB      93
4     Friday  CityA      95

DataFrame with high temperatures based on complex conditions:
      Day   City  F_Temp
4  Friday  CityA      95

Self-test Exercise: Functional Programming

Apply functional programming principles to find a moving average

Exercise 1: Tornado Data Analysis

  • Many natural phenomena have short-term fluctuations that hide patterns or trends in the data.
  • How do we find a trend when there are such short-term fluctuations?
  • Task: Determine the 10 year moving average for annual tornado counts in the United States

    • Have tornado counts increased in the last 20 years? Why or Why not?
    • Do the results surprise you?
  • Moving Averages

    • Moving averages are a fundamental tool in time series analysis, smoothing out short-term fluctuations and highlighting longer-term trends or cycles
    • They are widely used in weather forecasting, stock market analysis, and many other fields
    • Basic Syntax: dataframe['Column'].rolling(window).mean()
      • specify the data set, usually will refer to a column
      • size of the window (the number of periods to include in each average)
      • .mean() average for the data within the window, completing the moving average operation
    • Arguments:

      • data: The dataset or series on which the moving average is to be calculated
      • window: The number of periods over which to calculate the average
      • Window defines the "moving" part of the moving average, as this window slides over the data

      Note: We can consider .rolling() as a form of higher-order function because it takes a fucntion (mean()) and applies it to a window of data

In [90]:
# load the data

new_tordata= pd.read_csv(r'.\Data\1950-2022_actual_tornadoes.csv')
In [91]:
# the following code cell will give you a nice hover highlighting on your tables when used with itables library
In [105]:
%%html
<style>
  .dataTables_wrapper tbody tr:hover {
    background-color: #6495ED; /* Cornflower Blue */
  }
</style>


<!-- #1E3A8A (Dark Blue) -->
<!-- #0D9488 (Teal) -->
<!-- #065F46 (Dark Green) -->
<!-- #4C1D95 (Dark Purple) -->
<!-- #991B1B (Dark Red) -->
<!-- #374151 (Dark Gray) -->
<!-- #B45309 (Deep Orange) -->
<!-- #164E63 (Dark Cyan) -->
<!-- #4A2C2A (Dark Brown) -->
<!-- #831843 (Dark Magenta) -->
<!-- #1E3A8A (Dark Blue ) -->

<!-- Suggested Light Colors for Light Backgrounds -->
<!-- #AED9E0 (Light Blue) -->
<!-- #A7F3D0 (Light Teal) -->
<!-- #D1FAE5 (Light Green) -->
<!-- #DDD6FE (Light Purple) -->
<!-- #FECACA (Light Red) -->
<!-- #E5E7EB (Light Gray) -->
<!-- #FFEDD5 (Light Orange) -->
<!-- #B2F5EA (Light Cyan) -->
<!-- #FED7AA (Light Brown) -->
<!-- #FBCFE8 (Light Magenta) -->
In [106]:
show(new_tordata,options={'hover': True})
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [94]:
# Ouuu lala, much easier to scan the rows now
In [95]:
new_tordata= new_tordata.set_index('date')
In [96]:
new_tordata # we changed the index to time. This makes it easier to organize the data by time
Out[96]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [97]:
# hint:
#  to set up the data and plot annual counts

    # group by year and count tornadoes per year
    # save the annual tornado counts to a new variable

annual_tors= new_tordata.groupby("yr")["om"].count()

annual_tors.plot(kind='line', xlabel='Year', ylabel='Tornadoes', title='US Total Tornado Count from 1950-2022') # plot a line plot for annual counts

print(annual_tors)
yr
1950     201
1951     260
1952     240
1953     421
1954     550
        ... 
2018    1126
2019    1517
2020    1082
2021    1314
2022    1143
Name: om, Length: 73, dtype: int64
In [ ]:
# we see annual fluctuations in tornado counts
# how can we smooth this out to discern a trend more clearly?
In [104]:
# answer: 
# Find the Moving Average for annual tornado counts , averaging counts every 10 years

# group the data by year and count tornado events per year
annual_tors= new_tordata.groupby("yr")["om"].count()
print(annual_tors.head(20))

# define a function that will calculate moving average of our grouped data
# and always generate the same ouptut for the same input

def movingAvg(series, window_size):
    return series.rolling(window=window_size).mean() # rolling is a built-in pandas higher order function that we use within a our function

tor10yrMA = movingAvg (annual_tors, window_size=10) #average number of tornadoes occuring every 10 years 

print(tor10yrMA.head(20))

# Plotting the annual tornado count
ax = annual_tors.plot(kind='line', label='Annual Tornado Count')

# Plotting the 10-year moving average on the same Axes object
tor10yrMA.plot(kind='line', label='10 Year MA', ax=ax, xlabel='Year', ylabel='Tornadoes', title='US Total Tornado Count from 1950-2022')

# Displaying the legend
ax.legend()
yr
1950    201
1951    260
1952    240
1953    421
1954    550
1955    591
1956    504
1957    858
1958    564
1959    604
1960    616
1961    697
1962    657
1963    463
1964    704
1965    897
1966    585
1967    927
1968    657
1969    608
Name: om, dtype: int64
yr
1950      NaN
1951      NaN
1952      NaN
1953      NaN
1954      NaN
1955      NaN
1956      NaN
1957      NaN
1958      NaN
1959    479.3
1960    520.8
1961    564.5
1962    606.2
1963    610.4
1964    625.8
1965    656.4
1966    664.5
1967    671.4
1968    680.7
1969    681.1
Name: om, dtype: float64
Out[104]:
<matplotlib.legend.Legend at 0x15dac1d74d0>
In [ ]:
# After removing short term fluctuations by averaging each point over a 10 year window, we clearly see a general increase in annual tornadoes over time
# The reasons for this -not depicted here- are mostly due to spread and adoption of radar technology that enhanced detection of low magnitude events


# Conclusion
# Taken together, in the last 20 years we have consistently seen over 1000 tornadoes in the US 

links

social