Tornados at Night

Historical U.S. Tornado Events Analysis (1950-2022)

I. Data source overview

  • Name : U.S. TORNADOES (1950-2022)
    • Source: NOAA NWS Storm Prediction Center Severe weather database
    • Geographical distribution: US and territories (Puerto Rico, Virgin Islands)
    • Description: comma separated value (.csv) files for tornadoes as compiled from NWS Storm Data. Tornado reports exist back to 1950. Tornado data are provided in raw csv file format. Actual tornado tracks only (not including individual state segments) are provided in the "Actual_tornadoes.csv" file.
    • File name and size: 1950-2022_actual_tornadoes.csv (7.3mb)

Definition: - Tornado: a narrow, violently rotating column of air that extends from a thunderstorm to the ground - Hard to see unless it forms a funnel made up of water droplets, dust and debris - Can be among the most violent phenomena of all storms we experience

Motivation for this study: Understanding human impact of tornadoes occuring at night and in urban locations

II. Objectives

1. Investigate Tornado Impact Based on Geographic Location of the Tornado Event:

  • Urban vs. Rural tornado locations
  • Objective: Compare tornado injury and fatality between urban and rural areas to identify differences in impact

2. Analyze Temporal Patterns of Tornado Impact:

  • Day vs. Night tornado occurences
  • Objective: Examine how tornado injury and fatality varies between day and night occurrences to understand temporal differences in tornado impacts

3. Analyze Factors Influencing Tornado Injury Severity:

  • Dependent variable tornado injuries
  • Predictors, path length, magntiude, and urban location
  • Objective: To assess the influence of tornado magnitude, path length, and urban location on the severity of tornado injuries by developing a predictive model
  • Building a linear regression model to quantify the relationship between these predictors and the number of injuries, and identifying the predictors that have the greatest impact

4. Analyze Factors Influencing Tornado Fatality Severity:

  • Dependent variable tornado fatalities
  • Predictors, injuries, magntiude, and time of day
  • Objective: To assess the influence of tornado magnitude, path length, and time of day on the severity of tornado injuries by developing a predictive model
  • Building a linear regression model to quantify the relationship between these predictors and the number of fatalities, identifying the predictors that have the greatest impact

Conclusions:

  1. The analysis of tornado events reveals a significant difference in fatalities based on the time of day:
  2. Tornadoes that occur at night tend to have a statistically higher fatality count compared to those during the day
  3. Suggests that nighttime tornadoes are particularly dangerous, possibly due to reduced visibility and slower emergency response times

  4. Urban vs. Rural Impact on Injuries:

  5. When comparing urban and rural tornadoes, findings indicate that urban areas experience significantly more injuries from tornadoes
  6. Could be due to higher population density and infrastructure in urban areas, making them more vulnerable to tornado damage and resulting in higher injury counts

  7. OLS Linear Regression Analysis on Injuries:

  8. The linear regression analysis identifies tornado magnitude, path length, and urban location as significant predictors of tornado injuries

Additional Evidence for major severe tornados occurring at night:

  • Recording (see below) for the 1965 Tornado Outbreak called the "Longest Night" in Minnesota
Sources:
  • https://web.archive.org/web/20120224181632/http://www.lattery.com/vortex100/intro.htm
from IPython.display import Audio # a module for controlling notebook outputs, allows you to embed images, video, audio, even webpages with the Iframe function

# ArcGIS Pro documentation URL for a specific tool
recording = "WCCO-AM_1965_Tornado_3.mp3"

# Display the documentation inside Jupyter Notebook
Audio(recording) # iframe can be used to display local or online webpages, documents, audio and video

import packages

import os
import pandas as pd
import requests
import numpy as np
import matplotlib.colors as mcolors
import matplotlib.patches as mpatches
from matplotlib.lines import Line2D
import cartopy.crs as ccrs # for projection
import cartopy.feature as cfeature # for map features 
from cartopy.util import add_cyclic_point
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import cartopy.io.shapereader as shpreader
import shapely.geometry as sgeom
from matplotlib.colors import LogNorm
from shapely.geometry import Polygon
import cartopy.io.shapereader as shpreader
import geopandas as gpd
from itables import show
from scipy.stats import poisson
from scipy.stats import gamma
from scipy.stats import lognorm
from scipy.stats import chi2_contingency
from scipy.stats import boxcox
# The very important stats module in scipy package!
from scipy import stats
import scipy.stats as stats
from scipy.stats import norm
from scipy.stats import mannwhitneyu
import math 
import seaborn as sb
# For producing QQ (quantile-quantile) plots
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error # may need to install sklearn - machine learning package
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

Set up file paths

# local_save_dir = r'G:\My Drive\Python_projects\20240110_ATMS_517\Project\Data\tornado_data'
# file_name= '1950-2022_actual_tornadoes.csv'
# local_file_path = os.path.join(local_save_dir, file_name)

Access and download tornado dataset

# local_save_dir = r'G:\My Drive\Python_projects\20240110_ATMS_517\Project\Data\tornado_data'
# file_name= '1950-2022_actual_tornadoes.csv'

# def download_file(url):
#     # Ensure the save directory exists
#     if not os.path.exists(local_save_dir):
#         os.makedirs(local_save_dir)

#     # Full path to save the file
#     local_file_path = os.path.join(local_save_dir, file_name)

#     # Download the file
#     with requests.get(url) as r:
#         if r.status_code == 200:
#             with open(local_file_path, 'wb') as f:
#                 f.write(r.content)
#             print('file downloaded successfully')

#             return local_file_path

#         else:
#             print("Failed to download file")
#             print(f"Status code: {r.status_code}. Reason: {r.reason}")

#             return None

# url = "https://www.spc.noaa.gov/wcm/data/1950-2022_actual_tornadoes.csv"
# file_path = download_file(url)

III. Load downloaded dataset

  • if starting from this point /have downloaded the data already :
  • set up file paths
  • import packages
  • run cell below to load the dataset
local_save_dir = r'G:\My Drive\Python_projects\20240110_ATMS_517\Project\Data\tornado_data'
file_name= '1950-2022_actual_tornadoes.csv'
file_path = os.path.join(local_save_dir, file_name)

tornado_data = pd.read_csv(file_path)

tornada_data_original = tornado_data.copy()
print(tornado_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68701 entries, 0 to 68700
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   om      68701 non-null  int64  
 1   yr      68701 non-null  int64  
 2   mo      68701 non-null  int64  
 3   dy      68701 non-null  int64  
 4   date    68701 non-null  object 
 5   time    68701 non-null  object 
 6   tz      68701 non-null  int64  
 7   st      68701 non-null  object 
 8   stf     68701 non-null  int64  
 9   stn     68701 non-null  int64  
 10  mag     68701 non-null  int64  
 11  inj     68701 non-null  int64  
 12  fat     68701 non-null  int64  
 13  loss    68701 non-null  float64
 14  closs   68701 non-null  float64
 15  slat    68701 non-null  float64
 16  slon    68701 non-null  float64
 17  elat    68701 non-null  float64
 18  elon    68701 non-null  float64
 19  len     68701 non-null  float64
 20  wid     68701 non-null  int64  
 21  ns      68701 non-null  int64  
 22  sn      68701 non-null  int64  
 23  sg      68701 non-null  int64  
 24  f1      68701 non-null  int64  
 25  f2      68701 non-null  int64  
 26  f3      68701 non-null  int64  
 27  f4      68701 non-null  int64  
 28  fc      68701 non-null  int64  
dtypes: float64(7), int64(19), object(3)
memory usage: 15.2+ MB
None

IV. Data Structure & Characteristics:

  • Dataset structure: Dataframe created from a csv file loaded into Pandas, generating tabular data with labeled rows and columns, each column is a one-dimensional array-like list of values
  • Dimensions: Two dimensional, each row a single tornado event, and each column a different attributes of the event
  • Coordinates: Index values with integers starting from 0 by default for rows (observations) and columns names for columns (variables)

Most relevant fields:

  • om: tornado number. count of tornadoes through the year
  • yr: year
  • mo: Month - 1-12
  • dy: Day, 1-31
  • date: Date in yyyy-mm-dd
  • time: Time in HH:MM:SS
  • tz: time zone. note: all times except for '?'=unknown and 9=GMT are 3= CST.
  • st: state two-letter abbreviation
  • stf: State FIPS number. Federal Information Processing System (FIPS) Codes which uniquely identify States and Counties.
  • f: F-scale or EF-scale after 2007: values are either 0, 1, 2, 3, 4, 5 or -9 (unknown)
  • inj: Injuries note: when summing for state totals use sn=1, not sg=1
  • fat: Fatalities note: when summing for state totals use sn=1, not sg=1
  • loss: Estimated property loss - From 1996 reported as tornado property damage in millions of dollars (see documentation for additional details)
  • slat: Starting latitude of tornado
  • slon: Starging longitude of tornado
  • elat: Ending latitude of tornado
  • elon: Ending longitude of tornado
  • len: Length of tornado track in miles
  • wid: Width of tornado in yards
  • note: the physical attributes of each row (observation) with regard to time and location are defined by the 'date', 'time', 'lon', and 'lat' columns.
# dimensions of the data , with most relevant variables

print(tornado_data.shape)
print(f"The dataset has {tornado_data.ndim} dimensions")
(68701, 29)
The dataset has 2 dimensions

Check for missing values

# Sum all the missing values by column  with data.isnull().sum().sum()
total_null_torando_data = tornado_data.isnull().sum().sum()

print(f'Total number of missing values in the dataset is {total_null_torando_data}')

# check for missing values by column  with data.isnull().sum()
null_torando_data = tornado_data.isnull().sum()

print(null_torando_data)

# no missing values, so we keep all rows for now...
Total number of missing values in the dataset is 0
om       0
yr       0
mo       0
dy       0
date     0
time     0
tz       0
st       0
stf      0
stn      0
mag      0
inj      0
fat      0
loss     0
closs    0
slat     0
slon     0
elat     0
elon     0
len      0
wid      0
ns       0
sn       0
sg       0
f1       0
f2       0
f3       0
f4       0
fc       0
dtype: int64

Analysis of zeroes

# count columns with zeroes, and find the one with the most zeroes

column_names = tornado_data.columns

print(column_names)


zero_counts_dict = {}

for col in tornado_data.columns:
    zeroes_count = (tornado_data[col]== 0).sum()
    print(f"The column '{col}' contains {zeroes_count} zeroes or {round((zeroes_count)/len(tornado_data[col]),3)*100}% of these values are zero")
    zero_counts_dict[col]= (tornado_data[col]== 0).sum()

zero_count_series = pd.Series(zero_counts_dict)

column_with_most_zeroes = zero_count_series.idxmax()

print(f"The column '{column_with_most_zeroes}' contains the most zeros.")

# From this analysis of zero values in the data
    # ~ 99% of values in column 'f4' are zero values, representing rarity of tornado occurences with a 4th county code 
    # ~ 0.9% of values in column 'sn' are zero values, representing rarity of tornado occurences where entire entire track is NOT in the same state (i.e., most sn values = 1)
Index(['om', 'yr', 'mo', 'dy', 'date', 'time', 'tz', 'st', 'stf', 'stn', 'mag',
       'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len',
       'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc'],
      dtype='object')
The column 'om' contains 0 zeroes or 0.0% of these values are zero
The column 'yr' contains 0 zeroes or 0.0% of these values are zero
The column 'mo' contains 0 zeroes or 0.0% of these values are zero
The column 'dy' contains 0 zeroes or 0.0% of these values are zero
The column 'date' contains 0 zeroes or 0.0% of these values are zero
The column 'time' contains 0 zeroes or 0.0% of these values are zero
The column 'tz' contains 8 zeroes or 0.0% of these values are zero
The column 'st' contains 0 zeroes or 0.0% of these values are zero
The column 'stf' contains 0 zeroes or 0.0% of these values are zero
The column 'stn' contains 15344 zeroes or 22.3% of these values are zero
The column 'mag' contains 31776 zeroes or 46.300000000000004% of these values are zero
The column 'inj' contains 60943 zeroes or 88.7% of these values are zero
The column 'fat' contains 67128 zeroes or 97.7% of these values are zero
The column 'loss' contains 27172 zeroes or 39.6% of these values are zero
The column 'closs' contains 67755 zeroes or 98.6% of these values are zero
The column 'slat' contains 0 zeroes or 0.0% of these values are zero
The column 'slon' contains 0 zeroes or 0.0% of these values are zero
The column 'elat' contains 26363 zeroes or 38.4% of these values are zero
The column 'elon' contains 26363 zeroes or 38.4% of these values are zero
The column 'len' contains 123 zeroes or 0.2% of these values are zero
The column 'wid' contains 473 zeroes or 0.7000000000000001% of these values are zero
The column 'ns' contains 0 zeroes or 0.0% of these values are zero
The column 'sn' contains 590 zeroes or 0.8999999999999999% of these values are zero
The column 'sg' contains 0 zeroes or 0.0% of these values are zero
The column 'f1' contains 602 zeroes or 0.8999999999999999% of these values are zero
The column 'f2' contains 63078 zeroes or 91.8% of these values are zero
The column 'f3' contains 67599 zeroes or 98.4% of these values are zero
The column 'f4' contains 68378 zeroes or 99.5% of these values are zero
The column 'fc' contains 66838 zeroes or 97.3% of these values are zero
The column 'f4' contains the most zeros.

Analysis of fill values

  • magntidue of 0 is a valid data point.
  • From the documentation some rows have fill values set to -9
# count columns with fill value of -9, and find the one with the most fill values

default_data_source_fill_value = -9

column_names = tornado_data.columns

print(column_names)


fillval_minus_9_count_dict = {}

for col in tornado_data.columns:
    fillval_minus_9_count = (tornado_data[col]== -9).sum()
    print(f"The column '{col}' contains {fillval_minus_9_count} {default_data_source_fill_value} fill values or {round(fillval_minus_9_count/len(tornado_data),2)}%")
    fillval_minus_9_count_dict[col]= (tornado_data[col]== -9).sum()

fillval_minus_9_count_series = pd.Series(fillval_minus_9_count_dict)

column_with_most_fillval_minus_9 = fillval_minus_9_count_series.idxmax()

print(f"The column '{column_with_most_fillval_minus_9}' contains the most {default_data_source_fill_value} fill values")

# magnitiude column has the most fill values of -9, suggesting we may be able handle these by masking
Index(['om', 'yr', 'mo', 'dy', 'date', 'time', 'tz', 'st', 'stf', 'stn', 'mag',
       'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len',
       'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc'],
      dtype='object')
The column 'om' contains 0 -9 fill values or 0.0%
The column 'yr' contains 0 -9 fill values or 0.0%
The column 'mo' contains 0 -9 fill values or 0.0%
The column 'dy' contains 0 -9 fill values or 0.0%
The column 'date' contains 0 -9 fill values or 0.0%
The column 'time' contains 0 -9 fill values or 0.0%
The column 'tz' contains 0 -9 fill values or 0.0%
The column 'st' contains 0 -9 fill values or 0.0%
The column 'stf' contains 0 -9 fill values or 0.0%
The column 'stn' contains 0 -9 fill values or 0.0%
The column 'mag' contains 756 -9 fill values or 0.01%
The column 'inj' contains 0 -9 fill values or 0.0%
The column 'fat' contains 0 -9 fill values or 0.0%
The column 'loss' contains 0 -9 fill values or 0.0%
The column 'closs' contains 0 -9 fill values or 0.0%
The column 'slat' contains 0 -9 fill values or 0.0%
The column 'slon' contains 0 -9 fill values or 0.0%
The column 'elat' contains 0 -9 fill values or 0.0%
The column 'elon' contains 0 -9 fill values or 0.0%
The column 'len' contains 0 -9 fill values or 0.0%
The column 'wid' contains 0 -9 fill values or 0.0%
The column 'ns' contains 0 -9 fill values or 0.0%
The column 'sn' contains 0 -9 fill values or 0.0%
The column 'sg' contains 0 -9 fill values or 0.0%
The column 'f1' contains 0 -9 fill values or 0.0%
The column 'f2' contains 0 -9 fill values or 0.0%
The column 'f3' contains 0 -9 fill values or 0.0%
The column 'f4' contains 0 -9 fill values or 0.0%
The column 'fc' contains 0 -9 fill values or 0.0%
The column 'mag' contains the most -9 fill values
# check if mag has the predicted number of fill values 
mag_fill_val_minus_9_data = tornado_data[tornado_data['mag']==-9]

print(f"Number of rows with -9 fill values: {len(mag_fill_val_minus_9_data)}")


# for any analysis of magnitude we will drop any rows with -9 fill values, 756 records 
Number of rows with -9 fill values: 756

Exploratory Analysis Part 1: Counts, Magnitude and Time

What is the annual distribution of tornadoes across the US?

# Group by 'yr', and count the number of tornadoes in each year
total_annual_tornados = tornado_data.groupby('yr')['om'].count()

print("Descriptive statistics for total annual tornadoes across the US from 1950 - 2022")
print(total_annual_tornados.describe())

# Get the maximum and minimum annual tornado counts
max_annual_tornados = total_annual_tornados.max()
min_annual_tornados = total_annual_tornados.min()
median_annual_tor_count= total_annual_tornados.median()
mean_annual_tor_count= total_annual_tornados.mean()

print('\n')


# Calculate the skewness of the annual tornado counts
skewness = total_annual_tornados.skew()

# Print the skewness
print(f"Skewness: {skewness}")

# Calculate the quartiles
Q1 = total_annual_tornados.quantile(0.25)
Q2 = total_annual_tornados.quantile(0.5)
Q3 = total_annual_tornados.quantile(0.75)

# Calculate the Yule-Kendall index
yule_kendall_index = ((Q3 - Q2) - (Q2 - Q1)) / (Q3 - Q1)
print(f"Yule-Kendall index: {yule_kendall_index}")

# Calculate the IQR
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")

# Define the bins using the square root rule, number bins is the sq root of the number of data points 
num_bins = int(np.sqrt(len(total_annual_tornados))) # datapoints equal to the number of years since we are counting tornadoes per year
print('\n')
print(f'number of bins is {num_bins}')

#print(f'number of years used in total annual tornado counts {len(total_annual_tornados)} years')

# Compute the width of each bin
bin_width= (max_annual_tornados - min_annual_tornados) / num_bins
print(f'bin width is {bin_width}')

print('\n')


# Plot a histogram of the annual tornado counts
total_annual_tornados.plot.hist(bins=num_bins, edgecolor='black')

# Add a title and labels
plt.title('Distribution of Annual US Tornado Counts (1950-2022)')
plt.xlabel('Annual Tornado Count')
plt.ylabel('Number of Years')
plt.axvline(median_annual_tor_count, color='r', linestyle=':', label='Median annual count', zorder=2) 
plt.axvline(mean_annual_tor_count, color='b', linestyle=':', label='Mean annual count', zorder=3) 
# Add a legend
plt.legend()

plt.show()

# take away on annual tornado counts - overall trend is that there are more years with above average tornado counts
    # statistics suggest that the annual tornado counts are slightly positively skewed and have a considerable variability. 
    # While the most of the years have tornado counts close to the mean
    # there are more years with above-average tornado counts than below-average counts
    # The years with highest tornado counts are slightly further from the mean than years with the lowest counts.
Descriptive statistics for total annual tornadoes across the US from 1950 - 2022
count      73.000000
mean      941.109589
std       334.161952
min       201.000000
25%       697.000000
50%       919.000000
75%      1148.000000
max      1817.000000
Name: om, dtype: float64


Skewness: 0.1950846469778466
Yule-Kendall index: 0.015521064301552107
Interquartile Range (IQR): 451.0


number of bins is 8
bin width is 202.0

png

How many annual tornades occured across the US from 1950-2022?

#We will create a time series line plot to visualize the annual count of tornadoes in the US from 1950 to 2022
#This will help us identify any trends or patterns in tornado activity over this period
#For example, we might be able to see if tornadoes are becoming more or less common, or if there are certain years with unusually high or low counts


total_annual_tornados.plot.line()

# Add a title and labels
plt.title('US Total tornado Count from 1950-2022')
plt.xlabel('Year')
plt.ylabel('Count')
Text(0, 0.5, 'Count')

png

# At a glance, there does seem to be an upward trend in the overall counts
# It is interesting to see a sharp increase to ~1100 from ~800 annual tornadoes
# Counts increasing over time, especially in the pre vs. post- '1980s to '1990s
# likely related to increases in tornado detection due to dopppler radar adoption and spread of tornado spotting education across the US
# What years historically had the most tornadoes?

year_max_annual_tor =total_annual_tornados.idxmax()
print(f' The most tornadoes in a single year occurred in {year_max_annual_tor} with {total_annual_tornados.loc[year_max_annual_tor]} tornadoes in that year')

top_10_annual_tor =total_annual_tornados.nlargest(10)
print(f' The top 10 years with most tornadoes {top_10_annual_tor}')

# top 10 years with most tornadoes includes 2004, 2011, 2008, and 2019 with over 1500 tornadoes!
 The most tornadoes in a single year occurred in 2004 with 1817 tornadoes in that year
 The top 10 years with most tornadoes yr
2004    1817
2011    1691
2008    1689
2019    1517
2017    1428
1998    1424
2003    1374
1999    1339
2021    1314
1992    1297
Name: om, dtype: int64

What is the monthly distribution of tornadoes across the US?

def num_month_converter(value):
    month_lookup = {1: 'Jan',2: 'Feb', 3: 'Mar', 4: 'Apr', 5:  'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct',11: 'Nov',12: 'Dec'}
    return month_lookup[value]

num_month_converter(1)
'Jan'
# Group by 'mo', and count the number of tornadoes in each group
total_monthly_tor = tornado_data.groupby(['mo'])['om'].count()
print('\n')
print(total_monthly_tor) 

fig, ax = plt.subplots(figsize=(6, 4))
months_index=total_monthly_tor.index.map(num_month_converter) #The map() function applies a function to each element of a Series, which in this case is the index of total_monthly_tor.
# Create a line plot on the axes object
ax.plot(months_index, total_monthly_tor.values, color='blue', marker='o')

# Add a title and labels
ax.set_title('US Total Monthly Tornado Count from 1950-2022')
ax.set_xlabel('Month')
ax.set_ylabel('Count')

# Historically, across the US, May is the month of greatest counts of tornados followed by June and April
mo
1      1756
2      1956
3      4748
4      9792
5     15057
6     12615
7      7035
8      4823
9      3496
10     2838
11     2709
12     1876
Name: om, dtype: int64





Text(0, 0.5, 'Count')

png

# A different approach to above that outputs  a bar plot to display monthly tornado counts


# Group by 'mo', and count the number of tornadoes in each group
monthly_tor_by_yr = tornado_data.groupby('mo')['om'].count()

print("Descriptive statistics for total monthly tornadoes across the US from 1950 - 2022")
print(monthly_tor_by_yr.describe())

print('\n')

print(monthly_tor_by_yr)

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

print(f" the peak month for tornadoes across the US from 1950 - 2022 is {months[monthly_tor_by_yr.idxmax()-1]}")




# Plot the total monthly tornado counts
monthly_tor_by_yr.plot(kind='bar')


# Add a title and labels
plt.title('US tornado count by month 1950-2022')
plt.xlabel('Month')
plt.ylabel('Count')

# Add a legend
plt.legend()

# Add more ticks to the x-axis
plt.xticks(range(0, 12), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# Show the plot
plt.show()
Descriptive statistics for total monthly tornadoes across the US from 1950 - 2022
count       12.000000
mean      5725.083333
std       4489.944674
min       1756.000000
25%       2520.750000
50%       4122.000000
75%       7724.250000
max      15057.000000
Name: om, dtype: float64


mo
1      1756
2      1956
3      4748
4      9792
5     15057
6     12615
7      7035
8      4823
9      3496
10     2838
11     2709
12     1876
Name: om, dtype: int64
 the peak month for tornadoes across the US from 1950 - 2022 is May

png

Within any year what is the monthly distribution of tornadoes across the US?

# Group by 'mo', and count the number of tornadoes in each group
monthly_tor_by_yr = tornado_data.groupby(['yr','mo'])['om'].count()

print('\n')
print(monthly_tor_by_yr)
yr    mo
1950  1      7
      2     20
      3     21
      4     15
      5     61
            ..
2022  8     35
      9     25
      10    36
      11    62
      12    58
Name: om, Length: 871, dtype: int64
print("Descriptive statistics for total monthly tornadoes grouped by year across the US from 1950 - 2022")
print(monthly_tor_by_yr.describe())
total_monthly_tor_data =monthly_tor_by_yr.describe()


print('\n')


# Calculate the skewness of the annual tornado counts
skewness = monthly_tor_by_yr.skew()

# Print the skewness
print(f"Skewness: {skewness}")

# Calculate the quartiles
Q1 = monthly_tor_by_yr.quantile(0.25)
Q2 = monthly_tor_by_yr.quantile(0.5)
Q3 = monthly_tor_by_yr.quantile(0.75)

# Calculate the Yule-Kendall index
yule_kendall_index = ((Q3 - Q2) - (Q2 - Q1)) / (Q3 - Q1)
print(f"Yule-Kendall index: {yule_kendall_index}")

# Calculate the IQR
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")
# Define the bins using the square root rule, number bins is the sq root of the number of data points 
num_mon_bins = int(np.sqrt(len(monthly_tor_by_yr))) # datapoints equal to the number of years since we are counting tornadoes per year

print('\n')

print(f'number of bins is {num_mon_bins}')

# Compute the width of each bin
bin_width= (total_monthly_tor_data.iloc[7]  - total_monthly_tor_data.iloc[3]) / num_mon_bins

print(f'bin width is {bin_width}')



monthly_tor_by_yr.plot.hist(bins=num_mon_bins,edgecolor='black')

# Add a title and labels
plt.title('Distribution of Monthly tornado counts 1950-2022')
plt.xlabel('Monthly Tornado Count')
plt.ylabel('Number of Months')
plt.axvline(total_monthly_tor_data.loc['50%'], color='r', linestyle=':', label='Median monthly count', zorder=2) 
plt.axvline(total_monthly_tor_data.loc['mean'], color='b', linestyle=':', label='Mean monthly count', zorder=3) 

# Add a legend
plt.legend()
Descriptive statistics for total monthly tornadoes grouped by year across the US from 1950 - 2022
count    871.000000
mean      78.876005
std       81.822253
min        1.000000
25%       21.000000
50%       53.000000
75%      110.500000
max      757.000000
Name: om, dtype: float64


Skewness: 2.3179433143776262
Yule-Kendall index: 0.2849162011173184
Interquartile Range (IQR): 89.5


number of bins is 29
bin width is 26.06896551724138





<matplotlib.legend.Legend at 0x2075f9c32f0>

png

month_max_annual_tor =monthly_tor_by_yr.idxmax()
print(f' The year and month with most tornadoes is {month_max_annual_tor}')

print('\n')
#print(f' The most tornadoes in a single month occurred in {month_max_annual_tor.apply(num_month_converter)} with {monthly_tor_by_yr.loc[month_max_annual_tor]} tornadoes')

top_10_monthly_tor =monthly_tor_by_yr.nlargest(10)
print('\n')

print(f' The top 10 months with most tornadoes {top_10_monthly_tor}')

# Historically in the US, April 2011 had the greatest number of tornadoes but otherwise May tends to be in the top 10 for most tornadoes per month in a year
 The year and month with most tornadoes is (2011, 4)




 The top 10 months with most tornadoes yr    mo
2011  4     757
2003  5     542
2019  5     510
2004  5     509
2008  5     460
1992  6     399
1995  5     394
2015  5     381
1998  6     376
1991  5     335
Name: om, dtype: int64

Across the US, how has the magnitude of tornadoes changed over time?

# Group by 'yr' and 'mag', and count the number of tornadoes in each group
total_mag_annual_tornados = tornado_data.groupby(['yr', 'mag'])['om'].count()

print(total_mag_annual_tornados)
yr    mag
1950  0       16
      1       86
      2       68
      3       24
      4        7
            ... 
2022  0      401
      1      448
      2      119
      3       20
      4        4
Name: om, Length: 401, dtype: int64
# Plot of magnitudes over time
# need to rearrange the data so that the grouped object has only one index for years 

# Unstack the 'total_mag_annual_tornados' Series to create a DataFrame
    # the dataframe has two index yr and mag
    # unstack() by level 'mag' , transposes the mag from the rows to the columns
    # Creates a new table with years as rows and columns as magnitude
    # allows us to plot tornado counts by year for each magnitude

tornado_count_by_mag = total_mag_annual_tornados.unstack(level='mag')

# select only the known magntidues, ie. exclude -9 

tornado_count_by_mag = tornado_count_by_mag[ [0,1,2,3,4,5]]
print("Descriptive statistics for total annual by magnitude tornadoes across the US from 1950 - 2022")
print(tornado_count_by_mag.describe())

print('\n')

print(tornado_count_by_mag)

# Plot the DataFrame
tornado_count_by_mag.plot(kind='line',marker='o' , figsize=(12, 6))
#df.plot(c='g',marker='o',linestyle='dashed')

# Add a title and labels
plt.title('US Tornado Count by Magnitude Tornado 1950-2022')
plt.xlabel('Year')
plt.ylabel('Count')

# Show the plot
plt.show()
Descriptive statistics for total annual by magnitude tornadoes across the US from 1950 - 2022
mag              0           1           2          3         4          5
count    73.000000   73.000000   73.000000  73.000000  72.00000  30.000000
mean    435.287671  319.630137  132.000000  35.013699   8.12500   1.966667
std     276.464910  101.662501   53.323437  16.705366   5.74073   1.629117
min      16.000000   82.000000   62.000000  10.000000   1.00000   1.000000
25%     192.000000  262.000000   88.000000  23.000000   4.00000   1.000000
50%     359.000000  312.000000  122.000000  32.000000   7.00000   1.000000
75%     688.000000  380.000000  162.000000  41.000000  11.00000   2.750000
max    1216.000000  617.000000  301.000000  95.000000  30.00000   7.000000


mag       0      1      2     3     4    5
yr                                        
1950   16.0   86.0   68.0  24.0   7.0  NaN
1951   49.0  100.0   83.0  23.0   5.0  NaN
1952   32.0   82.0   72.0  36.0  18.0  NaN
1953   66.0  159.0  134.0  40.0  17.0  5.0
1954   89.0  226.0  189.0  39.0   7.0  NaN
...     ...    ...    ...   ...   ...  ...
2018  618.0  401.0   78.0  12.0   NaN  NaN
2019  647.0  531.0  117.0  33.0   3.0  NaN
2020  460.0  405.0   88.0  18.0   6.0  NaN
2021  556.0  427.0  104.0  21.0   3.0  NaN
2022  401.0  448.0  119.0  20.0   4.0  NaN

[73 rows x 6 columns]

png

Across the US, how does the magnitude of Tornadoes vary by month?

# Group by 'mo' and 'mag', and count the number of tornadoes in each group
total_mag_monthly_tor = tornado_data.groupby(['mo', 'mag'])['om'].count()
print(total_mag_monthly_tor)

print('\n')

# # unstack by magnitude to transpose magnitude from rows to columns
tornado_mo_count_by_mag = total_mag_monthly_tor.unstack(level='mag')

tornado_mo_count_by_mag = tornado_mo_count_by_mag[ [0,1,2,3,4,5]] # Exclude -9 mag from selected columns

print(tornado_mo_count_by_mag)
mo  mag
1   -9       7
     0     631
     1     704
     2     324
     3      79
          ... 
12   1     743
     2     389
     3     117
     4      16
     5       2
Name: om, Length: 80, dtype: int64


mag       0       1       2      3      4     5
mo                                             
1     631.0   704.0   324.0   79.0   11.0   NaN
2     626.0   815.0   385.0  108.0   21.0   1.0
3    1679.0  1767.0   894.0  295.0   68.0   4.0
4    3699.0  3614.0  1686.0  532.0  171.0  20.0
5    7396.0  4677.0  1951.0  573.0  131.0  20.0
6    6768.0  3913.0  1426.0  310.0   86.0   9.0
7    3877.0  2232.0   701.0  123.0   23.0   1.0
8    2608.0  1558.0   480.0   89.0   15.0   1.0
9    1792.0  1154.0   430.0   83.0   13.0   NaN
10   1271.0  1028.0   414.0   87.0    6.0   1.0
11    832.0  1128.0   556.0  160.0   24.0   NaN
12    597.0   743.0   389.0  117.0   16.0   2.0
print("Descriptive statistics for total monthly tornadoes across the US from 1950 - 2022")
print(tornado_mo_count_by_mag.describe())

print('\n')


# Plot the total monthly tornado counts
tornado_mo_count_by_mag.plot(kind='bar',figsize=(12, 6))


# Add a title and labels
plt.title('US tornado count by month and magnitude 1950-2022')
plt.xlabel('Month')
plt.ylabel('Count')

# Add a legend
plt.legend(title='Magnitude')

# Add more ticks to the x-axis
plt.xticks(range(0, 12), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# Show the plot
plt.show()

# Historically, tornado strength appears to peak in April and May
# coincides with when there are most tornadoes overall
Descriptive statistics for total monthly tornadoes across the US from 1950 - 2022
mag              0            1            2           3           4  \
count    12.000000    12.000000    12.000000   12.000000   12.000000   
mean   2648.000000  1944.416667   803.000000  213.000000   48.750000   
std    2361.886804  1374.510522   566.950054  177.034152   54.151513   
min     597.000000   704.000000   324.000000   79.000000    6.000000   
25%     781.750000   974.750000   407.750000   88.500000   14.500000   
50%    1735.500000  1356.000000   518.000000  120.000000   22.000000   
75%    3743.500000  2577.500000  1027.000000  298.750000   72.500000   
max    7396.000000  4677.000000  1951.000000  573.000000  171.000000

mag            5  
count   9.000000  
mean    6.555556  
std     8.048464  
min     1.000000  
25%     1.000000  
50%     2.000000  
75%     9.000000  
max    20.000000

png

# Plot each magnitude separately to be able to see low frequency mag 5 more clearly
    # Mag 5 is rare so has lots of nans, Fill missing magnitude values with 0
    # To ensure we do not overwrite original data, create new dataframe with filled nans
tornado_mo_count_by_mag_fillna = tornado_mo_count_by_mag.fillna(0)


fig, ax = plt.subplots(6, figsize=(6, 14))

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
colors = ['blue', 'orange', 'green', 'red', 'purple', 'brown']

for i, col in enumerate(tornado_mo_count_by_mag_fillna.columns):
    ax[i].bar(range(0, 12), tornado_mo_count_by_mag_fillna[col], color=colors[i])
    ax[i].set_title(f'US tornado count by month for magnitude {col} (1950-2022)')
    ax[i].set_xlabel('Month')
    ax[i].set_ylabel('Count')
    ax[i].set_xticks(range(0, 12))
    ax[i].set_xticklabels(months)

plt.tight_layout()
plt.show()

# plotting separately for monthly tornadoes across all years
# shows that for the strongest tornadoes (mag 4 and 5) counts reach peak in April
# weaker tornadoes continue to rise to reach peak counts in May 

png

# Historically, for a each magnitude which month has the most tornadoes?

peak_months = tornado_mo_count_by_mag_fillna.idxmax()
peak_months = peak_months.apply(num_month_converter) # convert the numeric values for month to 


print(f"The peak month for tornadoes by magnitude across the US from 1950 - 2022 is  as follows: {'\n'} {peak_months}")
The peak month for tornadoes by magnitude across the US from 1950 - 2022 is  as follows: 
 mag
0    May
1    May
2    May
3    May
4    Apr
5    Apr
dtype: object

What time do most tornadoes occur across the US?

# Function to convert 24-hour time to 12-hour time

def convert_to_12_hour(hour): 

    # when 0%12 and 12%12 and remainder is 0, the operator 'or' is triggered and value 12 is used

    return f"{(hour % 12 or 12)}{'am' if hour < 12 else 'pm'}"  # or 12, depends on python concept of 0 not being a 'truthy' value

V. Data Processing

  • Creates binary variable for categorizing tornadoes occuring at night
  • Creates binary variable for categorizing tornadoes occuring in urban locations
  • Drops columns not used in analyses

Create a binary time variable for marking tornados occuring at night

# Convert the 'time' column to datetime to easily extract the hour
tornado_data['time'] = pd.to_datetime(tornado_data['time'], format='%H:%M:%S')

# Create the binary 'is_night' indicator
 # the .dt accessor extract the year, month, day, hour , minute
# Define night as between 9:00 PM (21) and 5:00 AM (5)
# .astype(int) used with this logical evaluation, turns the presence of the value from True to a 1

tornado_data['is_night'] = ((tornado_data['time'].dt.hour >= 21) | (tornado_data['time'].dt.hour < 5)).astype(int)

# Extract the time part from the 'time' column
# tornado_data['time'] = tornado_data['time'].dt.time

tornado_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68701 entries, 0 to 68700
Data columns (total 30 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   om        68701 non-null  int64         
 1   yr        68701 non-null  int64         
 2   mo        68701 non-null  int64         
 3   dy        68701 non-null  int64         
 4   date      68701 non-null  object        
 5   time      68701 non-null  datetime64[ns]
 6   tz        68701 non-null  int64         
 7   st        68701 non-null  object        
 8   stf       68701 non-null  int64         
 9   stn       68701 non-null  int64         
 10  mag       68701 non-null  int64         
 11  inj       68701 non-null  int64         
 12  fat       68701 non-null  int64         
 13  loss      68701 non-null  float64       
 14  closs     68701 non-null  float64       
 15  slat      68701 non-null  float64       
 16  slon      68701 non-null  float64       
 17  elat      68701 non-null  float64       
 18  elon      68701 non-null  float64       
 19  len       68701 non-null  float64       
 20  wid       68701 non-null  int64         
 21  ns        68701 non-null  int64         
 22  sn        68701 non-null  int64         
 23  sg        68701 non-null  int64         
 24  f1        68701 non-null  int64         
 25  f2        68701 non-null  int64         
 26  f3        68701 non-null  int64         
 27  f4        68701 non-null  int64         
 28  fc        68701 non-null  int64         
 29  is_night  68701 non-null  int32         
dtypes: datetime64[ns](1), float64(7), int32(1), int64(19), object(2)
memory usage: 15.5+ MB
# # Convert 'time' to datetime format
tornado_data['time'] = pd.to_datetime(tornado_data['time'])

# # Extract the hour from 'time'
tornado_data['hour'] = tornado_data['time'].dt.hour

# # Group by 'hour' and count the number of tornadoes
time_tor = tornado_data.groupby('hour')['om'].count()


print(time_tor.describe())

print(time_tor.sort_values(ascending=False))



# Find the most common hour for tornadoes
most_frequent_tor_hour = time_tor.idxmax()

count_most_common_tor_hour = time_tor[most_frequent_tor_hour]
print('\n')
print(f'From 1950-2022 most tornadoes occur at {convert_to_12_hour(most_frequent_tor_hour)} Central Standard Time (CST) with {count_most_common_tor_hour} occurring at this time of day')


# Plot the count of tornadoes by hour of the day
time_tor.plot()
# Use list comprehension to apply convert_to_12_hour to each element of the index
time_tor.index = [convert_to_12_hour(int(hour)) for hour in time_tor.index]

print(f'Times for tornados events in{'\n'}{time_tor.sort_values(ascending=False)}')



# Add a title and labels
plt.title('US Total tornado Count by Time of Day from 1950-2022')
plt.xlabel('Time of day - Central Standard Time (CST)')
plt.ylabel('Count')

# Add a shaded region for the most common hour for tornadoes
plt.axvspan(most_frequent_tor_hour-2, most_frequent_tor_hour+2, color='gray', alpha=0.5, label='Most common time for tornadoes')

# Set the x-axis ticks to every 2 hours
plt.xticks(range(0, 24, 2))

# Add a legend
plt.legend()

# Convert the x-axis labels to 12-hour format
labels = [convert_to_12_hour(int(item)) if item.is_integer() else item for item in plt.gca().get_xticks()]
plt.gca().set_xticklabels(labels)
count      24.000000
mean     2862.541667
std      2557.032553
min       737.000000
25%       870.250000
50%      1620.000000
75%      4210.000000
max      8120.000000
Name: om, dtype: float64
hour
17    8120
16    7799
18    7316
15    6770
19    5823
14    5152
20    3896
13    3761
12    2607
21    2597
22    1817
11    1782
23    1458
10    1266
9     1057
0     1027
1      915
8      882
7      835
2      798
6      776
5      764
4      746
3      737
Name: om, dtype: int64


From 1950-2022 most tornadoes occur at 5pm Central Standard Time (CST) with 8120 occurring at this time of day
Times for tornados events in
5pm     8120
4pm     7799
6pm     7316
3pm     6770
7pm     5823
2pm     5152
8pm     3896
1pm     3761
12pm    2607
9pm     2597
10pm    1817
11am    1782
11pm    1458
10am    1266
9am     1057
12am    1027
1am      915
8am      882
7am      835
2am      798
6am      776
5am      764
4am      746
3am      737
Name: om, dtype: int64





[Text(0, 0, '12am'),
 Text(2, 0, '2am'),
 Text(4, 0, '4am'),
 Text(6, 0, '6am'),
 Text(8, 0, '8am'),
 Text(10, 0, '10am'),
 Text(12, 0, '12pm'),
 Text(14, 0, '2pm'),
 Text(16, 0, '4pm'),
 Text(18, 0, '6pm'),
 Text(20, 0, '8pm'),
 Text(22, 0, '10pm')]

png

# Top 10 times tornadoes occur
# Calculate the counts as a  percentage for each row i.e., each time
time_tor_percentage = (time_tor / time_tor.sum()) * 100


# Ensure 'percentage' column is numeric
time_tor['percentage'] = pd.to_numeric(time_tor_percentage, errors='coerce')

top_10_hour_tor =time_tor['percentage'].nlargest(10)

print('\n')

print(f'The top 10 times of day - Central Standard Time (CST) with most tornadoes as a percentage of total counts:{'\n'}{top_10_hour_tor}')

print(time_tor['percentage'].nlargest(10).sum())
The top 10 times of day - Central Standard Time (CST) with most tornadoes as a percentage of total counts:
5pm     11.819333
4pm     11.352091
6pm     10.649044
3pm      9.854296
7pm      8.475859
2pm      7.499163
8pm      5.670951
1pm      5.474447
12pm     3.794705
9pm      3.780149
Name: om, dtype: float64
78.37003828182995
# Time of tornado occurence 
    # From 1950-2022, most tornadoes occur at 5pm Central Standard Time (CST) with 8120 occurring at this time of day
    # The top 10 times range from 2pm to 9pm

# Here night is defined as a tornado occuring between 9:00 PM (21) and 5:00 AM (5)

Drop columns not needed for analysis

Save the processed dataset

# Create and save this processed version of the dataset

# tornado_data = tornado_data.drop(['stf', 'stn','ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc', 'elat', 'elon'], axis=1)
# print(tornado_data.columns)

# tornado_data.to_csv(r'..\..\Project\Data\tornado_data\processed_tornado_dataset_v2.csv')
# tornado_data

VI. Spatial join locations of tornados in tornado dataset to dataset of US urban area geography using QGIS

  • Why? to test whether a tornado event location being in an urban area impacts injuries or fatalities
  • Export saved csv into Tornado data into QGIS (external software) for spatial join
  • Download Urban and rural population data from US Census 2023 TIGER/Line Shapefiles: Urban Areas https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2023&layergroup=Urban+Areas
  • In QGIS, Tornado records to be joined to a shapefile dataset of urban and rural geographical population density data

  • Documentation for descriptions of urban area definition: https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tgrshp2023/TGRSHP2023_TechDoc.pdf

Definition of urban areas

  • The Census Bureau classified all territory, population, and housing units located within Urban Areas (UAs) as urban. The Census Bureau delineates UA boundaries to represent densely developed territory, encompassing residential, commercial, and other non-residential urban land uses. In general, this territory consists of areas of high population density and urban land use resulting in a representation of the urban footprint.
  • Rural areas consist of territory, population, and housing units located outside of UAs.
  • Urban Areas consist of densely developed territory that contain at least 2,000 Housing units or at least 5,000 population.

Relevant Field Names, Length , Type, Description

  • UACE20 5 String 2020 Census urban area code
  • GEOID20 5 String 2020 Census urban area identifier; 2020 Census urban area code
  • GEOIDFQ20 14 String 2020 Fully Qualified GEOID
  • NAME20 100 String 2020 Census urban area name
  • NAMELSAD20 100 String 2020 Census name and the translated legal/statistical area description for urban area
  • LSAD20 2 String 2020 Census legal/statistical area description code for urban area
  • INTPTLAT20 11 String 2020 Census latitude of the internal point
  • INTPTLON20 12 String 2020 Census longitude of the internal point

Load processed tornado data

  • This data has been enriched where each tornado event has been assigned a label as urban or rural based on the location of occurence using slon and slat fields
  • We will only keep the column identifying the location as an urban area i.e, 'NAMELSAD20'
tornado_data_urban_areas = pd.read_csv(r'..\..\Project\Data\tornado_data\processed_tornado_data_urban.csv')
show(tornado_data_urban_areas)
field_1 om yr mo dy date time tz st mag inj fat loss closs slat slon len wid is_night hour UACE20 GEOID20 GEOIDFQ20 NAME20 NAMELSAD20 LSAD20 MTFCC20 FUNCSTAT20 ALAND20 AWATER20 INTPTLAT20 INTPTLON20
Loading... (need help?)
len(tornado_data_urban_areas)
6064
tornado_data_urban_areas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6064 entries, 0 to 6063
Data columns (total 32 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   field_1     6064 non-null   int64  
 1   om          6064 non-null   int64  
 2   yr          6064 non-null   int64  
 3   mo          6064 non-null   int64  
 4   dy          6064 non-null   int64  
 5   date        6064 non-null   object 
 6   time        6064 non-null   object 
 7   tz          6064 non-null   int64  
 8   st          6064 non-null   object 
 9   mag         6064 non-null   int64  
 10  inj         6064 non-null   int64  
 11  fat         6064 non-null   int64  
 12  loss        6064 non-null   float64
 13  closs       6064 non-null   float64
 14  slat        6064 non-null   float64
 15  slon        6064 non-null   float64
 16  len         6064 non-null   float64
 17  wid         6064 non-null   int64  
 18  is_night    6064 non-null   bool   
 19  hour        6064 non-null   int64  
 20  UACE20      6064 non-null   int64  
 21  GEOID20     6064 non-null   int64  
 22  GEOIDFQ20   6064 non-null   object 
 23  NAME20      6064 non-null   object 
 24  NAMELSAD20  6064 non-null   object 
 25  LSAD20      6064 non-null   int64  
 26  MTFCC20     6064 non-null   object 
 27  FUNCSTAT20  6064 non-null   object 
 28  ALAND20     6064 non-null   int64  
 29  AWATER20    6064 non-null   int64  
 30  INTPTLAT20  6064 non-null   float64
 31  INTPTLON20  6064 non-null   float64
dtypes: bool(1), float64(7), int64(16), object(8)
memory usage: 1.4+ MB
tornado_data_urban_areas.columns
Index(['field_1', 'om', 'yr', 'mo', 'dy', 'date', 'time', 'tz', 'st', 'mag',
       'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'len', 'wid', 'is_night',
       'hour', 'UACE20', 'GEOID20', 'GEOIDFQ20', 'NAME20', 'NAMELSAD20',
       'LSAD20', 'MTFCC20', 'FUNCSTAT20', 'ALAND20', 'AWATER20', 'INTPTLAT20',
       'INTPTLON20'],
      dtype='object')
# drop uneeded columns from the processed data
tornado_data_urban_areas = tornado_data_urban_areas.drop(['field_1','UACE20', 'GEOID20', 'GEOIDFQ20', 'NAME20',
       'LSAD20', 'MTFCC20', 'FUNCSTAT20', 'ALAND20', 'AWATER20', 'INTPTLAT20',
       'INTPTLON20'], axis=1)
# i noticed that in the QGIS processed data, the time column now inlcudes the year month day 
# to remove that we can extract the time with dt.time
# Extract the time part from the 'time' column

# Convert the 'time' column to datetime
tornado_data_urban_areas['time'] = pd.to_datetime(tornado_data_urban_areas['time'], format='%Y/%m/%d %H:%M:%S')

# Extract the time part
tornado_data_urban_areas['time'] = tornado_data_urban_areas['time'].dt.time

# change the date column from this 1974/06/06 format to this format 1974-06-06
# first convert it to a datetime format
tornado_data_urban_areas['date'] = pd.to_datetime(tornado_data_urban_areas["date"])

tornado_data_urban_areas["date"] = tornado_data_urban_areas["date"].dt.strftime('%Y-%m-%d')

tornado_data_urban_areas
om yr mo dy date time tz st mag inj fat loss closs slat slon len wid is_night hour NAMELSAD20
0 521 1974 6 6 1974-06-06 16:10:00 3 AR 3 112 4 7.0 0.0 34.9800 -90.7800 3.80 150 False 16 Forrest City, AR Urban Area
1 417582 2012 12 20 2012-12-20 16:42:00 3 FL 0 0 0 0.0 0.0 30.1158 -83.5814 0.01 10 False 16 Perry, FL Urban Area
2 621491 2022 5 6 2022-05-06 17:06:00 3 FL 0 0 0 75000.0 0.0 30.1114 -83.5833 0.27 50 False 17 Perry, FL Urban Area
3 84 1958 4 15 1958-04-15 11:20:00 3 FL 3 9 0 5.0 0.0 29.8700 -81.3000 3.60 73 False 11 St. Augustine, FL Urban Area
4 351 1959 6 6 1959-06-06 12:55:00 3 FL 1 0 0 2.0 0.0 29.8700 -81.3000 0.10 10 False 12 St. Augustine, FL Urban Area
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6059 583 2002 4 28 2002-04-28 17:02:00 3 MD 2 0 0 1.0 0.0 38.5300 -76.9800 3.00 100 False 17 La Plata, MD Urban Area
6060 241 2004 6 17 2004-06-17 17:10:00 3 MD 0 0 0 0.0 0.0 38.5300 -76.9800 0.30 75 False 17 La Plata, MD Urban Area
6061 614643 2017 2 25 2017-02-25 14:06:00 3 MD 1 0 0 0.0 0.0 38.5551 -76.9841 8.44 125 False 14 La Plata, MD Urban Area
6062 74 1982 3 29 1982-03-29 17:25:00 3 CA 0 0 0 4.0 0.0 37.6800 -121.7700 0.10 10 False 17 Livermore--Pleasanton--Dublin, CA Urban Area
6063 156 1994 4 25 1994-04-25 13:50:00 3 CA 0 0 0 4.0 0.0 37.6800 -121.7700 0.70 30 False 13 Livermore--Pleasanton--Dublin, CA Urban Area

6064 rows × 20 columns

VII. Merge urban locations enriched dataset with our complete tornado dataset

tornado_data=pd.read_csv(r'..\..\Project\Data\tornado_data\processed_tornado_dataset_v2.csv')

# remove the repeated index column, (could've avoided this by saving it with index=False)
tornado_data= tornado_data.drop('Unnamed: 0', axis=1)
# the time column in the QGIS processed full dataset also now inlcudes the year month day in the time column 
# to remove that we can extract the time with dt.time

# Convert the 'time' column to datetime, note the hyphen separator in the data and include that in the format arguement i.e, format='%Y-%m-%d'
tornado_data['time'] = pd.to_datetime(tornado_data['time'], format='%Y-%m-%d %H:%M:%S')

# Extract the time part from the 'time' column
tornado_data['time'] = tornado_data['time'].dt.time
# we perform a merge of the full tornado_data dataframe containing all records with the urban data enriched dataframe
# we want to keep all records in the tornado_data and add the new column called 'NAMELSAD20' labeling tornado occuring at urban locations
# we use all common columns by not specifying any columns

tornado_data = tornado_data.merge(tornado_data_urban_areas, how='left')
tornado_data
om yr mo dy date time tz st mag inj fat loss closs slat slon len wid is_night hour NAMELSAD20
0 192 1950 10 1 1950-10-01 21:00:00 3 OK 1 0 0 4.0 0.0 36.7300 -102.5200 15.80 10 1 21 NaN
1 193 1950 10 9 1950-10-09 02:15:00 3 NC 3 3 0 5.0 0.0 34.1700 -78.6000 2.00 880 1 2 NaN
2 195 1950 11 20 1950-11-20 02:20:00 3 KY 2 0 0 5.0 0.0 37.3700 -87.2000 0.10 10 1 2 NaN
3 196 1950 11 20 1950-11-20 04:00:00 3 KY 1 0 0 5.0 0.0 38.2000 -84.5000 0.10 10 1 4 NaN
4 197 1950 11 20 1950-11-20 07:30:00 3 MS 1 3 0 4.0 0.0 32.4200 -89.1300 2.00 37 0 7 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
68696 621922 2022 9 28 2022-09-28 13:32:00 3 FL 0 0 0 0.0 0.0 28.0830 -80.8669 3.00 100 0 13 NaN
68697 621923 2022 9 30 2022-09-30 10:25:00 3 NC 0 0 0 0.0 0.0 33.9128 -78.2882 0.74 20 0 10 NaN
68698 621924 2022 9 30 2022-09-30 13:22:00 3 NC -9 0 0 0.0 0.0 35.3242 -76.7076 0.70 12 0 13 NaN
68699 621900 2022 9 4 2022-09-04 15:44:00 3 OH 0 0 0 12000.0 0.0 41.0210 -80.6559 0.07 15 0 15 Youngstown, OH Urban Area
68700 621901 2022 9 9 2022-09-09 23:21:00 3 SC 0 0 0 0.0 0.0 32.8750 -79.7514 0.68 125 1 23 Charleston, SC Urban Area

68701 rows × 20 columns

VIII. Create binary urban location variable for marking tornados occuring in urban locations

# fill Nans with rural. This means that if the location is not urban, we assign it a rural designation

tornado_data["NAMELSAD20"]= tornado_data["NAMELSAD20"].fillna("Rural")

#approach 1 
# In the new urban location column, whenever the word Urban Area appears, replace with a 1, else replace with 0
# Creates a binary urban location variable, category 1= urban location, 0= Rural
tornado_data["AreaIsUrban"] = np.where(tornado_data["NAMELSAD20"].str.contains("Urban Area"), 1, 0) 
# approach 2:
    #tornado_data["AreaIsUrban"] = np.where(tornado_data["NAMELSAD20"]=="Urban Area", 1, 0)# Cannot use this approach because there are other words in the value 

# approach 3: using .str.contains('String you are looking for').astype(int)
# turns the presence of the value from True to a 1.
# tornado_data['AreaIsUrban'] = tornado_data['NAMELSAD20'].str.contains('Urban Area').astype(int)

tornado_data
om yr mo dy date time tz st mag inj ... loss closs slat slon len wid is_night hour NAMELSAD20 AreaIsUrban
0 192 1950 10 1 1950-10-01 21:00:00 3 OK 1 0 ... 4.0 0.0 36.7300 -102.5200 15.80 10 1 21 Rural 0
1 193 1950 10 9 1950-10-09 02:15:00 3 NC 3 3 ... 5.0 0.0 34.1700 -78.6000 2.00 880 1 2 Rural 0
2 195 1950 11 20 1950-11-20 02:20:00 3 KY 2 0 ... 5.0 0.0 37.3700 -87.2000 0.10 10 1 2 Rural 0
3 196 1950 11 20 1950-11-20 04:00:00 3 KY 1 0 ... 5.0 0.0 38.2000 -84.5000 0.10 10 1 4 Rural 0
4 197 1950 11 20 1950-11-20 07:30:00 3 MS 1 3 ... 4.0 0.0 32.4200 -89.1300 2.00 37 0 7 Rural 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
68696 621922 2022 9 28 2022-09-28 13:32:00 3 FL 0 0 ... 0.0 0.0 28.0830 -80.8669 3.00 100 0 13 Rural 0
68697 621923 2022 9 30 2022-09-30 10:25:00 3 NC 0 0 ... 0.0 0.0 33.9128 -78.2882 0.74 20 0 10 Rural 0
68698 621924 2022 9 30 2022-09-30 13:22:00 3 NC -9 0 ... 0.0 0.0 35.3242 -76.7076 0.70 12 0 13 Rural 0
68699 621900 2022 9 4 2022-09-04 15:44:00 3 OH 0 0 ... 12000.0 0.0 41.0210 -80.6559 0.07 15 0 15 Youngstown, OH Urban Area 1
68700 621901 2022 9 9 2022-09-09 23:21:00 3 SC 0 0 ... 0.0 0.0 32.8750 -79.7514 0.68 125 1 23 Charleston, SC Urban Area 1

68701 rows × 21 columns

# Change AreaIsUrban column to a similar naming scheme for clarity
tornado_data = tornado_data.rename(columns= {'AreaIsUrban': 'is_urban'})

tornado_data.columns
Index(['om', 'yr', 'mo', 'dy', 'date', 'time', 'tz', 'st', 'mag', 'inj', 'fat',
       'loss', 'closs', 'slat', 'slon', 'len', 'wid', 'is_night', 'hour',
       'NAMELSAD20', 'is_urban'],
      dtype='object')
# To complete our data processing
# filter to relevant columns, excluding rows with fill values in magnitude column
filtered_tor_data = tornado_data[tornado_data['mag'] !=-9].copy()



print(f"There were {tornada_data_original.shape[0]} rows and {tornada_data_original.shape[1]} columns in the original dataset")
print(f"There are {filtered_tor_data.shape[0]} rows and {filtered_tor_data.shape[1]} columns in the processed dataset")
There were 68701 rows and 29 columns in the original dataset
There are 67945 rows and 21 columns in the processed dataset
# Check what was excluded
excluded_rows = tornado_data[tornado_data['mag'] ==-9]

print(excluded_rows.groupby("yr").size())
print(excluded_rows.groupby("yr").size().sum())

##...some years like 2021 now have significantly less data points...
yr
2016     30
2017     64
2018     17
2019    186
2020    105
2021    203
2022    151
dtype: int64
756

IX. QC of data processing

# Set the maximum number of displayed columns to None
pd.set_option('display.max_columns', None) # None, displays all columns


# Let's check on the top 10 tornadoes with the most injuries in the data to see if they were flagged as urban locations

filtered_tor_data.sort_values('inj',ascending=False).head(20) 

# Highest injury tornado event occured in 1979 with 1740 people injured and 42 deaths in a single storm event! 
# Hmmm... So few tornadoes in urban locations..?
# spot check list of top 20 by injury counts...
# 1 = yes, urban . 0= no, not urban , ?= unsure
#  Result: 1, 1, 1, 1, 1, 0, 0, 1, ...1,1 ,1  (last ones were those that were over 100 miles in length, may be more likely to pass through a city...)
    # e.g., Looking at the April 27 2011 Toranado event aka. 'Tuscaloosa-Birmingham tornado' in Alabama  passed through the city of Tuscaloosa , pop >100K
    # The STARTING coordinate : lat 33.0297, lon -87.9350, roughly in Eutaw, Alabama a 'city' with population of ~3000 in 2020, which is why it was marked as 'Rural'
# Hard lesson...we see here why using ONLY tornado starting location coordinates was not a good approach!
om yr mo dy date time tz st mag inj fat loss closs slat slon len wid is_night hour NAMELSAD20 is_urban
19482 124 1979 4 10 1979-04-10 17:50:00 3 TX 4 1740 42 8.0 0.0 33.8200 -98.6500 46.90 1320 0 17 Rural 0
55296 314625 2011 4 27 2011-04-27 15:43:00 3 AL 4 1500 64 2450.0 0.0 33.0297 -87.9350 80.68 2600 0 15 Rural 0
1057 278 1953 6 9 1953-06-09 14:25:00 3 MA 4 1228 94 8.0 0.0 42.4700 -72.1700 34.90 900 0 14 Rural 0
15259 116 1974 4 3 1974-04-03 13:30:00 3 OH 5 1150 36 8.0 0.0 39.6300 -84.0500 31.30 533 0 13 Rural 0
55600 296616 2011 5 22 2011-05-22 16:34:00 3 MO 5 1150 158 2800.1 0.0 37.0524 -94.5932 21.62 1600 0 16 Rural 0
1054 275 1953 6 8 1953-06-08 19:30:00 3 MI 5 844 116 7.0 0.0 43.1000 -83.8500 18.90 833 0 19 Rural 0
8103 152 1965 4 11 1965-04-11 18:20:00 3 IN 4 835 25 8.0 0.0 40.4000 -86.3800 48.00 880 0 18 Rural 0
12411 66 1971 2 21 1971-02-21 16:00:00 3 MS 4 795 58 4.0 0.0 32.7000 -91.0200 202.10 100 0 16 Rural 0
881 162 1953 5 11 1953-05-11 16:10:00 3 TX 5 597 114 7.0 0.0 31.5500 -97.1500 20.90 583 0 16 Rural 0
39849 1147 1999 5 3 1999-05-03 17:26:00 3 OK 5 583 36 1000.0 0.0 35.1300 -97.8500 37.00 1430 0 17 Rural 0
9005 43 1966 4 4 1966-04-04 07:00:00 3 FL 4 530 11 7.0 0.0 27.9200 -82.8000 135.80 300 0 7 Tampa--St. Petersburg, FL Urban Area 1
8925 32 1966 3 3 1966-03-03 18:30:00 3 MS 5 518 58 7.0 0.0 32.1800 -90.5500 202.50 900 0 18 Rural 0
66437 620750 2021 12 10 2021-12-10 20:54:00 3 TN 4 515 57 25000.0 0.0 36.4830 -89.1350 168.53 2600 0 20 Rural 0
12409 64 1971 2 21 1971-02-21 14:50:00 3 LA 5 510 47 7.0 0.0 32.3800 -91.4700 109.20 500 0 14 Rural 0
11853 188 1970 5 11 1970-05-11 20:35:00 3 TX 5 500 26 8.0 0.0 33.5500 -101.9000 8.40 1333 0 20 Rural 0
19367 796 1979 10 3 1979-10-03 13:00:00 3 CT 4 500 3 8.0 0.0 41.8800 -72.6700 11.30 1400 0 13 Rural 0
9695 186 1967 4 21 1967-04-21 17:24:00 3 IL 4 500 33 7.0 0.0 41.6700 -87.8300 15.00 200 0 17 Rural 0
822 90 1953 4 18 1953-04-18 17:00:00 3 AL 3 495 8 7.0 0.0 32.6000 -85.5800 39.80 400 0 17 Rural 0
59136 606471 2015 12 26 2015-12-26 18:46:00 3 TX 4 468 10 26.8 0.0 32.7960 -96.5894 13.04 550 0 18 Dallas--Fort Worth--Arlington, TX Urban Area 1
27561 816 1989 11 15 1989-11-15 16:30:00 3 AL 4 463 21 8.0 0.0 34.6500 -86.6500 18.50 880 0 16 Rural 0

image.png

sources: - https://mrcc.purdue.edu/gismaps/cntytorn# - https://www.ncei.noaa.gov/news/2011-tornado-super-outbreak - https://www.weather.gov/oun/events-19790410

# From manual spot check reviews of tracks within top 20 list
top_inj_update_is_urban=filtered_tor_data.sort_values('inj',ascending=False).head(20) 
#  need to change the first 5 row, and row 8 
# also need to change the 12, 13 and 14th rows
print(top_inj_update_is_urban.index)
# print(top_inj_update_is_urban.index[0:5].values)
# print(top_inj_update_is_urban.index[7])
# print(top_inj_update_is_urban.index[11:14].values)
# combine to list
update_labels_list = list(top_inj_update_is_urban.index[0:5]) + [top_inj_update_is_urban.index[7]] + list(top_inj_update_is_urban.index[11:14])
print(update_labels_list)
# # Change the 'is_urban' value of those rows with the same values as the list of indices
filtered_tor_data.loc[update_labels_list, 'is_urban'] = 1
Index([19482, 55296,  1057, 15259, 55600,  1054,  8103, 12411,   881, 39849,
        9005,  8925, 66437, 12409, 11853, 19367,  9695,   822, 59136, 27561],
      dtype='int64')
[19482, 55296, 1057, 15259, 55600, 12411, 8925, 66437, 12409]
# Let's check on to see if updated the expected rows
filtered_tor_data.sort_values('inj',ascending=False).head(20) 
om yr mo dy date time tz st mag inj fat loss closs slat slon len wid is_night hour NAMELSAD20 is_urban
19482 124 1979 4 10 1979-04-10 17:50:00 3 TX 4 1740 42 8.0 0.0 33.8200 -98.6500 46.90 1320 0 17 Rural 1
55296 314625 2011 4 27 2011-04-27 15:43:00 3 AL 4 1500 64 2450.0 0.0 33.0297 -87.9350 80.68 2600 0 15 Rural 1
1057 278 1953 6 9 1953-06-09 14:25:00 3 MA 4 1228 94 8.0 0.0 42.4700 -72.1700 34.90 900 0 14 Rural 1
15259 116 1974 4 3 1974-04-03 13:30:00 3 OH 5 1150 36 8.0 0.0 39.6300 -84.0500 31.30 533 0 13 Rural 1
55600 296616 2011 5 22 2011-05-22 16:34:00 3 MO 5 1150 158 2800.1 0.0 37.0524 -94.5932 21.62 1600 0 16 Rural 1
1054 275 1953 6 8 1953-06-08 19:30:00 3 MI 5 844 116 7.0 0.0 43.1000 -83.8500 18.90 833 0 19 Rural 0
8103 152 1965 4 11 1965-04-11 18:20:00 3 IN 4 835 25 8.0 0.0 40.4000 -86.3800 48.00 880 0 18 Rural 0
12411 66 1971 2 21 1971-02-21 16:00:00 3 MS 4 795 58 4.0 0.0 32.7000 -91.0200 202.10 100 0 16 Rural 1
881 162 1953 5 11 1953-05-11 16:10:00 3 TX 5 597 114 7.0 0.0 31.5500 -97.1500 20.90 583 0 16 Rural 0
39849 1147 1999 5 3 1999-05-03 17:26:00 3 OK 5 583 36 1000.0 0.0 35.1300 -97.8500 37.00 1430 0 17 Rural 0
9005 43 1966 4 4 1966-04-04 07:00:00 3 FL 4 530 11 7.0 0.0 27.9200 -82.8000 135.80 300 0 7 Tampa--St. Petersburg, FL Urban Area 1
8925 32 1966 3 3 1966-03-03 18:30:00 3 MS 5 518 58 7.0 0.0 32.1800 -90.5500 202.50 900 0 18 Rural 1
66437 620750 2021 12 10 2021-12-10 20:54:00 3 TN 4 515 57 25000.0 0.0 36.4830 -89.1350 168.53 2600 0 20 Rural 1
12409 64 1971 2 21 1971-02-21 14:50:00 3 LA 5 510 47 7.0 0.0 32.3800 -91.4700 109.20 500 0 14 Rural 1
11853 188 1970 5 11 1970-05-11 20:35:00 3 TX 5 500 26 8.0 0.0 33.5500 -101.9000 8.40 1333 0 20 Rural 0
19367 796 1979 10 3 1979-10-03 13:00:00 3 CT 4 500 3 8.0 0.0 41.8800 -72.6700 11.30 1400 0 13 Rural 0
9695 186 1967 4 21 1967-04-21 17:24:00 3 IL 4 500 33 7.0 0.0 41.6700 -87.8300 15.00 200 0 17 Rural 0
822 90 1953 4 18 1953-04-18 17:00:00 3 AL 3 495 8 7.0 0.0 32.6000 -85.5800 39.80 400 0 17 Rural 0
59136 606471 2015 12 26 2015-12-26 18:46:00 3 TX 4 468 10 26.8 0.0 32.7960 -96.5894 13.04 550 0 18 Dallas--Fort Worth--Arlington, TX Urban Area 1
27561 816 1989 11 15 1989-11-15 16:30:00 3 AL 4 463 21 8.0 0.0 34.6500 -86.6500 18.50 880 0 16 Rural 0
# Let's check on to see if updated the expected rows
filtered_tor_data.sort_values('fat',ascending=False).head(20) 
om yr mo dy date time tz st mag inj fat loss closs slat slon len wid is_night hour NAMELSAD20 is_urban
55600 296616 2011 5 22 2011-05-22 16:34:00 3 MO 5 1150 158 2800.1 0.0 37.0524 -94.5932 21.62 1600 0 16 Rural 1
1054 275 1953 6 8 1953-06-08 19:30:00 3 MI 5 844 116 7.0 0.0 43.1000 -83.8500 18.90 833 0 19 Rural 0
881 162 1953 5 11 1953-05-11 16:10:00 3 TX 5 597 114 7.0 0.0 31.5500 -97.1500 20.90 583 0 16 Rural 0
1057 278 1953 6 9 1953-06-09 14:25:00 3 MA 4 1228 94 8.0 0.0 42.4700 -72.1700 34.90 900 0 14 Rural 1
1929 227 1955 5 25 1955-05-25 22:00:00 3 OK 5 273 80 5.0 0.0 36.8800 -97.1500 56.40 1320 1 22 Rural 0
55280 309488 2011 4 27 2011-04-27 14:05:00 3 AL 5 145 72 1290.0 0.0 34.1043 -88.1479 132.00 2200 0 14 Rural 0
55296 314625 2011 4 27 2011-04-27 15:43:00 3 AL 4 1500 64 2450.0 0.0 33.0297 -87.9350 80.68 2600 0 15 Rural 1
8925 32 1966 3 3 1966-03-03 18:30:00 3 MS 5 518 58 7.0 0.0 32.1800 -90.5500 202.50 900 0 18 Rural 1
12411 66 1971 2 21 1971-02-21 16:00:00 3 MS 4 795 58 4.0 0.0 32.7000 -91.0200 202.10 100 0 16 Rural 1
66437 620750 2021 12 10 2021-12-10 20:54:00 3 TN 4 515 57 25000.0 0.0 36.4830 -89.1350 168.53 2600 0 20 Rural 1
519 54 1952 3 21 1952-03-21 16:50:00 3 AR 4 325 50 6.0 0.0 35.2200 -91.7000 14.60 1760 0 16 Rural 0
12409 64 1971 2 21 1971-02-21 14:50:00 3 LA 5 510 47 7.0 0.0 32.3800 -91.4700 109.20 500 0 14 Rural 1
3244 376 1957 5 20 1957-05-20 18:15:00 3 KS 5 207 44 6.0 0.0 38.4500 -95.5000 69.40 440 0 18 Rural 0
19482 124 1979 4 10 1979-04-10 17:50:00 3 TX 4 1740 42 8.0 0.0 33.8200 -98.6500 46.90 1320 0 17 Rural 1
735 416 1953 12 5 1953-12-05 17:45:00 3 MS 5 270 38 7.0 0.0 32.3300 -90.9000 9.00 500 0 17 Rural 0
542 77 1952 3 21 1952-03-21 23:30:00 3 TN 4 157 38 6.0 0.0 35.2700 -88.9800 46.80 177 1 23 Rural 0
8102 151 1965 4 11 1965-04-11 18:10:00 3 IN 4 321 36 0.0 0.0 41.5800 -86.2000 37.00 333 0 18 Rural 0
15259 116 1974 4 3 1974-04-03 13:30:00 3 OH 5 1150 36 8.0 0.0 39.6300 -84.0500 31.30 533 0 13 Rural 1
39849 1147 1999 5 3 1999-05-03 17:26:00 3 OK 5 583 36 1000.0 0.0 35.1300 -97.8500 37.00 1430 0 17 Rural 0
10627 224 1968 5 15 1968-05-15 20:45:00 3 AR 4 364 35 4.0 0.0 35.7300 -91.1800 20.90 250 0 20 Rural 0
# Conclusion and Limiations: 
    # The current is_urban categorization of tornadoes into rural vs urban only considers the starting point of tornado events
    # Some tornadoes travel for hundreds of miles, starting in rural areas, going through cities and suburbs and destroying everything along their path
    # Since we used slon and slat to assign urban vs rural, the current is_urban variable will not fully capture the relationship between tornado location and tornado injuries 

X. Exploratory Analysis Part 2: Injuries and Fatalities

What proportion of tornadoes result in injuries or fatalities?

# What proportion of tornadoes cause injuries or death?
fig,axes = plt.subplots(1,2 , figsize=(12,10))


# Create a Series with counts of tornadoes that caused injuries and didn't cause injuries
injury_counts = tornado_data['inj'].apply(lambda x: 'Injuries' if x > 0 else 'No injuries').value_counts()

# Generate a pie plot
injury_counts.plot.pie(autopct='%1.1f%%', ax=axes[0], colors=['grey', 'orange'])
axes[0].set_ylabel(None)
axes[0].set_title("Proportion of tornadoes that caused injuries from 1950-2022")

# Create a Series with counts of tornadoes that caused deaths and didn't cause deaths
fatalities_counts = tornado_data['fat'].apply(lambda x: 'Deaths' if x > 0 else 'No deaths').value_counts()

# Generate a pie plot
fatalities_counts.plot.pie(autopct='%1.1f%%', ax=axes[1], colors=['grey', 'red'])
axes[1].set_ylabel(None)
axes[1].set_title("Proportion of tornadoes that caused deaths from 1950-2022")
plt.tight_layout()

# Overall, most tornadoes do not cause injuries or deaths

png

# Overall, most tornadoes do not cause injuries or deaths

What is the trend in the proportion of tornadoes that result in injuries or fatalities over time from 1950-2022?

# Create a copy of the DataFrame for plotting
plot_data = filtered_tor_data.copy()

fig,axes = plt.subplots(2,2 , figsize=(14,10))

# Set 'yr' as the index of the copy
plot_data.set_index('yr', inplace=True)


# total injuries over time
ax1 = axes[0,0].plot(plot_data.groupby('yr').size(), label="Total Tornadoes", color="grey")
ax2 = axes[0,0].twinx()  # Create a second y-axis
ax2.plot(plot_data.groupby('yr')["inj"].sum(), label="Injury Count", color="orange")
axes[0,0].set_title("Tornado Injury Counts in US from 1950-2022")
axes[0,0].set_ylabel("Total Tornadoes")
ax2.set_ylabel("Injury Count")
axes[0,0].set_xlabel(" Year")
# ax2.legend(loc='upper left')
# axes[0,0].legend()


# proportion injuries
# gives the number of tornado events that resulted in at least one injury each year
percentage_inj= plot_data[plot_data['inj'] > 0].groupby('yr').size() / plot_data.groupby('yr').size() # count is then divided by the total number of tornado events each year
# Plot 'inj' over time
percentage_inj.plot.line(label="Percentage causing injury", ax=axes[1,0],color="orange")
axes[1,0].set_title("Percentage Tornados causing Injury in US from 1950-2022")
axes[1,0].set_ylabel("% Tornados causing at least 1 injury")
axes[1,0].set_xlabel(" Year")
# Format y-axis labels as percentages
axes[1,0].yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
axes[1,0].legend()  # Add legend to this subplot

# total fatalities over time
ax3 = axes[0,1].plot(plot_data.groupby('yr').size(), label="Total Tornadoes", color="grey")
ax4 = axes[0,1].twinx()  # Create a second y-axis
ax4.plot(plot_data.groupby("yr")["fat"].sum(), label="Fatality Count", color="r")
axes[0,1].set_title("Tornado Fatality Counts in US from 1950-2022")
axes[0,1].set_ylabel("Total Tornadoes")
ax4.set_ylabel("Fatality Count")
axes[0,1].set_xlabel(" Year")
# axes[0,1].legend(loc='upper right')
# ax4.legend(loc='upper left')


# proportion fatalities
# gives the number of tornado events that resulted in at least one fatality each year
percentage_fat= plot_data[plot_data['fat'] > 0].groupby('yr').size() / plot_data.groupby('yr').size()  #count is then divided by the total number of tornado events each year
# Plot 'fat' over time
percentage_fat.plot.line(label="Percentage causing fatalities", ax=axes[1,1],color="r")
axes[1,1].set_title("Percentage Tornados causing Fatalities in US from 1950-2022")
axes[1,1].set_ylabel("% Tornados causing at least 1 fatality")
axes[1,1].set_xlabel(" Year")
# Format y-axis labels as percentages
axes[1,1].yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
axes[1,1].legend()  # Add legend to this subplot

# Get the lines and labels from both plots
lines1, labels1 = axes[0,0].get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
lines3, labels3 = axes[0,1].get_legend_handles_labels()
lines4, labels4 = ax4.get_legend_handles_labels()

# Combine lines and labels
all_lines = lines1 + lines2
all_labels = labels1 + labels2

# Create a single legend for both plots
axes[0,0].legend(all_lines, all_labels, loc='upper left')

# Combine lines and labels
all_lines = lines3 + lines4
all_labels = labels3 + labels4

# Create a single legend for both plots
axes[0,1].legend(all_lines, all_labels, loc='upper left')
# Calculate overall mean percentages
mean_percentage_inj = plot_data[plot_data['inj'] > 0].size / plot_data.size
mean_percentage_fat = plot_data[plot_data['fat'] > 0].size/ plot_data.size

# Add horizontal lines at the mean percentages
axes[1,0].axhline(mean_percentage_inj, color='blue', linestyle='--', label='Mean % causing injury')
axes[1,1].axhline(mean_percentage_fat, color='blue', linestyle='--', label='Mean % causing fatality')

# Update the legends to include the new lines
axes[1,0].legend()
axes[1,1].legend()
plt.tight_layout()

png

# Total tornado count trends indicate an increase in the occurrence of tornadoes since the 1950s,
# that upward count trend is likely due to advances in the tornado detection systems 
# Although overall tornado counts increase over time, tornado related fatalities and injuries decrease, particularly around the 1980s-1990s
# likely , due to more widespread adoption of doppler radars, improvements in detection and tornado warning sytems, and increased awareness across the US pre- and post- 1980s-1990s


# For more info:  https://www.nationalgeographic.org/forces-nature/tornadoes.html
# and... https://www.ametsoc.org/index.cfm/ams/about-ams/ams-statements/archive-statements-of-the-ams/tornado-preparedness-and-safety/
# get the years of the top 10 for tornado injury
top_inj_years = plot_data.groupby('yr')['inj'].sum().nlargest(10).index

#filter the dataframe to get only these years
#'yr' is set as the index , use the DataFrame's index for filtering instead of column
top_inj_years_data= plot_data[plot_data.index.isin(top_inj_years)] # uses the index of the dataframe to filter to the list of top 10 years


# get aggregate stats
top_inj_years_stats = top_inj_years_data.groupby("yr").agg({'len':'mean',
    'wid':'mean',
    'hour':'mean',
    'is_night':'sum',
    'is_urban':'sum',
    'fat':'sum',
    'inj' : 'sum',

    })

# Repeat for fatalties

# get the years of the top 10 for tornado injury
top_fat_years = plot_data.groupby('yr')['fat'].sum().nlargest(10).index

#filter the dataframe to get only these years
#'yr' is set as the index , use the DataFrame's index for filtering instead of column
top_fat_years_data= plot_data[plot_data.index.isin(top_fat_years)] # uses the index of the dataframe to filter to the list of top 10 years


# get aggregate stats
top_fat_years_stats = top_fat_years_data.groupby("yr").agg({'len':'mean',
    'wid':'mean',
    'hour':'mean',
    'is_night':'sum',
    'is_urban':'sum',
    'fat':'sum',
    'inj' : 'sum',

    })
# Sort the DataFrame by 'inj' or 'fat' and get the top 100 rows
# These are the top 100 tornadoes causing the most injury or fatality
top_100_inj_tor = plot_data.sort_values('inj', ascending=False).head(100)
top_100_fat_tor = plot_data.sort_values('fat', ascending=False).head(100)

# Define a scaling factor to adjust marker size
scaling_factor_injuries = 0.5  # Adjust this value to your preference
scaling_factor_fatalities = 3  # Adjust this value to your preference

# Initialize Cartopy map with the PlateCarree projection
fig, ax = plt.subplots(figsize=(14, 10), subplot_kw={'projection': ccrs.PlateCarree()})

# Add natural features (like borders, coastlines)
ax.add_feature(cfeature.BORDERS)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.STATES, linestyle=':')

# Set the extent to cover the 50 states
ax.set_extent([-125, -66.5, 24, 50], crs=ccrs.PlateCarree())

# Normalize the counts for the color map using LogNorm for better differentiation
norm_injuries = LogNorm(vmin=top_100_inj_tor['inj'].min(), vmax=top_100_inj_tor['inj'].max())
norm_fatalities = LogNorm(vmin=top_100_fat_tor['fat'].min(), vmax=top_100_fat_tor['fat'].max())

# Plot the top 100 injury tornadoes
sc1 = ax.scatter(top_100_inj_tor['slon'], top_100_inj_tor['slat'],
           s=top_100_inj_tor['inj'] * scaling_factor_injuries
          , alpha=0.7, transform=ccrs.PlateCarree(),
           c=top_100_inj_tor['inj'], cmap='Oranges', norm=norm_injuries)

# Plot the top 100 fatality tornadoes
sc2 = ax.scatter(top_100_fat_tor['slon'], top_100_fat_tor['slat'],
           s=top_100_fat_tor['fat'] * scaling_factor_fatalities,
            alpha=0.7, transform=ccrs.PlateCarree(),
           c=top_100_fat_tor['fat'], cmap='Reds', norm=norm_fatalities)

# Add a color bar for injuries
cbar1 = plt.colorbar(sc1, ax=ax, orientation='horizontal', pad=0.05, shrink=0.7)
cbar1.set_label('Number of Injuries')
cbar1.set_ticks([200, 500, 1000, 1500])  # Set ticks to the values you want
cbar1.set_ticklabels([ '200', '500', '1000', '1500'])  # Set original values

# Add a color bar for fatalities
cbar2 = plt.colorbar(sc2, ax=ax, orientation='horizontal', pad=0.15, shrink=0.7)
cbar2.set_label('Number of Fatalities')
cbar2.set_ticks([ 20, 50, 100, 150])  # Set ticks to the values you want
cbar2.set_ticklabels([ '20', '50', '100', '150'])  # Set original values

# For a general reference, ...coordinates for well-studied i.e., devastating tornadoes Tuscaloosa, Alabama and Joplin, Missouri
tuscaloosa_coords = (33.0297, -87.9350)
joplin_coords = (37.0524, -94.5932)

# Create a rectangular patch for Tuscaloosa, Alabama
rect_tuscaloosa = mpatches.Rectangle((tuscaloosa_coords[1] - 1, tuscaloosa_coords[0] - 1),
                                     2, 2, edgecolor='orange', facecolor='none', linestyle='--', lw=2, transform=ccrs.PlateCarree())
ax.add_patch(rect_tuscaloosa)

# Create a rectangular patch for Joplin, Missouri
rect_joplin = mpatches.Rectangle((joplin_coords[1] - 1, joplin_coords[0] - 1),
                                 2, 2, edgecolor='red', facecolor='none', linestyle='--', lw=2, transform=ccrs.PlateCarree())
ax.add_patch(rect_joplin)

# Add the rectangles to the legend
orange_rect_legend = mpatches.Patch(edgecolor='orange', facecolor='none', linestyle='--', lw=2, label='Tuscaloosa, Alabama')
red_rect_legend = mpatches.Patch(edgecolor='red', facecolor='none', linestyle='--', lw=2, label='Joplin, Missouri')

ax.legend(handles=[sc1, sc2, orange_rect_legend, red_rect_legend], loc='upper right')



# Add a title and a legend
ax.set_title("Top 100 Tornado Events by Injuries and Fatalities (1950-2022)", fontsize=16)
plt.tight_layout()
# Add gridlines and labels
gl = ax.gridlines(draw_labels=True, linestyle='--', linewidth=0.5, color='gray')
gl.top_labels = False
gl.right_labels = False
gl.xlabel_style = {'size': 10}
gl.ylabel_style = {'size': 10}

plt.show()
C:\Users\Brian\AppData\Local\Temp\ipykernel_14308\2278222385.py:67: MatplotlibDeprecationWarning: An artist whose label starts with an underscore was passed to legend(); such artists will no longer be ignored in the future.  To suppress this warning, explicitly filter out such artists, e.g. with `[art for art in artists if not art.get_label().startswith('_')]`.
  ax.legend(handles=[sc1, sc2, orange_rect_legend, red_rect_legend], loc='upper right')

png

# Convert the 'time' column to datetime to easily extract the hour
tornado_data['time'] = pd.to_datetime(tornado_data['time'], format='%H:%M:%S')

# Create the binary 'is_night' indicator
 # the .dt accessor extract the year, month, day, hour , minute
# Define night as between 9:00 PM (21) and 5:00 AM (5)
# .astype(int) used with this logical evaluation, turns the presence of the value from True to a 1

tornado_data['is_night'] = ((tornado_data['time'].dt.hour >= 21) | (tornado_data['time'].dt.hour < 5)).astype(int)
# # Convert 'time' to datetime format
tornado_data['time'] = pd.to_datetime(tornado_data['time'])

# # Extract the hour from 'time'
tornado_data['hour'] = tornado_data['time'].dt.hour

# # Group by 'hour' and count the number of tornadoes
time_tor = tornado_data.groupby('hour')['om'].count()


print(time_tor.describe())

print(time_tor.sort_values(ascending=False))



# Find the most common hour for tornadoes
most_frequent_tor_hour = time_tor.idxmax()

count_most_common_tor_hour = time_tor[most_frequent_tor_hour]
print('\n')
print(f'From 1950-2022 most tornadoes occur at {convert_to_12_hour(most_frequent_tor_hour)} Central Standard Time (CST) with {count_most_common_tor_hour} occurring at this time of day')


# Plot the count of tornadoes by hour of the day
time_tor.plot()
# Use list comprehension to apply convert_to_12_hour to each element of the index
time_tor.index = [convert_to_12_hour(int(hour)) for hour in time_tor.index]

print(f'Times for tornados events in{'\n'}{time_tor.sort_values(ascending=False)}')



# Add a title and labels
plt.title('US Total tornado Count by Time of Day from 1950-2022')
plt.xlabel('Time of day - Central Standard Time (CST)')
plt.ylabel('Count')

# Add a shaded region for the most common hour for tornadoes
plt.axvspan(most_frequent_tor_hour-2, most_frequent_tor_hour+2, color='gray', alpha=0.5, label='Most common time for tornadoes')

# Set the x-axis ticks to every 2 hours
plt.xticks(range(0, 24, 2))

# Add a legend
plt.legend()

# Convert the x-axis labels to 12-hour format
labels = [convert_to_12_hour(int(item)) if item.is_integer() else item for item in plt.gca().get_xticks()]
plt.gca().set_xticklabels(labels)
count      24.000000
mean     2862.541667
std      2557.032553
min       737.000000
25%       870.250000
50%      1620.000000
75%      4210.000000
max      8120.000000
Name: om, dtype: float64
hour
17    8120
16    7799
18    7316
15    6770
19    5823
14    5152
20    3896
13    3761
12    2607
21    2597
22    1817
11    1782
23    1458
10    1266
9     1057
0     1027
1      915
8      882
7      835
2      798
6      776
5      764
4      746
3      737
Name: om, dtype: int64


From 1950-2022 most tornadoes occur at 5pm Central Standard Time (CST) with 8120 occurring at this time of day
Times for tornados events in
5pm     8120
4pm     7799
6pm     7316
3pm     6770
7pm     5823
2pm     5152
8pm     3896
1pm     3761
12pm    2607
9pm     2597
10pm    1817
11am    1782
11pm    1458
10am    1266
9am     1057
12am    1027
1am      915
8am      882
7am      835
2am      798
6am      776
5am      764
4am      746
3am      737
Name: om, dtype: int64





[Text(0, 0, '12am'),
 Text(2, 0, '2am'),
 Text(4, 0, '4am'),
 Text(6, 0, '6am'),
 Text(8, 0, '8am'),
 Text(10, 0, '10am'),
 Text(12, 0, '12pm'),
 Text(14, 0, '2pm'),
 Text(16, 0, '4pm'),
 Text(18, 0, '6pm'),
 Text(20, 0, '8pm'),
 Text(22, 0, '10pm')]

png

Compared to the overall dataset, do the top 10 years for injuries or fatalities have a higher proportion of tornadoes occuring at night or in urban locations?

Compared to the overall dataset, do the 100 most extreme tornadoes for injuries or fatalities have a higher proportion of tornadoes occuring at night or in urban locations?

# Calculate proportions for the overall dataset
overall_proportions = plot_data[['is_night', 'is_urban']].mean()

# Calculate proportions for the top injury years dataset
top_years_proportions = top_inj_years_data[['is_night', 'is_urban']].mean()

# Create a DataFrame for the proportions
proportions1_df = pd.DataFrame({'Overall': overall_proportions, 'Top Injury Years': top_years_proportions})


fig,axes = plt.subplots(2,2,figsize=(14, 10))


# Create a bar plot for the proportions
proportions1_df.plot(kind='bar', color=['grey','orange'], ax=axes[0,0])
axes[0,0].set_ylabel('Percentage')
axes[0,0].set_title('Proportion of Tornadoes Occurring at Night and in Urban Areas')
axes[0,0].yaxis.set_major_formatter(mtick.PercentFormatter(1.0))

# Sort the DataFrame by 'inj' and get the top 100 rows
# These are the top 100 tornadoes causing the most injury
top_100_inj_tor = plot_data.sort_values('inj', ascending=False).head(100)

overall_proportions = plot_data[['is_night', 'is_urban']].mean()
top_100_proportions = top_100_inj_tor[['is_night', 'is_urban']].mean()

proportions2_df = pd.DataFrame({
    'Overall': overall_proportions,
    'Top 100 injury Tornadoes': top_100_proportions
})

proportions2_df.plot(kind='bar', color=['grey','orange'], ax=axes[1,0])
axes[1,0].set_ylabel('Percentage')
axes[1,0].set_title('Proportion of Tornadoes Occurring at Night and in Urban Areas')
axes[1,0].yaxis.set_major_formatter(mtick.PercentFormatter(1.0))

# Repeat plots for fatalities

# Calculate proportions for the top injury years dataset
top_fat_years_proportions = top_fat_years_data[['is_night', 'is_urban']].mean()

# Create a DataFrame for the proportions
proportions_fat1_df = pd.DataFrame({'Overall': overall_proportions, 'Top Fatality Years': top_fat_years_proportions})

# Create a bar plot for the proportions
proportions_fat1_df.plot(kind='bar', color=['grey','red'], ax=axes[0,1])
axes[0,1].set_ylabel('Percentage')
axes[0,1].set_title('Proportion of Tornadoes Occurring at Night and in Urban Areas')
axes[0,1].yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
# Sort the DataFrame by 'fat' and get the top 100 rows
# These are the top 100 tornadoes causing the most deaths
top_100_fat_tor = plot_data.sort_values('fat', ascending=False).head(100)


top_100_fat_proportions = top_100_fat_tor[['is_night', 'is_urban']].mean()

proportions_fat2_df = pd.DataFrame({
    'Overall': overall_proportions,
    'Top 100 fatality Tornadoes': top_100_fat_proportions
})

proportions_fat2_df.plot(kind='bar', color=['grey','red'], ax=axes[1,1])
axes[1,1].set_ylabel('Percentage')
axes[1,1].set_title('Proportion of Tornadoes Occurring at Night and in Urban Areas')
axes[1,1].yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
plt.tight_layout()

plt.show()

png

# Create box-and-whisker plots, by is_urban , 0(rural) vs 1(urban)
fig,ax = plt.subplots(figsize=(10,5))
filtered_tor_data.boxplot(ax=ax,column='inj', by='is_urban')
plt.title("Injuries from tornado by rural versus urban location from 1950-2022")
plt.ylabel("Injuries")
plt.suptitle(None) #remove default title
Text(0.5, 0.98, '')

png

# Create box-and-whisker plots, by is_night , 0(day) vs 1(night)
fig,ax = plt.subplots(figsize=(10,5))
filtered_tor_data.boxplot(ax=ax,column='fat', by='is_night')
plt.title("Fatalities from tornado by occurence in day versus night from 1950-2022")
plt.ylabel("Fatalities")
plt.suptitle(None) #remove default title
Text(0.5, 0.98, '')

png

XI. Evaluating distribution of tornado injuries by rural vs urban location and day vs night

Grouping by rural and urban

# Groupby urban vs rural location and save to separate variables
is_urban_inj_data = filtered_tor_data.groupby('is_urban')
print(is_urban_inj_data)
rural_tor_injury = is_urban_inj_data.get_group(0).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night']]
urban_tor_injury = is_urban_inj_data.get_group(1).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night']]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002075FE8DF40>

Grouping by day and night

# Repeat grouping for day and night tornadoes...
# Groupby day vs night time occurence and save to separate variables
# Here night is defined as a tornado occuring between 9:00 PM (21) and 5:00 AM (5)
is_night_fat_data = filtered_tor_data.groupby('is_night')
print(is_night_fat_data)
day_tor_fatality = is_night_fat_data.get_group(0).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night','time','hour']]
night_tor_fatality = is_night_fat_data.get_group(1).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night','time','hour']]
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000207624A8A10>
# how many data points per group
print(len(rural_tor_injury))
print(len(urban_tor_injury))
print(len(day_tor_fatality))
print(len(night_tor_fatality))
61878
6067
57889
10056
# Scatter plot to compare rural and urban datasets for injury and fatalities
fig,axes = plt.subplots(1,2, figsize=(16,8))

axes[0].scatter(x=rural_tor_injury['fat'],y=rural_tor_injury['inj'],label='Rural tornadoes')
axes[0].scatter(x=urban_tor_injury['fat'],y=urban_tor_injury['inj'],label='Urban tornadoes')
axes[0].legend()
axes[0].set_xlabel('Fatality Count')
axes[0].set_ylabel('Injury Count')
axes[0].set_title("Tornado Injuries", color="orange", loc="left")
# axes[0].set_title('Tornado Event Injuries, Urban vs. Rural counts')
axes[0].set_title("Urban vs. Rural counts", color="black", loc="center")


axes[1].scatter(x=day_tor_fatality['fat'],y=day_tor_fatality['inj'],label='Day tornadoes', color="green")
axes[1].scatter(x=night_tor_fatality['fat'],y=night_tor_fatality['inj'],label='Night tornadoes', color="red")
axes[1].legend()
axes[1].set_xlabel('Fatality Count')
axes[1].set_ylabel('Injury Count')
# axes[1].set_title('Tornado Event Fatalities, Day vs. Night counts')
axes[1].set_title("Tornado Fatalities", color="red", loc="left")
axes[1].set_title("Day vs. Night counts", color="black", loc="center")

plt.tight_layout()

png

# Variations in data distributions
    # Very large variation in injury count data from 0 to 1750 injuries
    # lots of outliers
    # Most tornado events have zero or near-zero injuries
# Difference betweeen groups
    # There is a noticeable upward shift for urban tornadoes in the scatter plot comparing rural and urban tornadoes (the plot on the left)
    # This upward shift indicates that urban tornadoes generally have higher injury counts compared to rural tornadoes
    # There is a noticeable rightward shift for night tornadoes in the scatter plot comparing day and night tornadoes (the plot on the right)
    # This shift to the right signifies that night tornadoes tend to have higher fatality counts compared to day tornadoes

Correlation analyses

from scipy.stats import pearsonr

# Calculate Pearson correlation
corr, _ = pearsonr(rural_tor_injury['fat'], rural_tor_injury['inj'])

print('Rural Pearsons correlation: %.3f' % corr)



# Calculate Pearson correlation
corr_u, _ = pearsonr(urban_tor_injury['fat'], urban_tor_injury['inj'])

print('Urban Pearsons correlation: %.3f' % corr_u)

# Scatter plot to compare rural and urban datasets for injury and fatalities
fig,axes = plt.subplots(1,2, figsize=(16,8))


# Create a scatter plot
axes[0].scatter(rural_tor_injury['fat'], rural_tor_injury['inj'])
# Create a scatter plot
axes[1].scatter(urban_tor_injury['fat'], urban_tor_injury['inj'])

# Calculate line of best fit
# np.polyfit() is  a function for a polynomial to be fit to the data
    # determines the polynomial that best represents the relationships within the observed data points
    # when you specify 1 that is called the degree , you are picking the model to use to describe the relationship between the variables
    # # 1 is a first degree polynomial, which is straight line
#  When we specify degree of 1 , we are saying we want to fit a linear polynomial i.e., straight line equation to our data
# higher degrees woule allow for curves and more complex relationships between variables
# what degree you choose determines the type of curve used to model the relationships between variables in the data

# calculates m (slope) and b(intercept) that make this line fit as closely as possible to the data points
m, b = np.polyfit(rural_tor_injury['fat'], rural_tor_injury['inj'], 1)


# calculates m (slope) and b(intercept) that make this line fit as closely as possible to the data points
m_u, b_u = np.polyfit(urban_tor_injury['fat'], urban_tor_injury['inj'], 1)


# Add line of best fit to rural plot
axes[0].plot(rural_tor_injury['fat'], m*rural_tor_injury['fat'] + b, color='red')

# Add line of best fit to urban plot
axes[1].plot(urban_tor_injury['fat'], m_u*urban_tor_injury['fat'] + b_u, color='red')



# Add labels
axes[0].set_title('Relationship between Rural Fatalities and Injuries')
axes[0].set_xlabel('Fatalities')
axes[0].set_ylabel('Injury')
axes[1].set_title('Relationship between Urban Fatalities and Injuries')
axes[1].set_xlabel('Fatalities')
axes[1].set_ylabel('Injury')


# Show the plot
plt.show()
Rural Pearsons correlation: 0.734
Urban Pearsons correlation: 0.819

png

# A correlation of 0.7 suggests a strong positive linear relationship, meaning as one variable increases, the other tends to increase in a linear manner too
# shoudl we use the grouped or ungrouped data for the univariate model of predicting fatalities based on injuries
# Calculate Pearson correlation for rural data
corr, _ = pearsonr(rural_tor_injury['fat'], rural_tor_injury['len'])

print('Rural Pearsons correlation: %.3f' % corr)


# Calculate Pearson correlation for rural data
corr, _ = pearsonr(urban_tor_injury['fat'], urban_tor_injury['len'])

print('Urban Pearsons correlation: %.3f' % corr)

# Scatter plot to compare rural and urban datasets for injury and fatalities
fig,axes = plt.subplots(1,2, figsize=(16,8))

# Create a scatter plot
axes[0].scatter(rural_tor_injury['fat'], rural_tor_injury['len'])

# Calculate line of best fit
m_1, b_1 = np.polyfit(rural_tor_injury['fat'], rural_tor_injury['len'], 1)

# Generate x values for your line of best fit (from minimum to maximum fatality counts)
x_values = np.linspace(min(rural_tor_injury['fat']), max(rural_tor_injury['fat']), num=100) # num =100, means generate 100 evenly spaced points between min and max

# using 100 points above for the x values, common practice 
#balancing comp efficiency and visual smoothness of plotted line
    # 100 points is enough to represent the linear relationship without overloading the plot or the resources needed to render the plot
# Calculate corresponding y values using the equation of the line of best fit
y_values = m_1 * x_values + b_1

# Add line of best fit to plot
axes[0].plot(x_values, y_values, color='red', linewidth=2)

# Add labels correctly, as per your plotted data
axes[0].set_xlabel('Fatalities')
axes[0].set_ylabel('Length')




# Create a scatter plot
axes[1].scatter(urban_tor_injury['fat'], urban_tor_injury['len'])
# Calculate line of best fit
m_u1, b_u1 = np.polyfit(urban_tor_injury['fat'], urban_tor_injury['len'], 1)

# Generate x values for your line of best fit (from minimum to maximum fatality counts)
x_values_u1 = np.linspace(min(urban_tor_injury['fat']), max(urban_tor_injury['fat']), num=100) # num =100, means generate 100 evenly spaced points between min and max

# Calculate corresponding y values using the equation of the line of best fit
y_values_u1 = m_u1 * x_values_u1 + b_u1

# add line of best fit to urban data plot
axes[1].plot(x_values_u1, y_values_u1, color='red', linewidth=2)

# Add labels correctly, as per your plotted data
axes[1].set_xlabel('Fatalities')
axes[1].set_ylabel('Length')


# Show the plot
plt.show()
Rural Pearsons correlation: 0.223
Urban Pearsons correlation: 0.397

png

# Create a scatter plot
plt.scatter(urban_tor_injury['fat'], urban_tor_injury['inj'])

# Calculate line of best fit
m, b = np.polyfit(urban_tor_injury['fat'], urban_tor_injury['inj'], 1)

# Add line of best fit to plot
plt.plot(urban_tor_injury['fat'], m*urban_tor_injury['fat'] + b, color='red')

# Add labels
plt.xlabel('Fatalities')
plt.ylabel('Injuries')

# Show the plot
plt.show()

png

OLS linear regression model : univariate

# # Convert is_night to integer data type 
filtered_tor_data['is_night'] = filtered_tor_data['is_night'].astype(int)

# We exclude 0 magnitude as we saw in the EDA only ~1% tornado injuries caused by magnitude 0 tornadoes
filtered_tor_data_hi_mag = filtered_tor_data[filtered_tor_data["mag"]!=0]

# Aggregate data by year
annual_data_fatalities = filtered_tor_data_hi_mag.groupby('yr').agg({
    'fat': 'sum',  # Sum injuries per year
    'mag': 'mean', # Average magnitude of tornadoes per year
    'len':'mean',  # Mean of tornado length in miles occuring per year
    'is_urban': 'sum',
    'is_night': 'sum',
    'inj':'sum',
    # Add other aggregations as needed
}).reset_index()


# # # Apply log transformation to fatalities
annual_data_fatalities['log_fat'] = np.log(annual_data_fatalities['fat'] + 1)

# # # Apply log transformation to injuries
annual_data_fatalities['log_inj'] = np.log(annual_data_fatalities['inj']+ 1)  # Add 1 to avoid taking log of 0



# Define predictors and outcome
X = annual_data_fatalities[['log_inj']]  # Add other annual predictors as needed
y = annual_data_fatalities[['log_fat']]  # outcome , dependent variable

# Split the data into training and testing sets
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X, y, test_size=0.4, random_state=42)

# Adding a constant for the intercept
    # allows for the calculation of an intercept in the regression model
    # Adding a constant to the predictor variables (feature set) is necessary when using statsmodels' OLS for fitting a regression model to the data
    # This step adds a column of ones to the input predictor variables or feature set, 
    # which allows the OLS algorithm to include an intercept term (β0) in the model 
    # The intercept represents the predicted value of the dependent variable when all the predictor  (independent) variables are zero 
    # Including this constant ensures that the model has a y-axis intercept,
    # which improves the fit and interpretability of the model

X_train_f = sm.add_constant(X_train_f) 
X_test_f = sm.add_constant(X_test_f)


# Fit the OLS model using the training data
model_f = sm.OLS(y_train_f, X_train_f).fit()

# Make predictions using the testing data
y_pred_f = model_f.predict(X_test_f)

# Sort the test data and predictions for plotting
sorted_indices = np.argsort(X_test_f['log_inj'])
sorted_X_test = X_test_f['log_inj'].iloc[sorted_indices]
sorted_y_pred = y_pred_f.iloc[sorted_indices]

# Plot the data to show the R squared line and the data points
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(X_test_f['log_inj'], y_test_f['log_fat'], color='blue', s=15)
ax.set_xlabel('Log Injuries')
ax.set_ylabel('Log Fatalities')

# calculate r squared
r2 = r2_score(y_test_f, y_pred_f)
rmse_err = mean_squared_error(y_test_f, y_pred_f, squared=False)
ax.set_title('Relationship between Log Fatalities and Log Injuries: Univariate Regression, R^2 = {:.2f}, RMSE = {:.2f}'.format(r2, rmse_err))
#linear regression line plotted against the original / test data, so we can visualize the "fit" of the model to the data
ax.plot(sorted_X_test, sorted_y_pred, color='red', linewidth=4)
plt.show()

png

# Regression analysis
# a statistical technique used to model and analyze the relationships between a dependent variable and one or more independent variables
# main goal of Regression analysis is to 1) understand how the dependent variable changes in response to variations in the independent variable(s), 2) and to quantify the strength of those relationships
    # 1) the dependent variable changes are determined by the coefficient of the independent variable
    # 2) R squared quantifies the strength of the relationship and proportion of variance predicatable from independent variable, a higher R squared indicates stronger explanatory power
# R squared of 0.57 means that 57% of the variability in the depependent variable fatalities (log-fatalities)  
# can be  explained by the independent variable injuries (log-injuries)
# i.e., 57% variability in fatalities can be explained by changes in injuries
# tells us how well the variations in fatalities are being explained by injuries
# model equation
    # the full regression equation based on the paramters from the model output
    # log(fatalities)  = -2.6174 + 0.9679 * log(Injuries)

# this model equation can be exponentiated to interpret in the original units
    # Coefficient in the log-log model tells us the expected change in the dependent variable (log-fatalities)
    # for a one-unit change in the independent variable( log injuries)
    # When log(injuries) increases by 1, then the expected change in log(fatalities) is 0.9679 units.
    # To find the actual change in fatalities, you exponentiate this change in the coefficient
    # If we are doubling injuries, the Change in Fatalities for injury doubling = e^ 0.9679*log(2), where log(2) is two injuries in original values being converted into the logarithm
    # The change in log fatalities is then: 0.9679 * 0.693 = 0.671, so log of fatalities increases by 0.671 units for every two new injuries
    # Exponentiate the change in log fatalities back to origianl units: 
        # e ^ 0.671, is a 1.96 fold increase in fatalities, or 96% increase in the ACTUAL NUMBER of fatalities
        # so , if you had 100 fatalities initially, doubling injuries would increase it to 196 fatalities
        # Conclusion: doubling of injuries nearly doubles fatalities

# Y-intercept is e^-2.6174  , means that when injuries are at their lowest level (i.e., 1 injury since we log-transformed injuries)
    # expected fatalities is 0.073
    #  Indicates that a single injury even results in a very low number of fatalities

# X^m or Injuries^0.9679 (coefficient), indicates that the relationship between injuries and fatalities is multiplicative
    # increasing injuries by some percentage leads to a predicatable percentage increase in fatalities, determined by exponent m i.e., coefficient amount
    # changes in fatalities depend on the power m to which injuries variable X is raised 
    # If m >1 fatalities increase at a faster rate than injuries
    # if m <1 fatalities increase at  a slower rate than injuries
    # if m = 1 fatalities increase linearly with injuries , so that a 10-fold increase in injuries would cause 10 fold increase in fatalities
    #  Here m is less than 1 i.e., m= 0.9679
np.log(2)
0.6931471805599453
np.exp(.671)
1.9561925354010234
print(len(y_pred_f))
print(y_pred_f.dtypes)
print(type(y_pred_f))
30
float64
<class 'pandas.core.series.Series'>
print(len(y_test_f))
print(y_test_f.dtypes)
print(type(y_test_f))
30
log_fat    float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
# Residuals vs. predicted values from the regression


# Ensure that y_pred_f is a Series and aligns with y_test_f
    # If not already the case, can convert y_pred_f to a pandas Series, ensuring it aligns with y_test_f
    # y_pred_f = pd.Series(y_pred_f, index=y_test_f.index)

# Check if there are any index mismatches or length issues
print("Length of y_pred_f:", len(y_pred_f))
print("Length of y_test_f:", len(y_test_f))

if len(y_pred_f) != len(y_test_f):
    print("Error: The predicted and test data lengths do not match.")
else:
    # Calculate residuals
    residuals = y_test_f['log_fat'] - y_pred_f

    # Plotting Residuals vs. Predicted values
    fig, ax = plt.subplots()
    ax.scatter(y_pred_f, residuals, color='blue')
    ax.axhline(y=0, color='red', linestyle='--')
    ax.set_xlabel('Predicted Log Fatalities')
    ax.set_ylabel('Residuals')
    ax.set_title('Residual Plot')
    plt.show()

    # Compute and print R-squared and RMSE
    r2 = r2_score(y_test_f, y_pred_f)
    rmse = mean_squared_error(y_test_f, y_pred_f, squared=False)
    print(f'R-squared: {r2:.3f}')
    print(f'RMSE: {rmse:.3f}')

    # Residuals vs. time
    fig,ax=plt.subplots(figsize=(20,10))
    ax.scatter(y_test_f.index, y_test_f['log_fat'] - y_pred_f,color='b',s=20)
    ax.axhline(y=0, color='red', linestyle='--')
    ax.set_xlabel('Year')
    ax.set_ylabel('Residual')
    ax.set_title('Ordered Residuals')
    plt.show()
Length of y_pred_f: 30
Length of y_test_f: 30

png

R-squared: 0.567
RMSE: 0.446

png

# Let's also evaluate how normal the distribution of the residuals may be, or not be
fig,ax=plt.subplots()
ax.hist(y_test_f['log_fat'] - y_pred_f)
ax.set_title('Distribution of residuals')
Text(0.5, 1.0, 'Distribution of residuals')

png

# Is this model of the relationship between log fatalities and log injuries performing well? 
#Yes, data seem to be randomly distributed around the zero line
    # The residual plots measure how far off the regression predicted values are from the actual test data points
    # esiduals do not trend in a particular direction as the predicted values increase
    # model’s assumptions of linearity and homogeneity of variance (equal variance across all levels of the independent variable) are reasonably met
    #  suggest that this model, while not perfect (some outliers far away from center line), is generally well-specified for the relationship between log injuries and log fatalities 
    # no apparent violations of the major assumptions required for linear regression 
    # The primary takeaway is that the model appears robust across different values and over time
# R squared
# The R squared in the OLS regression summary is calculated using the training data, which reflects how well the model explains the variance in the training set
# The R squared in the r2 score was computed on the test data
# This evaluated how well the model predicts on new unseen data

# Conclusion 
    #   The model is performing better on the training data than on the test data
print(model_f.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_fat   R-squared:                       0.821
Model:                            OLS   Adj. R-squared:                  0.816
Method:                 Least Squares   F-statistic:                     187.5
Date:                Tue, 14 May 2024   Prob (F-statistic):           6.90e-17
Time:                        22:11:35   Log-Likelihood:                -14.005
No. Observations:                  43   AIC:                             32.01
Df Residuals:                      41   BIC:                             35.53
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.6174      0.494     -5.295      0.000      -3.616      -1.619
log_inj        0.9679      0.071     13.693      0.000       0.825       1.111
==============================================================================
Omnibus:                        1.493   Durbin-Watson:                   2.113
Prob(Omnibus):                  0.474   Jarque-Bera (JB):                0.847
Skew:                           0.330   Prob(JB):                        0.655
Kurtosis:                       3.193   Cond. No.                         67.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Groupby urban vs rural location and save to separate variables
is_urban_inj_data = filtered_tor_data.groupby('is_urban')
print(is_urban_inj_data)
rural_tor_injury = is_urban_inj_data.get_group(0).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night']]
urban_tor_injury = is_urban_inj_data.get_group(1).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night']]

rural_nonzero = rural_tor_injury[rural_tor_injury['inj'] > 0]
urban_nonzero = urban_tor_injury[urban_tor_injury['inj'] > 0]

## Print summary stats, perform skewness checks, then generate histogram and fit distribution curves to our data

# Stats

print(f" Descriptive Summary statistics for total tornado injuries from 1950-2022 {'\n'} {rural_nonzero['inj'].describe()}")
print(f" Descriptive Summary statistics for total tornado injuries from 1950-2022 {'\n'} {urban_nonzero['inj'].describe()}")


rural_tor_inj_stats = rural_nonzero['inj'].describe()

print("\n")
urban_tor_inj_stats = urban_nonzero['inj'].describe()

print("\n")

# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_inj_annual_r = rural_tor_inj_stats.loc["mean"]

# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_inj_annual_u = urban_tor_inj_stats.loc["mean"]

# print(f"mean number of injuries occuring annually {round(mean_inj_annual)}")

# Calculate the skewness of the annual tornado counts
skewness_r = rural_nonzero['inj'].skew()

# Calculate the skewness of the annual tornado counts
skewness_u = urban_nonzero['inj'].skew()


# Print the skewness
print(f"Rural Skewness: {skewness_r}")

# Print the skewness
print(f"Urban Skewness: {skewness_u}")

# Calculate the quartiles
Q1_r = rural_nonzero['inj'].quantile(0.25)
Q2_r = rural_nonzero['inj'].quantile(0.5)
Q3_r = rural_nonzero['inj'].quantile(0.75)

# Calculate the quartiles
Q1_u = urban_nonzero['inj'].quantile(0.25)
Q2_u = urban_nonzero['inj'].quantile(0.5)
Q3_u = urban_nonzero['inj'].quantile(0.75)


# Calculate the Yule-Kendall index
yule_kendall_index_r = ((Q3_r - Q2_r) - (Q2_r - Q1_r)) / (Q3_r - Q1_r)
print(f"Rural Yule-Kendall index: {yule_kendall_index_r}")



# Calculate the Yule-Kendall index
yule_kendall_index_u = ((Q3_u - Q2_u) - (Q2_u - Q1_u)) / (Q3_u - Q1_u)
print(f"Urban Yule-Kendall index: {yule_kendall_index_u}")


# Calculate the IQR
IQR_r = Q3_r - Q1_r
print(f"Rural Interquartile Range (IQR): {IQR_r}")

# Calculate the IQR
IQR_u = Q3_u - Q1_u
print(f"Urban Interquartile Range (IQR): {IQR_u}")

#========================


# Create figure
fig,axes = plt.subplots(1,2, figsize=(14,8))


# bins
# calculate the number of bins
# bins_r = int(np.sqrt(len(rural_nonzero['inj'])))

# # calculate the number of bins
# bins_u = int(np.sqrt(len(urban_nonzero['inj'])))

# # Adjust bins
bins_r = 8
bins_u = 8

# # calculate the bin width
# bin_width_r = (rural_nonzero['inj'].max() - rural_nonzero['inj'].min()) / bins_r
# # calculate the bin width
# bin_width_u = (urban_nonzero['inj'].max() - urban_nonzero['inj'].min()) / bins_u

# print(f"Number of bins: {bins}")
# print(f"Bin width: {bin_width}")

print('\n')

# Plot the histogram for injury counts
axes[0].hist(rural_nonzero['inj'], bins=bins_r,density=False, label='Rural')

# Add vertical lines to mark mean and median
axes[0].axvline(rural_tor_inj_stats.loc['50%'], color='r', linestyle=':', label='Median annual rural injury count', zorder=2) 
axes[0].axvline(rural_tor_inj_stats.loc["mean"],color='b', linestyle=':', label='Mean annual rural injury count', zorder=3)


# Plot the histogram for injury counts
axes[1].hist(urban_nonzero['inj'], bins=bins_u,density=False, color='orange', label="Urban")

# Add vertical lines to mark mean and median
axes[1].axvline(urban_tor_inj_stats.loc['50%'], color='r', linestyle=':', label='Median annual urban injury count', zorder=2) 
axes[1].axvline(urban_tor_inj_stats.loc["mean"],color='b', linestyle=':', label='Mean annual urban injury count', zorder=3)

#========================


# Fit a theoretical normal distrubution to our data:

# 1. Calculate parameters for theoretical distribution:

    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_r, std_r = np.mean(rural_nonzero['inj']), np.std(rural_nonzero['inj'])
    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_u, std_u = np.mean(urban_nonzero['inj']), np.std(urban_nonzero['inj'])


# 2. Generate x values for the normal pdf

# x_r = np.linspace(min(rural_nonzero['inj']), max(rural_nonzero['inj']), 100)
# x_u = np.linspace(min(urban_nonzero['inj']), max(urban_nonzero['inj']), 100)

# x_r = np.linspace(min(rural_nonzero['log_inj']), max(rural_nonzero['boxcox_inj']), 100)
x_r = np.linspace(stats.norm.ppf(0.01, mu_r, std_r), 
                stats.norm.ppf(0.99, mu_r, std_r), 100)

# x_u = np.linspace(min(urban_nonzero['log_inj']), max(urban_nonzero['log_inj']), 100)
x_u = np.linspace(stats.norm.ppf(0.01, mu_u, std_u), 
                stats.norm.ppf(0.99, mu_u, std_u), 100)
# 3.  Using the parameters obtained, compute the pdf
# Scale PDF to match histogram count
scale_r = len(rural_nonzero['inj']) * (rural_nonzero['inj'].max() - rural_nonzero['inj'].min()) / bins_r
scale_u = len(urban_nonzero['inj']) * (urban_nonzero['inj'].max() - urban_nonzero['inj'].min()) / bins_u

pdf_r = norm.pdf(x_r, mu_r, std_r) * scale_r
pdf_u = norm.pdf(x_u, mu_u, std_u) * scale_u


# 4. Generate the plots
    # Plot normal PDF
axes[0].plot(x_r, pdf_r, 'r-', lw=5, alpha=0.6, label='norm pdf')
axes[1].plot(x_u, pdf_u, 'r-', lw=5, alpha=0.6, label='norm pdf')


# Label plots 
axes[0].set_title("Injuries:", color="orange", loc="left")
axes[0].set_title("Distibution of Rural US Tornado Injuries (1950-2022)", loc="right")
axes[0].set_ylabel("Tornado")
axes[0].set_xlabel("Injuries")
axes[0].legend()
# Label plot 
axes[1].set_title("Injuries:", color="orange", loc="left")
axes[1].set_title("Distibution of Urban US Tornado Injuries (1950-2022)", loc="right")
axes[1].set_ylabel("Tornado Counts")
axes[1].set_xlabel("Injuries")
axes[1].legend()

plt.tight_layout()

#========================
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000295CF6CE630>
 Descriptive Summary statistics for total tornado injuries from 1950-2022 
 count    6618.000000
mean       11.259142
std        37.780761
min         1.000000
25%         1.000000
50%         3.000000
75%         7.000000
max       844.000000
Name: inj, dtype: float64
 Descriptive Summary statistics for total tornado injuries from 1950-2022 
 count    1140.000000
mean       20.123684
std       102.358138
min         1.000000
25%         1.000000
50%         3.000000
75%         9.000000
max      1740.000000
Name: inj, dtype: float64




Rural Skewness: 9.737194832995467
Urban Skewness: 11.596327876911191
Rural Yule-Kendall index: 0.3333333333333333
Urban Yule-Kendall index: 0.5
Rural Interquartile Range (IQR): 6.0
Urban Interquartile Range (IQR): 8.0

png

#Notes on log-transformed data:
    # In this log-log model, the independent variable X (log of injuries) and dependent variable Y (log of fatalities) are both transformed by taking the natural logarithm
    #  When X=0: In the transformed scale, this corresponds to exp(0)=1 on the original scale due to properties of logarithms
    #  Therefore, β0 tells us the expected value of Y when X is 0 in the transformed scale, i.e., when there is one injury (since log⁡(1)=0)
    #Exponential Interpretation: 
    # The interpretation of 𝛽0=−2.6174 in the transformed scale is exp⁡(−2.6174) in the original scale, 
    # which gives the baseline number of fatalities when there is exactly one injury
# Normality assessment: data mostly at or near zero,  right-skewed, and not normally distributed
    # Distribution Characteristics: Both datasets show high skewness, indicating they are highly skewed to the right
    # Mean vs. Median: The mean is significantly higher than the median in both datasets, emphasizing the presence of outliers
    # Spread: The IQR and standard deviation indicate high variability, especially in the urban data
# Conclusion:
    # Right-Skewed Distribution: The original data exhibits significant right skewness, meaning it's not normally distributed.
    # Outliers: The presence of outliers is evident, with values much higher than the 75th percentile
    # Transformation Needed: The skewness and variability suggest that a log or Box-Cox transformation could normalize the data for better modeling
    # Given the wide range of values with abundance of them at or near 0 , Box cox is more suitable
    # also, are goal is to make the data closer to  normal distribution and Box cox is designed to find the optimal tranformation to achieve normality
# We learned our data does not have a normal distribution, which we would need to apply subsequent analyses such as hypothesis testing
#  Data is highly positively skewed and has an abundance of near zero values
# We are not interested in 0 value injury counts, and can focus on comparing injuries counts above 0 
# so to tranform the data closer to a gaussian distribution, we apply a box-cox transformation to the injury counts
    # The Box-Cox transformation is designed to make data more normally distributed and stabilize variance
    # Original injury count data are discrete values count i.e., each injury is a whole number(e.g., rural dataset min is 1, max is 844)
    # Box-Cox transformed data is continuous 
    # The transformation results in continuous data

rural_nonzero = rural_nonzero.copy()  # Make a copy to avoid modifying the original data
urban_nonzero = urban_nonzero.copy()  # Make a copy to avoid modifying the original data

# Filter non-zero data and ensure strictly positive values for Box-Cox
rural_nonzero = rural_nonzero[rural_nonzero['inj'] > 0]
urban_nonzero = urban_nonzero[urban_nonzero['inj'] > 0]


# Apply Box-Cox transformation and store the result in a new column
rural_nonzero['boxcox_inj'], _ = boxcox(rural_nonzero['inj'])
urban_nonzero['boxcox_inj'], _ = boxcox(urban_nonzero['inj'])


## Print summary stats, perform skewness checks, then generate histogram and fit distribution curves to our data

# Stats

# Print descriptive statistics
print(f" Descriptive Summary statistics for total tornado injuries from 1950-2022 {'\n'} {rural_nonzero['boxcox_inj'].describe()}")
print(f" Descriptive Summary statistics for total tornado injuries from 1950-2022 {'\n'} {urban_nonzero['boxcox_inj'].describe()}")


rural_tor_inj_stats = rural_nonzero['boxcox_inj'].describe()

print("\n")
urban_tor_inj_stats = urban_nonzero['boxcox_inj'].describe()

print("\n")

# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_inj_annual_r = rural_tor_inj_stats.loc["mean"]

# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_inj_annual_u = urban_tor_inj_stats.loc["mean"]

# print(f"mean number of injuries occuring annually {round(mean_inj_annual)}")

# Calculate the skewness of the annual tornado counts
skewness_r = rural_nonzero['boxcox_inj'].skew()

# Calculate the skewness of the annual tornado counts
skewness_u = urban_nonzero['boxcox_inj'].skew()


# Print the skewness
print(f"Rural Skewness: {skewness_r}")

# Print the skewness
print(f"Urban Skewness: {skewness_u}")

# Calculate the quartiles
Q1_r = rural_nonzero['boxcox_inj'].quantile(0.25)
Q2_r = rural_nonzero['boxcox_inj'].quantile(0.5)
Q3_r = rural_nonzero['boxcox_inj'].quantile(0.75)

# Calculate the quartiles
Q1_u = urban_nonzero['boxcox_inj'].quantile(0.25)
Q2_u = urban_nonzero['boxcox_inj'].quantile(0.5)
Q3_u = urban_nonzero['boxcox_inj'].quantile(0.75)


# Calculate the Yule-Kendall index
yule_kendall_index_r = ((Q3_r - Q2_r) - (Q2_r - Q1_r)) / (Q3_r - Q1_r)
print(f"Rural Yule-Kendall index: {yule_kendall_index_r}")



# Calculate the Yule-Kendall index
yule_kendall_index_u = ((Q3_u - Q2_u) - (Q2_u - Q1_u)) / (Q3_u - Q1_u)
print(f"Urban Yule-Kendall index: {yule_kendall_index_u}")


# Calculate the IQR
IQR_r = Q3_r - Q1_r
print(f"Rural Interquartile Range (IQR): {IQR_r}")

# Calculate the IQR
IQR_u = Q3_u - Q1_u
print(f"Urban Interquartile Range (IQR): {IQR_u}")

#========================


# Create figure
fig,axes = plt.subplots(1,2, figsize=(14,8))


# bins
# # calculate the number of bins
# bins_r = int(np.sqrt(len(rural_nonzero['inj'])))

# # calculate the number of bins
# bins_u = int(np.sqrt(len(urban_nonzero['inj'])))

# Adjust bins
bins_r = 7
bins_u = 7
# Define bin edges for both rural and urban datasets
# bin_edges_rural = np.linspace(rural_nonzero['boxcox_inj'].min(), rural_nonzero['boxcox_inj'].max(), num=bins_r+1)
# bin_edges_urban = np.linspace(urban_nonzero['boxcox_inj'].min(), urban_nonzero['boxcox_inj'].max(), num=bins_u+1)

# # calculate the bin width
# bin_width_r = (rural_nonzero['inj'].max() - rural_nonzero['inj'].min()) / bins_r
# # calculate the bin width
# bin_width_u = (urban_nonzero['inj'].max() - urban_nonzero['inj'].min()) / bins_u

# print(f"Number of bins: {bins}")
# print(f"Bin width: {bin_width}")

print('\n')

# Plot the histogram for injury counts
axes[0].hist(rural_nonzero['boxcox_inj'], bins=bins_r,density=False, label='Rural')

# Add vertical lines to mark mean and median
axes[0].axvline(rural_tor_inj_stats.loc['50%'], color='r', linestyle=':', label='Median annual rural injury count', zorder=2) 
axes[0].axvline(rural_tor_inj_stats.loc["mean"],color='b', linestyle=':', label='Mean annual rural injury count', zorder=3)


# Plot the histogram for injury counts
axes[1].hist(urban_nonzero['boxcox_inj'], bins=bins_u,density=False, color='orange', label="Urban")

# Add vertical lines to mark mean and median
axes[1].axvline(urban_tor_inj_stats.loc['50%'], color='r', linestyle=':', label='Median annual urban injury count', zorder=2) 
axes[1].axvline(urban_tor_inj_stats.loc["mean"],color='b', linestyle=':', label='Mean annual urban injury count', zorder=3)

#========================


# Fit a theoretical normal distrubution to our data:

# 1. Calculate parameters for theoretical distribution:

    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_r, std_r = np.mean(rural_nonzero['boxcox_inj']), np.std(rural_nonzero['boxcox_inj'])
    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_u, std_u = np.mean(urban_nonzero['boxcox_inj']), np.std(urban_nonzero['boxcox_inj'])


# 2. Generate x values for the normal pdf

# x_r = np.linspace(min(rural_nonzero['log_inj']), max(rural_nonzero['log_inj']), 100)
x_r = np.linspace(stats.norm.ppf(0.01, mu_r, std_r), 
                stats.norm.ppf(0.99, mu_r, std_r), 100)

# x_u = np.linspace(min(urban_nonzero['log_inj']), max(urban_nonzero['log_inj']), 100)
x_u = np.linspace(stats.norm.ppf(0.01, mu_u, std_u), 
                stats.norm.ppf(0.99, mu_u, std_u), 100)




# 3.  Using the parameters obtained, compute the pdf
# Scale PDF to match histogram count
scale_r = len(rural_nonzero['boxcox_inj']) * (rural_nonzero['boxcox_inj'].max() - rural_nonzero['boxcox_inj'].min()) / bins_r
scale_u = len(urban_nonzero['boxcox_inj']) * (urban_nonzero['boxcox_inj'].max() - urban_nonzero['boxcox_inj'].min()) / bins_u

pdf_r = norm.pdf(x_r, mu_r, std_r) * scale_r
pdf_u = norm.pdf(x_u, mu_u, std_u) * scale_u


# 4. Generate the plots
    # Plot normal PDF
axes[0].plot(x_r, pdf_r, 'r-', lw=5, alpha=0.6, label='norm pdf')
axes[1].plot(x_u, pdf_u, 'r-', lw=5, alpha=0.6, label='norm pdf')



# Label plots 
axes[0].set_title("Injuries:", color="orange", loc="left")
axes[0].set_title("Distibution of Rural US Tornado Injuries (1950-2022)", loc="right")
axes[0].set_ylabel("Counts")
axes[0].set_xlabel("Transformed Injuries (Box-Cox)")
axes[0].legend()

axes[1].set_title("Injuries:", color="orange", loc="left")
axes[1].set_title("Distibution of Urban US Tornado Injuries (1950-2022)", loc="right")
axes[1].set_ylabel("Counts")
axes[1].set_xlabel("Transformed Injuries (Box-Cox)")
axes[1].legend()

plt.tight_layout()
#========================
 Descriptive Summary statistics for total tornado injuries from 1950-2022 
 count    6618.000000
mean        0.785996
std         0.659445
min         0.000000
25%         0.000000
50%         0.880963
75%         1.331807
max         2.249701
Name: boxcox_inj, dtype: float64
 Descriptive Summary statistics for total tornado injuries from 1950-2022 
 count    1140.000000
mean        0.896069
std         0.725194
min         0.000000
25%         0.000000
50%         0.902424
75%         1.502975
max         2.527979
Name: boxcox_inj, dtype: float64




Rural Skewness: 0.21287735150391573
Urban Skewness: 0.19852852907622287
Rural Yule-Kendall index: -0.3229597358812457
Urban Yule-Kendall index: -0.20085076407621735
Rural Interquartile Range (IQR): 1.331806692988304
Urban Interquartile Range (IQR): 1.5029748759846113

png

# Normality assessment: box cox-transformed data is still right-skewed, but more symmetrical, and closer to normal normal distribution
    # Distribution Characteristics: The Box-Cox transformation reduces skewness significantly for both datasets
        # Both distributions are more symmetrical now
    # Mean vs. Median: The mean and median for both datasets are now closer together, showing less skewness
    # Spread: The IQR is slightly wider for urban data (1.50) than for rural (1.33), indicating a marginally broader distribution of data around the median
    # Conclusion:
        # Rural vs. Urban: Urban data exhibit a slightly higher average and wider distribution of injuries than rural data, 
        # suggesting that tornado injuries in urban areas could have more variation
        #The transformed data is more suitable for statistical modeling that assumes normality of the data but still has slight skewness
# The highest bin (first bin) is the same for both urban and rural distributions, showing the highest frequency of tornado injuries at the lowest values
# In the urban plot, the fourth bin has the second highest count of tornado injuries. This demonstrates a rightward shift in the urban distribution compared to the rural distribution
# also mean and median is shifted to the right in urban plot

# In the urban plot, more data points fall into bins representing higher transformed injury values compared to the rural plot
# Groupby urban vs rural location and save to separate variables
is_night_fat_data = filtered_tor_data.groupby('is_night')
print(is_night_fat_data)
day_tor_fat = is_night_fat_data.get_group(0).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night', 'time','hour']]
night_tor_fat = is_night_fat_data.get_group(1).reset_index()[['yr','inj', 'fat','mag', 'len', 'wid','is_night','time', 'hour']]

day_nonzero = day_tor_fat[day_tor_fat['fat'] > 0]
night_nonzero = night_tor_fat[night_tor_fat['fat'] > 0]

## Print summary stats, perform skewness checks, then generate histogram and fit distribution curves to our data

# Stats

print(f" Descriptive Summary statistics for total tornado injuries from 1950-2022 {'\n'} {day_nonzero['fat'].describe()}")
print(f" Descriptive Summary statistics for total tornado injuries from 1950-2022 {'\n'} {night_nonzero['fat'].describe()}")


day_tor_fat_stats = day_nonzero['fat'].describe()

print("\n")
night_tor_fat_stats = night_nonzero['fat'].describe()

print("\n")

# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_fat_annual_d = day_tor_fat_stats.loc["mean"]

# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_fat_annual_n = night_tor_fat_stats.loc["mean"]

# print(f"mean number of injuries occuring annually {round(mean_inj_annual)}")

# Calculate the skewness of the annual tornado counts
skewness_d = day_nonzero['fat'].skew()

# Calculate the skewness of the annual tornado counts
skewness_n = night_nonzero['fat'].skew()


# Print the skewness
print(f"Day Skewness: {skewness_d}")

# Print the skewness
print(f"Night Skewness: {skewness_n}")

# Calculate the quartiles
Q1_d = day_nonzero['fat'].quantile(0.25)
Q2_d = day_nonzero['fat'].quantile(0.5)
Q3_d = day_nonzero['fat'].quantile(0.75)

# Calculate the quartiles
Q1_n = night_nonzero['fat'].quantile(0.25)
Q2_n = night_nonzero['fat'].quantile(0.5)
Q3_n = night_nonzero['fat'].quantile(0.75)


# Calculate the Yule-Kendall index
yule_kendall_index_d = ((Q3_d - Q2_d) - (Q2_d - Q1_d)) / (Q3_d - Q1_d)
print(f"Day Yule-Kendall index: {yule_kendall_index_d}")



# Calculate the Yule-Kendall index
yule_kendall_index_n = ((Q3_n - Q2_n) - (Q2_n - Q1_n)) / (Q3_n - Q1_n)
print(f"Night Yule-Kendall index: {yule_kendall_index_n}")


# Calculate the IQR
IQR_d = Q3_d - Q1_d
print(f"Day Interquartile Range (IQR): {IQR_d}")

# Calculate the IQR
IQR_n = Q3_n - Q1_n
print(f"Night Interquartile Range (IQR): {IQR_n}")

#========================


# Create figure
fig,axes = plt.subplots(1,2, figsize=(14,8))


# bins
# calculate the number of bins
# bins_r = int(np.sqrt(len(rural_nonzero['inj'])))

# # calculate the number of bins
# bins_u = int(np.sqrt(len(urban_nonzero['inj'])))

# # Adjust bins
bins_d = 8
bins_n = 8

# # calculate the bin width
# bin_width_r = (rural_nonzero['inj'].max() - rural_nonzero['inj'].min()) / bins_r
# # calculate the bin width
# bin_width_u = (urban_nonzero['inj'].max() - urban_nonzero['inj'].min()) / bins_u

# print(f"Number of bins: {bins}")
# print(f"Bin width: {bin_width}")

print('\n')

# Plot the histogram for fatality counts
axes[0].hist(day_nonzero['fat'], bins=bins_d,density=False, label='Day', color="green")

# Add vertical lines to mark mean and median
axes[0].axvline(day_tor_fat_stats.loc['50%'], color='r', linestyle=':', label='Median annual day fatality count', zorder=2) 
axes[0].axvline(day_tor_fat_stats.loc["mean"],color='b', linestyle=':', label='Mean annual day fatality count', zorder=3)


# Plot the histogram for fatality counts
axes[1].hist(night_nonzero['fat'], bins=bins_n,density=False, color='red', label="Night")

# Add vertical lines to mark mean and median
axes[1].axvline(night_tor_fat_stats.loc['50%'], color='r', linestyle=':', label='Median annual night injury count', zorder=2) 
axes[1].axvline(night_tor_fat_stats.loc["mean"],color='b', linestyle=':', label='Mean annual night injury count', zorder=3)

#========================


# Fit a theoretical normal distrubution to our data:

# 1. Calculate parameters for theoretical distribution:

    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_d, std_d = np.mean(day_nonzero['fat']), np.std(day_nonzero['fat'])
    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_n, std_n = np.mean(night_nonzero['fat']), np.std(night_nonzero['fat'])


# 2. Generate x values for the normal pdf

# x_r = np.linspace(min(rural_nonzero['inj']), max(rural_nonzero['inj']), 100)
# x_u = np.linspace(min(urban_nonzero['inj']), max(urban_nonzero['inj']), 100)

# x_r = np.linspace(min(rural_nonzero['log_inj']), max(rural_nonzero['boxcox_inj']), 100)
x_d = np.linspace(stats.norm.ppf(0.01, mu_d, std_d), 
                stats.norm.ppf(0.99, mu_d, std_d), 100)

# x_u = np.linspace(min(urban_nonzero['log_inj']), max(urban_nonzero['log_inj']), 100)
x_n = np.linspace(stats.norm.ppf(0.01, mu_n, std_n), 
                stats.norm.ppf(0.99, mu_n, std_n), 100)
# 3.  Using the parameters obtained, compute the pdf
# Scale PDF to match histogram count
scale_d = len(day_nonzero['fat']) * (day_nonzero['fat'].max() - day_nonzero['fat'].min()) / bins_d
scale_n = len(night_nonzero['fat']) * (night_nonzero['fat'].max() - night_nonzero['fat'].min()) / bins_n

pdf_d = norm.pdf(x_d, mu_d, std_d) * scale_d
pdf_n = norm.pdf(x_n, mu_n, std_n) * scale_n


# 4. Generate the plots
    # Plot normal PDF
axes[0].plot(x_d, pdf_d, 'k-', lw=5, alpha=0.6, label='norm pdf')
axes[1].plot(x_n, pdf_n, 'k-', lw=5, alpha=0.6, label='norm pdf')


# Label plots 
axes[0].set_title("Fatalities:", color="red", loc="left")
axes[0].set_title("Distibution of Day US Tornado Fatalities (1950-2022)", loc="right")
axes[0].set_ylabel("Tornado")
axes[0].set_xlabel("Fatalities")
axes[0].legend()

axes[1].set_title("Fatalities:", color="red", loc="left")
axes[1].set_title("Distibution of Night US Tornado Fatalities (1950-2022)", loc="right")
axes[1].set_ylabel("Tornado Counts")
axes[1].set_xlabel("Fatalities")
axes[1].legend()

plt.tight_layout()
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000295DCF91D00>
 Descriptive Summary statistics for total tornado injuries from 1950-2022 
 count    1218.000000
mean        4.128900
std         9.645856
min         1.000000
25%         1.000000
50%         1.000000
75%         3.000000
max       158.000000
Name: fat, dtype: float64
 Descriptive Summary statistics for total tornado injuries from 1950-2022 
 count    355.000000
mean       3.115493
std        5.815007
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max       80.000000
Name: fat, dtype: float64




Day Skewness: 8.170049802134653
Night Skewness: 8.04904770983996
Day Yule-Kendall index: 1.0
Night Yule-Kendall index: 0.0
Day Interquartile Range (IQR): 2.0
Night Interquartile Range (IQR): 2.0

png

day_nonzero = day_nonzero.copy()  # Make a copy to avoid modifying the original data
night_nonzero = night_nonzero.copy()  # Make a copy to avoid modifying the original data

# Filter non-zero data and ensure strictly positive values for Box-Cox
day_nonzero = day_nonzero[day_nonzero['fat'] > 0]
night_nonzero = night_nonzero[night_nonzero['fat'] > 0]


# Apply Box-Cox transformation and store the result in a new column
day_nonzero['boxcox_fat'], _ = boxcox(day_nonzero['fat'])
night_nonzero['boxcox_fat'], _ = boxcox(night_nonzero['fat'])

## Print summary stats, perform skewness checks, then generate histogram and fit distribution curves to our data

# Stats

print(f" Descriptive Summary statistics for total tornado fatalities from 1950-2022 {'\n'} {day_nonzero ['boxcox_fat'].describe()}")
print(f" Descriptive Summary statistics for total tornado fatalities from 1950-2022 {'\n'} {night_nonzero['boxcox_fat'].describe()}")


day_tor_fat_stats = day_nonzero['boxcox_fat'].describe()

print("\n")
night_tor_fat_stats = night_nonzero['boxcox_fat'].describe()

print("\n")

# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_fat_annual_d = day_tor_fat_stats.loc["mean"]
print(mean_fat_annual_d)
print(f"mean number of day fatalities occuring annually {round(mean_fat_annual_d)}")
# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_fat_annual_n = night_tor_fat_stats.loc["mean"]
print(mean_fat_annual_n)
print(f"mean number of night fatalities occuring annually {round(mean_fat_annual_n)}")

# Calculate the skewness of the annual tornado counts
skewness_d = day_nonzero['boxcox_fat'].skew()

# Calculate the skewness of the annual tornado counts
skewness_n = night_nonzero['boxcox_fat'].skew()


# Print the skewness
print(f"Day Skewness: {skewness_d}")

# Print the skewness
print(f"Night Skewness: {skewness_n}")

# Calculate the quartiles
Q1_d = day_nonzero['boxcox_fat'].quantile(0.25)
Q2_d = day_nonzero['boxcox_fat'].quantile(0.5)
Q3_d = day_nonzero['boxcox_fat'].quantile(0.75)

# Calculate the quartiles
Q1_n = night_nonzero['boxcox_fat'].quantile(0.25)
Q2_n = night_nonzero['boxcox_fat'].quantile(0.5)
Q3_n = night_nonzero['boxcox_fat'].quantile(0.75)


# Calculate the Yule-Kendall index
yule_kendall_index_d = ((Q3_d - Q2_d) - (Q2_d - Q1_d)) / (Q3_d - Q1_d)
print(f"Day Yule-Kendall index: {yule_kendall_index_d}")



# Calculate the Yule-Kendall index
yule_kendall_index_n = ((Q3_n - Q2_n) - (Q2_n - Q1_n)) / (Q3_n - Q1_n)
print(f"Night Yule-Kendall index: {yule_kendall_index_n}")


# Calculate the IQR
IQR_d = Q3_d - Q1_d
print(f"Day Interquartile Range (IQR): {IQR_d}")

# Calculate the IQR
IQR_n = Q3_n - Q1_n
print(f"Night Interquartile Range (IQR): {IQR_n}")

#========================


# Create figure
fig,axes = plt.subplots(1,2, figsize=(14,8))


# bins
# calculate the number of bins
# bins_r = int(np.sqrt(len(rural_nonzero['inj'])))

# # calculate the number of bins
# bins_u = int(np.sqrt(len(urban_nonzero['inj'])))

# # Adjust bins
bins_d = 4
bins_n = 4

# # calculate the bin width
# bin_width_r = (rural_nonzero['inj'].max() - rural_nonzero['inj'].min()) / bins_r
# # calculate the bin width
# bin_width_u = (urban_nonzero['inj'].max() - urban_nonzero['inj'].min()) / bins_u

# print(f"Number of bins: {bins}")
# print(f"Bin width: {bin_width}")

print('\n')

# Plot the histogram for fatality counts
axes[0].hist(day_nonzero['boxcox_fat'], bins=bins_d,density=False, label='Day', color="green")

# Add vertical lines to mark mean and median
axes[0].axvline(day_tor_fat_stats.loc['50%'], color='r', linestyle=':', label='Median annual day fatality count', zorder=2) 
axes[0].axvline(day_tor_fat_stats.loc["mean"],color='b', linestyle=':', label='Mean annual day fatality count', zorder=3)


# Plot the histogram for fatality counts
axes[1].hist(night_nonzero['boxcox_fat'], bins=bins_n,density=False, color='red', label="Night")

# Add vertical lines to mark mean and median
axes[1].axvline(night_tor_fat_stats.loc['50%'], color='r', linestyle=':', label='Median annual night injury count', zorder=2) 
axes[1].axvline(night_tor_fat_stats.loc["mean"],color='b', linestyle=':', label='Mean annual night injury count', zorder=3)

#========================


# Fit a theoretical normal distrubution to our data:

# 1. Calculate parameters for theoretical distribution:

    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_d, std_d = np.mean(day_nonzero['boxcox_fat']), np.std(day_nonzero['boxcox_fat'])
    # Calculate parameters for the normal distribution:  mean and standard deviation
mu_n, std_n = np.mean(night_nonzero['boxcox_fat']), np.std(night_nonzero['boxcox_fat'])


# 2. Generate x values for the normal pdf

# x_r = np.linspace(min(rural_nonzero['inj']), max(rural_nonzero['inj']), 100)
# x_u = np.linspace(min(urban_nonzero['inj']), max(urban_nonzero['inj']), 100)

# x_r = np.linspace(min(rural_nonzero['log_inj']), max(rural_nonzero['boxcox_inj']), 100)
x_d = np.linspace(stats.norm.ppf(0.01, mu_d, std_d), 
                stats.norm.ppf(0.99, mu_d, std_d), 100)

# x_u = np.linspace(min(urban_nonzero['log_inj']), max(urban_nonzero['log_inj']), 100)
x_n = np.linspace(stats.norm.ppf(0.01, mu_n, std_n), 
                stats.norm.ppf(0.99, mu_n, std_n), 100)
# 3.  Using the parameters obtained, compute the pdf
# Scale PDF to match histogram count
scale_d = len(day_nonzero['boxcox_fat']) * (day_nonzero['boxcox_fat'].max() - day_nonzero['boxcox_fat'].min()) / bins_d
scale_n = len(night_nonzero['boxcox_fat']) * (night_nonzero['boxcox_fat'].max() - night_nonzero['boxcox_fat'].min()) / bins_n

pdf_d = norm.pdf(x_d, mu_d, std_d) * scale_d
pdf_n = norm.pdf(x_n, mu_n, std_n) * scale_n


# 4. Generate the plots
    # Plot normal PDF
axes[0].plot(x_d, pdf_d, 'k-', lw=5, alpha=0.6, label='norm pdf')
axes[1].plot(x_n, pdf_n, 'k-', lw=5, alpha=0.6, label='norm pdf')


# Label plots 
axes[0].set_title("Fatalities:", color="red", loc="left")
axes[0].set_title("Distibution of Day US Tornado Fatalities (1950-2022)", loc="right")
axes[0].set_ylabel("Tornado")
axes[0].set_xlabel("Transformed Fatalities (Box-Cox)")
axes[0].legend()
# Label plot 
axes[1].set_title("Fatalities:", color="red", loc="left")
axes[1].set_title("Distibution of Night US Tornado Fatalities (1950-2022)", loc="right")
axes[1].set_ylabel("Tornado Counts")
axes[1].set_xlabel("Transformed Fatalities (Box-Cox)")
axes[1].legend()

plt.tight_layout()
 Descriptive Summary statistics for total tornado fatalities from 1950-2022 
 count    1218.000000
mean        0.355496
std         0.405708
min         0.000000
25%         0.000000
50%         0.000000
75%         0.706610
max         1.132169
Name: boxcox_fat, dtype: float64
 Descriptive Summary statistics for total tornado fatalities from 1950-2022 
 count    355.000000
mean       0.374933
std        0.383942
min        0.000000
25%        0.000000
50%        0.526673
75%        0.719931
max        1.170017
Name: boxcox_fat, dtype: float64




0.35549616683406865
mean number of day fatalities occuring annually 0
0.37493285161910495
mean number of night fatalities occuring annually 0
Day Skewness: 0.4766521322421027
Night Skewness: 0.3307083700646506
Day Yule-Kendall index: 1.0
Night Yule-Kendall index: -0.46312079144374874
Day Interquartile Range (IQR): 0.7066099601722811
Night Interquartile Range (IQR): 0.7199310152287951

png

Hypothesis test 1: Do tornado injuries differ between rural and urban location

# Mann-Whitney U test 
# a non-parametric test, meaning it does not require the assumption of normality for the data
# compares the medians of two independent samples to determine if they come from the same distribution

# Two-sided test to see if distributions differ
two_side = mannwhitneyu(rural_nonzero['boxcox_inj'], urban_nonzero['boxcox_inj'], alternative='two-sided')
print(f"Two-sided test result: {two_side}")
# REJECT the null, distributions are different

# One-sided test.  Is my rural_tor_injury distribution less than my urban_tor_injury distribution?
one_side = mannwhitneyu(rural_nonzero['boxcox_inj'], urban_nonzero['boxcox_inj'], alternative='less')
print(f"One-sided test result: {one_side}")
# REJECT the null, rural locations  have lower injury counts than  urbans
Two-sided test result: MannwhitneyuResult(statistic=3324003.0, pvalue=6.261042406910288e-11)
One-sided test result: MannwhitneyuResult(statistic=3324003.0, pvalue=3.130521203455144e-11)
# The Chi-square test 
# examines differences in the distribution of categorical data 
# Calculate the contingency table
contingency_table = [[sum(rural_tor_injury['inj'] > 0), len(rural_tor_injury['inj']) - sum(rural_tor_injury['inj'] > 0)],
                     [sum(urban_tor_injury['inj'] > 0), len(urban_tor_injury['inj']) - sum(urban_tor_injury['inj'] > 0)]]

# Chi-square test for proportions
chi2, p, _, _ = chi2_contingency(contingency_table)
print(f'Chi-square Statistic={chi2}, p-value={p}')
Chi-square Statistic=357.166515612783, p-value=1.1656648993911099e-79
# Conclusions for tornadoes impact in urban locations: 


# The Chi-square test: 
    # The chi-square test examines the differences in the distribution of categorical data 
    # significant p-value suggests a statistically significant difference in the proportion of tornadoes causing injuries between rural and urban locations
#Mann-Whitney U test: 
    # The p-values for the two-sided and one-sided tests are very small, indicating statistically significant differences in the distribution of box cox-transformed injuries between rural and urban data
    # The one-sided t-test shows that the distribution of injuries in rural locations is significantly less than in urban locations
    # Indicates that rural locations have statistically fewer tornado injuries than urban locations

# Conclusion: 
    # The results indicate that tornado injuries are statistically fewer in rural locations compared to urban locations
    # There are statistically significant differences in the proportion of tornadoes causing injuries between rural and urban locations  
    # Could be due to increased population density and increased infrastructure (e.g., more buildings that could lead to higher injuries due to flying debris and structural collapses)

Hypothesis test 2: Do tornado fatalities differ between day and night occurrence

# Mann-Whitney U test 
# a non-parametric test, meaning it does not require the assumption of normality for the data
# compares the medians of two independent samples to determine if they come from the same distribution


# Two-sided test to see if distributions differ
two_side = mannwhitneyu(day_nonzero['boxcox_fat'], night_nonzero['boxcox_fat'], alternative='two-sided')
print(f"Two-sided test result: {two_side}")
# REJECT the null, distributions are different between day and night tornadoes

# One-sided test.  Is my rural_tor_injury distribution less than my urban_tor_injury distribution?
one_side = mannwhitneyu(day_nonzero['boxcox_fat'], night_nonzero['boxcox_fat'], alternative='less')
print(f"One-sided test result: {one_side}")
# REJECT the null, day tornadoes  have less fatality counts than night tornadoes
Two-sided test result: MannwhitneyuResult(statistic=201508.0, pvalue=0.03511812117143068)
One-sided test result: MannwhitneyuResult(statistic=201508.0, pvalue=0.01755906058571534)
# Conclusions for tornadoes impact during night versus day: 

# Chi-square Test:
    # A significant p-value of the Chi-square test indicates a statistically significant difference in the proportions
    #  night tornadoes have a significantly different proportion of fatalities compared to day tornadoes
# Mann-Whitney U Test:
    # The Mann-Whitney U test is a non-parametric test that compares the medians of two independent samples to determine if they come from the same distribution
    # The two-sided test shows a significant difference in the distribution of fatalities between day and night tornadoes 
    # This suggests that the distributions of fatalities differ significantly between these two time periods
    # The one-sided test shows that night tornadoes have significantly higher fatality counts compared to day tornadoes

#Conclusion:
    # Fatalities are statistically higher in night tornadoes than in day tornadoes
    # There are statistically significant differences in the proportion of tornadoes causing fatalities between day and night   
    # Could be due to factors like reduced visibility, slower response times, and people being less prepared at night

XII. Exploratory Analysis Part 3: Understanding factors contributing to increased injuries and fatalites

  • goal is to develop an OLS linear regression model
fig, axes = plt.subplots(2,1 ,figsize=(12,10))

                      # plot injuries   # groupby tornado magnitude
filtered_tor_data.boxplot(column='inj', by='mag', ax=axes[0])
axes[0].set_title("Injuries by tornado magnitude from 1950-2022")
axes[0].set_ylabel("Injury Counts")


# Add a small constant to avoid taking the log of 0, which is undefined for 0 and negative numbers
#  np.log(x + 1) effectively shifting all data up by one unit 
#  so that the minimum value is 1 instead of 0, and then taking the logarithm
filtered_tor_data['log_inj'] = np.log(filtered_tor_data['inj'] + 1)



# Plot log of injuries by tornado magnitude
filtered_tor_data.boxplot(column='log_inj', by='mag', ax=axes[1])
axes[1].set_title("Log of injuries by tornado magnitude from 1950-2022")
axes[1].set_ylabel("Log of Injury Counts")
plt.suptitle(None) #remove default title
Text(0.5, 0.98, '')

png

fig,axes = plt.subplots(2,1, figsize=(12,10))


filtered_tor_data.boxplot(column='fat', by='mag',ax=axes[0])
axes[0].set_title("Fatalities by tornado magnitude from 1950-2022")
axes[0].set_ylabel("Fatality Counts")



# Add a small constant to avoid taking the log of 0, which is undefined for 0 and negative numbers
#  np.log(x + 1) effectively shifting all data up by one unit 
#  so that the minimum value is 1 instead of 0, and then taking the logarithm
filtered_tor_data['log_fat'] = np.log(filtered_tor_data['fat'] + 1)

filtered_tor_data.boxplot(column='log_fat', by='mag',ax=axes[1])
axes[1].set_title("Log of Fatalities by tornado magnitude from 1950-2022")
axes[1].set_ylabel("Log of Fatalities Counts")





plt.suptitle(None) #remove default title
Text(0.5, 0.98, '')

png

# There are many outliers in the injury and fatality count data
# log-transformation of injuries and fatalities normalized the dataset somewhat 
# Calculate the sum of injuries for each magnitude
injuries_by_mag = filtered_tor_data.groupby('mag')['inj'].sum()

# Calculate the total sum of injuries
total_injuries_by_mag = injuries_by_mag.sum()

# Calculate the percentage of injuries caused by each magnitude
percentage_injuries_by_mag = (injuries_by_mag / total_injuries_by_mag) * 100

# Print the result
print(percentage_injuries_by_mag)

fig, axs = plt.subplots(1, 2, figsize=(14, 7))  # Create a figure with 2 subplots

# plot the pie chart for injuries on the first subplot
percentage_injuries_by_mag.plot(kind='pie', labels=percentage_injuries_by_mag.index,autopct='%1.1f%%', ax=axs[0])
axs[0].set_ylabel('')  # Hide the y-label
axs[0].set_title('Percentage of Injuries by Tornado Magnitude',size='18')


# Calculate the sum of deaths for each magnitude
fatalities_by_mag = filtered_tor_data.groupby('mag')['fat'].sum()

# Calculate the total sum of deaths
total_fatalities_by_mag = fatalities_by_mag.sum()

# Calculate the percentage of deaths caused by each magnitude
percentage_fatalities_by_mag = (fatalities_by_mag / total_fatalities_by_mag) * 100

# Print the result
print(percentage_fatalities_by_mag)

# Print the result
print(percentage_fatalities_by_mag.sum())

percentage_fatalities_by_mag.plot(kind='pie', labels=percentage_fatalities_by_mag.index,autopct='%1.1f%%', ax=axs[1])
axs[1].set_ylabel('')  # Hide the y-label
axs[1].set_title('Percentage of Fatalities by Tornado Magnitude', size='18')

plt.legend(title='Magnitudes')
plt.tight_layout()  # Adjust the layout to prevent overlapping
plt.show()
mag
0     0.887598
1     7.396310
2    16.799721
3    24.988200
4    36.633694
5    13.294477
Name: inj, dtype: float64
mag
0     0.407498
1     3.960880
2    10.350448
3    22.689487
4    40.635697
5    21.955990
Name: fat, dtype: float64
100.0

png

# Nearly ~99%  of all tornado injuries and fatalities result from magnitude 1 or greater torndaoes!!
# ~75% of total tornado injuries are caused by magnitude 3, 4 and 5 tornadoes
# ~85% of total tornado deaths are caused by magnitude 3, 4 and 5 tornadoes

Correlation matrix and heatmap

# Plot correlation coefficient matrix

#filter to relevant columns
filtered_tor_data_hm = filtered_tor_data[filtered_tor_data['mag'] !=-9].copy()

filtered_tor_data_hm = filtered_tor_data_hm.drop(['yr','time', 'loss','closs','om','dy','date' ,'hour','mo','tz', 'st', 'NAMELSAD20', 'slon', 'slat','log_inj','log_fat'], axis=1)
print(filtered_tor_data_hm.columns)
Index(['mag', 'inj', 'fat', 'len', 'wid', 'is_night', 'is_urban'], dtype='object')
# plot a scatterplot matrix, to visualize any relationships between variables 
sb.pairplot(filtered_tor_data_hm[['inj','fat','mag', 'len', 'wid', 'is_night', 'is_urban']])

# And the correlation and a heatmap, to evaluate strength of these relationships
corr = filtered_tor_data_hm[['inj','fat','mag', 'len', 'wid', 'is_night', 'is_urban']].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

fig,ax = plt.subplots()
cmap='coolwarm'
sb.heatmap(corr, mask=mask, annot=True, fmt = ".2f",cmap=cmap, vmin=-1, vmax=1, square=True, linewidth=1)
<Axes: >

png

png

# In the original data ,
# injuries and fatalities high positive correlation
# length, width and magnitude have low positive correlations with injuries and fatalities
# is_urban and is_night have no correlations , 
    # could be because here when we annualized the data and now we have far less data points vs. the overall analyses done above
# Log-transform the 'injuries' and 'fatalities' variables
filtered_tor_data_hm['inj_log'] = np.log(filtered_tor_data_hm['inj'] + 1)
filtered_tor_data_hm['fat_log'] = np.log(filtered_tor_data_hm['fat'] + 1)

# plot a scatterplot matrix, to visualize any relationships between variables 
sb.pairplot(filtered_tor_data_hm[['inj','fat','mag', 'inj_log', 'fat_log', 'len', 'wid', 'is_night', 'is_urban']])

# And the correlation and a heatmap, to evaluate strength of these relationships
corr = filtered_tor_data_hm[['inj','fat','mag', 'inj_log', 'fat_log', 'len', 'wid', 'is_night', 'is_urban']].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

fig,ax = plt.subplots()
cmap='coolwarm'
sb.heatmap(corr, mask=mask, annot=True, fmt = ".2f",cmap=cmap, vmin=-1, vmax=1, square=True, linewidth=1)
<Axes: >

png

png

# In the log-transformed inj and fat tor data ,
# log transformed injuries and fatalities higher positive correlation with some variables
    # length, width and magnitude now have higher positive correlations with log-transformed injuries and fatalities
# is_urban and is_night still have no correlations , 
    # could be because here when we annualized the data and now we have far less data points vs. the overall analyses done above

What is the distribution of tornado injuries and fatalities from 1950-2022?

# Groupby year and sum injuries per year
total_tor_injuries = filtered_tor_data.groupby('yr')["inj"].sum()  # total number of injuries for each year

print(total_tor_injuries.sort_values(ascending=False)) # sort by higest to lowest injuries per year

print("\n")

# number of groupes when grouped by year
print(len(total_tor_injuries))

print("\n")
# number of records for each year
print(filtered_tor_data.groupby('yr')["inj"].size()) # total number of tornado events that occured each year
print("\n")



# Interpretation: ~1300 tornado injuries, on average, occur in US per year 
print(f"Average annual number of tornado injuries from 1950 - 2022 is : {round(total_tor_injuries.mean())}")
yr
1974    6824
2011    5483
1965    5197
1953    5131
1979    3014
        ... 
2009     396
2004     395
2016     325
2022     314
2018     199
Name: inj, Length: 73, dtype: int64


73


yr
1950     201
1951     260
1952     240
1953     421
1954     550
        ... 
2018    1109
2019    1331
2020     977
2021    1111
2022     992
Name: inj, Length: 73, dtype: int64


Average annual number of tornado injuries from 1950 - 2022 is : 1335
## Print summary stats, perform skewness checks, then generate histogram and fit distribution curves to our data

# Stats

print(f" Descriptive Summary statistics for total tornado injuries from 1950-2022 {'\n'} {total_tor_injuries.describe()}")


total_tor_injuries_stats =total_tor_injuries.describe()

print("\n")


# Interpretation: 1334 tornadoes, on average, occur in the US per year 
mean_inj_annual = total_tor_injuries_stats.loc["mean"]
# print(f"mean number of injuries occuring annually {round(mean_inj_annual)}")

# Calculate the skewness of the annual tornado counts
skewness = total_tor_injuries.skew()

# Print the skewness
print(f"Skewness: {skewness}")

# Calculate the quartiles
Q1 = total_tor_injuries.quantile(0.25)
Q2 = total_tor_injuries.quantile(0.5)
Q3 = total_tor_injuries.quantile(0.75)

# Calculate the Yule-Kendall index
yule_kendall_index = ((Q3 - Q2) - (Q2 - Q1)) / (Q3 - Q1)
print(f"Yule-Kendall index: {yule_kendall_index}")

# Calculate the IQR
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")

#========================


# Create figure


fig,ax = plt.subplots()


# bins
# calculate the number of bins
bins = int(np.sqrt(len(total_tor_injuries)))
# calculate the bin width
bin_width = (total_tor_injuries.max() - total_tor_injuries.min()) / bins

# print(f"Number of bins: {bins}")
# print(f"Bin width: {bin_width}")

print('\n')

# Plot the histogram for injury counts
ax.hist(total_tor_injuries, bins=bins,density=False)

# Add vertical lines to mark mean and median
ax.axvline(total_tor_injuries_stats.loc['50%'], color='r', linestyle=':', label='Median annual injury count', zorder=2) 
ax.axvline(total_tor_injuries_stats.loc["mean"],color='b', linestyle=':', label='Mean annual injury count', zorder=3)

#========================


# Fit a theoretical normal distrubution to our data:

# 1. Calculate parameters for theoretical distribution:

    # Calculate parameters for the normal distribution:  mean and standard deviation
mu, std = np.mean(total_tor_injuries), np.std(total_tor_injuries)

# 2. Generate x values for the normal pdf

x = np.linspace(min(total_tor_injuries), max(total_tor_injuries), 100)

# 3.  Using the parameters obtained, compute the pdf

normal_pdf = norm.pdf(x, mu, std) * len(total_tor_injuries) * bin_width #   scaled to match original data by total count and bin width


# 4. Generate the plots
    # Plot normal PDF
ax.plot(x, normal_pdf, 'r-', lw=5, alpha=0.6, label='norm pdf')


# Label plot 
ax.set_title("Distibution of Annual US Tornado Injuries (1950-2022)")
ax.set_ylabel("Years")
ax.set_xlabel("Injuries")
ax.legend()

#========================
 Descriptive Summary statistics for total tornado injuries from 1950-2022 
 count      73.000000
mean     1334.986301
std      1216.470321
min       199.000000
25%       703.000000
50%       948.000000
75%      1323.000000
max      6824.000000
Name: inj, dtype: float64


Skewness: 2.7648635200500014
Yule-Kendall index: 0.20967741935483872
Interquartile Range (IQR): 620.0







<matplotlib.legend.Legend at 0x26484fd3830>

png

# Normality assessment: data is right-skewed, and not normally distributed
    # mean is higher than the median , typcial of right-skewed data
    # Yule-kendal index measures skew based on the quartiles, so less impacted by extreme values
        # 0 would indicate symmetry, normal distribution
        # positive values, right skew. negative left skew
    # Skewness stat measures the degree of asymmetry, considers all data points and their distance from mean
        # 0 is perfectly symmetrical, positive= right skew, negative = left skew
        # abs value >1, highly skewed
        # abs value between 0.5 to 1 is mod skew
    # Standard deviation is quite large- almmost as large as mean, indicating significant variability
    # Noticeable difference between 75% quantile and max injuries, indicating outliers
        # In the histogram, outliers appear as bars disconnected from the body of the distribution

# Conclusion: original injury counts data is not suitable for models that assume normality of the data
    # Key indicators for a Log transformation : highly skewed data with high variability  
        # Skewness greater than 1 and long right tail, sign of data that could benefit from log transformation to normalize the distribution
        # High standard deviation relative to the mean, due to outliers 
        # Histogram shows long tail to the right
        # Log-transform helps meet OLS linear regression model assumptions, by reducing skewness, extreme outliers, and stabilizing variance
## Print summary stats, perform skewness checks, then generate histogram and fit distribution curves to our data

# We learned our data does not have a normal distribution, so we apply a log transformation to the summed annual injury counts
    # Original injury count data are discrete values count i.e., each injjury is a whole number(e.g., min is 199, max is 6824)
    # Log-transformed data will be continuous values with decimal points 
total_tor_log_injuries = np.log(total_tor_injuries+1)


# Stats
print(f" Descriptive Summary statistics for the log of annual total tornado injuries from 1950-2022 {'\n'} {total_tor_log_injuries.describe()}")
total_tor_log_injuries_stats =total_tor_log_injuries.describe()

print("\n")

# print(total_tor_log_injuries.sort_values(ascending=False)) # sort by higest to lowest injuries per year

print("\n")


# Calculate the skewness of the annual tornado counts
skewness = total_tor_log_injuries.skew()

# Print the skewness
print(f"Skewness: {skewness}")

# Calculate the quartiles
Q1 = total_tor_log_injuries.quantile(0.25)
Q2 = total_tor_log_injuries.quantile(0.5)
Q3 = total_tor_log_injuries.quantile(0.75)

# Calculate the Yule-Kendall index
yule_kendall_index = ((Q3 - Q2) - (Q2 - Q1)) / (Q3 - Q1)
print(f"Yule-Kendall index: {yule_kendall_index}")

# Calculate the IQR
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")

#========================


# Create figure
fig,ax = plt.subplots()

# bins
# calculate the number of bins
bins = int(np.sqrt(len(total_tor_log_injuries)))
# calculate the bin width
bin_width = (total_tor_log_injuries.max() - total_tor_log_injuries.min()) / bins
print("\n")
# print(f"Number of bins: {bins}")
# print(f"Bin width: {bin_width}")

# plot histogram for log injuries
ax.hist(total_tor_log_injuries, bins=bins,density=False)
# Add vertical lines to mark the mean and median
ax.axvline(total_tor_log_injuries_stats.loc['50%'], color='r', linestyle=':', label='Median annual injury count', zorder=2) 
ax.axvline(total_tor_log_injuries_stats.loc["mean"],color='b', linestyle=':', label='Mean annual injury count', zorder=3)

#========================



# Fit the theoretical normal and gamma distrubutions to our data:

# 1. Calculate parameters for each distribution:

    # Calculate parameters for the normal distribution:  mean and standard deviation
mu, std = np.mean(total_tor_log_injuries), np.std(total_tor_log_injuries)

    # Calculate parameters for the gamma distribution: shape, location, scale
shape, loc, scale = gamma.fit(total_tor_log_injuries)

# 2. Generate x values for the gamma and normal pdf
x = np.linspace(min(total_tor_log_injuries), max(total_tor_log_injuries), 100)


# 3.  Using the parameters obtained, compute the probability density functions (PDF)
    # Calculate the normal PDF
normal_pdf = norm.pdf(x, mu, std) * len(total_tor_log_injuries) * bin_width #   scaled to match original data by total count and bin width

    # Calculate the gamma PDF
gamma_pdf = gamma.pdf(x, shape, loc, scale) * len(total_tor_log_injuries) * bin_width


# 4. Generate the plots
    # Plot normal PDF
ax.plot(x, normal_pdf, 'r-', lw=5, alpha=0.6, label='norm pdf')
    # Plot gamma PDF
ax.plot(x, gamma_pdf, 'g-', lw=5, alpha=0.6, label='gamma pdf')




# label plot
ax.set_title("Distibution of Annual US Tornado Log Injuries (1950-2022)")
ax.set_ylabel("Years")
ax.set_xlabel("Log Injuries")



ax.legend()

#========================
 Descriptive Summary statistics for the log of annual total tornado injuries from 1950-2022 
 count    73.000000
mean      6.943825
std       0.669271
min       5.298317
25%       6.556778
50%       6.855409
75%       7.188413
max       8.828348
Name: inj, dtype: float64




Skewness: 0.612632818476642
Yule-Kendall index: 0.05441992473034167
Interquartile Range (IQR): 0.6316343803389115







<matplotlib.legend.Legend at 0x2648509a4e0>

png

# Normality assessment: log-transformed data is still right-skewed, but more symmetrical, and closer to normal normal distribution
    # The mean (6.94) is closer to the median (6.86), suggesting a relatively symmetrical distribution
    # Yule-Kendall index measures skew based on quartiles, so less impacted by extreme values:
    #   Index value (0.054) indicates slight right skew, confirming that the distribution is more symmetrical after log transformation
    # Skewness stat measures the degree of asymmetry, considering all data points and their distance from the mean:
    #   Skewness value (0.61) shows the data is slightly right-skewed but significantly less than the non-log-transformed data
    #   Abs value < 1, indicating moderate skew
    # IQR and Standard Deviation indicate less variability in log-transformed data:
    #   IQR (0.63) and standard deviation (0.67) show the spread of data is moderate, with a narrow range for 50% of the data

    # Fit of normal distribution appears reasonable in the center but less so in the tails, particularly on the right tail:
    #   Around the mean, the frequency of data points is similar to what would be expected in a normal distribution
    #   The higher frequency of data points in the right tail compared to the normal PDF suggests more high-value outliers than expected
    # gamma distribution aligns mote closely with the histogram, might better capture the asymmetry and slight skewness of the log-transformed data

    # Conclusion: 
    # The log-transformed injury data is more symmetrical, but still not perfectly normal
    # This transformation makes the data more suitable for models that assume normality, although it remains slightly right-skewed with some high-value outliers
# Quantile - Quantile plot
# used to visually show how closely a datset follows a theoretical distribution, in this case the normal distribution
# For normal distribution, when fit=True, it standardizes the data to have mean of 0, and a standard deviation of 1, properties of normal distribution
# this is done to make the data more comparable to the theoretical normal distribution
# !! Standardiziing a dataset like this does NOT make it a normal distribution , will NOT change the shape of the data
    #because a distribution is also defined by its shape, i.e., symmetric and bell-shaped
    # if the dataset  is skewed or has outliers, the standardized data will also be skewed and have outliers
    # So, subtracting the mean and dividing by std, only rescales the dataset, it DOES NOT MAKE IT MORE NORMALLY DISTRIBUTED

# (1) Create the parameters automatically for the 
# (2) Distribution specified here
# (3) reference line to which data is compared
                        # (1)      # (2)           # (3) 
sm.qqplot(total_tor_log_injuries,fit=True,dist=stats.norm,line = '45')   



plt.title('QQ Plot: Fit to Normal Distribution')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Empirical quantiles')
plt.show()

png

# Interpretation:
# Central part of the data conforms well to a normal distribution
    # Lower and upper ends of the data deviate from  the line
    # Data deviates from normal distribution at the tails
    # Data has more extreme values than would be expected for a normal distribution, indicating presence of outliers
    # Good fit in the central part of the QQ plot means that normal-based methods reliable for central tendency estimates (mean, variance)
    # Poor fit in the tails, if analyzying tails or require accuracy of extremes, other methods or transformations are needed to mitigate outliers
# Conclusion: The log transformation improved the distribution's symmetry and reduced variability, but the dataset still has some deviations from normality, particularly in the tails
sm.qqplot(total_tor_log_injuries,fit=True,dist=stats.gamma,line = '45')   



plt.title('QQ Plot: Fit to Gamma Distribution')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Empirical quantiles')
plt.show()

png

# Gamma distribution appears to fit the data well in the central range, with most points aligning closely with the 45-degree line
# provides a better fit to the data compared to the normal distribution, as seen from the slight improved alignment with the 45-degree line
# Points at the extreme upper end / upper tail still show deviation from the line, suggests that like the normal distribution, gamma may not capture full extent of outliers
# # Convert is_night to integer data type 
filtered_tor_data['is_night'] = filtered_tor_data['is_night'].astype(int)

# We exclude 0 magnitude as we saw in the EDA only ~1% tornado injuries caused by magnitude 0 tornadoes
filtered_tor_data_hi_mag = filtered_tor_data[filtered_tor_data["mag"]!=0]

# Aggregate data by year
annual_data = filtered_tor_data_hi_mag.groupby('yr').agg({
    'inj': 'sum',  # Sum log injuries per year
    'mag': 'mean', # Average magnitude of tornadoes per year
    'len':'mean',  # Mean of tornado length in miles occuring per year
    'is_urban': 'sum' # Sum injuries in urban locations per year
}).reset_index()

# Apply log transformations
# To reduce the impact of extreme values and make the distribution more symmetric  
# the log transformation changes the scale of the data from the original units (injuries) to a log scale (log-injuries)
    # Means that coefficients represent the percentage change in 'inj' for a one-unit increase in the predictor, holding other predictors constant
# Apply log transformation after summing injuries
annual_data['log_inj'] = np.log(annual_data['inj'] + 1)  # Adding 1 to avoid log(0)

# Define predictors and outcome
X = annual_data[[ 'mag','len', 'is_urban']] #annual predictors, add others as needed as needed
y = annual_data['log_inj']  # Using transformed injuries

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Adding a constant for the intercept
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# Fit the OLS model using the training data
model = sm.OLS(y_train, X_train).fit()

# Make predictions using the testing data
y_pred = model.predict(X_test)

XIII. OLS linear regression model to asses the influence of tornado magnitude , path length, and urban location on tornado injuries

# Print the summary
ols_model_tor_injury_results =model.summary()
print(model.summary())

# Goal : To assess the influence of tornado magnitude, path length, and urban location on the severity of tornado injuries

# Approach: OLS linear regression model 
    # estimates annual log injuries from 3 predictors: mean magnitude, sum of urban locations and mean length in miles
    # An R-squared of 0.623 means that about 62% of the variability in log-transformed injuries is explained by the model
    # Adj. R-squared value is slightly lower 0.594 adjusted for the number of predictors but it is still a good fit!

# When dependent variable total annual injuries, is log-transformed 
    # coefficients of the predictors represent the expected percentage change
    # in total annual injuries for a one-unit change in the predictor, holding other variables constant
    # e.g., Each one-unit increase in mag is associated with an increase in the log total annual injuries by 1.4442 units
    # So, increasing from a mean mag of 1 to 2 is associated with a
    # ~ 4.24 times change (i.e., exp(1.4442)) in the total annual injuries 


# Overall conclusions:    
    # significant positive relationship between mean tornado magnitude, length in miles, urban location and annual injuries 
    # suggests that years with more severe tornadoes occurring in urban settings, will have a greater number of total annual injuries
    # however, important to note that the impact of urban settings is relatively small compared to the effects of magnitude and path length
    #   Effect size of each predictor on injuries varies
        #  largest coefficient has the largest impact on the outcome, in this case magnitude has the greatest impact on injuries
        # 'mag' has the largest coefficient (1.4442), followed by 'len' (0.2052), and then 'is_urban' (0.0169)
        # Changes in the mean magnitude of tornadoes have the largest impact on total annual injuries, followed by changes in the mean length of tornadoes, and then changes in the sum of urban locations
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_inj   R-squared:                       0.626
Model:                            OLS   Adj. R-squared:                  0.597
Method:                 Least Squares   F-statistic:                     21.74
Date:                Mon, 06 May 2024   Prob (F-statistic):           1.93e-08
Time:                        12:51:53   Log-Likelihood:                -26.959
No. Observations:                  43   AIC:                             61.92
Df Residuals:                      39   BIC:                             68.96
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.7806      0.643      4.323      0.000       1.479       4.082
mag            1.4437      0.549      2.632      0.012       0.334       2.553
len            0.2037      0.072      2.830      0.007       0.058       0.349
is_urban       0.0168      0.004      4.562      0.000       0.009       0.024
==============================================================================
Omnibus:                        0.400   Durbin-Watson:                   1.741
Prob(Omnibus):                  0.819   Jarque-Bera (JB):                0.050
Skew:                           0.066   Prob(JB):                        0.975
Kurtosis:                       3.103   Cond. No.                         606.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Statistical significance of the effect our predictors on the dependent variable, injuries (log-transformed)
    # p-values are used to determine the statistical significance of each individual predictor in the model 
    # Here all predictors are statistically significant with P-values less than 0.05, strong evidence against the null hypothesis for each predictor
    # For each predictor, the null hypothesis is that its coefficient is zero (no effect on injuries)
    # These results indicate that each predictor has a statistically significant impact on injuries
# Evaluate Root Mean squared error (RMSE)
    #  a measure of the model's average error in prediction
    # average deviation of the predictions from the actual values
    # expressed in the same units as the dependent variable (log-transformed injuries in this case)


# Compute RMSE in log injuries
    # differences betweeen actual values and the values predicted by the model
    # Since we log-transformed injuries, RMSE is in term of the log of injuries
rmse_err = mean_squared_error(y_test, y_pred, squared=False) # False means we want the the root mean squared error
print(f"RMSE: {rmse_err:.3f}")

# Calculate the range of log injuries and percentage RMSE
    # expressed as a percentage of the range, it gives a relative indication of the model's prediction error in comparison to the spread of the data
    # Print the min and max of y_test for context
    # useful for understanding the spread of the test data and for interpreting the RMSE of the model in terms of the range of the test data

min_test, max_test = y_test.min(), y_test.max()
print(f"Range of test data: {min_test:.2f} to {max_test:.2f}")
range_log_injuries = max_test - min_test
percentage_rmse = (rmse_err / range_log_injuries) * 100
print(f"Range of log injuries: {range_log_injuries:.3f}")
print(f"RMSE as a percentage of the range: {percentage_rmse:.2f}%")

# Conclusion about the model error
print(f"The model error (RMSE) is {percentage_rmse:.2f}% of the range of log-injuries")

# Error seems moderate, what error is valid in practice varies
# in this case model's predictions deviate from actual values of the test data by 0.463 log-injuries or 16% of the observed range of log-injuries on average
RMSE: 0.460
Range of test data: 5.72 to 8.61
Range of log injuries: 2.883
RMSE as a percentage of the range: 15.96%
The model error (RMSE) is 15.96% of the range of log-injuries
# Diagnostic plot for validating assumptions of OLS regression model

# Residuals
    # calculated as the difference between the actual and predicted values of the dependent variable log-injuries
    # Residuals analysis is done to check if there are patterns in the errors made by the model
    # These errors can sometimes be correlated with predictors or predicted values, which would indicate issues like heteroscedasticity

# Generate figure
fig, ax = plt.subplots()

# Calculate residuals for the test data
residuals = y_test - y_pred

# Choosing one predictor for residuals analysis
    # allows you to see if there's a error pattern related to this specific predictor
ax.scatter(X_test["mag"], residuals, color='b', s=10) # shows a plot of the residuals on y axis and predicted values on the x axis
ax.set_xlabel('Magnitude')
ax.set_ylabel('Residuals (log injuries)')
ax.set_title('Residuals vs. Magnitude')

# Draw a horizontal line at y=0
ax.axhline(y=0, color='r', linestyle='--')
<matplotlib.lines.Line2D at 0x1bb1aebe1b0>

png

# Residuals plot: shows no strong pattern, suggesting no systematic relationship between the residuals and this specific predictor
    # The residuals appear to be randomly scattered around the horizontal zero line
    # Overall, suggests that the assumption of homoscedasticity is likely met, 
    # meaning the variance of the residuals is roughly consistent across different values of the predictors
# Let's repeat this check for constant variance of errors, this time with all the predictors used in the model 

# Residuals
# calculated as the difference between the actual and predicted values of the dependent variable


fig, ax = plt.subplots()

# Choosing predicted values for residuals analysis
    # The predicted values (y_pred) are the predictions of the model 
    # Using y_pred provides a comprehensive view of how well the model performs across the entire range of predictions
ax.scatter(y_pred, residuals, color='b', s=10) 
ax.set_xlabel('Predicted log_injuries')
ax.set_ylabel('Residuals (log injuries)')
ax.set_title('Residuals vs. Predicted log_injuries')

# Draw a horizontal line at y=0
ax.axhline(y=0, color='r', linestyle='--')
<matplotlib.lines.Line2D at 0x1bb1af900b0>

png

# Residuals plot:  shows no strong pattern, suggesting no systematic relationship between the residuals and predicted values
    # The residuals appear to be randomly scattered around the horizontal zero line
    # suggests that the assumption of homoscedasticity is likely met, 
    # meaning the variance of the residuals is roughly consistent across the range of predicted values
# Let's plot the Residual vs. time
# to see how these residuals change with time

fig,ax=plt.subplots(figsize=(20,10))
ax.scatter(X_test.index, y_test - y_pred,color='b',s=20)
ax.set_xlabel('Year')
ax.set_ylabel('Residuals (log injuries)')
ax.set_title('Ordered Residuals (log_injuries)')
# Draw a horizontal line at y=0
ax.axhline(y=0, color='r', linestyle='--')
plt.show()


# Sort X_test and y_test by index
X_test_sorted = X_test.sort_index()
y_test_sorted = y_test.sort_index()

# Calculate residuals for the sorted test data
residuals_sorted = y_test_sorted - model.predict(X_test_sorted)

fig, ax = plt.subplots(figsize=(20,10))
ax.plot(X_test_sorted.index, residuals_sorted, 'b-o')
ax.set_xlabel('Year')
ax.set_ylabel('Residuals (log injuries)')
ax.set_title('Ordered Residuals (log_injuries)')
# Draw a horizontal line at y=0
ax.axhline(y=0, color='r', linestyle='--')
plt.show()

png

png

# Analysis of Residual vs. time: Absence of a distinct trend (either increasing or decreasing) over time, but there are outlier events not captured by the model 
    # There isn't a clear pattern that would suggest a systematic issue
    # However, the high and low residuals in specific years might be potential outliers or specific events affecting those years
    # Some years may have seen major events such as unusually severe tornado outbreaks that significantly impacted injury counts
    # Current predictors don’t directly capture this, residuals could vary significantly for those years
    # Would have to look at the years associated with these residuals to better understand the nature of these outlier events
# Analysis of outliers years identified in Residuals versus time

# Sort X_test and y_test by index
X_test_sorted = X_test.sort_index()
y_test_sorted = y_test.sort_index()

# Calculate residuals for the sorted test data
residuals_sorted = y_test_sorted - model.predict(X_test_sorted)

# Combine year and residuals into a DataFrame for easier analysis
results = pd.DataFrame({
    'Index': X_test_sorted.index,  # Original indices from the data split
    'Residuals': residuals_sorted
})

# Use the original annual_data to map the index to the actual year
results['Year'] = results['Index'].apply(lambda i: annual_data.loc[i, 'yr'])

# Identify the years with the top 3 highest residuals and bottom 3 lowest residuals
top_3_highest_residuals = results.nlargest(3, 'Residuals')
bottom_3_lowest_residuals = results.nsmallest(3, 'Residuals')

print("Top 3 highest residuals:")
print(top_3_highest_residuals[['Year', 'Residuals']])
print("\nBottom 3 lowest residuals:")
print(bottom_3_lowest_residuals[['Year', 'Residuals']])
Top 3 highest residuals:
    Year  Residuals
61  2011   1.007388
16  1966   0.793839
34  1984   0.767768

Bottom 3 lowest residuals:
    Year  Residuals
72  2022  -0.794757
4   1954  -0.666684
0   1950  -0.614011
# years with high residuals suggest that the model may be underestimating the impact of extreme events on injuries,
# possibly due to insufficient predictors capturing severe tornado outbreaks or other unusual weather events
show(annual_data[annual_data["yr"].isin(top_3_highest_residuals["Year"])])
yr inj mag len is_urban log_inj
Loading... (need help?)
# years with low residuals indicate that the model may be overestimating injuries due to factors like changing safety standards, 
# advancements in early warning systems, or milder weather conditions that aren't adequately captured by the current predictors
show(annual_data[annual_data["yr"].isin(bottom_3_lowest_residuals["Year"])])
yr inj mag len is_urban log_inj
Loading... (need help?)

XIV. OLS linear regression model to asses the influence of injuries, path length and time of day on tornado fatalities

  • This model is a work in progress...

What is the distribution of tornado fatalities from 1950-2022?

# Aggregate data by year
annual_data_fatalities = filtered_tor_data_hi_mag.groupby('yr').agg({
    'fat': 'sum',  # Sum injuries per year
    'mag': 'mean', # Average magnitude of tornadoes per year
    'len':'mean',  # Mean of tornado length in miles occuring per year
    'is_urban': 'sum',
    'is_night': 'sum',
    'inj':'sum',
    # Add other aggregations as needed
}).reset_index()


# # Apply log transformation to fatalities
annual_data_fatalities['log_fatalities'] = np.log(annual_data_fatalities['fat'] + 1)

# # Apply log transformation to injuries
annual_data_fatalities['log_inj'] = np.log(annual_data_fatalities['inj']+ 1)  # Add 1 to avoid taking log of 0



# Define predictors and outcome
X = annual_data_fatalities[['log_inj','len']]  # Add other annual predictors as needed
y = annual_data_fatalities['log_fatalities']  # Using transformed injuries

# Split the data into training and testing sets
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X, y, test_size=0.4, random_state=42)

# Adding a constant for the intercept
X_train_f = sm.add_constant(X_train_f)
X_test_f = sm.add_constant(X_test_f)

# Fit the OLS model using the training data
model_f = sm.OLS(y_train_f, X_train_f).fit()

# Make predictions using the testing data
y_pred_f = model_f.predict(X_test_f)
# Print the summary
print(model_f.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         log_fatalities   R-squared:                       0.890
Model:                            OLS   Adj. R-squared:                  0.884
Method:                 Least Squares   F-statistic:                     161.2
Date:                Mon, 06 May 2024   Prob (F-statistic):           7.20e-20
Time:                        13:06:11   Log-Likelihood:                -3.5571
No. Observations:                  43   AIC:                             13.11
Df Residuals:                      40   BIC:                             18.40
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.6366      0.392     -6.718      0.000      -3.430      -1.843
log_inj        0.8106      0.064     12.601      0.000       0.681       0.941
len            0.1912      0.038      5.003      0.000       0.114       0.268
==============================================================================
Omnibus:                        1.119   Durbin-Watson:                   1.952
Prob(Omnibus):                  0.572   Jarque-Bera (JB):                1.140
Skew:                           0.300   Prob(JB):                        0.565
Kurtosis:                       2.475   Cond. No.                         87.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Goal : To predict the total annual fatalities from tornadoes based on yearly aggregated data or annual characteristics

# Approach: OLS linear regression model 
    # built on this aggregated data estimates annual fatalities from annual aggregated predictors mean length in miles and log-transformed injuries
    # R-squared value, represents the proportion of the variance in the log-transformed fatalities that is explained by the model 
    # An R-squared of 0.890 means that about 89% of the variability in log-transformed fatalities is explained by the predictors in the model

# When dependent variable total annual fatalities, is log-transformed 
    # we model the relationship of log(Y) and X (where Y is the dependent variable and X is the predictor),
    # essentially saying that a linear change in X  (length or log_inj) is associated with a multiplicative change in Y (total annual fatalities)
    # So, coefficients of the predictors represent the expected percentage change
    # in total annual fatalities for a one-unit change in the predictor, holding other variables constant
    # e.g., A one-unit change in len is associated with a change in the log-transformed total annual fatalities by 0.1912 units on the log scale
    # a one-unit increase in the mean len (e.g., changing from a mean len from 1 to 2 miles) is associated with an exp(0.1912) ~ 1.21 times change in the total annual fatalities 
    # This means the total annual fatalities are expected to increase by ~ 21% for a one-unit increase in the mean length of tornadoes

# Overall conclusions:    
    # significant positive relationship between mean tornado length, log-transformed injuries, and annual fatalities
    # Also, suggests that years with longer tornadoes and more injuries, will have a greater number of total annual fatalities
    # The predictor with the largest coefficient has the largest impact on the outcomes
    # In this case, 'log_inj' has the largest coefficient (0.8106), followed by 'len' (0.1912)
    # Changes in the injuries have the largest impact on total annual fatalities, followed by changes in the mean length of tornadoes
print(y_test_f.min())
print(y_test_f.max())

# useful for understanding the spread of the test data and for interpreting the RMSE of the model in terms of the range of the test data
3.1780538303479458
6.317164686747284
# RMSE, Root mean squared error 
# differences betweeen actual values and the values predicted by the model
# Since we log-transformed injuries, RMSE is in term of the log of injuries
# using y_test_f (actual values of the  outcome variable in original testing data i.e., unseen data)
# y_pred_f (predicted values of the outcome variable for testing data, obtained by applying the model developed on training data to the testing data) 
rmse_err_f = mean_squared_error(y_test_f, y_pred_f,squared=False) # checking how well model generalizes to unseen data
print(rmse_err_f)

# RMSE on the log-transformed scale is a measure of relative error
print(round(rmse_err_f/(y_test_f.max()-y_test_f.min())*100),'%') # usefule to use percentage if comparing different model outputs

# RMSE is ~ 0.3489 log fatalities 
# The range of log-fatalities is from  ~6.317 (min log fatalities ) to ~3.17(max log fatalities)
# 6.317-3.17= 3.20, is the Range of log-fatalities 
# So, RMSE of 0.3489 means that the model error is about 11% of this range (0.3489/3.20)
# The model's predictions are off by about 11% of the range of log-fatalities on average 
# This means that the model's error, in terms of the log-transformed fatalities, is about 11% of the total variation in the log-transformed fatalities

# convert the RMSE from log-fatalities back to the original units 
# when we exponentiate this value, we get a factor representing the average multiplicative error in the original units 
#  exponentiated RMSE is 1.42, it means that on average, the predictions are 42% higher or lower than the actual values
rmse_err_f_original_units = np.exp(rmse_err_f)
print(rmse_err_f_original_units) 
0.34890556658604427
11 %
1.4175153232100153
# diagnostic plots for issues with regression model

# Residuals
# calculated as the difference between the actual and predicted values of the dependent variable


fig, ax = plt.subplots()

# Calculate residuals for the test data
residuals_f = y_test_f - y_pred_f

ax.scatter(X_test_f["len"], residuals, color='b', s=10) # shows a plot of the residuals on y axis and predicated values on the x axis
ax.set_xlabel('Predicted Magnitude')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs. Predicted Mag')

# Draw a horizontal line at y=0
ax.axhline(y=0, color='r', linestyle='--')
<matplotlib.lines.Line2D at 0x1bb0ae220f0>

png

# Points in the plot are randomly scattered around the horizontal axis,
# this typically indicates that the assumption of homoscedasticity is met
# diagnostic plots for issues with regression model

# Residuals
# calculated as the difference between the actual and predicted values of the dependent variable


fig, ax = plt.subplots()

ax.scatter(y_pred_f, residuals, color='b', s=10) # shows a plot of the residuals on y axis and predicated values on the x axis
ax.set_xlabel('Predicted log_fatalities')
ax.set_ylabel('Residuals')
ax.set_title('Residuals vs. Predicted log_fatalities')

# Draw a horizontal line at y=0
ax.axhline(y=0, color='r', linestyle='--')
<matplotlib.lines.Line2D at 0x1bb77d40a10>

png

# If the points in the plot are randomly scattered around the horizontal axis,
# this typically indicates that the assumption of homoscedasticity is met.
# Residuals from OLS regression model over time
# Residual vs. time
fig,ax=plt.subplots(figsize=(20,10))
ax.scatter(X_test_f.index, y_test_f - y_pred_f,color='b',s=20)
ax.set_xlabel('Year')
ax.set_ylabel('Residuals (log fatalities)')
ax.set_title('Ordered Residuals (log_fatalities)')
plt.show()


# Sort X_test and y_test by index
X_test_sorted_f = X_test_f.sort_index()
y_test_sorted_f = y_test_f.sort_index()

# Calculate residuals for the sorted test data
residuals_sorted_f = y_test_sorted_f - model_f.predict(X_test_sorted_f)

fig, ax = plt.subplots(figsize=(20,10))
ax.plot(X_test_sorted_f.index, residuals_sorted_f, 'b-o')
ax.set_xlabel('Year')
ax.set_ylabel('Residuals (log fatalities)')
ax.set_title('Ordered Residuals (log_fatalities)')
plt.show()

png

png

# No clear increasing or decreasing trend, suggests that there are no obvious non-linear trends in your data
# However, the variance seems consistent across the years
# i.e., there are some years with particularly high or low residuals, could be outliers or interesting data points worth investigating further
tornado_data["fat"].max()

tor_annual_deaths = tornado_data.groupby('yr')["fat"].sum()

total_tor_deaths_stats = tor_annual_deaths.describe()
print(f" Descriptive Summary statistics for total tornado deaths from 1950-2022 {'\n'} {tor_annual_deaths.describe()}")

print("\n")


print(tor_annual_deaths.head(10).sort_values(ascending=False))

plt.clf()


tor_annual_deaths .plot.hist()
plt.title("Distibution of Annual US Tornado Fatalities (1950-2022)")


plt.axvline(total_tor_deaths_stats.loc['50%'], color='r', linestyle=':', label='Median annual fatalities count', zorder=2) 
plt.axvline(total_tor_deaths_stats.loc['mean'],color='b', linestyle=':', label='Mean annual fatalities count', zorder=3)

# Add a legend
plt.legend()
plt.ylabel("Years")
plt.xlabel("Fatalities")

plt.show()
 Descriptive Summary statistics for total tornado deaths from 1950-2022 
 count     73.000000
mean      84.041096
std       97.429524
min       10.000000
25%       36.000000
50%       55.000000
75%       84.000000
max      553.000000
Name: fat, dtype: float64


yr
1953    523
1952    230
1957    192
1955    129
1956     81
1950     70
1958     67
1959     58
1954     36
1951     34
Name: fat, dtype: int64

png

Conclusions:

  1. The analysis of tornado events reveals a significant difference in fatalities based on the time of day:
  2. Tornadoes that occur at night tend to have a statistically higher fatality count compared to those during the day
  3. Suggests that nighttime tornadoes are particularly dangerous, possibly due to reduced visibility and slower emergency response times

  4. Urban vs. Rural Impact on Injuries:

  5. When comparing urban and rural tornadoes, findings indicate that urban areas experience significantly more injuries from tornadoes
  6. Could be due to higher population density and infrastructure in urban areas, making them more vulnerable to tornado damage and resulting in higher injury counts

  7. OLS Linear Regression Analysis on Injuries:

  8. The linear regression analysis identifies tornado magnitude, path length, and urban location as significant predictors of tornado injuries
  9. Magnitude has the largest impact, aligning with the intuition that stronger tornadoes lead to more injuries
  10. Path length is also important, where longer tornado paths correlate with higher injury counts
  11. An urban setting further amplifies the injury count, albeit with a smaller influence than magnitude and path length
  12. The model explains about 60% of the variability in injury counts, highlighting the importance of these factors in assessing tornado impact

Future work

  • Complete the build for the OLS linear regression model of fatalities
    • incorporate the night predictor into the model
  • Evaluate the Effectiveness of Tornado Warning Systems and Response Measures:
    • Urban vs. Rural:
    • Objective: Assess the effectiveness of tornado warning systems and response measures in urban and rural settings to identify areas of improvement
    • Day vs. Night:
    • Objective: Examine the effectiveness of tornado warning systems and response measures based on time of occurrence to ensure readiness for both day and night events

Additional studies:

  • State level analyses
  • County level analyses example (MN)

Which US states have had the greatest number of tornados from 1950 to 2022?

  • Group by state, count tornado occurence
tornado_data_by_states = tornado_data.groupby('st')['om'].count().sort_values(ascending=False)
print(f"There are {len(tornado_data_by_states.index)} US states in the tornado dataset" )
print('\n')
print(tornado_data_by_states.index)

us_states=  ('AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY')

for state in tornado_data_by_states.index:
    if state not in us_states:
        print(f"{state} is in the tornado dataset but not in the typical 50 US states list")


print('\n')

print("Descriptive statistics for tornadoes by state across the US from 1950 - 2022")
print(tornado_data_by_states.describe())
import requests
url='https://www.census.gov/geographies/reference-files/2010/geo/state-area.html'
us_state_area =requests.get(url)
tables= pd.read_html(url, skiprows=4, header=0) # took some testing to find the right number of rows to skip

# The first table on the page is at index 0
us_state_areas_data = tables[0] # selects the first table on the page
# rename , filter and drop nan rows from dataset of US state areas
us_state_areas_data_filtered= us_state_areas_data[['Unnamed: 0','Unnamed: 1']]
us_state_areas_data_filtered.rename(columns={'Unnamed: 0': 'State','Unnamed: 1':'Square Miles'},inplace=True)
us_state_areas_data_filtered.dropna(inplace=True)
us_state_areas_data_filtered = us_state_areas_data_filtered.drop(55) # unnamed group of island territoties dropped
print(len(us_state_areas_data_filtered))
area_states_list = ['US','AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'PR', 'AS', 'GU', 'MP', 'VI'] 
print(len(area_states_list))

#add two letter codes
us_state_areas_data_filtered['st'] = area_states_list # gives same name as our tor data
print(us_state_areas_data_filtered)
tornado_data_by_states.plot.bar(figsize=(12,4))

# Add a title and labels
plt.title('Total tornado counts by State from 1950-2022')
plt.xlabel('Year')
# Rotate the x-axis labels
plt.xticks(rotation=45)
plt.ylabel('Count')

print(f" From 1950-2022 top states by number of tornadoes is: \n {tornado_data_by_states.nlargest(13)}")

Which US states have had the greatest number of tornados from 1950 to 2022 if normalized for difference in total state area?

# convert our grouped tornado count by state into dataframe

tor_by_states_data = pd.DataFrame(tornado_data_by_states)


print(tor_by_states_data)

# merge the states area data with our datasets of tornado counts by state based on the state column

merged_tor_by_states = tor_by_states_data.merge(us_state_areas_data_filtered,on='st')

#normalize tornado counts based on the area of the state in square miles
merged_tor_by_states['normalized om'] = merged_tor_by_states['om']/merged_tor_by_states['Square Miles']
print(merged_tor_by_states.sort_values(by='normalized om', ascending=False))
# sort the data from highest to lowest normalzied om
merged_tor_by_states=merged_tor_by_states.sort_values(by='normalized om', ascending=False)

# Set 'st' states as the index
merged_tor_by_states.set_index('st', inplace=True)

# plot the data
merged_tor_by_states['normalized om'].plot.bar(figsize=(12,4))

# Add a title and labels
plt.title('Area Normalized (sq. mi.) total tornado counts by State from 1950-2022')
plt.xlabel('State')
plt.ylabel('Normalized Count')

# Rotate the x-axis labels
plt.xticks(rotation=45)

print(f" From 1950-2022 top states for number of tornadoes normalized by total area of the state is: \n {merged_tor_by_states['normalized om'].nlargest(13)}")

Within any year, which US states averaged the greatest annual number of tornados?

  • Group by state and year, count mean annual number
# Group by state and year, and count the number of tornadoes in each group
tornado_count_by_state_year = tornado_data.groupby(['st', 'yr'])['om'].count()
print(tornado_count_by_state_year)

# Group by state, and calculate the mean count in each group
tornado_mean_count_by_state = tornado_count_by_state_year.groupby('st').mean()

# Group by state, and calculate the median count in each group
tornado_median_count_by_state = tornado_count_by_state_year.groupby('st').median()

# Sort the result in descending order
tornado_mean_count_by_state = tornado_mean_count_by_state.sort_values(ascending=False)
# Sort the result in descending order
tornado_median_count_by_state = tornado_median_count_by_state.sort_values(ascending=False)

print('\n')

print("Descriptive statistics for mean annual tornado count by state across the US from 1950 - 2022")

print(tornado_mean_count_by_state.describe())
tornado_mean_count_by_state.plot.bar(figsize=(12,4))

# Add a title and labels
plt.title('Mean annual tornado count by State 1950-2022')
plt.xlabel('Year')
# Rotate the x-axis labels
plt.xticks(rotation=45)
plt.ylabel('Count')
# convert our grouped tornado count by state into dataframe

tor_annual_mean_by_state_data = pd.DataFrame(tornado_mean_count_by_state)
print(len(tor_annual_mean_by_state_data))

tor_annual_mean_by_state_data = tor_annual_mean_by_state_data[tor_annual_mean_by_state_data['om']>10]
print(tor_annual_mean_by_state_data)

print(len(tor_annual_mean_by_state_data))

merged_annual_mean_tor_by_states = tor_annual_mean_by_state_data.merge(us_state_areas_data_filtered,on='st')

merged_annual_mean_tor_by_states['normalized om'] = merged_annual_mean_tor_by_states['om']/merged_annual_mean_tor_by_states['Square Miles']

print(merged_annual_mean_tor_by_states.sort_values(by='normalized om', ascending=False))
# sort the data from highest to lowest normalzied om
merged_annual_mean_tor_by_states=merged_annual_mean_tor_by_states.sort_values(by='normalized om', ascending=False)

# Set 'st' states as the index
merged_annual_mean_tor_by_states.set_index('st', inplace=True)
top_25_states_mean_annual_ntor =merged_annual_mean_tor_by_states['normalized om'].nlargest(25)
print(top_25_states_mean_annual_ntor)
# plot the data
merged_annual_mean_tor_by_states['normalized om'].nlargest(25).plot.bar(figsize=(12,4))

# Add a title and labels
plt.title('Top 25 US states for Mean annual total tornado counts from 1950-2022 Area Normalized (sq. mi.)')
plt.xlabel('State')
plt.ylabel('Normalized Count')

# Rotate the x-axis labels
plt.xticks(rotation=45)

print(f" From 1950-2022 top states for mean annual number of tornadoes normalized by total area of the state is: \n {merged_annual_mean_tor_by_states['normalized om'].nlargest(13)}")
# # A common projection to use 
proj=ccrs.PlateCarree()

# Create a figure with an axes object on which we will plot. Pass the projection to that axes.
fig, ax = plt.subplots(figsize=(18,15),subplot_kw=dict(projection=proj))

# Remove the frame
#ax.set_frame_on(False)

# Set the extent of the map to the continental US
        # [x min, x max, y min , y max]
        # x min longitude = -130 degrees (West)
        # x max longitude = -60 degrees (West)
        # min longitude = 15 degrees (North)
        # max longitude = 50 degrees (North)
ax.set_extent([-130, -60, 15, 50])

# Add geographic features
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.OCEAN)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS, linestyle=':')
ax.add_feature(cfeature.STATES, linestyle='-')
ax.add_feature(cfeature.LAKES, alpha=0.5)  # Add lakes


# Acquire datasets for state borders to help center our datapoints
    # Get the shapefile for US states
    # Uses the shpreader.natural_earth() function 

    # shape files store the location, shape and other attributes of physical objects, as lines, points, and polgons
    # so shapefile help describe physical locations by mapping polygons, and other shapes to specific geographical coordinates
shapename = 'admin_1_states_provinces_lakes' # the particular shape file for state and province boundaries of the entire world

    # need to import import cartopy.io.shapereader as shpreader to be able to access and manipulate shapefiles
    # downloads and returns the file path of a shapefile from the Natural Earth database
    # public map dataset with data vailiable any many scales, 1:10 (fine) , 1:50, 1:110 million (coarse) scales

                # resolution is scale of shapefile (10,50,110),category - physical(rivers, mountains) or cultural(country,state borders)
states_shp = shpreader.natural_earth(resolution='110m', category='cultural', name=shapename)



# Access the geometries and attributes of the features in the downloaded shape file
    #shpreader.Reader(shape_file_name).records()
    # creates a Reader object from the shapefile
    # Reader helps to access the contents of the shapefile
    # has two methods records() and geometries() 
    # records() allows us to access more metadata than geometries()

#=============================Optional- one-time set up for help accessing specific attributes of states
# # Load the shapefile
# reader = shpreader.Reader(states_shp)

# # Get the fields from the first record
# first_state = next(reader.records())
# fields = first_state.attributes.keys()
# print("Fields:", fields)
# # Get the unique states
# states = set(record.attributes['postal'] for record in reader.records())
# print(f"States:, {states}")
#===================================================================================== 

# Create a dictionary to store the geometric centers (centroid) of each state
state_centroids = {}

# Iterate over the states
    #.records() function used to iterate over the records in the shapefile
    # each record in a shapefile consist of its geometry (shape) and its attributes (additional info about the state)



for state in shpreader.Reader(states_shp).records():# 
    # Get the name of the state
    # state.attributes is a dictionary containing the attributes of the state
    # keys in this dictionary are the names of the attributes
    state_name = state.attributes['postal'] # postal is a key used to access the two-letter postal abbreviation of the state

    # Calculate the geometric centers (centroid) of each state using the shapely library, and stored their values
                                # .geometry attribute represents the shape of the state
                                # .geometry attribute cam from the shpreader.Reader(shapes_file_name) which read the shapes from the shapes file
                                #.centroid attribute of the shape is a point that represents the center of this shape
                                #.x and .y attributes of the centroid are lon and lat coordinates of the center             
    centroid = sgeom.Point(state.geometry.centroid.x, state.geometry.centroid.y)# creates a point object based on the center coordinates

    # store each state center in the state centroid dictionary
                                # key is the name of the state
                                #  value is the lon, lat coordinates of the center point (centroid)
    state_centroids[state_name] = centroid

# use state_centroids to plot markers centered on each state in the plot
for state, centroid in state_centroids.items():
    if state not in ['HI' , 'AK']:
        lon, lat = centroid.x, centroid.y
        mean_count = tornado_mean_count_by_state.loc[state]
        # we scale down marker size
        #ro' argument is a shorthand argument for the color and marker style of the plot, r=red, o=circle , ro= red circle.
        ax.plot(lon, lat, 'ro', markersize=mean_count*0.2, transform=ccrs.PlateCarree())
        ax.text(lon, lat, state, transform=ccrs.PlateCarree())

# Add a legend
# Calculate min, median, and max
min_count = tornado_mean_count_by_state.min()
mean_count = tornado_mean_count_by_state.mean()
max_count = tornado_mean_count_by_state.max()

# Create circles for the legend that roughly approximate the sizes and values of circles on the US state map
# sizes of circle are proportional to the tornado count mean values' min, max, mean or half max
circle_min = Line2D([0], [0], marker='o', color='w', markerfacecolor='r', markersize=min_count*3, label=f"{round(min_count)} (min)")
circle_mean = Line2D([0], [0], marker='o', color='w', markerfacecolor='r', markersize=mean_count*0.45, label=round(mean_count))
circle_max = Line2D([0], [0], marker='o', color='w', markerfacecolor='r', markersize=max_count*0.15, label= f"{round(max_count)} (max)")
circle_half_max = Line2D([0], [0], marker='o', color='w', markerfacecolor='r', markersize=max_count*0.1, label= round(max_count/2))

# Add the legend to the plot
ax.legend(title= 'Mean Annual tornado count',handles=[circle_min, circle_mean,circle_half_max, circle_max])

# Some options involving the gridlines
gl = ax.gridlines(draw_labels=True)
gl.top_labels = False
gl.right_labels = False
ax.set_title('Mean Annual tornado count by State 1950-2022')
# # A common projection to use 
proj=ccrs.PlateCarree()

# Create a figure with an axes object on which we will plot. Pass the projection to that axes.
fig, ax = plt.subplots(figsize=(16,14),subplot_kw=dict(projection=proj))


# Create a colormap
cmap = cm.get_cmap('viridis')


# Set the extent of the map to the continental US
        # [x min, x max, y min , y max]
        # x min longitude = -130 degrees (West)
        # x max longitude = -60 degrees (West)
        # min longitude = 15 degrees (North)
        # max longitude = 50 degrees (North)
ax.set_extent([-130, -60, 15, 50])

# Add geographic features
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.OCEAN)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS, linestyle=':')
ax.add_feature(cfeature.STATES, linestyle='-')
ax.add_feature(cfeature.LAKES, alpha=0.5)  # Add lakes


# Acquire datasets for state borders to help center our datapoints
    # Get the shapefile for US states
    # Uses the shpreader.natural_earth() function 

    # shape files store the location, shape and other attributes of physical objects, as lines, points, and polgons
    # so shapefile help describe physical locations by mapping polygons, and other shapes to specific geographical coordinates
shapename = 'admin_1_states_provinces_lakes' # the particular shape file for state and province boundaries of the entire world

    # need to import import cartopy.io.shapereader as shpreader to be able to access and manipulate shapefiles
    # downloads and returns the file path of a shapefile from the Natural Earth database
    # public map dataset with data vailiable any many scales, 1:10 (fine) , 1:50, 1:110 million (coarse) scales

                # resolution is scale of shapefile (10,50,110),category - physical(rivers, mountains) or cultural(country,state borders)
states_shp = shpreader.natural_earth(resolution='110m', category='cultural', name=shapename)



# Access the geometries and attributes of the features in the downloaded shape file
    #shpreader.Reader(shape_file_name).records()
    # creates a Reader object from the shapefile
    # Reader helps to access the contents of the shapefile
    # has two methods records() and geometries() 
    # records() allows us to access more metadata than geometries()

#=============================Optional- one-time set up for help accessing specific attributes of states
# # Load the shapefile
# reader = shpreader.Reader(states_shp)

# # Get the fields from the first record
# first_state = next(reader.records())
# fields = first_state.attributes.keys()
# print("Fields:", fields)
# # Get the unique states
# states = set(record.attributes['postal'] for record in reader.records())
# print(f"States:, {states}")
#===================================================================================== 

# Create a dictionary to store the geometric centers (centroid) of each state
state_centroids = {}

# Iterate over the states
    #.records() function used to iterate over the records in the shapefile
    # each record in a shapefile consist of its geometry (shape) and its attributes (additional info about the state)



for state in shpreader.Reader(states_shp).records():# 
    # Get the name of the state
    # state.attributes is a dictionary containing the attributes of the state
    # keys in this dictionary are the names of the attributes
    state_name = state.attributes['postal'] # postal is a key used to access the two-letter postal abbreviation of the state

    # Calculate the geometric centers (centroid) of each state using the shapely library, and stored their values
                                # .geometry attribute represents the shape of the state
                                # .geometry attribute cam from the shpreader.Reader(shapes_file_name) which read the shapes from the shapes file
                                #.centroid attribute of the shape is a point that represents the center of this shape
                                #.x and .y attributes of the centroid are lon and lat coordinates of the center             
    centroid = sgeom.Point(state.geometry.centroid.x, state.geometry.centroid.y)# creates a point object based on the center coordinates

    # store each state center in the state centroid dictionary
                                # key is the name of the state
                                #  value is the lon, lat coordinates of the center point (centroid)
    state_centroids[state_name] = centroid
# Normalize the tornado counts to the range [0, 1]
norm = plt.Normalize(top_25_states_mean_annual_ntor.min(), top_25_states_mean_annual_ntor.max())
# Create a ScalarMappable instance
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)

# use state_centroids to plot markers centered on each state in the plot
for state, centroid in state_centroids.items():
    if state in top_25_states_mean_annual_ntor.index:
        lon, lat = centroid.x, centroid.y
        area_normalized_count = top_25_states_mean_annual_ntor.loc[state]
        color = cmap(norm(area_normalized_count))# Use the colormap to plot markers
        # we scale down marker size
        #ro' argument is a shorthand argument for the color and marker style of the plot, r=red, o=circle , ro= red circle.
        ax.plot(lon, lat, 'o', color=color, markersize=area_normalized_count*30000, transform=ccrs.PlateCarree())
        ax.text(lon, lat, state, transform=ccrs.PlateCarree())

# Create a new axes for the colorbar at the right of the current axes
cbar_ax = fig.add_axes([0.95, 0.25, 0.03, 0.5])

# Add a colorbar to the new axes
plt.colorbar(sm, cax=cbar_ax, orientation='vertical', label='Area Normalized Mean annual tornado count (Sq. Miles)')

# Some options involving the gridlines
gl = ax.gridlines(draw_labels=True)
gl.top_labels = False
gl.right_labels = False
ax.set_title('Top 25 States for Mean annual tornado count from 1950-2022')

Focus analysis at the single state-level : Minnesota

minnesota_filtered_tor_data = tornado_data[(tornado_data['st']=='MN')]
mn_annual_tor_counts= minnesota_filtered_tor_data.groupby('yr')['om'].count()


print(f"Total number of records in the Minnesota filtered tornado dataset from 1950-2022:{'\n'} {len(minnesota_filtered_tor_data)}")

print('\n')

print("Descriptive statistics for Minnesota total annual tornadoes across the US from 1950 - 2022")
print(mn_annual_tor_counts.describe())


print("Total number of records in the Minnesota filtered tornado dataset from 1950-2022")

# Get the maximum and minimum annual tornado counts
max_annual_tornados = mn_annual_tor_counts.max()
min_annual_tornados = mn_annual_tor_counts.min()
median_annual_tor_count= mn_annual_tor_counts.median()
mean_annual_tor_count= mn_annual_tor_counts.mean()

print('\n')


# Calculate the skewness of the annual tornado counts
skewness = mn_annual_tor_counts.skew()

# Print the skewness
print(f"Skewness: {skewness}")

# Calculate the quartiles
Q1 = mn_annual_tor_counts.quantile(0.25)
Q2 = mn_annual_tor_counts.quantile(0.5)
Q3 = mn_annual_tor_counts.quantile(0.75)

# Calculate the Yule-Kendall index
yule_kendall_index = ((Q3 - Q2) - (Q2 - Q1)) / (Q3 - Q1)
print(f"Yule-Kendall index: {yule_kendall_index}")

# Calculate the IQR
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")

# Define the bins using the square root rule, number bins is the sq root of the number of data points 
num_bins = int(np.sqrt(len(mn_annual_tor_counts))) # datapoints equal to the number of years since we are counting tornadoes per year
print('\n')
print(f'number of bins is {num_bins}')

#print(f'number of years used in total annual tornado counts {len(total_annual_tornados)} years')

# Compute the width of each bin
bin_width= (max_annual_tornados - min_annual_tornados) / num_bins
print(f'bin width is {bin_width}')

print('\n')


# Plot a histogram of the annual tornado counts
mn_annual_tor_counts.plot.hist(bins=num_bins, edgecolor='black')

# Add a title and labels
plt.title('Distribution of Annual US Tornado Counts (1950-2022)')
plt.xlabel('Annual Tornado Count')
plt.ylabel('Number of Years')
plt.axvline(median_annual_tor_count, color='r', linestyle=':', label='Median annual count', zorder=2) 
plt.axvline(mean_annual_tor_count, color='b', linestyle=':', label='Mean annual count', zorder=3) 
# Add a legend
plt.legend()

plt.show()

# take away on annual tornado counts - overall trend is that there are more years with above average tornado counts
    # statistics suggest that the annual tornado counts are slightly positively skewed and have a considerable variability. 
    # While the most of the years have tornado counts close to the mean
    # there are years with above-average tornado counts
    # The years with highest tornado counts are slightly further from the mean than years with the lowest counts.

What is the annual count of tornadoes for Minnesota by County?

# Filter data to state of Minnesota
minnesota_filtered_tor_data = tornado_data[(tornado_data['st']=='MN')]

minnesota_filtered_tor_data.head()
# obtain a list of county coordinates fro Minnesota Geospatial information Office

url='https://www.mngeo.state.mn.us/chouse/coordinates.html'

# Load the county centroid data
county_boundaries_list = pd.read_html(url,header=0)  # read_htrml returns a list of dataframes on the webpage

# Need to select a table. Select the first table
county_boundaries = county_boundaries_list[0]

print(county_boundaries.head())
rows_to_remove = ['Minnesota (statewide)','Twin Cities  metropolitan area','Canada', 'State of Illinois','State of Iowa','State of North Dakota', 'State of South Dakota',  'State of Wisconsin']


# Replace whitespaces in column names with underscores
county_boundaries.columns = county_boundaries.columns.str.replace('  ', '_')
print(county_boundaries.columns)
cleaned_county_boundaries= county_boundaries[~county_boundaries['Area_covered'].isin(rows_to_remove)]

cleaned_county_boundaries.head()

# Reset the index of county_boundaries DataFrame
cleaned_county_boundaries = cleaned_county_boundaries.reset_index(drop=True)

cleaned_county_boundaries.head(10)
# find the nearest counties to each tornado based on their lat and lon relative to county center points / centroid

# Calculate the centroid coordinates for each county
    # computes the average of the west/east and north/south county boundings to find the center point for lon and lat

cleaned_county_boundaries ['Longitude'] = (cleaned_county_boundaries ['West_bounding coordinate'] + cleaned_county_boundaries ['East_bounding coordinate']) / 2
cleaned_county_boundaries ['Latitude'] = (cleaned_county_boundaries ['North_bounding coordinate'] + cleaned_county_boundaries ['South_bounding coordinate']) / 2

show(cleaned_county_boundaries.head())
# Initialize a counter for checking the number of torndado events where we could not find the nearest county using the index of closest lat= closest lon approach 
else_counter = [0] #counter is defined as a list so that it can be modified inside the get_nearest_county function

# Initialize a counter
if_counter = [0] #counter is defined as a list so that it can be modified inside the get_nearest_county function

# Define a function to get the nearest county 
# Based on comparing lons and lats of a unique list of counties with the lon and lat of each individual each tornado event
# Returns the county name that corresponds to the county that is closest to the tornado event
def get_nearest_county(tornado_lat, tornado_lon):# these values are passed into the function for each record in the tornado dataset


    # Calculate the absolute difference between the tornado coordinates and each county centroid
    # Distance from each point of the longitude array from the longitude of a single tornado event 
    # We want absolute value because we want minimum distance, whether its to the east or the west!
    lon_diff = np.fabs(cleaned_county_boundaries ['Longitude'] - tornado_lon)
    lat_diff = np.fabs(cleaned_county_boundaries ['Latitude'] - tornado_lat)

    # Find the index of the minimum difference for longitude and latitude
    # stores the index position of the counties with the smallest difference
    # stores separately the index of the closest counties by lon and by lat
    # The argmin function will find the index in this array (that consists of
    # distance between each county's lat or lon and a tornadoes lat or lon) with the minimum value!
    # I.e., the county latitude that is closest to the tornado event (the minimum distance)
    # And the county longitude that is closest to the tornado event (the minimum distance)
    # Which together gives you the county that is to the tornado event !

    nearest_lon_idx = lon_diff.argmin()
    nearest_lat_idx = lat_diff.argmin()

    # checks if the index of the county with the closest longitude to the tornado is the same as the index of the county with the closest latitude to the tornado
    # The reason this approach works is because each index in the cleaned_county_boundaries DataFrame corresponds to a unique county with a specific latitude and longitude. 
    #If the indices for the smallest latitude and longitude differences are the same, 
    #it means that the same county is closest to the tornado event in terms of both latitude and longitude

    if nearest_lon_idx == nearest_lat_idx:
        if_counter[0] +=1
        return cleaned_county_boundaries.loc[nearest_lon_idx, 'Area_covered'] # If the indices are the same, return the county name


    else:
        # If the indices are different, calculate the Euclidean distance to each county centroid and return the nearest one
            # means that one county has a closer longitude but another county has the closer latitude to the tornado event
            #In this case, the function calculates the Euclidean distance from the tornado event to each county
        distances = np.sqrt((cleaned_county_boundaries ['Longitude'] - tornado_lon)**2 + (cleaned_county_boundaries ['Latitude'] - tornado_lat)**2)
        nearest_idx = distances.argmin()

        else_counter[0] += 1 #checks how many counties are edge cases not covered by the if statement

        return cleaned_county_boundaries.loc[nearest_idx, 'Area_covered']  #returns the county with the smallest Euclidean distance

# Apply the function to each record in the tornado dataset
    #.apply() applies a lambda function to each row of the DataFrame (because axis=1) 
    #lambda function takes a row of the DataFrame as input
    #passes the 'slat' and 'slon' values of the row to the get_nearest_county() function
    #Result is a Series of the nearest county for each tornado event
    # For each row these values in series are assigned to a new column called 'county', representing the county closest to the tornado event 
minnesota_filtered_tor_data['county'] = minnesota_filtered_tor_data.apply(lambda row: get_nearest_county(row['slat'], row['slon']), axis=1)

# Print the counter
print('The if statement was triggered, i.e., county with closest lon same as county with closest lat', if_counter[0], 'times.')

# Print the counter
print('The else statement was triggered, i.e, county with the closest lon to a tornado event was not the same as the county with the closest lat', else_counter[0], 'times.')
minnesota_filtered_tor_data_original = tornado_data[(tornado_data['st']=='MN')]

print(f'number of records in original Minnesota-filtered tornado dataset {len(minnesota_filtered_tor_data_original)}')
print(f'number of records in county-enriched Minnesota-filtered tornado dataset {len(minnesota_filtered_tor_data)}')

print('\n')
print(minnesota_filtered_tor_data.columns)
print('\n')
minnesota_filtered_tor_data.head()

What time do most tornadoes occur in Minnesota?

# Using loc and .index[0] to access first rows state name
state_name = minnesota_filtered_tor_data.loc[minnesota_filtered_tor_data.index[0], 'st']

# Convert 'time' to datetime format
minnesota_filtered_tor_data['time'] = pd.to_datetime(minnesota_filtered_tor_data['time'])

# Extract the hour from 'time'
minnesota_filtered_tor_data['hour'] = minnesota_filtered_tor_data['time'].dt.hour

# Group by 'hour' and count the number of tornadoes
mn_hour_tor = minnesota_filtered_tor_data.groupby('hour')['om'].count()

print(f'Descriptive Statistics for the state of {state_name}')

print(mn_hour_tor.describe())

print('\n')

print(f' there are {mn_hour_tor.sum()} in the hour grouped dataset')

print('\n')


print(mn_hour_tor.sort_values(ascending=False))


# Find the most common hour for tornadoes
most_frequent_tor_hour = mn_hour_tor.idxmax()
count_most_common_tor_hour = mn_hour_tor[most_frequent_tor_hour]
print('\n')
print(f'From 1950-2022 most tornadoes occur at {convert_to_12_hour(most_frequent_tor_hour)} Central Standard Time (CST) with {count_most_common_tor_hour} occurring at this time of day')


# Plot the count of tornadoes by hour of the day
mn_hour_tor.plot()
# Use list comprehension to apply convert_to_12_hour to each element of the index
mn_hour_tor.index = [convert_to_12_hour(int(hour)) for hour in mn_hour_tor.index]

print(f'Times for tornados events in {state_name}:{'\n'}{mn_hour_tor.sort_values(ascending=False)}')

# Calculate the percentage for each row count
mn_hour_tor['percentage'] = (mn_hour_tor / mn_hour_tor.sum()) * 100



print('\n')

top_10_mn_hour_tor =mn_hour_tor['percentage'].nlargest(10)

print('\n')

print(f'The top 10 times of day - Central Standard Time (CST) with most tornadoes by percentages of total counts:{'\n'}{top_10_mn_hour_tor}')


# Add a title and labels
plt.title('Minnesota Total tornado Count by Time of Day from 1950-2022')
plt.xlabel('Time of day - Central Standard Time (CST)')
plt.ylabel('Count')

# Add a shaded region for the most common hour for tornadoes
plt.axvspan(most_frequent_tor_hour-2, most_frequent_tor_hour+2, color='gray', alpha=0.5, label='Most common time for tornadoes')

# Set the x-axis ticks to every 2 hours
plt.xticks(range(0, 24, 2))

# Add a legend
plt.legend()

# Convert the x-axis labels to 12-hour format
labels = [convert_to_12_hour(int(item)) if item.is_integer() else item for item in plt.gca().get_xticks()]
plt.gca().set_xticklabels(labels)

What is the distribution of total tornadoes by county in Minnesota from 1950-2022?

# group by county and count the number of tornadoes per county across the entire dataset
MN_tor_totals_by_county =minnesota_filtered_tor_data.groupby('county')['om'].count().sort_values(ascending=False)

print(f' In the county grouped Minnesota tornado data there are {MN_tor_totals_by_county.sum()} tornadoes')

print(MN_tor_totals_by_county.sort_values(ascending=False))
print('\n')
print(f'Descriptive statiscs for total county tornadoes for the state of Minneosta from 1950-2022:{'\n'}{MN_tor_totals_by_county.describe()}')
# Get the maximum and minimum annual tornado counts
mn_total_county_stats = MN_tor_totals_by_county.describe()

# Get the maximum and minimum annual tornado counts
max_county_tornados = mn_total_county_stats['max']
min_county_tornados = mn_total_county_stats['min']
median_county_tor_count= mn_total_county_stats['50%']
mean_county_tor_count= mn_total_county_stats['mean']

print('\n')


# Calculate the skewness of the annual tornado counts
skewness = MN_tor_totals_by_county.skew()

# Print the skewness
print(f"Skewness: {skewness}")

# Calculate the quartiles
Q1 = MN_tor_totals_by_county.quantile(0.25)
Q2 = MN_tor_totals_by_county.quantile(0.5)
Q3 = MN_tor_totals_by_county.quantile(0.75)

# Calculate the Yule-Kendall index
yule_kendall_index = ((Q3 - Q2) - (Q2 - Q1)) / (Q3 - Q1)
print(f"Yule-Kendall index: {yule_kendall_index}")

# Calculate the IQR
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")

# Define the bins using the square root rule, number bins is the sq root of the number of data points 
num_bins = int(np.sqrt(len(MN_tor_totals_by_county))) # datapoints equal to the number of years since we are counting tornadoes per year
print('\n')
print(f'number of bins is {num_bins}')

#print(f'number of years used in total annual tornado counts {len(total_annual_tornados)} years')

# Compute the width of each bin
bin_width= (max_county_tornados - min_county_tornados) / num_bins
print(f'bin width is {bin_width}')

print('\n')


# Plot a histogram of the annual tornado counts
MN_tor_totals_by_county.plot.hist(bins=num_bins, edgecolor='black')

# Add a title and labels
plt.title('Distribution of Total Minnesota Tornado Counts by County (1950-2022)')
plt.xlabel('Annual Tornado Count')
plt.ylabel('Number of Years')
plt.axvline(median_county_tor_count, color='r', linestyle=':', label='Median county count', zorder=2) 
plt.axvline(mean_county_tor_count, color='b', linestyle=':', label='Mean county count', zorder=3) 
# Add a legend
plt.legend()

plt.show()

# take away on annual tornado counts - overall trend is that there are more years with above average tornado counts
    # statistics suggest that the annual tornado counts are slightly positively skewed and have a considerable variability. 
    # While the most of the years have tornado counts close to the mean
    # there are years with above-average tornado counts
    # The years with highest tornado counts are slightly further from the mean than years with the lowest counts.
num_years_with_24_tornadoes = (MN_tor_totals_by_county == 24).sum()
print(num_years_with_24_tornadoes)

Which counties tended to have the most tornadoes from 1950-2022?

# Plot the county level data
MN_tor_totals_by_county.plot.bar(figsize=(16,4))

plt.title('Total Tornado Count by County from 1950-2022')
plt.xlabel('County')
plt.ylabel('Tornado Count')
plt.legend()
plt.show()
cleaned_county_boundaries.head()
# Group by county and calculate the mean latitude and longitude for each county
    # we use the .agg() method to apply different functions to different columns:
        # we are applying two different functions to 3 differrent columns
        # applying mean to slat and slon columns
            # we take the mean slat and slon because within any county a tornado can have multiple lat and lon
        # applying count to the om column
county_data = minnesota_filtered_tor_data.groupby('county').agg({
    'slat': 'mean',
    'slon': 'mean',
    'om': 'count'
}).reset_index()

print(f" Top 10 counties for total tornado count from 1950-2022:{'\n'} {county_data.sort_values(by='om',ascending=False).head(10)}")

proj=ccrs.PlateCarree()
trans=ccrs.PlateCarree()

# Create a figure with an axes object on which we will plot. Pass the projection to that axes.
fig, ax = plt.subplots(figsize=(16,14),subplot_kw=dict(projection=proj), transform=trans)


# # Create a colormap
# cmap = plt.colormaps['Reds']


# Add geographic features
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.OCEAN)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS, linestyle=':')
ax.add_feature(cfeature.STATES, linestyle='-')
ax.add_feature(cfeature.LAKES, alpha=0.5)  # Add lakes

# Read the county borders shapefile
reader = shpreader.Reader(r'G:\My Drive\Python_projects\20240110_ATMS_517\Project\Data\mn_county_data\mn_county_boundaries2.shp')


# Add the county borders to the map
for record in reader.records():
    ax.add_geometries([record.geometry], ccrs.PlateCarree(), facecolor='none', edgecolor='grey')

for i, row in county_data.iterrows():
    plt.text(row['slon'], row['slat'], row['county'], horizontalalignment='right', verticalalignment='bottom', transform=ccrs.PlateCarree(), size=6)

# Plot the data
scatter = ax.scatter(county_data['slon'], county_data['slat'], s=county_data['om']*3, c=county_data['om'], cmap='YlOrRd', alpha=0.5)



cb=fig.colorbar(scatter,shrink=0.4)
cb.set_label("Total tornado count")
gl = ax.gridlines(draw_labels=True)
gl.right_labels= False
gl.top_labels=False
ax.set_title("Total Tornado Counts in Minnesota by County from 1950-2022")
plt.show()

Which counties tended to have the greatest number of above 1 magnitude tornadoes from 1950-2022?

# Group by county and calculate the mean latitude and longitude for each county
    # we use the .agg() method to apply different functions to different columns:
        # we are applying two different functions to 3 differrent columns
        # applying mean to slat and slon columns
            # we take the mean slat and slon because within any county a tornado can have multiple lat and lon
        # applying count to the om column
filtered_data= minnesota_filtered_tor_data[minnesota_filtered_tor_data['mag']>1]        
county_data = filtered_data.groupby('county').agg({
    'slat': 'mean',
    'slon': 'mean',
    'om': 'count'
}).reset_index()

print(f" Top 10 counties for total tornado count from 1950-2022:{'\n'} {county_data.sort_values(by='om',ascending=False).head(10)}")

proj=ccrs.PlateCarree()
trans=ccrs.PlateCarree()

# Create a figure with an axes object on which we will plot. Pass the projection to that axes.
fig, ax = plt.subplots(figsize=(16,14),subplot_kw=dict(projection=proj), transform=trans)


# # Create a colormap
# cmap = plt.colormaps['Reds']


# Add geographic features
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.OCEAN)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS, linestyle=':')
ax.add_feature(cfeature.STATES, linestyle='-')
ax.add_feature(cfeature.LAKES, alpha=0.5)  # Add lakes

# Read the county borders shapefile
reader = shpreader.Reader(r'G:\My Drive\Python_projects\20240110_ATMS_517\Project\Data\mn_county_data\mn_county_boundaries2.shp')


# Add the county borders to the map
for record in reader.records():
    ax.add_geometries([record.geometry], ccrs.PlateCarree(), facecolor='none', edgecolor='grey')

for i, row in county_data.iterrows():
    plt.text(row['slon'], row['slat'], row['county'], horizontalalignment='right', verticalalignment='bottom', transform=ccrs.PlateCarree(), size=6)

# Plot the data
scatter = ax.scatter(county_data['slon'], county_data['slat'], s=county_data['om']*3, c=county_data['om'], cmap='YlOrRd', alpha=0.5)



cb=fig.colorbar(scatter,shrink=0.4)
cb.set_label("Total tornado count")
ax.set_title("Total Tornado Counts with Magnitude >1 in Minnesota by County from 1950-2022")
gl = ax.gridlines(draw_labels=True)
gl.right_labels= False
gl.top_labels=False
plt.show()
# Group by county and calculate the mean latitude and longitude for each county
    # we use the .agg() method to apply different functions to different columns:
        # we are applying two different functions to 3 differrent columns
        # applying mean to slat and slon columns
            # we take the mean slat and slon because within any county a tornado can have multiple lat and lon
        # applying count to the om column
county_data_annual = minnesota_filtered_tor_data.groupby(['county','yr']).agg({
    'slat': 'mean',
    'slon': 'mean',
    'om': 'count'
}).reset_index()

print(f" Top 10 counties with greatest number of recent annual tornado count from 1950-2022:{'\n'} {county_data_annual.sort_values(by=['yr', 'om'], ascending=[False, False]).head(10)}")
proj=ccrs.PlateCarree()
trans=ccrs.PlateCarree()

# Create a figure with an axes object on which we will plot. Pass the projection to that axes.
fig, ax = plt.subplots(figsize=(16,14),subplot_kw=dict(projection=proj), transform=trans)


# # Create a colormap
# cmap = plt.colormaps['Reds']


# Add geographic features
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.OCEAN)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS, linestyle=':')
ax.add_feature(cfeature.STATES, linestyle='-')
ax.add_feature(cfeature.LAKES, alpha=0.5)  # Add lakes

# Read the county borders shapefile
reader = shpreader.Reader(r'G:\My Drive\Python_projects\20240110_ATMS_517\Project\Data\mn_county_data\mn_county_boundaries2.shp')


# Add the county borders to the map
for record in reader.records():
    ax.add_geometries([record.geometry], ccrs.PlateCarree(), facecolor='none', edgecolor='grey')

# Create a DataFrame with the mean latitude and longitude for each county
county_locations = minnesota_filtered_tor_data.groupby('county').agg({
    'slat': 'mean',
    'slon': 'mean'
}).reset_index()

# Add the county labels to the map
for i, row in county_locations.iterrows():
    plt.text(row['slon'], row['slat'], row['county'], horizontalalignment='right', verticalalignment='bottom', transform=ccrs.PlateCarree(), size=6)

# Plot the data
scatter = ax.scatter(county_data_annual['slon'], county_data_annual['slat'], s=county_data_annual['om']*3, c=county_data_annual['yr'], cmap='rainbow', alpha=0.5)

# Add a legend
# Calculate min, median, and max
min_count = county_data_annual['om'].min()

max_count = county_data_annual['om'].max()

# Create circles for the legend that roughly approximate the sizes and values of circles on the US state map
# sizes of circle are proportional to the tornado count mean values' min, max, mean or half max
circle_min = Line2D([0], [0], marker='o', color='w', markerfacecolor='grey', markersize=min_count*4, label=f"{round(min_count)} (min)")

circle_max = Line2D([0], [0], marker='o', color='w', markerfacecolor='grey', markersize=max_count*0.7, label= f"{round(max_count)} (max)")


# Add the legend to the plot
ax.legend(title= 'Annual tornado count',handles=[circle_min, circle_max])


ax.set_title("Annual Tornado Counts in Minnesota by County from 1950-2022")
cb=fig.colorbar(scatter,shrink=0.4)
cb.set_label("Year")
gl = ax.gridlines(draw_labels=True)
gl.right_labels= False
gl.top_labels=False
plt.show()

links

social