Introduction to Jupyter Notebooks and Pandas

Jupyter Notebooks

  • This is a jupyter notebook file (file extension is .ipynb for python notebook versus .py for standard python scripts)
  • It is a format used by many data scientists and researchers for analysis and visualization tasks

1. New to Jupyter notebooks?:

  • Jupyter Notebooks are useful because (a) they are easier to work with and understand than single python script files (.py) and
  • (b) enables you to break problems down into steps and quickly see results while building and iterating on your code in blocks called 'cells'
  • Key point: for data analysis a jupyter notebook is clear and efficient way to conduct analysis and communicate the findings

Two types of Jupyter notebook cells:

(1) Markdown cells (Cell->Cell Type->Markdown):

  • Descriptions
  • text (this entire cell that you are looking at is a Markdown cell where text exists)
  • images
  • Key point: Markdown cells are a place for notes and descriptions



(2) Code "cells" (Cell->Cell Type->Code) : Actual code

  • A code "cell" store and compile your code
  • any non-code text must be formatted as a comment - begin with "#"
  • Key point: code cells are the workhorse, they execute your program and output results

Running Jupyter notebook cells

(1) To run every line of code in the entire notebook

  • Cell->Run All

(2) To run only a single code cell

  • Click on the cell and hit Shift-Enter
In [1]:
# This is a code cell
# This is the place where your code is written
# Any lines beginning with the "#" are not code but descriptions to help you and others know what is being done
# hit shift-Enter together to run this cell (Note that we don't expect any result to be generated from this annotation text)
In [2]:
# This is another code cell
# Here we have some code written....it is a 'function' , let's not worry about details of 'functions' at the moment
#  it is just for seeing outputs
# in jupyter notebooks, codes show their outputs below the code cell that was run
def perserverance(text):
    return "so long as you do not stop!"


print(f"{perserverance('It does not matter how slowly you go...')}") 

# hit shift-Enter together to run this cell
# below this example code cell an output will appear
so long as you do not stop!
In [3]:
# another example

print("this will print a message to the output") 

# hit shift-Enter together to run this cell
# print("some text") messages help to a) show result of the code, b) understand what went wrong
this will print a message to the output

Pandas

  • Pandas is a powerful data manipulation library in Python
  • It is widely used in data science, machine learning, scientific computing, and many other data-intensive fields

2. New to pandas?

  • Pandas is useful because (a) it provides flexible data structures for efficient manipulation of structured data and
  • (b) it has rich functionality for data cleaning, transformation, and analysis
  • Key point: for data analysis, pandas provides a high-performance, easy-to-use data structure (DataFrame) and data analysis tools.

Two main data structures in pandas:

(1) Series :

  • A one-dimensional labeled array capable of holding any data type
  • It is similar to a column in a spreadsheet, a field in a database, or a vector in a mathematical matrix
  • Key point: Series is the primary building block of pandas



(2) DataFrame :

  • A two-dimensional labeled data structure with columns potentially of different types.
  • It is similar to a spreadsheet or SQL table
  • Key point: DataFrame is the primary pandas data structure for data manipulation and analysis

Working with pandas

(1) To import the pandas library

  • import pandas as pd

(2) To create a DataFrame

  • df = pd.DataFrame(data)

(3) To read a CSV file into a DataFrame

  • df = pd.read_csv('file.csv')

(4) To get the first 5 rows of the DataFrame

  • df.head()

3. Load your packages

In [4]:
# if you want to see which packages you already have installed from this notebook, uncomment the specified line below  
    #  The '!'  character below is used to run a command checking for your package manager conda directly in the notebook
    # conda is the package manager recommended
# ! conda list  # <--- (highlight this row , press ctrl + / to remove the first  '#' character from this line - otherwise just use backspace)
In [5]:
# I recommend using conda to install install itables in your terminal 
# why in the terminal and not in the code cell? : because conda requires you to input a yes OR no to execute the installation,
#  which may or may not show-up in your environment (e.g., you visual code studio IDE may not prompt you with the y or n choice)
# Instead, in the terminal type: conda install itables, and hit enter
In [6]:
import pandas as pd # should already be pre-installed with conda

import numpy as np # should also already be pre-istalled with conda

from itables import init_notebook_mode, show #to fully take advantage of the ability to display our data as a table format, let's use the itables library 
%matplotlib inline 

init_notebook_mode(all_interactive=True) #After this, any Pandas DataFrame, or Series, is displayed as interactive table, which lets you explore, filter or sort your data
In [7]:
# the following code cell will give you a nice hover highlighting on your tables when used with itables library (for details on itables see the itables lesson on my Git)
In [8]:
%%html
<style>
  .dataTables_wrapper tbody tr:hover {
    background-color: #6495ED; /* Cornflower Blue */
  }
</style>


<!-- #1E3A8A (Dark Blue) -->
<!-- #0D9488 (Teal) -->
<!-- #065F46 (Dark Green) -->
<!-- #4C1D95 (Dark Purple) -->
<!-- #991B1B (Dark Red) -->
<!-- #374151 (Dark Gray) -->
<!-- #B45309 (Deep Orange) -->
<!-- #164E63 (Dark Cyan) -->
<!-- #4A2C2A (Dark Brown) -->
<!-- #831843 (Dark Magenta) -->
<!-- #1E3A8A (Dark Blue ) -->

<!-- Suggested Light Colors for Light Backgrounds -->
<!-- #AED9E0 (Light Blue) -->
<!-- #A7F3D0 (Light Teal) -->
<!-- #D1FAE5 (Light Green) -->
<!-- #DDD6FE (Light Purple) -->
<!-- #FECACA (Light Red) -->
<!-- #E5E7EB (Light Gray) -->
<!-- #FFEDD5 (Light Orange) -->
<!-- #B2F5EA (Light Cyan) -->
<!-- #FED7AA (Light Brown) -->
<!-- #FBCFE8 (Light Magenta) -->

Get into the habit of reading data documentation or Metadata

  • Why? : Some fields are not understandable purely based on their names, some data has caveats, field may have fill values, or other special considerations ### Data source overview
  • Name : U.S. TORNADOES* (1950-2022)
    • Source: NOAA NWS Storm Prediction Center Severe weather database
    • Geographical distribution: US and territories (Puerto Rico, Virgin Islands)
    • Description: comma separated value (.csv) files for tornadoes as compiled from NWS Storm Data. Tornado reports exist back to 1950. Tornado data are provided in raw csv file format. Actual tornado tracks only (not including individual state segments) are provided in the "Actual_tornadoes.csv" file.
    • File name and size: 1950-2022_actual_tornadoes.csv (7.3mb)
    • Data source link: https://www.spc.noaa.gov/gis/svrgis/

Data Structure & Characteristics:

  • Dataset structure: Dataframe created from a csv file loaded into Pandas, generating tabular data with labeled rows and columns, each column is a one-dimensional array-like list of values
  • Dimensions: Two dimensional, each row a single tornado event, and each column a different attributes of the event
  • Coordinates: Index values with integers starting from 0 by default for rows (observations) and columns names for columns (variables)

Most relevant fields:

  • om: tornado number. count of tornadoes through the year
  • yr: year
  • mo: Month - 1-12
  • dy: Day, 1-31
  • date: Date in yyyy-mm-dd
  • time: Time in HH:MM:SS
  • tz: time zone. note: all times except for '?'=unknown and 9=GMT are 3= CST.
  • st: state two-letter abbreviation
  • stf: State FIPS number. Federal Information Processing System (FIPS) Codes which uniquely identify States and Counties.
  • f: F-scale or EF-scale after 2007: values are either 0, 1, 2, 3, 4, 5 or -9 (unknown)
  • inj: Injuries note: when summing for state totals use sn=1, not sg=1
  • fat: Fatalities note: when summing for state totals use sn=1, not sg=1
  • loss: Estimated property loss - From 1996 reported as tornado property damage in millions of dollars (see documentation for additional details)
  • slat: Starting latitude of tornado
  • slon: Starging longitude of tornado
  • elat: Ending latitude of tornado
  • elon: Ending longitude of tornado
  • len: Length of tornado track in miles
  • wid: Width of tornado in yards
  • note: the physical attributes of each row (observation) with regard to time and location are defined by the 'date', 'time', 'lon', and 'lat' columns.

4. Read in the data

In [9]:
# here we will load a historical US tornado dataset using pandas

# you can give it any name you want...just has to follow python convention for names ,search for 'snake case in python to learn more'
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv") 

print(tor_data.head()) 

# You can show the contents of this dataset as a table below by simply typing its name and running this code cell OR using print(data_set_name)
    om    yr  mo  dy        date      time  tz  st  stf  stn  ...   len  wid  \
0  192  1950  10   1   10/1/1950  21:00:00   3  OK   40   23  ...  15.8   10   
1  193  1950  10   9   10/9/1950   2:15:00   3  NC   37    9  ...   2.0  880   
2  195  1950  11  20  11/20/1950   2:20:00   3  KY   21    1  ...   0.1   10   
3  196  1950  11  20  11/20/1950   4:00:00   3  KY   21    2  ...   0.1   10   
4  197  1950  11  20  11/20/1950   7:30:00   3  MS   28   14  ...   2.0   37   

   ns  sn  sg   f1  f2  f3  f4  fc  
0   1   1   1   25   0   0   0   0  
1   1   1   1   47   0   0   0   0  
2   1   1   1  177   0   0   0   0  
3   1   1   1  209   0   0   0   0  
4   1   1   1  101   0   0   0   0  

[5 rows x 29 columns]
In [10]:
# Because we loaded the itables library, we can see it as an interactive table. You can click on the arrows by the colum names to sort them...
tor_data.head() # note that you do not have to use show() to generate an interactive table unless you want to add addtional customizations 
Out[10]:
om yr mo dy date time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading... (need help?)
In [11]:
# To read any dataset into python you need to specify:

    # 1. WHAT - name to save the data to 
    # 2. WHO - is going to do the work? i.e., which package is going to do the work?, here we rely on pandas 'pd' package or library 
    # 3. HOW - read_csv(), the specific method that pandas will use 
    # 4. WHERE - is the file? the file is specified by its name with relative path (i.e. '.\') or by its full path, including its extension, in this case .csv is the extension


# (1)     (2)    (3)             (4) actual file name and its path is always in quotes
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv") # Note: we use an 'r' path_name_here because 'r' allows us  to a) copy paste paths directly from your explorer, and mainly b) use '\' without causing error with python
In [12]:
# a good habit is to make  a copy of the original data BEFORE making any changes

orginal_tor_data=tor_data.copy() # now when we want to compare or revert to original data we have a way to do that
In [13]:
# use the itables package with the .show() method
# notice how we can customize the table
show(tor_data.head(), caption="Tornado Data 1950-2022", options={'hover': True})  # itables also allows you to add a title or caption to your table


# caption adds a title to the table, it is below the table But still used full for descriptions
# add a hover highlighting function
Tornado Data 1950-2022
om yr mo dy date time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading... (need help?)
In [14]:
#  column_filters="header" extension for itables, my personal favorite for exploring table data 



show(tor_data, column_filters="header",layout={"topEnd": None}, maxBytes=0 ,options={'hover': True} )  # Adds individual column filters and removes the single default search bar (which isn't that useful )
omyrmodydatetimetzststfstnmaginjfatlossclossslatslonelatelonlenwidnssnsgf1f2f3f4fc
Loading... (need help?)
In [15]:
# this is the table of data we read in 
# it has over 68 thousand rows and 29 columns
# first column has no name, BUT that column contains all the row labels of the data, these row labels are called the index
In [16]:
# Now that we have a good handle on the data
# let us confirm the type of data we are using 

type(tor_data)

# a dataframe is the technical name for the data table with rows and columns that pandas has created for us
Out[16]:
pandas.core.frame.DataFrame
In [17]:
# another way to know how many rows, columns

tor_data.shape
Out[17]:
(68701, 29)
In [18]:
# what if you want to know how many rows, columns, also the data type of each column?

tor_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68701 entries, 0 to 68700
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   om      68701 non-null  int64  
 1   yr      68701 non-null  int64  
 2   mo      68701 non-null  int64  
 3   dy      68701 non-null  int64  
 4   date    68701 non-null  object 
 5   time    68701 non-null  object 
 6   tz      68701 non-null  int64  
 7   st      68701 non-null  object 
 8   stf     68701 non-null  int64  
 9   stn     68701 non-null  int64  
 10  mag     68701 non-null  int64  
 11  inj     68701 non-null  int64  
 12  fat     68701 non-null  int64  
 13  loss    68701 non-null  float64
 14  closs   68701 non-null  float64
 15  slat    68701 non-null  float64
 16  slon    68701 non-null  float64
 17  elat    68701 non-null  float64
 18  elon    68701 non-null  float64
 19  len     68701 non-null  float64
 20  wid     68701 non-null  int64  
 21  ns      68701 non-null  int64  
 22  sn      68701 non-null  int64  
 23  sg      68701 non-null  int64  
 24  f1      68701 non-null  int64  
 25  f2      68701 non-null  int64  
 26  f3      68701 non-null  int64  
 27  f4      68701 non-null  int64  
 28  fc      68701 non-null  int64  
dtypes: float64(7), int64(19), object(3)
memory usage: 15.2+ MB
In [19]:
# Scanning through, we focus on second to last row, showing 3 data types: float, int, and objects
# that tells you how much of each datatype there is , mostly int64 datatype i.e., integer or whole numbers. 
# Then float or decimal and finally objects which is words

# Any missing data? or non-nulls?,
# Not Surprisingly, we learned that there are no empty rows
    # how do we know? Because all the columns say '68701 non-null' , which is the same number of rows as the entire dataframe
    # there are some caveats to what we define as 'missing' but here we just mean that the row value is blank
In [20]:
# let's read in the data again this time let's tell pandas to make the row labels the date

tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv", parse_dates= ['date'])
In [21]:
tor_data
Out[21]:
om yr mo dy date time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading... (need help?)
In [22]:
tor_data.info() 

# check out the date- columns data type now
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68701 entries, 0 to 68700
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   om      68701 non-null  int64         
 1   yr      68701 non-null  int64         
 2   mo      68701 non-null  int64         
 3   dy      68701 non-null  int64         
 4   date    68701 non-null  datetime64[ns]
 5   time    68701 non-null  object        
 6   tz      68701 non-null  int64         
 7   st      68701 non-null  object        
 8   stf     68701 non-null  int64         
 9   stn     68701 non-null  int64         
 10  mag     68701 non-null  int64         
 11  inj     68701 non-null  int64         
 12  fat     68701 non-null  int64         
 13  loss    68701 non-null  float64       
 14  closs   68701 non-null  float64       
 15  slat    68701 non-null  float64       
 16  slon    68701 non-null  float64       
 17  elat    68701 non-null  float64       
 18  elon    68701 non-null  float64       
 19  len     68701 non-null  float64       
 20  wid     68701 non-null  int64         
 21  ns      68701 non-null  int64         
 22  sn      68701 non-null  int64         
 23  sg      68701 non-null  int64         
 24  f1      68701 non-null  int64         
 25  f2      68701 non-null  int64         
 26  f3      68701 non-null  int64         
 27  f4      68701 non-null  int64         
 28  fc      68701 non-null  int64         
dtypes: datetime64[ns](1), float64(7), int64(19), object(2)
memory usage: 15.2+ MB
In [23]:
tor_data['date'].dtypes

#dtype('<M8[ns]') is equivalent to a datetime64 data type in pandas, which is used for date and time data
Out[23]:
dtype('<M8[ns]')
In [24]:
tor_data.set_index('date') # this shows us the result of the data when it is indexed by date
Out[24]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [25]:
# # Setting the date as the index is helpful for a time series type anaylsis 
# BUT we did not actually yet change the index to be the date
    # NO! this is only a view of the data, the dataframe index has not been modified to use the date column

# if we wanted to make a plot of tornadoes over time we would get an error because we did not change the index of the data to date
# curious to see the error? try this to see an error when tyring to plot the data : tor_data.plot()
In [26]:
# We need to set a special parameter to make sure we actually change the data 
# This is 'inplace' = True parameter to ensure the index is set to date on the tor_data

tor_data.set_index('date', inplace=True) 
In [27]:
# want to see all the columns without using itables?
# Set display options

pd.set_option('display.max_columns', None) # None means show all the columns, but you can use a number to show only that number of columns
In [28]:
# convert time to a date time object , then extract the hour

tor_data["hour"] = pd.to_datetime(tor_data["time"], format='%H:%M:%S').dt.hour

# convert time to time objects to store just the time without the date

tor_data["time"]=pd.to_datetime(tor_data['time'], format= '%H:%M:%S').dt.time



tor_data.head()
Out[28]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc hour
date
Loading... (need help?)

5. Process and clean the data

Manipulating columns

  • Move columns
    • syntax: columns.insert(index_value, "column_name")
  • Remove columns
    • syntax: columns.remove("column_name")
In [29]:
# Earlier we created a column with the hour for the tornado event but it was placed in the last column
# How would we move a column?

# print columns to see the current names and order 
print(tor_data.columns)

# lets change this to a list

tcolumns = list(tor_data.columns)

print(tcolumns) # what is neat about the fact that we made it into a list?

# now we can handle our list of column names like any list of data

tcolumns.insert(0, "blank") # add names

print(tcolumns)

tcolumns.remove("blank") # remove names

print(tcolumns)

tcolumns.remove("hour") # remove 

tcolumns.insert(6, 'hour') # add name to specific position

tor_data = tor_data[tcolumns] # reorder the dataframe accordin gto the new list of column names
 
tor_data.head()
Index(['om', 'yr', 'mo', 'dy', 'time', 'tz', 'st', 'stf', 'stn', 'mag', 'inj',
       'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len', 'wid',
       'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc', 'hour'],
      dtype='object')
['om', 'yr', 'mo', 'dy', 'time', 'tz', 'st', 'stf', 'stn', 'mag', 'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len', 'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc', 'hour']
['blank', 'om', 'yr', 'mo', 'dy', 'time', 'tz', 'st', 'stf', 'stn', 'mag', 'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len', 'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc', 'hour']
['om', 'yr', 'mo', 'dy', 'time', 'tz', 'st', 'stf', 'stn', 'mag', 'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len', 'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc', 'hour']
Out[29]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [30]:
# another approach to move columns, if you just have one to move it works well
if "hour" in tor_data.columns:
    hour_data = tor_data.pop("hour")

    tor_data.insert(6, "hour", hour_data)
tor_data.columns
Out[30]:
Index(['om', 'yr', 'mo', 'dy', 'time', 'tz', 'hour', 'st', 'stf', 'stn', 'mag',
       'inj', 'fat', 'loss', 'closs', 'slat', 'slon', 'elat', 'elon', 'len',
       'wid', 'ns', 'sn', 'sg', 'f1', 'f2', 'f3', 'f4', 'fc'],
      dtype='object')

Check for missing dats

In [31]:
# see if we have missing data

tor_data.isnull().sum() # for more on how this works , do this in two steps i.e, tor_data.isnull(), and then print that alone i.e, it is summing true (null=true) and falses for each columns
Out[31]:
0
Loading... (need help?)
In [32]:
# why do we care about missing data?
    # it is problem if we want to make fair comparisons
    # it complicates the meaning of the results

# Why care about data type?
    # Pandas has special methods that only apply to certain types of data i.e., dates , times

check duplicates

  • Many datasets contain duplicate data entires, possibly due to error
  • May be an issue if you are performing computations assuming each record is unique
  • syntax: df.duplicated()
    • use case: identify records that have been encountered prevoiusly and mark them with True
      • filter and isolate these duplciates, if needed ### drop duplicates
  • Often we need to remove duplicate entries from the data before doing any summary or complex analyses
  • syntax: df.drop_duplicates()
    • use case: dropping records that are duplicated as defined by all or some specific columns
In [33]:
# example of identifying duplciates, using duplicates .duplicates()
# Seed for reproducibility
np.random.seed(0)

# Generate sample data
data = {
    'User_ID': np.random.randint(1, 100, 100), # random.randint selects random integers from 1 and 99
    'Account_Balance': np.random.uniform(1000.0, 5000.0, 100) # random.uniform will create a random selection of floats between values 1000.0 and 5000.0
}

# Create DataFrame
df = pd.DataFrame(data)

# Intentionally add duplicate rows
df = pd.concat([df, df.iloc[:10]], ignore_index=True)

print("\n")

print("first 10 rows of data:\n", df.head(10))

len(df)

print("\n")

print("last 10 rows of data:\n", df.tail(10))


# access the duplicated rows, filter the dataframe for duplicated rows and save them to a new dataframe
dup_values = df[df.duplicated()]

print("\n")

print("only the duplicated values:\n", dup_values)

# what if we want to count the number of times a record is duplicated?

all_dups = df[df.duplicated(keep=False)]

# count the number of duplicates
count_dups =all_dups.value_counts()

print("\n")

print("counts for the duplicated values:\n", count_dups)

first 10 rows of data:
    User_ID  Account_Balance
0       45      4798.284214
1       48      3650.107468
2       65      1054.286542
3       68      3491.384382
4       68      3694.638523
5       10      4887.780010
6       84      4512.773885
7       22      3038.497507
8       37      1222.858775
9       88      2804.636858


last 10 rows of data:
      User_ID  Account_Balance
100       45      4798.284214
101       48      3650.107468
102       65      1054.286542
103       68      3491.384382
104       68      3694.638523
105       10      4887.780010
106       84      4512.773885
107       22      3038.497507
108       37      1222.858775
109       88      2804.636858


only the duplicated values:
      User_ID  Account_Balance
100       45      4798.284214
101       48      3650.107468
102       65      1054.286542
103       68      3491.384382
104       68      3694.638523
105       10      4887.780010
106       84      4512.773885
107       22      3038.497507
108       37      1222.858775
109       88      2804.636858


counts for the duplicated values:
 User_ID  Account_Balance
10       4887.780010        2
22       3038.497507        2
37       1222.858775        2
45       4798.284214        2
48       3650.107468        2
65       1054.286542        2
68       3491.384382        2
         3694.638523        2
84       4512.773885        2
88       2804.636858        2
Name: count, dtype: int64
In [34]:
# Access the duplicate records, default behavior is to keep the first duplicate and mark all other occurences as True
is_tdups = tor_data.duplicated(keep="first")

# print the duplicated record
show(tor_data[is_tdups]) 

# setting keep= False marks all duplciates as True, resulting in all of them being returned
dup_tor_data= tor_data[tor_data.duplicated(keep=False)]

print("\n")

print("Duplicated records in the tornado dataset:\n",dup_tor_data[["om",'yr','dy', 'time', 'slat','slon']]) # showing reduced columns for clarity

print("\n")

# This will count unique rows in dup_tor_data considering all columns
count_tor_dups =dup_tor_data.value_counts()

print("counts for the duplicated values:\n", count_tor_dups)


print("\n")

# Counting duplicates for a selection of columns
count_tor_dups =dup_tor_data[["om",'yr','dy', 'time', 'slat','slon']].value_counts()

print("counts for the duplicated values:\n", count_tor_dups)
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)

Duplicated records in the tornado dataset:
                 om    yr  dy      time     slat     slon
date                                                    
2015-06-20  610626  2015  20  18:32:00  40.8123 -91.2844
2015-06-20  610626  2015  20  18:32:00  40.8123 -91.2844


counts for the duplicated values:
 om      yr    mo  dy  time      tz  hour  st  stf  stn  mag  inj  fat  loss  closs  slat     slon      elat     elon      len   wid  ns  sn  sg  f1  f2  f3  f4  fc
610626  2015  6   20  18:32:00  3   18    IA  19   0    1    0    0    0.3   0.0    40.8123  -91.2844  40.7783  -91.2494  2.92  50   1   1   1   57  0   0   0   0     2
Name: count, dtype: int64


counts for the duplicated values:
 om      yr    dy  time      slat     slon    
610626  2015  20  18:32:00  40.8123  -91.2844    2
Name: count, dtype: int64
In [35]:
# Drop the duplicate records, save to a new dataframe that we specify as the 'cleaned' tornado dataframe for clarity
cleaned_tor_data=tor_data.drop_duplicates()

# print row count before dropping records
print(" Number of rows before dropping duplicates", len(tor_data)) # notice the original data still has the duplciate record


print(" Number of rows before after duplicates", len(cleaned_tor_data))

# Access the duplicate records, default behavior is to keep the first duplicate and mark all other occurences as True
is_tdups = cleaned_tor_data.duplicated(keep="first")


print("New cleaned dataframe duplicates:\n")

# Check if we still have duplicated records
show(cleaned_tor_data[is_tdups])


# Repeat for the original data to confirm that we have not modified the original 

    # Access the duplicate records, default behavior is to keep the first duplicate and mark all other occurences as True
is_org_tdups = tor_data.duplicated(keep="first")

# Check if we still have duplicated records

print("\n")

print("original dataframe duplicates:\n")
show(tor_data[is_org_tdups])
 Number of rows before dropping duplicates 68701
 Number of rows before after duplicates 68700
New cleaned dataframe duplicates:

om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)

original dataframe duplicates:

om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)

6. Summarize the Data

  • Foundation of Data Analysis: Statistics are mathmetical tools used to transform complex data into an understandable and actionable form
  • Interpretation of Large Datasets: Raw data, often challenging to interpret, is condensed by statistics into meaningful summary measures
  • Simplification and Insight: These summary measures are key to be able to have insight into the data's central tendencies, variability, distribution, and interrelationships between datapoints, thereby simplifying what might otherwise be an incomprehensible table of numbers

Benefits of Statistical Summaries

These summary measures help us:

  • Understand the Data: Quickly get a sense of the data's main characteristics, such as typical values (mean, median), variability (standard deviation, IQR), and overall distribution (min, max)
  • Make Comparisons: Easily compare different datasets or subgroups within a dataset based upon these summary measure
  • Identify Patterns: Recognize patterns, trends, and anomalies, which are essential for hypothesis testing, decision making, and predictive analytics
  • Communicate Findings: Effectively communicate the data's story to others in a clear and concise manner

syntax: df.describe()

  • use-case: helps in quickly understanding the range, central value, and spread of each variable in your dataset, which is crucial for initial data assessment and lays the groundwork for further analyses
In [36]:
# .describe() gives us a neat statiscal table summary of the numerical data

tor_data.describe()
Out[36]:
om yr mo dy tz hour stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading... (need help?)
In [72]:
# need to see summary stats for specific columns?

    # add double brackets and list your columns in quotes,

    # dataframe_name[['column1', 'column2']].describe()

tor_data[["mag", "len","time", "hour"]].describe()
Out[72]:
mag len hour
Loading... (need help?)
In [38]:
# so just using the .describe() method we can access a bunch of summary statistics
# for example, now we can see
    # the average length a tornado travels across the entire dataset is 3.48 miles!
    # the average time tornadoes occur is ~15:00 central time. ~3pm in 12 hour time format
In [39]:
# But wait, these is issue with our data!...
    # in the magnitude 'mag' column we have a -9 value. What is that?...
    # From the documentation we learn that -9 is used as a fill value, denoting missing data
    # So, we decide that we are going to drop any rows with -9 fill values in the 'mag' column

# filter to relevant columns, excluding rows with fill values in magnitude column
filtered_tor_data = tor_data[tor_data['mag'] !=-9].copy() # use copy to ensure the new dataframe is treated as such
In [40]:
len(filtered_tor_data)
Out[40]:
67945
In [73]:
filtered_tor_data[["mag", "len","time", "hour"]].describe()
Out[73]:
mag len hour
Loading... (need help?)

Boxplots

  • Provide a visual sumary of the distribution of the data
  • syntax: df.boxplot(column=["name1","name2", "name3"])

    • Median (Central Line within the boxes): Shows the data's central value
    • Interquartile Range (IQR, the ends of Boxes you see in the plot): The middle 50% of the data, providing a sense of the data's spread
    • Whiskers: Extend from the box to the highest and lowest values within 1.5 times the IQR from the quartiles, offering a view of the range
    • Outliers: Are any points (black outlined circles) found outside the whiskers, indicating data points that differ significantly from the rest of the data
  • use-case: identifying outliers, understanding the variability of the data, and comparing distributions across different variables or groups
In [69]:
# We can visualize the spread of the data in boxplot, which shows the min, max, median and IQR
filtered_tor_data.boxplot(column=["mag", "len", "hour"])
Out[69]:
<Axes: >
In [43]:
#  Generating statistics and using tools like .describe() and boxplots are about making data understandable and actionable
#  These summaries of the raw data become the basis for making meaningful insights, guiding further analysis, hypothesis testing, and decision-making
In [70]:
# with the help of matplotlib which pandas plots are built on we can plot all three columns separately 
# this makes the data more understanable by not sharing the same y-axis, whose values vary widely across the different columns

import matplotlib.pyplot as plt


# Create a figure and a set of subplots
    # key to generating separate plots are the fig and axes objects
    #  fig represents the entire figure or plot area, and axes is an array of subplots within the figure



fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))    # essentially makes 3 separate empty plots for use to plot our data

# Plot each column on its own subplot
filtered_tor_data.boxplot(column='hour', ax=axes[0])
filtered_tor_data.boxplot(column='len', ax=axes[1])
filtered_tor_data.boxplot(column='mag', ax=axes[2])

# Display the plot
plt.tight_layout() # adjusts the layout to prevent plot elements from overlapping, making it easier to read the plots
plt.show()
In [ ]:
# After generating summary measures of the raw data from ~68,000 tornadoes from 1950-2022 we see the following:
    # typically tornadoes occur around 15:00 central time(3pm), this is actually actionable information in these sense we can prepare better 
    # Historically, most tornadoes have a magnitude of 1 or lower
    # There is great variability in the distance or length traveled by tornadoes but the summary table shows the average distance is ~3.5 miles

# Effectivaly interpreting boxplots:
    # Position of Outliers: the position and arrangement of outliers can tell you a lot about the less frequent occurrences within your dataset
    # for example, in the context of this tornado data:
    # Outliers near the upper end of the len (length) boxplot indicate unusually long paths traveled by some tornadoes, which are rare but possible events
    # Outliers in the hour boxplot in the early hours indicate rare but possible early morning tornado occurrences

#Position of the Box:

    # The position of the box itself gives a visual representation of where the bulk of the data lies, particularly the center line, representing the median value 
    # Median line being close to the 75th percentile edge of the boxplot for mag suggests that the majority of tornadoes have a magnitude near 1 
    # Median line being near the 25th percentile edge in the length boxplot implies most tornadoes travel a relatively short distance 
In [44]:
from IPython.core.getipython import get_ipython

class StopExecution(Exception):
    def _render_traceback_(self):
        pass

get_ipython().set_custom_exc((StopExecution,), lambda etype, value, tb: None)

8. Access specific subsets of data : Filtering

In [45]:
# only interested in a certain year?
# filter the data by that year
tor_data[(tor_data['yr']==2011)]
Out[45]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)

9. Sorting the data

In [46]:
#  Say we are interested in injuries in 2011, a year known for having the largest tornado outbreaks i.e., multiple tornado events on the same day
#  So, maybe we should sort by the highest injuries within that known year for tornado outbreaks? 

# filter the data by that year, and then sort by injuries 
tor_data[(tor_data['yr']==2011)].sort_values(by='inj', ascending=False)
Out[46]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [47]:
# Say within 2011, we are only interested in a certain subset of that year?
# Then filter the data even further. 
# Looking at 2011 tornado events AND those magnitude 2 and above
tor_data[(tor_data['yr']==2011) & (tor_data['mag']>=2)].sort_values(by='inj', ascending=False) # notice how we are chaining together the commands? Filter by year AND mag >=2, then sort injuries highest to lowest
Out[47]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)

10. Selecting specific rows with .loc()

  • use to select rows and columns by their name
  • structure
    • DataFrame.loc[rows, columns]
    • rows can be a single label, a list of labels, or a boolean (a condition that the rows must meet)
    • columns can be single , list , or slice object like this df.loc[:, 'column1':column3']
In [48]:
# if we are very interested in that specific date
tor_data.loc[(tor_data.index == '2011-04-27')] # note if the date was not the index we would use: df.loc[(df['date'] == '2022-01-01')]
Out[48]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [49]:
# Select only a specific row label 
# slice only the first few columns up to mag
tor_data.loc[tor_data.index == '2011-04-27',  :'mag'] # ':' means select all , combined with mag means select all up to and including mag 
Out[49]:
om yr mo dy time tz hour st stf stn mag
date
Loading... (need help?)
In [50]:
# specify the names for only those columns you need
tor_data.loc[tor_data.index == '2011-04-27',  ['st','mag','inj','fat', 'time']] # notice how we are using brackets around the column name list
Out[50]:
st mag inj fat time
date
Loading... (need help?)
In [51]:
# if we are very interested in that specific date and the tornadoes occuring with magnitude 2 and up
tor_data.loc[(tor_data.index == '2011-04-27') & (tor_data['mag'] >= 2)] # note if the date was not the index we would use: df.loc[(df['date'] == '2022-01-01') & (df['mag'] >= 2)]
Out[51]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [52]:
# up until now we were simply filtering the data down to a subset of interest
# and then viewing that subset
# usually what you want is to save that subset of interest into a variable 
# so you can (a) not have to repeat the filtering each time you want to see this subset, (b) focus on it separately from the rest of the data

# if we are very interested in that specific date and the tornadoes occuring with magnitude 2 and up
tor_data_2011_mag2= tor_data.loc[(tor_data.index == '2011-04-27') & (tor_data['mag'] >= 2)] # note if the date was not the index we would use: df.loc[(df['date'] == '2022-01-01') & (df['mag'] >= 2)]
In [53]:
tor_data_2011_mag2.head().sort_values(by="inj", ascending=False) # What is wrong with this?
Out[53]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [54]:
# the order in which you do things matters...
# The outcome is correct if we sort first, then take a look at the data
tor_data_2011_mag2.sort_values(by="inj", ascending=False).head() # notice the first row is different from above?
Out[54]:
om yr mo dy time tz hour st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading... (need help?)
In [55]:
# 1500 injuries occurred associated with this tornado event on April 04 2011 in Alabama! 

11. Basic plotting in Pandas

In [56]:
# Want to understand the data by year?
# So, group your data by year 
# How? using groupby

# 1. groupby is a method that can group by a variable column and within that group count by another column
# 2. here we group by year "yr" column
# 3. within those groups select the "om" column
# 4. Count up the distinct values in the omm column
# 5. call the plot method that is built-into pandas

#            (1)  (2)   (3)    (4)     (5)
tor_data.groupby("yr")["om"].count().plot()
Out[56]:
<Axes: xlabel='yr'>
In [57]:
# what did we do here?
# we plotted the data by the year column over time using the builtin pandas plotting .plot()
# it should choose the most logical plot for the data 
In [58]:
# what we are actually plotting is a line plot

tor_data.groupby("yr")["om"].count().plot(kind='line')
Out[58]:
<Axes: xlabel='yr'>
In [59]:
# .plot() has different parameters, one is kind, you can set this to 'line' or something else... 
In [60]:
# what if we had a special criteria ?
# only show the data for the years with magnitude greater than or equal to 3

tor_mag_3_data =tor_data[(tor_data["mag"]>=3)] 
In [61]:
tor_mag_3_data.groupby("yr")["om"].count().plot(kind='line')
Out[61]:
<Axes: xlabel='yr'>
In [62]:
# plots the mag 3 and up data
In [63]:
tor_mag_3_data.groupby(["st"])["om"].count().sort_values(ascending=False).plot(kind='bar')
Out[63]:
<Axes: xlabel='st'>
In [64]:
# bar plot for the magnitude 3 and up tornadoes grouped by state
# kind is the bar plot
# Options include 'line' (default), 'bar', 'barh', 'hist', 'box', 'kde', 'density', 'area', 'pie', 'scatter', 'hexbin'
# legend: Whether to show the legend. Defaults to True for line plots and False otherwise.
In [65]:
# Always good to cross check your plots in another way
# here we start with the dataset for magnitude 3 and up
# Then we are filtering the tor_mag_3_data for those rows that have state as 'TX' , i.e., texas
 
tor_mag_3_data[tor_mag_3_data["st"]=="TX"].count()
Out[65]:
0
Loading... (need help?)
In [66]:
# bar plot results for mag 3 and up:
# Very clear that Texas has the most tornadoes among those with mag 3 and up
# Interesting...
#   But what about overall torndaoes?  Texas still at the top?
In [67]:
# add additional parameters to add labels, change figure size and color etc.
tor_data.groupby("st")["om"].count().sort_values(ascending=False).plot(
    kind='bar', title="Tornado Count by State: 1950-2022", rot=45, legend=True, figsize=(14,8), color='r', grid=False, xlabel="state", ylabel="Tornado counts", fontsize=14)

# Overall, Texas still has the most tornadoes
Out[67]:
<Axes: title={'center': 'Tornado Count by State: 1950-2022'}, xlabel='state', ylabel='Tornado counts'>
In [68]:
# A lot that can be customized even using only this built-in pandas plotting function .plot()!
    # pandas is using matplotlib 'under the hood' to generate these plots
# But even more can be done by using matplotlib library directly 
# Or with another libray Seaborn
#  both are specialized for creating very visually sophisticated plots
#... until next time!

links

social