Jupyter Notebooks¶
- This is a jupyter notebook file (file extension is .ipynb for python notebook versus .py for standard python scripts)
- It is a format used by many data scientists and researchers for analysis and visualization tasks
1. New to Jupyter notebooks?:¶
- Jupyter Notebooks are useful because (a) they are easier to work with and understand than single python script files (.py) and
- (b) enables you to break problems down into steps and quickly see results while building and iterating on your code in blocks called 'cells'
- Key point: for data analysis a jupyter notebook is clear and efficient way to conduct analysis and communicate the findings
Two types of Jupyter notebook cells:¶
(1) Markdown cells (Cell->Cell Type->Markdown):
- Descriptions
- text (this entire cell that you are looking at is a Markdown cell where text exists)
- images
- Key point: Markdown cells are a place for notes and descriptions
(2) Code "cells" (Cell->Cell Type->Code) : Actual code
- A code "cell" store and compile your code
- any non-code text must be formatted as a comment - begin with "#"
- Key point: code cells are the workhorse, they execute your program and output results
Running Jupyter notebook cells¶
(1) To run every line of code in the entire notebook
- Cell->Run All
(2) To run only a single code cell
- Click on the cell and hit Shift-Enter
In [1]:
# This is a code cell
# This is the place where your code is written
# Any lines beginning with the "#" are not code but descriptions to help you and others know what is being done
# hit shift-Enter together to run this cell (Note that we don't expect any result to be generated from this annotation text)
In [2]:
# This is another code cell
# Here we have some code written....it is a 'function' , let's not worry about details of 'functions' at the moment
# it is just for seeing outputs
# in jupyter notebooks, codes show their outputs below the code cell that was run
def perserverance(text):
return "so long as you do not stop!"
print(f"{perserverance('It does not matter how slowly you go...')}")
# hit shift-Enter together to run this cell
# below this example code cell an output will appear
In [3]:
# another example
print("this will print a message to the output")
# hit shift-Enter together to run this cell
# print("some text") messages help to a) show result of the code, b) understand what went wrong
Pandas¶
- Pandas is a powerful data manipulation library in Python
- It is widely used in data science, machine learning, scientific computing, and many other data-intensive fields
2. New to pandas?¶
- Pandas is useful because (a) it provides flexible data structures for efficient manipulation of structured data and
- (b) it has rich functionality for data cleaning, transformation, and analysis
- Key point: for data analysis, pandas provides a high-performance, easy-to-use data structure (DataFrame) and data analysis tools.
Two main data structures in pandas:¶
(1) Series :
- A one-dimensional labeled array capable of holding any data type
- It is similar to a column in a spreadsheet, a field in a database, or a vector in a mathematical matrix
- Key point: Series is the primary building block of pandas
(2) DataFrame :
- A two-dimensional labeled data structure with columns potentially of different types.
- It is similar to a spreadsheet or SQL table
- Key point: DataFrame is the primary pandas data structure for data manipulation and analysis
Working with pandas¶
(1) To import the pandas library
import pandas as pd
(2) To create a DataFrame
df = pd.DataFrame(data)
(3) To read a CSV file into a DataFrame
df = pd.read_csv('file.csv')
(4) To get the first 5 rows of the DataFrame
df.head()
3. Load your packages¶
In [4]:
# if you want to see which packages you already have installed from this notebook, uncomment the specified line below
# The '!' character below is used to run a command checking for your package manager conda directly in the notebook
# conda is the package manager recommended
# ! conda list # <--- (highlight this row , press ctrl + / to remove the first '#' character from this line - otherwise just use backspace)
In [5]:
# I recommend using conda to install install itables in your terminal
# why in the terminal and not in the code cell? : because conda requires you to input a yes OR no to execute the installation,
# which may or may not show-up in your environment (e.g., you visual code studio IDE may not prompt you with the y or n choice)
# Instead, in the terminal type: conda install itables, and hit enter
In [6]:
import pandas as pd # should already be pre-installed with conda
import numpy as np # should also already be pre-istalled with conda
from itables import init_notebook_mode, show #to fully take advantage of the ability to display our data as a table format, let's use the itables library
%matplotlib inline
init_notebook_mode(all_interactive=True) #After this, any Pandas DataFrame, or Series, is displayed as interactive table, which lets you explore, filter or sort your data
In [7]:
# the following code cell will give you a nice hover highlighting on your tables when used with itables library (for details on itables see the itables lesson on my Git)
In [8]:
%%html
<style>
.dataTables_wrapper tbody tr:hover {
background-color: #6495ED; /* Cornflower Blue */
}
</style>
<!-- #1E3A8A (Dark Blue) -->
<!-- #0D9488 (Teal) -->
<!-- #065F46 (Dark Green) -->
<!-- #4C1D95 (Dark Purple) -->
<!-- #991B1B (Dark Red) -->
<!-- #374151 (Dark Gray) -->
<!-- #B45309 (Deep Orange) -->
<!-- #164E63 (Dark Cyan) -->
<!-- #4A2C2A (Dark Brown) -->
<!-- #831843 (Dark Magenta) -->
<!-- #1E3A8A (Dark Blue ) -->
<!-- Suggested Light Colors for Light Backgrounds -->
<!-- #AED9E0 (Light Blue) -->
<!-- #A7F3D0 (Light Teal) -->
<!-- #D1FAE5 (Light Green) -->
<!-- #DDD6FE (Light Purple) -->
<!-- #FECACA (Light Red) -->
<!-- #E5E7EB (Light Gray) -->
<!-- #FFEDD5 (Light Orange) -->
<!-- #B2F5EA (Light Cyan) -->
<!-- #FED7AA (Light Brown) -->
<!-- #FBCFE8 (Light Magenta) -->
Get into the habit of reading data documentation or Metadata¶
- Why? : Some fields are not understandable purely based on their names, some data has caveats, field may have fill values, or other special considerations ### Data source overview
- Name : U.S. TORNADOES* (1950-2022)
- Source: NOAA NWS Storm Prediction Center Severe weather database
- Geographical distribution: US and territories (Puerto Rico, Virgin Islands)
- Description: comma separated value (.csv) files for tornadoes as compiled from NWS Storm Data. Tornado reports exist back to 1950. Tornado data are provided in raw csv file format. Actual tornado tracks only (not including individual state segments) are provided in the "Actual_tornadoes.csv" file.
- File name and size: 1950-2022_actual_tornadoes.csv (7.3mb)
- Data source link: https://www.spc.noaa.gov/gis/svrgis/
Data Structure & Characteristics:¶
- Dataset structure: Dataframe created from a csv file loaded into Pandas, generating tabular data with labeled rows and columns, each column is a one-dimensional array-like list of values
- Dimensions: Two dimensional, each row a single tornado event, and each column a different attributes of the event
- Coordinates: Index values with integers starting from 0 by default for rows (observations) and columns names for columns (variables)
Most relevant fields:¶
- om: tornado number. count of tornadoes through the year
- yr: year
- mo: Month - 1-12
- dy: Day, 1-31
- date: Date in yyyy-mm-dd
- time: Time in HH:MM:SS
- tz: time zone. note: all times except for '?'=unknown and 9=GMT are 3= CST.
- st: state two-letter abbreviation
- stf: State FIPS number. Federal Information Processing System (FIPS) Codes which uniquely identify States and Counties.
- f: F-scale or EF-scale after 2007: values are either 0, 1, 2, 3, 4, 5 or -9 (unknown)
- inj: Injuries note: when summing for state totals use sn=1, not sg=1
- fat: Fatalities note: when summing for state totals use sn=1, not sg=1
- loss: Estimated property loss - From 1996 reported as tornado property damage in millions of dollars (see documentation for additional details)
- slat: Starting latitude of tornado
- slon: Starging longitude of tornado
- elat: Ending latitude of tornado
- elon: Ending longitude of tornado
- len: Length of tornado track in miles
- wid: Width of tornado in yards
- note: the physical attributes of each row (observation) with regard to time and location are defined by the 'date', 'time', 'lon', and 'lat' columns.
4. Read in the data¶
In [9]:
# here we will load a historical US tornado dataset using pandas
# you can give it any name you want...just has to follow python convention for names ,search for 'snake case in python to learn more'
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv")
print(tor_data.head())
# You can show the contents of this dataset as a table below by simply typing its name and running this code cell OR using print(data_set_name)
In [10]:
# Because we loaded the itables library, we can see it as an interactive table. You can click on the arrows by the colum names to sort them...
tor_data.head() # note that you do not have to use show() to generate an interactive table unless you want to add addtional customizations
Out[10]:
In [11]:
# To read any dataset into python you need to specify:
# 1. WHAT - name to save the data to
# 2. WHO - is going to do the work? i.e., which package is going to do the work?, here we rely on pandas 'pd' package or library
# 3. HOW - read_csv(), the specific method that pandas will use
# 4. WHERE - is the file? the file is specified by its name with relative path (i.e. '.\') or by its full path, including its extension, in this case .csv is the extension
# (1) (2) (3) (4) actual file name and its path is always in quotes
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv") # Note: we use an 'r' path_name_here because 'r' allows us to a) copy paste paths directly from your explorer, and mainly b) use '\' without causing error with python
In [12]:
# a good habit is to make a copy of the original data BEFORE making any changes
orginal_tor_data=tor_data.copy() # now when we want to compare or revert to original data we have a way to do that
In [13]:
# use the itables package with the .show() method
# notice how we can customize the table
show(tor_data.head(), caption="Tornado Data 1950-2022", options={'hover': True}) # itables also allows you to add a title or caption to your table
# caption adds a title to the table, it is below the table But still used full for descriptions
# add a hover highlighting function
In [14]:
# column_filters="header" extension for itables, my personal favorite for exploring table data
show(tor_data, column_filters="header",layout={"topEnd": None}, maxBytes=0 ,options={'hover': True} ) # Adds individual column filters and removes the single default search bar (which isn't that useful )