Lab 2: Summarizing Data and Basic Python Plotting
Contents
Lab 2: Summarizing Data and Basic Python Plotting#
Caution
Please use this webpage as a set of general recommendations. The lab itself is located in Google Colab. Make sure to copy the notebook as you work through it.
Warning
Please make sure to submit the lab by October 10/11, 2022, depending on your section.
This lab will teach you how to work import, export and analyze data in Python, and specifically cover the following concepts:
Working with data (Importing and saving data with GeoPandas from Google Drive and from URL)
Calculating measures of location and variation
Creating basic plots and charts in Matplotlib
Optional useful tutorials on lab topics:
Working with data#
Reading the data#
Rectangular data in Python is typically handled via Pandas. This includes files with the following extensions: .csv, .xls and .xlsx (Excel), .tab (tab delimited), .txt (text format), and many more.
import pandas as pd
df = pd.read_csv('path_to_file')
In Pandas, the data is typically stored in two-dimensional DataFrame objects with (un)labeled rows and columns. Another way to think about Pandas DataFrames is by analogy with programmable spreadsheets. A one-dimensional array (any one variable) is called Pandas Series.
Tip
When working with data in languages other than English, consider using Unicode Transformation Format - 8 bit or UTF-8 for short. Most read functions in both Python and R packages provide interface to set encoding when reading/writing information onto disk.
# import data specifying the encoding
df = pd.read_csv('path_fo_file', encoding='utf-8')
Note
Computers store information in binary system as sequence of 0 and 1. Because text is made of individual characters it is represented by a string of bits. For example, letter ‘A’ is represented as ‘01000001’ and letter ‘a’ is represented as ‘01100001’. Normally, a standardized encoding system is used to translate from byte code to characters. One early example of such system is American Standard Code for Information Interchange (ASCII), but it only works for latin alphabet, assigning a three digit code to each character. For instance, ‘A’ is ‘065’ and ‘a’ is ‘097’. The problem with ASCII is that it can only store 256 unique bytes (characters), which was not enough to encode characters from other languages. Thus, the UTF-8 came into being. UTF-8 is capable of encoding 1,112,064 symbols (yes, that includes emojii). For example, letter ‘A’ is encoded as ‘U+0041’.
Normally, the first thing to check after importing the data is whether the data was imported correctly. This typically implies 1) checking the total number of intended variables (if we how many features/variables the original file had); 2) checking for correct type of variables. To check the number of records and variables in the dataset we can use the command shape, which returns a tuple (110,22), where the first number denotes the number of records and the second number denotes the number of variables.
print(df.shape)
You can check out the types of data after reading the data file using a dtypes command on the data frame.
df.dtypes
Calling Variables#
Variables (features) can be called using the following conventions
df.variable_name
df['variable_name']
Filtering and subsetting the data#
We can subset the data either by label or index.
Label-based filtering.
# select only records for California
ca_df = df.loc[df.state=="CA",]
# select only records for CA and FL
ca_fl_df = df.loc[(df.state=="CA")|(df.state=="FL"),]
# if we only needed specific variables in the previous selections
ca_df2 = df.loc[df.state=="CA", ['first_var', 'second_var', 'third_var']]
Index-based filtering
# select first 100 rows
df.iloc[:100,]
# select first 100 rows and variables 2,3,4,5
df.iloc[:100, 2:5]
To select all records for specific variables use double square brackets like so:
other_df = df[['first_var', 'second_var', 'third_var']]
Introducing Data for the Lab#
There are many wonderful resources online containing different types of data: Kaggle, Google, Data.gov, Census, OpenStreetMap and many more. Also, check out this wonderiful Github repo. For this lab, we will be working with fire perimeter data from CalFire. The geodatabase dump can be donwloaded from this page. Here is the description of the data from the CalFire website:
This is a multi-agency statewide database of fire history. For CAL FIRE, timber fires 10 acres or greater, brush fires 30 acres and greater, and grass fires 300 acres or greater are included. For the USFS, there is a 10 acre minimum for fires since 1950.
This dataset contains wildfire history, prescribed burns and other fuel modification projects.
This data can also be viewed interactively on a CA.gov.
The file is available as a ArcGIS personal geodatabase (file extension .gdb). Since we are reading it in GeoPandas some labels are not available, we will have to recode a variable ‘CAUSE’ manually, using the following values:
rec_vars = {
1: 'Lightning',
2: 'Equipment Use',
3: 'Smoking',
4: 'Campfire',
5: 'Debris',
6: 'Railroad',
7: 'Arson',
8: 'Playing with fire',
9: 'Miscellaneous',
10: 'Vehicle',
11: 'Powerline',
12: 'Firefighter Trainig',
13: 'Non-Firefighter Training',
14: 'Unknown'
}
Modifying variables#
Working with datetime#
Some variables denote date and time. These should normally be processed as ‘datetime’ variable, thus we convert them
df['my_date_variable'] = pd.to_datetime(df.my_date_variable)
Calculating Measures of Location and Variation of Variables#
Normally we are interested in assessing specific variables. In Pandas and GeoPandas the variables can be called using either square brackets or dot. See pseudo code below used to calculate averages for variables.
df.variable_name.mean()
df['variable_name'].mean()
Pandas implements convenience functions such as mean(), median(), mode(), std() to name a few.
Visualzing Data#
Setting up canvas for plotting#
Matplotlib remains a default backend for many utilitity functions in Pandas. Therefore, it is important to know how to change figure attributes. Here is a typical workflow for modifying canvas size, axis labels and a title.
fig, ax = plt.subplots(figsize=(10,10)) #figure size in inches
df.variable_name.plot(kind='hist', ax=ax) # set axis to be axis we modified above
ax.set_title('My Title', fontsize=16)
ax.set_
Histograms#
We discussed in class that data distributions can be visualized via histograms. Pandas makes this process extremely easy:
df.variable_name.plot(kind='hist')
For cases when the distributions are exponential, we could use the log-scale and density on y-axis to see more variation in our data:
df.variable_name.plot(kind='hist', logy=True, density=True)
Cartographic Mapping in GeoPandas#
Geometric information is typically stored as well-known text (WKT) format in GeoPandas dataframes.
There are three basic types of GIS data: [multi]point, [multi]line and [multi]polygon.
Creating maps in GeoPandas is easy and straight-forward.
gdf = gpd.read_file('path_to_file')
gdf.plot()
For polygons, we can use GeoPandas to generate choropleth map, by passing the name of a variable we want to use for thematic mapping, like so:
gdf.plot('my_variable')
Lab Instructions#
Describe and characterize wildfires in the assigned CA county. The assignment is available in the Google Doc. Find your assignment by a perm number. There are two students per county. It is up to you if you want to collaborate on your lab, but each of you should submit individual lab. Your code and plots should have different styling. There are still some counties that have not been assigned to anyone. If you want to work on some other county, feel free to put your PERM for any other county on the sign-up sheet.
Open, create a copy and continue working in this Lab 2 Template Notebook!#
Tip
Please pay attention to what the question asks you. A few things to note:
Identify if the questions ask asking to provide one number (summary statistic) or a series of numbers?
For questions that require a series of numbers or a time plot identify:
what variable needs to appear in the groupby()-statement and which variable needs to follow the groupby() statement.
For questions asking for ‘number’ use count() function (see examples in Lab 2 Google Colab notebook)
For questions asking for ‘average’ use mean() function (see examples in Lab 2 Google Colab notebook)
For questions asking for a ‘sum’ or ‘total’ use sum() function (see examples in Lab 2 Google Colab notebook)
‘Annual’ means that you need to aggregate your data (~ group your data) for every year.
Create a geographic map of the county and fires
Calculate the total number of fires in your assigned county.
Calculate average acreage (use variable ‘GIS_ACRES’) within the assigned county.
Calculate average duration of wildfires in the county. The variable for duration was calculated in Lab 2 Google Colab notebook, you just need to use.
Plot the histogram of wildfire durations in the county.
Create a a timeseries plot of the total number of fires for each year in the assigned county.
Calculate and plot average acreage (use variable ‘GIS_ACRES’) of fires over years.
Download the average annual temperature for your data from NOAA
Go to the website
Input the following: Parameter - Average Temperature, Time Scale - Annual, Start Year - 1950, End Year - 2021, State - California, County - County_Of_Your_Assignment
Click ‘Plot’, scroll down and download data in csv format.
Upload the file to your Google drive and continue working. Alternatively right click on the excel logo above the rendered table to the right of the word ‘Download’ and click ‘Copy link address’.
Plot three subfigures (use any orientation you find useful): a) annual number of fires, b) total sum of annual acreage of fires, c) average temperature (that you just downloaded).
Edit your report with markdown headings and text where necessary. MAKE SURE TO COMMENT AND INTERPRET EVERY PLOT.
Submit via GauchoSpace as geog172_firstlastname_lab02.ipynb.
Optional 1: recode variable ‘CAUSE’ using a dictionary above and generate a bar chart with the most common cause of fires in CA.
Optional 2: Plot the number of fires by cause over time.
Optional 3: Create a bar plot or a line plot with the average number of wildfires in each month.