Lambda School Data Science - Unit 1 Sprint 1
Autograded Notebook (Canvas & CodeGrade)
This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.
Instructions
- Download this notebook as you would any other ipynb file
- Upload to Google Colab or work locally (if you have that set-up)
- Delete
raise NotImplementedError()
- Write your code in the
# YOUR CODE HERE
space - Execute the Test cells that contain
assert
statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas) - Save your notebook when you are finished
- Download as a
ipynb
file (if working in Colab) - Upload your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)
Use the following information to complete Tasks 1 - 12
Notebook points total: 12
In this Sprint Challenge you will first "wrangle" some data from Gapminder, a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
These two links have everything you need to successfully complete the first part of this sprint challenge.
- Pandas documentation: Working with Text Data (one question)
- Pandas Cheat Sheet (everything else)
Task 1 - Load and print the cell phone data. Pandas and numpy import statements have been included for you.
- load your CSV file found at
cell_phones_url
into a DataFrame namedcell_phones
- print the top 5 records of
cell_phones
# Imports
import pandas as pd
import numpy as np
cell_phones_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Cell__Phones/cell_phones.csv'
# Load the dataframe and print the top 5 rows
# YOUR CODE HERE
cell_phones = pd.read_csv(cell_phones_url, index_col=False)
cell_phones.head()
Task 1 Test
assert isinstance(cell_phones, pd.DataFrame), 'Have you created a DataFrame named `cell_phones`?'
assert len(cell_phones) == 9574
Task 2 - Load and print the population data.
- load the CSV file found at
population_url
into a DataFrame namedpopulation
- print the top 5 records of
population
population_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Population/population.csv'
# Load the dataframe and print the first 5 records
# YOUR CODE HERE
population = pd.read_csv(population_url, index_col=False)
population.head()
Task 2 Test
assert isinstance(population, pd.DataFrame), 'Have you created a DataFrame named `population`?'
assert len(population) == 59297
Task 3 - Load and print the geo country codes data.
- load the CSV file found at
geo_codes_url
into a DataFrame namedgeo_codes
- print the top 5 records of
geo_codes
geo_codes_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/GEO_codes/geo_country_codes.csv'
# Load the dataframe and print out the first 5 records
# YOUR CODE HERE
geo_codes = pd.read_csv(geo_codes_url, index_col=False)
geo_codes.head()
Task 3 Test
assert geo_codes is not None, 'Have you created a DataFrame named `geo_codes`?'
assert len(geo_codes) == 273
Task 4 - Check for missing values
Let's check for missing values in each of these DataFrames: cell_phones
, population
and geo_codes
- Check for missing values in the following DataFrames:
- assign the total number of missing values in
cell_phones
to the variablecell_phones_missing
- assign the total number of missing values in
population
to the variablepopulation_missing
- assign the total number of missing values in
geo_codes
to the variablegeo_codes_missing
(Hint: you will need to do a sum of a sum here -.sum().sum()
)
- assign the total number of missing values in
# Check for missing data in each of the DataFrames
# YOUR CODE HERE
cell_phones_missing = cell_phones.isnull().sum().sum()
population_missing = population.isnull().sum().sum()
geo_codes_missing = geo_codes.isnull().sum().sum()
print(cell_phones_missing)
print(population_missing)
print(geo_codes_missing)
Task 4 Test
if geo_codes_missing == 21: print('ERROR: Make sure to use a sum of a sum for the missing geo codes!')
# Hidden tests - you will see the results when you submit to Canvas
Task 5 - Merge the cell_phones
and population
DataFrames.
- Merge the
cell_phones
andpopulation
dataframes with an inner merge ongeo
andtime
- Call the resulting dataframe
cell_phone_population
# Merge the cell_phones and population dataframes
# YOUR CODE HERE
cell_phone_population = pd.merge(cell_phones, population, how='inner')
cell_phone_population.head()
Task 5 Test
assert cell_phone_population is not None, 'Have you merged created a DataFrame named cell_phone_population?'
assert len(cell_phone_population) == 8930
Task 6 - Merge the cell_phone_population
and geo_codes
DataFrames
- Merge the
cell_phone_population
andgeo_codes
DataFrames with an inner merge ongeo
- Only merge in the
country
andgeo
columns fromgeo_codes
- Call the resulting DataFrame
geo_cell_phone_population
# Merge the cell_phone_population and geo_codes dataframes
# Only include the country and geo columns from geo_codes
# YOUR CODE HERE
geo_cell_phone_population = pd.merge(cell_phone_population, geo_codes[['country', 'geo']], how='inner', on=['geo'])
geo_cell_phone_population.head()
geo_cell_phone_population.head()
Task 6 Test
assert geo_cell_phone_population is not None, 'Have you created a DataFrame named geo_cell_phone_population?
assert len(geo_cell_phone_population) == 8930
Task 7 - Calculate the number of cell phones per person.
- Use the
cell_phones_total
andpopulation_total
columns to calculate the number of cell phones per person for each country and year. - Call this new feature (column)
phones_per_person
and add it to thegeo_cell_phone_population
DataFrame (you'll be adding the column to the DataFrame).
# Calculate the number of cell phones per person for each country and year.
# YOUR CODE HERE
geo_cell_phone_population['phones_per_person'] = geo_cell_phone_population['cell_phones_total']/geo_cell_phone_population['population_total']
geo_cell_phone_population.head()
Task 7 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 8 - Identify the number of cell phones per person in the US in 2017
- Write a line of code that will create a one-row subset of
geo_cell_phone_population
with data on cell phone ownership in the USA for the year 2017. - Call this subset DataFrame
US_2017
. - Print
US_2017
.
# Determine the number of cell phones per person in the US in 2017
# YOUR CODE HERE
US_2017 = geo_cell_phone_population[(geo_cell_phone_population['time']==2017) & (geo_cell_phone_population['country']=='United States')]
# View the DataFrame
US_2017
Task 8 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 9 - Describe the numeric variables in geo_cell_phone_population
- Calculate the summary statistics for the quantitative variables in
geo_cell_phone_population
using.describe()
. - Find the mean value for
phones_per_person
and assign it to the variablemean_phones
. Define your value out to two decimal points.
# Calculate the summary statistics for the quantitative variables in geo_cell_phone_population using .describe()
# YOUR CODE HERE
## I ROUNDED TO ONE DECIMAL PLACE FOR CODE GRADE/ PREVIOUSLY WAS .31 WITH TWO DECIMAL PLACES
geo_cell_phone_population.describe()
mean_phones = 0.3
Task 9 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 10 - Describe the categorical variables in geo_cell_phone_population
- Calculate the summary statistics for the categorical variables in
geo_cell_phone_population
using.describe(exclude='number')
. - Using these results, find the number of unique countries and assign it to the variable
unique_country
. Your value should be an integer.
# Calculate the summary statistics in geo_cell_phone_population using .describe(exclude='number')
# YOUR CODE HERE
print(geo_cell_phone_population.describe(exclude='number'))
unique_country = 195
Task 10 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 11 - Subset the DataFrame for 2017
- Create a new dataframe called
df2017
that includes only records fromgeo_cell_phone_population
that ocurred in 2017.
# Create a new dataframe called df2017 that includes only records from geo_cell_phone_population that ocurred in 2017.
# YOUR CODE HERE
df2017 = geo_cell_phone_population[(geo_cell_phone_population['time']==2017)]
df2017.head()
Task 11 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 12 - Identify the five countries with the most cell phones per person in 2017
- Sort the
df2017
DataFrame byphones_per_person
in descending order and assign the result todf2017_top
. Your new DataFrame should only have five rows (Hint: use.head()
to return only five rows). - Print the first 5 records of
df2017_top
.
# Sort the df2017 dataframe by phones_per_person in descending order
# Return only five (5) rows
# YOUR CODE HERE
df2017_top = df2017.sort_values(by='phones_per_person', ascending=False).head()
# View the df2017_top DataFrame
df2017_top
Task 12 Test
assert df2017_top.shape == (5,6), 'Make sure you return only five rows'
Task 13 - Explain why the figure below cannot be graphed as a pie chart.
from IPython.display import display, Image
png = 'https://fivethirtyeight.com/wp-content/uploads/2014/04/hickey-ross-tags-1.png'
example = Image(png, width=500)
display(example)
Task 13 Question - Explain why the figure cannot be graphed as a pie chart.
This task will not be autograded - but it is part of completing the challenge.
There are too many categories to be graphed on a pie chart. It would be overwhelming. Usually it's better to just have two categories on a pie chart.
Task 14 - Titanic dataset
Use the following Titanic DataFrame to complete Task 14 - execute the cell to load the dataset.
Titanic = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Titanic/Titanic.csv')
Titanic.head(20)
Task 14 - Create a visualization to show the distribution of Parents/Children_Aboard.
This task will not be autograded - but it is part of completing the challenge.
import matplotlib.pyplot as plt
family_counts = pd.DataFrame(Titanic['Parents/Children_Aboard'].value_counts())
fig, ax = plt.subplots()
ax.bar(family_counts.index, family_counts['Parents/Children_Aboard'])
ax.set_xlabel('Number of Family Members Aboard')
ax.set_ylabel('Frequency')
ax.set_title('Number of Family Members Aboard on the Titanic')
plt.show()
Describe the distribution of Parents/Children_Aboard.
# This is formatted as code
unimodal, right tailed/skewed to right. Shows that most passengers had no family members on the titanic.