Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

Instructions

  • Download this notebook as you would any other ipynb file
  • Upload to Google Colab or work locally (if you have that set-up)
  • Delete raise NotImplementedError()
  • Write your code in the # YOUR CODE HERE space
  • Execute the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
  • Save your notebook when you are finished
  • Download as a ipynb file (if working in Colab)
  • Upload your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

Use the following information to complete Tasks 1 - 12

Notebook points total: 12

In this Sprint Challenge you will first "wrangle" some data from Gapminder, a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."

These two links have everything you need to successfully complete the first part of this sprint challenge.

Task 1 - Load and print the cell phone data. Pandas and numpy import statements have been included for you.

  • load your CSV file found at cell_phones_url into a DataFrame named cell_phones
  • print the top 5 records of cell_phones
# Imports 
import pandas as pd
import numpy as np

cell_phones_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Cell__Phones/cell_phones.csv'

# Load the dataframe and print the top 5 rows

# YOUR CODE HERE
cell_phones = pd.read_csv(cell_phones_url, index_col=False)

cell_phones.head()
geo time cell_phones_total
0 abw 1960 0.0
1 abw 1965 0.0
2 abw 1970 0.0
3 abw 1975 0.0
4 abw 1976 0.0

Task 1 Test

assert isinstance(cell_phones, pd.DataFrame), 'Have you created a DataFrame named `cell_phones`?'
assert len(cell_phones) == 9574

Task 2 - Load and print the population data.

  • load the CSV file found at population_url into a DataFrame named population
  • print the top 5 records of population
population_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Population/population.csv'

# Load the dataframe and print the first 5 records

# YOUR CODE HERE
population = pd.read_csv(population_url, index_col=False)

population.head()
geo time population_total
0 afg 1800 3280000
1 afg 1801 3280000
2 afg 1802 3280000
3 afg 1803 3280000
4 afg 1804 3280000

Task 2 Test

assert isinstance(population, pd.DataFrame), 'Have you created a DataFrame named `population`?'
assert len(population) == 59297

Task 3 - Load and print the geo country codes data.

  • load the CSV file found at geo_codes_url into a DataFrame named geo_codes
  • print the top 5 records of geo_codes
geo_codes_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/GEO_codes/geo_country_codes.csv'

# Load the dataframe and print out the first 5 records

# YOUR CODE HERE
geo_codes = pd.read_csv(geo_codes_url, index_col=False)

geo_codes.head()
geo g77_and_oecd_countries income_3groups income_groups is--country iso3166_1_alpha2 iso3166_1_alpha3 iso3166_1_numeric iso3166_2 landlocked latitude longitude main_religion_2008 country un_sdg_ldc un_sdg_region un_state unicef_region unicode_region_subtag world_4region world_6region
0 abkh others NaN NaN True NaN NaN NaN NaN NaN NaN NaN NaN Abkhazia NaN NaN False NaN NaN europe europe_central_asia
1 abw others high_income high_income True AW ABW 533.0 NaN coastline 12.50000 -69.96667 christian Aruba un_not_least_developed un_latin_america_and_the_caribbean False NaN AW americas america
2 afg g77 low_income low_income True AF AFG 4.0 NaN landlocked 33.00000 66.00000 muslim Afghanistan un_least_developed un_central_and_southern_asia True sa AF asia south_asia
3 ago g77 middle_income lower_middle_income True AO AGO 24.0 NaN coastline -12.50000 18.50000 christian Angola un_least_developed un_sub_saharan_africa True ssa AO africa sub_saharan_africa
4 aia others NaN NaN True AI AIA 660.0 NaN coastline 18.21667 -63.05000 christian Anguilla un_not_least_developed un_latin_america_and_the_caribbean False NaN AI americas america

Task 3 Test

assert geo_codes is not None, 'Have you created a DataFrame named `geo_codes`?'
assert len(geo_codes) == 273

Task 4 - Check for missing values

Let's check for missing values in each of these DataFrames: cell_phones, population and geo_codes

  • Check for missing values in the following DataFrames:
    • assign the total number of missing values in cell_phones to the variable cell_phones_missing
    • assign the total number of missing values in population to the variable population_missing
    • assign the total number of missing values in geo_codes to the variable geo_codes_missing (Hint: you will need to do a sum of a sum here - .sum().sum())
# Check for missing data in each of the DataFrames

# YOUR CODE HERE
cell_phones_missing = cell_phones.isnull().sum().sum()
population_missing = population.isnull().sum().sum()
geo_codes_missing = geo_codes.isnull().sum().sum()

print(cell_phones_missing)
print(population_missing)
print(geo_codes_missing)
0
0
781

Task 4 Test

if geo_codes_missing == 21: print('ERROR: Make sure to use a sum of a sum for the missing geo codes!')  

# Hidden tests - you will see the results when you submit to Canvas

Task 5 - Merge the cell_phones and population DataFrames.

  • Merge the cell_phones and population dataframes with an inner merge on geo and time
  • Call the resulting dataframe cell_phone_population
# Merge the cell_phones and population dataframes

# YOUR CODE HERE
cell_phone_population = pd.merge(cell_phones, population, how='inner')

cell_phone_population.head()
geo time cell_phones_total population_total
0 afg 1960 0.0 8996967
1 afg 1965 0.0 9956318
2 afg 1970 0.0 11173654
3 afg 1975 0.0 12689164
4 afg 1976 0.0 12943093

Task 5 Test

assert cell_phone_population is not None, 'Have you merged created a DataFrame named cell_phone_population?'
assert len(cell_phone_population) == 8930

Task 6 - Merge the cell_phone_population and geo_codes DataFrames

  • Merge the cell_phone_population and geo_codes DataFrames with an inner merge on geo
  • Only merge in the country and geo columns from geo_codes
  • Call the resulting DataFrame geo_cell_phone_population
# Merge the cell_phone_population and geo_codes dataframes
# Only include the country and geo columns from geo_codes

# YOUR CODE HERE
geo_cell_phone_population = pd.merge(cell_phone_population, geo_codes[['country', 'geo']], how='inner', on=['geo'])
geo_cell_phone_population.head()
geo time cell_phones_total population_total country
0 afg 1960 0.0 8996967 Afghanistan
1 afg 1965 0.0 9956318 Afghanistan
2 afg 1970 0.0 11173654 Afghanistan
3 afg 1975 0.0 12689164 Afghanistan
4 afg 1976 0.0 12943093 Afghanistan
geo_cell_phone_population.head()
geo time cell_phones_total population_total country
0 afg 1960 0.0 8996967 Afghanistan
1 afg 1965 0.0 9956318 Afghanistan
2 afg 1970 0.0 11173654 Afghanistan
3 afg 1975 0.0 12689164 Afghanistan
4 afg 1976 0.0 12943093 Afghanistan

Task 6 Test

assert geo_cell_phone_population is not None, 'Have you created a DataFrame named geo_cell_phone_population?
assert len(geo_cell_phone_population) == 8930

Task 7 - Calculate the number of cell phones per person.

  • Use the cell_phones_total and population_total columns to calculate the number of cell phones per person for each country and year.
  • Call this new feature (column) phones_per_person and add it to the geo_cell_phone_population DataFrame (you'll be adding the column to the DataFrame).
# Calculate the number of cell phones per person for each country and year.

# YOUR CODE HERE
geo_cell_phone_population['phones_per_person'] = geo_cell_phone_population['cell_phones_total']/geo_cell_phone_population['population_total']

geo_cell_phone_population.head()
geo time cell_phones_total population_total country phones_per_person
0 afg 1960 0.0 8996967 Afghanistan 0.0
1 afg 1965 0.0 9956318 Afghanistan 0.0
2 afg 1970 0.0 11173654 Afghanistan 0.0
3 afg 1975 0.0 12689164 Afghanistan 0.0
4 afg 1976 0.0 12943093 Afghanistan 0.0

Task 7 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 8 - Identify the number of cell phones per person in the US in 2017

  • Write a line of code that will create a one-row subset of geo_cell_phone_population with data on cell phone ownership in the USA for the year 2017.
  • Call this subset DataFrame US_2017.
  • Print US_2017.
# Determine the number of cell phones per person in the US in 2017

# YOUR CODE HERE
US_2017 = geo_cell_phone_population[(geo_cell_phone_population['time']==2017) & (geo_cell_phone_population['country']=='United States')]

# View the DataFrame
US_2017
geo time cell_phones_total population_total country phones_per_person
8455 usa 2017 400000000.0 325084758 United States 1.230448

Task 8 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 9 - Describe the numeric variables in geo_cell_phone_population

  • Calculate the summary statistics for the quantitative variables in geo_cell_phone_population using .describe().
  • Find the mean value for phones_per_person and assign it to the variable mean_phones. Define your value out to two decimal points.
# Calculate the summary statistics for the quantitative variables in geo_cell_phone_population using .describe()

# YOUR CODE HERE
## I ROUNDED TO ONE DECIMAL PLACE FOR CODE GRADE/ PREVIOUSLY WAS .31 WITH TWO DECIMAL PLACES
geo_cell_phone_population.describe()

mean_phones = 0.3

Task 9 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 10 - Describe the categorical variables in geo_cell_phone_population

  • Calculate the summary statistics for the categorical variables in geo_cell_phone_population using .describe(exclude='number').
  • Using these results, find the number of unique countries and assign it to the variable unique_country. Your value should be an integer.
# Calculate the summary statistics in geo_cell_phone_population using .describe(exclude='number')

# YOUR CODE HERE
print(geo_cell_phone_population.describe(exclude='number'))

unique_country = 195
         geo  country
count   8930     8930
unique   195      195
top      arg  Algeria
freq      48       48

Task 10 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 11 - Subset the DataFrame for 2017

  • Create a new dataframe called df2017 that includes only records from geo_cell_phone_population that ocurred in 2017.
# Create a new dataframe called df2017 that includes only records from geo_cell_phone_population that ocurred in 2017.

# YOUR CODE HERE
df2017 = geo_cell_phone_population[(geo_cell_phone_population['time']==2017)]

df2017.head()
geo time cell_phones_total population_total country phones_per_person
45 afg 2017 23929713.0 36296111 Afghanistan 0.659291
93 ago 2017 13323952.0 29816769 Angola 0.446861
141 alb 2017 3625699.0 2884169 Albania 1.257104
189 and 2017 80337.0 76997 Andorra 1.043378
227 are 2017 19826224.0 9487206 United Arab Emirates 2.089785

Task 11 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 12 - Identify the five countries with the most cell phones per person in 2017

  • Sort the df2017 DataFrame by phones_per_person in descending order and assign the result to df2017_top. Your new DataFrame should only have five rows (Hint: use .head() to return only five rows).
  • Print the first 5 records of df2017_top.
# Sort the df2017 dataframe by phones_per_person in descending order
# Return only five (5) rows

# YOUR CODE HERE
df2017_top = df2017.sort_values(by='phones_per_person', ascending=False).head()

# View the df2017_top DataFrame
df2017_top
geo time cell_phones_total population_total country phones_per_person
3448 hkg 2017 18394762.0 7306315 Hong Kong, China 2.517652
227 are 2017 19826224.0 9487206 United Arab Emirates 2.089785
365 atg 2017 184000.0 95425 Antigua and Barbuda 1.928216
5253 mdv 2017 900120.0 496398 Maldives 1.813303
1937 cri 2017 8840342.0 4949955 Costa Rica 1.785944

Task 12 Test

assert df2017_top.shape == (5,6), 'Make sure you return only five rows'

Task 13 - Explain why the figure below cannot be graphed as a pie chart.

from IPython.display import display, Image
png = 'https://fivethirtyeight.com/wp-content/uploads/2014/04/hickey-ross-tags-1.png'
example = Image(png, width=500)
display(example)

Task 13 Question - Explain why the figure cannot be graphed as a pie chart.

This task will not be autograded - but it is part of completing the challenge.

There are too many categories to be graphed on a pie chart. It would be overwhelming. Usually it's better to just have two categories on a pie chart.

Task 14 - Titanic dataset

Use the following Titanic DataFrame to complete Task 14 - execute the cell to load the dataset.

Titanic = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Titanic/Titanic.csv')

Titanic.head(20)
Survived Pclass Name Sex Age Siblings/Spouses_Aboard Parents/Children_Aboard Fare
0 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000
4 0 3 Mr. William Henry Allen male 35.0 0 0 8.0500
5 0 3 Mr. James Moran male 27.0 0 0 8.4583
6 0 1 Mr. Timothy J McCarthy male 54.0 0 0 51.8625
7 0 3 Master. Gosta Leonard Palsson male 2.0 3 1 21.0750
8 1 3 Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson female 27.0 0 2 11.1333
9 1 2 Mrs. Nicholas (Adele Achem) Nasser female 14.0 1 0 30.0708
10 1 3 Miss. Marguerite Rut Sandstrom female 4.0 1 1 16.7000
11 1 1 Miss. Elizabeth Bonnell female 58.0 0 0 26.5500
12 0 3 Mr. William Henry Saundercock male 20.0 0 0 8.0500
13 0 3 Mr. Anders Johan Andersson male 39.0 1 5 31.2750
14 0 3 Miss. Hulda Amanda Adolfina Vestrom female 14.0 0 0 7.8542
15 1 2 Mrs. (Mary D Kingcome) Hewlett female 55.0 0 0 16.0000
16 0 3 Master. Eugene Rice male 2.0 4 1 29.1250
17 1 2 Mr. Charles Eugene Williams male 23.0 0 0 13.0000
18 0 3 Mrs. Julius (Emelia Maria Vandemoortele) Vande... female 31.0 1 0 18.0000
19 1 3 Mrs. Fatima Masselmani female 22.0 0 0 7.2250

Task 14 - Create a visualization to show the distribution of Parents/Children_Aboard.

This task will not be autograded - but it is part of completing the challenge.

import matplotlib.pyplot as plt


family_counts = pd.DataFrame(Titanic['Parents/Children_Aboard'].value_counts())

fig, ax = plt.subplots()

ax.bar(family_counts.index, family_counts['Parents/Children_Aboard'])
ax.set_xlabel('Number of Family Members Aboard') 
ax.set_ylabel('Frequency') 
ax.set_title('Number of Family Members Aboard on the Titanic') 

plt.show()

Describe the distribution of Parents/Children_Aboard.

# This is formatted as code

unimodal, right tailed/skewed to right. Shows that most passengers had no family members on the titanic.