Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

Instructions

Download this notebook as you would any other ipynb file
Upload to Google Colab or work locally (if you have that set-up)
Delete raise NotImplementedError()
Write your code in the # YOUR CODE HERE space
Execute the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
Save your notebook when you are finished
Download as a ipynb file (if working in Colab)
Upload your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

Use the following information to complete Tasks 1 - 12

Notebook points total: 12

In this Sprint Challenge you will first "wrangle" some data from Gapminder, a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."

These two links have everything you need to successfully complete the first part of this sprint challenge.

Pandas documentation: Working with Text Data (one question)
Pandas Cheat Sheet (everything else)

Task 1 - Load and print the cell phone data. Pandas and numpy import statements have been included for you.

load your CSV file found at cell_phones_url into a DataFrame named cell_phones
print the top 5 records of cell_phones

# Imports 
import pandas as pd
import numpy as np

cell_phones_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Cell__Phones/cell_phones.csv'

# Load the dataframe and print the top 5 rows

# YOUR CODE HERE
cell_phones = pd.read_csv(cell_phones_url, index_col=False)

cell_phones.head()

Task 1 Test

assert isinstance(cell_phones, pd.DataFrame), 'Have you created a DataFrame named `cell_phones`?'
assert len(cell_phones) == 9574

Task 2 - Load and print the population data.

load the CSV file found at population_url into a DataFrame named population
print the top 5 records of population

population_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Population/population.csv'

# Load the dataframe and print the first 5 records

# YOUR CODE HERE
population = pd.read_csv(population_url, index_col=False)

population.head()

Task 2 Test

assert isinstance(population, pd.DataFrame), 'Have you created a DataFrame named `population`?'
assert len(population) == 59297

Task 3 - Load and print the geo country codes data.

load the CSV file found at geo_codes_url into a DataFrame named geo_codes
print the top 5 records of geo_codes

geo_codes_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/GEO_codes/geo_country_codes.csv'

# Load the dataframe and print out the first 5 records

# YOUR CODE HERE
geo_codes = pd.read_csv(geo_codes_url, index_col=False)

geo_codes.head()

Task 3 Test

assert geo_codes is not None, 'Have you created a DataFrame named `geo_codes`?'
assert len(geo_codes) == 273

Task 4 - Check for missing values

Let's check for missing values in each of these DataFrames: cell_phones, population and geo_codes

Check for missing values in the following DataFrames:
- assign the total number of missing values in cell_phones to the variable cell_phones_missing
- assign the total number of missing values in population to the variable population_missing
- assign the total number of missing values in geo_codes to the variable geo_codes_missing (Hint: you will need to do a sum of a sum here - .sum().sum())

# Check for missing data in each of the DataFrames

# YOUR CODE HERE
cell_phones_missing = cell_phones.isnull().sum().sum()
population_missing = population.isnull().sum().sum()
geo_codes_missing = geo_codes.isnull().sum().sum()

print(cell_phones_missing)
print(population_missing)
print(geo_codes_missing)

0
0
781

Task 4 Test

if geo_codes_missing == 21: print('ERROR: Make sure to use a sum of a sum for the missing geo codes!')  

# Hidden tests - you will see the results when you submit to Canvas

Task 5 - Merge the cell_phones and population DataFrames.

Merge the cell_phones and population dataframes with an inner merge on geo and time
Call the resulting dataframe cell_phone_population

# Merge the cell_phones and population dataframes

# YOUR CODE HERE
cell_phone_population = pd.merge(cell_phones, population, how='inner')

cell_phone_population.head()

Task 5 Test

assert cell_phone_population is not None, 'Have you merged created a DataFrame named cell_phone_population?'
assert len(cell_phone_population) == 8930

Task 6 - Merge the cell_phone_population and geo_codes DataFrames

Merge the cell_phone_population and geo_codes DataFrames with an inner merge on geo
Only merge in the country and geo columns from geo_codes
Call the resulting DataFrame geo_cell_phone_population

# Merge the cell_phone_population and geo_codes dataframes
# Only include the country and geo columns from geo_codes

# YOUR CODE HERE
geo_cell_phone_population = pd.merge(cell_phone_population, geo_codes[['country', 'geo']], how='inner', on=['geo'])
geo_cell_phone_population.head()

geo_cell_phone_population.head()

Task 6 Test

assert geo_cell_phone_population is not None, 'Have you created a DataFrame named geo_cell_phone_population?
assert len(geo_cell_phone_population) == 8930

Task 7 - Calculate the number of cell phones per person.

Use the cell_phones_total and population_total columns to calculate the number of cell phones per person for each country and year.
Call this new feature (column) phones_per_person and add it to the geo_cell_phone_population DataFrame (you'll be adding the column to the DataFrame).

# Calculate the number of cell phones per person for each country and year.

# YOUR CODE HERE
geo_cell_phone_population['phones_per_person'] = geo_cell_phone_population['cell_phones_total']/geo_cell_phone_population['population_total']

geo_cell_phone_population.head()

Task 7 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 8 - Identify the number of cell phones per person in the US in 2017

Write a line of code that will create a one-row subset of geo_cell_phone_population with data on cell phone ownership in the USA for the year 2017.
Call this subset DataFrame US_2017.
Print US_2017.

# Determine the number of cell phones per person in the US in 2017

# YOUR CODE HERE
US_2017 = geo_cell_phone_population[(geo_cell_phone_population['time']==2017) & (geo_cell_phone_population['country']=='United States')]

# View the DataFrame
US_2017

Task 8 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 9 - Describe the numeric variables in geo_cell_phone_population

Calculate the summary statistics for the quantitative variables in geo_cell_phone_population using .describe().
Find the mean value for phones_per_person and assign it to the variable mean_phones. Define your value out to two decimal points.

# Calculate the summary statistics for the quantitative variables in geo_cell_phone_population using .describe()

# YOUR CODE HERE
## I ROUNDED TO ONE DECIMAL PLACE FOR CODE GRADE/ PREVIOUSLY WAS .31 WITH TWO DECIMAL PLACES
geo_cell_phone_population.describe()

mean_phones = 0.3

Task 9 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 10 - Describe the categorical variables in geo_cell_phone_population

Calculate the summary statistics for the categorical variables in geo_cell_phone_population using .describe(exclude='number').
Using these results, find the number of unique countries and assign it to the variable unique_country. Your value should be an integer.

# Calculate the summary statistics in geo_cell_phone_population using .describe(exclude='number')

# YOUR CODE HERE
print(geo_cell_phone_population.describe(exclude='number'))

unique_country = 195

         geo  country
count   8930     8930
unique   195      195
top      arg  Algeria
freq      48       48

Task 10 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 11 - Subset the DataFrame for 2017

Create a new dataframe called df2017 that includes only records from geo_cell_phone_population that ocurred in 2017.

# Create a new dataframe called df2017 that includes only records from geo_cell_phone_population that ocurred in 2017.

# YOUR CODE HERE
df2017 = geo_cell_phone_population[(geo_cell_phone_population['time']==2017)]

df2017.head()

Task 11 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 12 - Identify the five countries with the most cell phones per person in 2017

Sort the df2017 DataFrame by phones_per_person in descending order and assign the result to df2017_top. Your new DataFrame should only have five rows (Hint: use .head() to return only five rows).
Print the first 5 records of df2017_top.

# Sort the df2017 dataframe by phones_per_person in descending order
# Return only five (5) rows

# YOUR CODE HERE
df2017_top = df2017.sort_values(by='phones_per_person', ascending=False).head()

# View the df2017_top DataFrame
df2017_top

Task 12 Test

assert df2017_top.shape == (5,6), 'Make sure you return only five rows'

Task 13 - Explain why the figure below cannot be graphed as a pie chart.

from IPython.display import display, Image
png = 'https://fivethirtyeight.com/wp-content/uploads/2014/04/hickey-ross-tags-1.png'
example = Image(png, width=500)
display(example)

Task 13 Question - Explain why the figure cannot be graphed as a pie chart.

This task will not be autograded - but it is part of completing the challenge.

There are too many categories to be graphed on a pie chart. It would be overwhelming. Usually it's better to just have two categories on a pie chart.

Task 14 - Titanic dataset

Use the following Titanic DataFrame to complete Task 14 - execute the cell to load the dataset.

Titanic = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Titanic/Titanic.csv')

Titanic.head(20)

Task 14 - Create a visualization to show the distribution of Parents/Children_Aboard.

This task will not be autograded - but it is part of completing the challenge.

import matplotlib.pyplot as plt


family_counts = pd.DataFrame(Titanic['Parents/Children_Aboard'].value_counts())

fig, ax = plt.subplots()

ax.bar(family_counts.index, family_counts['Parents/Children_Aboard'])
ax.set_xlabel('Number of Family Members Aboard') 
ax.set_ylabel('Frequency') 
ax.set_title('Number of Family Members Aboard on the Titanic') 

plt.show()

Describe the distribution of Parents/Children_Aboard.

# This is formatted as code

unimodal, right tailed/skewed to right. Shows that most passengers had no family members on the titanic.

	geo	time
0	abw	1960
1	abw	1965
2	abw	1970
3	abw	1975
4	abw	1976

	geo	time	population_total
0	afg	1800	3280000
1	afg	1801	3280000
2	afg	1802	3280000
3	afg	1803	3280000
4	afg	1804	3280000

	geo	g77_and_oecd_countries	income_3groups	income_groups	is--country	iso3166_1_alpha2	iso3166_1_alpha3	iso3166_1_numeric	iso3166_2	landlocked	latitude	longitude	main_religion_2008	country	un_sdg_ldc	un_sdg_region	un_state	unicef_region	unicode_region_subtag	world_4region	world_6region
0	abkh	others	NaN	NaN	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Abkhazia	NaN	NaN	False	NaN	NaN	europe	europe_central_asia
1	abw	others	high_income	high_income	True	AW	ABW	533.0	NaN	coastline	12.50000	-69.96667	christian	Aruba	un_not_least_developed	un_latin_america_and_the_caribbean	False	NaN	AW	americas	america
2	afg	g77	low_income	low_income	True	AF	AFG	4.0	NaN	landlocked	33.00000	66.00000	muslim	Afghanistan	un_least_developed	un_central_and_southern_asia	True	sa	AF	asia	south_asia
3	ago	g77	middle_income	lower_middle_income	True	AO	AGO	24.0	NaN	coastline	-12.50000	18.50000	christian	Angola	un_least_developed	un_sub_saharan_africa	True	ssa	AO	africa	sub_saharan_africa
4	aia	others	NaN	NaN	True	AI	AIA	660.0	NaN	coastline	18.21667	-63.05000	christian	Anguilla	un_not_least_developed	un_latin_america_and_the_caribbean	False	NaN	AI	americas	america

	geo	time	population_total
0	afg	1960	8996967
1	afg	1965	9956318
2	afg	1970	11173654
3	afg	1975	12689164
4	afg	1976	12943093

	geo	time	population_total	country
0	afg	1960	8996967	Afghanistan
1	afg	1965	9956318	Afghanistan
2	afg	1970	11173654	Afghanistan
3	afg	1975	12689164	Afghanistan
4	afg	1976	12943093	Afghanistan

	geo	time	cell_phones_total	population_total	country	phones_per_person
45	afg	2017	23929713.0	36296111	Afghanistan	0.659291
93	ago	2017	13323952.0	29816769	Angola	0.446861
141	alb	2017	3625699.0	2884169	Albania	1.257104
189	and	2017	80337.0	76997	Andorra	1.043378
227	are	2017	19826224.0	9487206	United Arab Emirates	2.089785

	geo	time	cell_phones_total	population_total	country	phones_per_person
3448	hkg	2017	18394762.0	7306315	Hong Kong, China	2.517652
227	are	2017	19826224.0	9487206	United Arab Emirates	2.089785
365	atg	2017	184000.0	95425	Antigua and Barbuda	1.928216
5253	mdv	2017	900120.0	496398	Maldives	1.813303
1937	cri	2017	8840342.0	4949955	Costa Rica	1.785944

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses_Aboard	Parents/Children_Aboard	Fare
0	0	3	Mr. Owen Harris Braund	male	22.0	1	0	7.2500
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	0	71.2833
2	1	3	Miss. Laina Heikkinen	female	26.0	0	0	7.9250
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	0	53.1000
4	0	3	Mr. William Henry Allen	male	35.0	0	0	8.0500
5	0	3	Mr. James Moran	male	27.0	0	0	8.4583
6	0	1	Mr. Timothy J McCarthy	male	54.0	0	0	51.8625
7	0	3	Master. Gosta Leonard Palsson	male	2.0	3	1	21.0750
8	1	3	Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson	female	27.0	0	2	11.1333
9	1	2	Mrs. Nicholas (Adele Achem) Nasser	female	14.0	1	0	30.0708
10	1	3	Miss. Marguerite Rut Sandstrom	female	4.0	1	1	16.7000
11	1	1	Miss. Elizabeth Bonnell	female	58.0	0	0	26.5500
12	0	3	Mr. William Henry Saundercock	male	20.0	0	0	8.0500
13	0	3	Mr. Anders Johan Andersson	male	39.0	1	5	31.2750
14	0	3	Miss. Hulda Amanda Adolfina Vestrom	female	14.0	0	0	7.8542
15	1	2	Mrs. (Mary D Kingcome) Hewlett	female	55.0	0	0	16.0000
16	0	3	Master. Eugene Rice	male	2.0	4	1	29.1250
17	1	2	Mr. Charles Eugene Williams	male	23.0	0	0	13.0000
18	0	3	Mrs. Julius (Emelia Maria Vandemoortele) Vande...	female	31.0	1	0	18.0000
19	1	3	Mrs. Fatima Masselmani	female	22.0	0	0	7.2250