Lambda School Data Science - Unit 1 Sprint 2
Autograded Notebook (Canvas & CodeGrade)
This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.
Instructions
- Download this notebook as you would any other ipynb file
- Upload to Google Colab or work locally (if you have that set-up)
- Delete
raise NotImplementedError()
- Write your code in the
# YOUR CODE HERE
space - Execute the Test cells that contain
assert
statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas) - Save your notebook when you are finished
- Download as a
ipynb
file (if working in Colab) - Upload your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)
Part A: Statistical Analysis
Use the following information to complete tasks 1 - 8
Dataset description:
Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)?
Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the Longbones.csv
dataset which you can find here.
What can we learn about the bodies that were buried in the cemetery?
The variable names are:
- Site = Site ID, either Site 1 or Site 2
- Time = Interrment time in years
- Depth = Burial depth in ft.
- Lime = Burial with Quiklime (0 = No, 1 = Yes)
- Age = Age at time of death in years
- Nitro = Nitrogen composition of the long bones in g per 100g of bone.
- Oil = Oil contamination of the grave site (0 = No contamination, 1 = Oil contamination)
Source: D.R. Jarvis (1997). "Nitrogen Levels in Long Bones from Coffin Burials Interred for Periods of 26-90 Years," Forensic Science International, Vol85, pp199-208
Task 1 - Load the data
As we usually begin, let's load the data! The URL has been provided.
- load your CSV file into a DataFrame named
df
import pandas as pd
import numpy as np
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Longbones/Longbones.csv'
# YOUR CODE HERE
df = pd.read_csv(data_url)
# Print out your DataFrame
df.head()
Task 1 - Test
assert isinstance(df, pd.DataFrame), 'Have you created a DataFrame named `df`?'
assert len(df) == 42
Task 2 - Missing data
Now, let's determine if there is any missing data in the dataset. If there is, drop the row that contains a missing value.
- check for missing/null values and assign the sum to
num_null
- the result should be the sum of all the null values and a single integer (Hint: you will compute the sum of a sum) - if there are null values, drop them in place (your DataFrame should still be
df
)
Hint: If you need to go back and update your DataFrame, read in the data again before calculating the null values
# Hint: Make sure to read in the data again if you re-do you Null calculation
# YOUR CODE HERE
num_null = df.isnull().sum().sum()
df.dropna(inplace=True)
print(num_null)
Task 2 - Test
# Hidden tests - you will see the results when you submit to Canvas
Use the following information to complete tasks 3 - 8
The mean nitrogen composition in living individuals is 4.3g per 100g of bone.
We wish to use the Longbones sample to test the null hypothesis that the mean nitrogen composition per 100g of bone in the deceased is 4.3g (equal to that of living humans) vs the alternative hypothesis that the mean nitrogen composition per 100g of bone in the deceased is not 4.3g (not equal to that of living humans).
Task 3 - Statistical hypotheses
Write the null and alternative hypotheses described above.
This task will not be autograded - but it is part of completing the challenge.
Task 3 ANSWER:
$H_0: \mu =$ 4.3g
$H_a: \mu \neq$ 4.3g
Task 4 - Statistical distributions
What is the appropriate test for these hypotheses? A t-test or a chi-square test? Explain your answer in a sentence or two.
This task will not be autograded - but it is part of completing the challenge.
Task 4 ANSWER:
We want a t-test because we're going to compare a mean from our sample data to a known value, 4.3g/100g. A chi-square test compares if two groups are related/correlated.
Task 5 - Hypothesis testing
Use a built-in Python function to conduct the statistical test you identified earlier. The scipy stats module has been imported.
- Assign the t statistic to the variable
t
- Assign the p-value to the variable
p
Hint: Review the documentation to verify what it returns. You can assign the two variables in one step or two steps.
# Use this import for your calculation
from scipy import stats
# YOUR CODE HERE
t, p = stats.ttest_1samp(df['Nitro'], 4.3)
print(t)
print(p)
Task 5 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 6 - Conclusion
What is the p-value for this hypothesis test. Do you reject or fail to reject the null hypothesis at the 0.05 level?
This task will not be autograded - but it is part of the project!
Task 6 ANSWER:
The p-value is 8.1*10^-18. Since that is less than the .05 significance level, we reject the null hypothesis and conclude that the alternative hypothesis is correct(that the mean nitrogen level per 100 g of bone in the deceased is not 4.3g)
Task 7 - Confidence Interval
Calculate a 95% confidence interval for the mean nitrogen composition in the longbones of a deceased individual using the t.interval function.
- Assign the lower end of the confidence interval to the variable
l
- Assign the upper end of the confidence interval to the variable
u
Hint: You will need to calculate other statistics to complete the confidence interval calculation. These variables can be named whatever you like - just make sure to name your confidence interval variables as specified above.
# Use this import for your calculation
from scipy.stats import t
# YOUR CODE HERE
mean = df['Nitro'].mean()
sd = df['Nitro'].std()
n = df['Nitro'].count()
se = sd/(n**(1/2))
l, u = t.interval(alpha=0.95, df=34, loc=mean, scale=se)
print(l)
print(u)
Task 7 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 8 - Conclusion
Write an interpretation of your 95% confidence interval.
This task will not be autograded - but it is part of completing the challenge.
Task 8 ANSWER:
We are 95% sure that the mean nitrogen level per 100 g in the deceased is between 3.73 and 3.86
A/B Testing and Udacity
Udacity is an online learning platform geared toward tech professionals who want to develop skills in programming, data science, etc. These classes are intensive - both for the students and instructors - and the learning experience is best when students are able to dedicate enough time to the classes and there is not a lot of student churn.
Udacity wished to determine if presenting potential students with a screen that would remind them of the time commitment involved in taking a class would decrease the enrollment of students who were unlikely to succeed in the class.
At the time of the experiment, when a student selected a course, she was taken to the course overview page and presented with two options: "start free trial", and "access course materials".
If the student clicked "start free trial", she was asked to enter her credit card information and was enrolled in a free trial for the paid version of the course (which would covert to a paid membership after 14 days).
If the student clicked "access course materials", she could view the videos and take the quizzes for free but could not access all the features of the course such as coaching.
Here's the experiment: Udacity tested a change where if the student clicked "start free trial", she was asked how much time she had available to devote to the course.
If the student indicated 5 or more hours per week, she would be taken through the checkout process as usual. If she indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion and suggesting that the student might like to access the course materials for free.
At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.
Now we wish to see if there was an association between the screen the potential student viewed and whether or not the student enrolled in the paid version of the course.
The Udacity data is linked below and is in a non-tidy format. We'll be focusing on the number of enrolling customers who convert to paying customers.
You don't need to do anything with the non-tidy data in this Challenge; we're sharing it here so you can get an idea of what data looks like before we clean it.
import pandas as pd
import numpy as np
# Load data
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Udacity%20AB%20testing%20data/AB%20testing%20data.csv'
ABtest_ = pd.read_csv(data_url)
print(ABtest_.shape)
ABtest_.head()
Now, here is the enrollment and payment data in tidy format. You can see how I set it up here.
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Udacity%20AB%20testing%20data/AB_test_payments.csv'
ABtest = pd.read_csv(data_url, skipinitialspace=True, header=0)
print(ABtest.shape)
ABtest.head()
Dataset information
The "tidy" data has the following values for the columns:
- Group = Control or Experimental depending on the screen viewed
- Payment = 0 if the individual did not not enroll as a paying customer, 1 = if the individual did enroll as a paying customer
Our goal is to determine if there is an association between the screen that a potential student viewed as she was signing up for a course and whether or not she converted to a paying customer.
Task 9 - Statistical hypotheses
Write the null and alternative hypothesis to test if there is an association between the screen that a potential student viewed as she was signing up for a course and whether or not he or she converted to a paying customer.
This task will not be autograded - but it is part of completing the challenge.
Task 9 ANSWER:
Ho: There is no relationship between the screen viewed and converting to a paying customer.
Ha: There is a relationship between the screen viewed and conterving to a paying customer.
Task 10 - Frequency and relative frequency
Calculate the frequency and relative frequency of viewing the control version of the website and the experimental version of the website.
- Use
pd.crosstab()
- Assign the frequency table the name
group_freq
- Assign the relative frequency table the name
group_pct
. Multiply by 100 to convert the proportions in the table to percents.
# YOUR CODE HERE
group_freq = ABtest['Group'].value_counts()
group_pct = ABtest['Group'].value_counts(normalize = True)*100
#group_freq = pd.crosstab(ABtest['Group'])
#group_pct = pd.crosstab(ABtest['Group'],normalize = True)*100
print(group_freq)
print(group_pct)
Task 10 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 11 - Frequency and relative frequency
Calculate the frequency and relative frequency of converting to a paying customer.
- Use
pd.crosstab()
- Assign the frequency table the name
pay_freq
- Assign the relative frequency table the name
pay_pct
. Multiply by 100 to convert the proportions in the table to percents.
# YOUR CODE HERE
pay_freq = ABtest['Payment'].value_counts()
pay_pct = ABtest['Payment'].value_counts(normalize = True)*100
print(pay_freq)
print(pay_pct)
Task 11 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 12 - Joint distribution
Calculate the joint distribution of experimental condition and conversion to a paying customer.
- Use the experimental group as the index variable
- Name the results of the joint distribution
joint_dist
# YOUR CODE HERE
joint_dist = pd.crosstab(index = ABtest['Group'], columns=ABtest['Payment'])
joint_dist
Task 12 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 13 - Marginal distribution
Add the table margins to the joint distribution of experimental condition and conversion to a paying customer.
- Use the experimental group as the index variable
- Name the results of the distribution
marginal_dist
# YOUR CODE HERE
marginal_dist = pd.crosstab(index = ABtest['Group'], columns=ABtest['Payment'], margins=True)
marginal_dist
Task 13 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 14 - Conditional distribution
Calculate the distribution of payment conversion conditional on the text the individual saw when he or she was signing up for Udacity.
- Use the experimental group as the index variable
- Name the results of the distribution
conditional_dist
and make sure to multiple the result by 100
# YOUR CODE HERE
conditional_dist = pd.crosstab(index=ABtest["Group"], columns=ABtest["Payment"],normalize="index")*100
conditional_dist
Task 14 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 15 - Statistical distributions
Identify the appropriate statistical test to determine if there is an association between the screen that a potential student viewed as she was signing up for a course and whether or not he or she converted to a paying customer.
This task will not be autograded - but it is part of completing the challenge.
Task 15 ANSWER:
It is appropriate to use the chi squared test because we're seeing if there is a correlation between two things(categorical variables), screen viewed and becoming a paying customer.
Task 16 - Hypothesis testing
Conduct the hypothesis test you identified in Task 15.
- Assign the p-value to the variable
p
Hint: The chi2_contingency()
function returns more than one parameter - make sure to read the documentation to assign the correct one to your p-value
from scipy.stats import chi2_contingency
# YOUR CODE HERE
g, p, dof, expctd = chi2_contingency(pd.crosstab(index=ABtest["Group"], columns=ABtest["Payment"]))
print(p)
Task 16 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 17 - Conclusions
Do you reject or fail to reject the null hypothesis at the 0.05 significance level?
This task will not be autograded - but it is part of completing the challenge.
Task 17 ANSWER:
0.008<0.05, so we reject the null hypothesis and conclude that the alternative hypothesis is correct, that there is a statistically significant relationship between the screen viewed and converting to a paying customer.
Task 18 - Visualization
Draw a side-by-side boxplot illustrating the conditional distribution of conversion by experimental group.
This task will not be autograded - but it is part of completing the challenge.
import matplotlib.pyplot as plt
import seaborn as sns
# YOUR CODE HERE
sns.barplot(x='Group', y='Payment', data = ABtest, ci = None);
Task 19 - Bayesian and Frequentist Statistics
In a few sentences, describe the difference between Bayesian and Frequentist statistics.
This task will not be autograded - but it is part of completing the challenge.
Task 19 ANSWER:
Bayesian Statistics looks at prior belief in addition to data, whereas Frequentist Statistics only uses the data as a source of information. Frequentist uses fixed parameters and looks at long run frequency to determine the probability. Bayesian uses random variables and probability is defined by degree of belief.