Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

Instructions

Download this notebook as you would any other ipynb file
Upload to Google Colab or work locally (if you have that set-up)
Delete raise NotImplementedError()
Write your code in the # YOUR CODE HERE space
Execute the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
Save your notebook when you are finished
Download as a ipynb file (if working in Colab)
Upload your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

Part A: Statistical Analysis

Use the following information to complete tasks 1 - 8

Dataset description:

Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)?

Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the Longbones.csv dataset which you can find here.

What can we learn about the bodies that were buried in the cemetery?

The variable names are:

Site = Site ID, either Site 1 or Site 2
Time = Interrment time in years
Depth = Burial depth in ft.
Lime = Burial with Quiklime (0 = No, 1 = Yes)
Age = Age at time of death in years
Nitro = Nitrogen composition of the long bones in g per 100g of bone.
Oil = Oil contamination of the grave site (0 = No contamination, 1 = Oil contamination)

Source: D.R. Jarvis (1997). "Nitrogen Levels in Long Bones from Coffin Burials Interred for Periods of 26-90 Years," Forensic Science International, Vol85, pp199-208

Task 1 - Load the data

As we usually begin, let's load the data! The URL has been provided.

load your CSV file into a DataFrame named df

import pandas as pd
import numpy as np

data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Longbones/Longbones.csv'

# YOUR CODE HERE
df = pd.read_csv(data_url)

# Print out your DataFrame
df.head()

Task 1 - Test

assert isinstance(df, pd.DataFrame), 'Have you created a DataFrame named `df`?'
assert len(df) == 42

Task 2 - Missing data

Now, let's determine if there is any missing data in the dataset. If there is, drop the row that contains a missing value.

check for missing/null values and assign the sum to num_null - the result should be the sum of all the null values and a single integer (Hint: you will compute the sum of a sum)
if there are null values, drop them in place (your DataFrame should still be df)

Hint: If you need to go back and update your DataFrame, read in the data again before calculating the null values

# Hint: Make sure to read in the data again if you re-do you Null calculation
# YOUR CODE HERE
num_null = df.isnull().sum().sum()


df.dropna(inplace=True)

print(num_null)

0

Task 2 - Test

# Hidden tests - you will see the results when you submit to Canvas

Use the following information to complete tasks 3 - 8

The mean nitrogen composition in living individuals is 4.3g per 100g of bone.

We wish to use the Longbones sample to test the null hypothesis that the mean nitrogen composition per 100g of bone in the deceased is 4.3g (equal to that of living humans) vs the alternative hypothesis that the mean nitrogen composition per 100g of bone in the deceased is not 4.3g (not equal to that of living humans).

Task 3 - Statistical hypotheses

Write the null and alternative hypotheses described above.

This task will not be autograded - but it is part of completing the challenge.

Task 3 ANSWER:

$H_0: \mu =$ 4.3g

$H_a: \mu \neq$ 4.3g

Task 4 - Statistical distributions

What is the appropriate test for these hypotheses? A t-test or a chi-square test? Explain your answer in a sentence or two.

This task will not be autograded - but it is part of completing the challenge.

Task 4 ANSWER:

We want a t-test because we're going to compare a mean from our sample data to a known value, 4.3g/100g. A chi-square test compares if two groups are related/correlated.

Task 5 - Hypothesis testing

Use a built-in Python function to conduct the statistical test you identified earlier. The scipy stats module has been imported.

Assign the t statistic to the variable t
Assign the p-value to the variable p

Hint: Review the documentation to verify what it returns. You can assign the two variables in one step or two steps.

# Use this import for your calculation
from scipy import stats

# YOUR CODE HERE
t, p = stats.ttest_1samp(df['Nitro'], 4.3)
print(t)
print(p)

-16.525765821830365
8.097649978903554e-18

Task 5 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 6 - Conclusion

What is the p-value for this hypothesis test. Do you reject or fail to reject the null hypothesis at the 0.05 level?

This task will not be autograded - but it is part of the project!

Task 6 ANSWER:

The p-value is 8.1*10^-18. Since that is less than the .05 significance level, we reject the null hypothesis and conclude that the alternative hypothesis is correct(that the mean nitrogen level per 100 g of bone in the deceased is not 4.3g)

Task 7 - Confidence Interval

Calculate a 95% confidence interval for the mean nitrogen composition in the longbones of a deceased individual using the t.interval function.

Assign the lower end of the confidence interval to the variable l
Assign the upper end of the confidence interval to the variable u

Hint: You will need to calculate other statistics to complete the confidence interval calculation. These variables can be named whatever you like - just make sure to name your confidence interval variables as specified above.

# Use this import for your calculation
from scipy.stats import t

# YOUR CODE HERE

mean = df['Nitro'].mean()
sd = df['Nitro'].std()
n = df['Nitro'].count()
se = sd/(n**(1/2))
l, u = t.interval(alpha=0.95, df=34, loc=mean, scale=se)
print(l)
print(u)

3.734020952024922
3.8579790479750784

Task 7 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 8 - Conclusion

Write an interpretation of your 95% confidence interval.

This task will not be autograded - but it is part of completing the challenge.

Task 8 ANSWER:

We are 95% sure that the mean nitrogen level per 100 g in the deceased is between 3.73 and 3.86

Part B: A/B Testing

Use the following information to complete tasks 9 - 18

A/B Testing and Udacity

Udacity is an online learning platform geared toward tech professionals who want to develop skills in programming, data science, etc. These classes are intensive - both for the students and instructors - and the learning experience is best when students are able to dedicate enough time to the classes and there is not a lot of student churn.

Udacity wished to determine if presenting potential students with a screen that would remind them of the time commitment involved in taking a class would decrease the enrollment of students who were unlikely to succeed in the class.

At the time of the experiment, when a student selected a course, she was taken to the course overview page and presented with two options: "start free trial", and "access course materials".

If the student clicked "start free trial", she was asked to enter her credit card information and was enrolled in a free trial for the paid version of the course (which would covert to a paid membership after 14 days).

If the student clicked "access course materials", she could view the videos and take the quizzes for free but could not access all the features of the course such as coaching.

Credit: Udacity A/B testing final project example

Here's the experiment: Udacity tested a change where if the student clicked "start free trial", she was asked how much time she had available to devote to the course.

If the student indicated 5 or more hours per week, she would be taken through the checkout process as usual. If she indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion and suggesting that the student might like to access the course materials for free.

At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.

Now we wish to see if there was an association between the screen the potential student viewed and whether or not the student enrolled in the paid version of the course.

The Udacity data is linked below and is in a non-tidy format. We'll be focusing on the number of enrolling customers who convert to paying customers.

You don't need to do anything with the non-tidy data in this Challenge; we're sharing it here so you can get an idea of what data looks like before we clean it.

import pandas as pd
import numpy as np

# Load data
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Udacity%20AB%20testing%20data/AB%20testing%20data.csv'
ABtest_ = pd.read_csv(data_url)

print(ABtest_.shape)
ABtest_.head()

(999, 10)

Now, here is the enrollment and payment data in tidy format. You can see how I set it up here.

data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Udacity%20AB%20testing%20data/AB_test_payments.csv'

ABtest = pd.read_csv(data_url, skipinitialspace=True, header=0)

print(ABtest.shape)
ABtest.head()

(7208, 3)

Dataset information

The "tidy" data has the following values for the columns:

Group = Control or Experimental depending on the screen viewed
Payment = 0 if the individual did not not enroll as a paying customer, 1 = if the individual did enroll as a paying customer

Our goal is to determine if there is an association between the screen that a potential student viewed as she was signing up for a course and whether or not she converted to a paying customer.

Task 9 - Statistical hypotheses

Write the null and alternative hypothesis to test if there is an association between the screen that a potential student viewed as she was signing up for a course and whether or not he or she converted to a paying customer.

This task will not be autograded - but it is part of completing the challenge.

Task 9 ANSWER:

Ho: There is no relationship between the screen viewed and converting to a paying customer.

Ha: There is a relationship between the screen viewed and conterving to a paying customer.

Task 10 - Frequency and relative frequency

Calculate the frequency and relative frequency of viewing the control version of the website and the experimental version of the website.

Use pd.crosstab()
Assign the frequency table the name group_freq
Assign the relative frequency table the name group_pct. Multiply by 100 to convert the proportions in the table to percents.

# YOUR CODE HERE
group_freq = ABtest['Group'].value_counts()
group_pct = ABtest['Group'].value_counts(normalize = True)*100

#group_freq = pd.crosstab(ABtest['Group'])
#group_pct = pd.crosstab(ABtest['Group'],normalize = True)*100
print(group_freq)
print(group_pct)

Control       3785
Experiment    3423
Name: Group, dtype: int64
Control       52.511099
Experiment    47.488901
Name: Group, dtype: float64

Task 10 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 11 - Frequency and relative frequency

Calculate the frequency and relative frequency of converting to a paying customer.

Use pd.crosstab()
Assign the frequency table the name pay_freq
Assign the relative frequency table the name pay_pct. Multiply by 100 to convert the proportions in the table to percents.

# YOUR CODE HERE
pay_freq = ABtest['Payment'].value_counts()
pay_pct = ABtest['Payment'].value_counts(normalize = True)*100
print(pay_freq)
print(pay_pct)

1    3978
0    3230
Name: Payment, dtype: int64
1    55.188679
0    44.811321
Name: Payment, dtype: float64

Task 11 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 12 - Joint distribution

Calculate the joint distribution of experimental condition and conversion to a paying customer.

Use the experimental group as the index variable
Name the results of the joint distribution joint_dist

# YOUR CODE HERE
joint_dist = pd.crosstab(index = ABtest['Group'], columns=ABtest['Payment'])
joint_dist

Task 12 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 13 - Marginal distribution

Add the table margins to the joint distribution of experimental condition and conversion to a paying customer.

Use the experimental group as the index variable
Name the results of the distribution marginal_dist

# YOUR CODE HERE
marginal_dist = pd.crosstab(index = ABtest['Group'], columns=ABtest['Payment'], margins=True)
marginal_dist

Task 13 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 14 - Conditional distribution

Calculate the distribution of payment conversion conditional on the text the individual saw when he or she was signing up for Udacity.

Use the experimental group as the index variable
Name the results of the distribution conditional_dist and make sure to multiple the result by 100

# YOUR CODE HERE
conditional_dist = pd.crosstab(index=ABtest["Group"], columns=ABtest["Payment"],normalize="index")*100
conditional_dist

Task 14 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 15 - Statistical distributions

Identify the appropriate statistical test to determine if there is an association between the screen that a potential student viewed as she was signing up for a course and whether or not he or she converted to a paying customer.

This task will not be autograded - but it is part of completing the challenge.

Task 15 ANSWER:

It is appropriate to use the chi squared test because we're seeing if there is a correlation between two things(categorical variables), screen viewed and becoming a paying customer.

Task 16 - Hypothesis testing

Conduct the hypothesis test you identified in Task 15.

Assign the p-value to the variable p

Hint: The chi2_contingency() function returns more than one parameter - make sure to read the documentation to assign the correct one to your p-value

from scipy.stats import chi2_contingency

# YOUR CODE HERE
g, p, dof, expctd = chi2_contingency(pd.crosstab(index=ABtest["Group"], columns=ABtest["Payment"]))
print(p)

0.008608736615463934

Task 16 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 17 - Conclusions

Do you reject or fail to reject the null hypothesis at the 0.05 significance level?

This task will not be autograded - but it is part of completing the challenge.

Task 17 ANSWER:

0.008<0.05, so we reject the null hypothesis and conclude that the alternative hypothesis is correct, that there is a statistically significant relationship between the screen viewed and converting to a paying customer.

Task 18 - Visualization

Draw a side-by-side boxplot illustrating the conditional distribution of conversion by experimental group.

This task will not be autograded - but it is part of completing the challenge.

import matplotlib.pyplot as plt
import seaborn as sns

# YOUR CODE HERE
sns.barplot(x='Group', y='Payment', data = ABtest, ci = None);

Task 19 - Bayesian and Frequentist Statistics

In a few sentences, describe the difference between Bayesian and Frequentist statistics.

This task will not be autograded - but it is part of completing the challenge.

Task 19 ANSWER:

Bayesian Statistics looks at prior belief in addition to data, whereas Frequentist Statistics only uses the data as a source of information. Frequentist uses fixed parameters and looks at long run frequency to determine the probability. Bayesian uses random variables and probability is defined by degree of belief.

	Site	Time	Depth	Lime	Age	Nitro	Oil
0	1	88.5	7.0	1	NaN	3.88	1
1	1	88.5	NaN	1	NaN	4.00	1
2	1	85.2	7.0	1	NaN	3.69	1
3	1	71.8	7.6	1	65.0	3.88	0
4	1	70.6	7.5	1	42.0	3.53	0

	Date	C-Pageviews	C-Clicks	C-Enrollments	C-Payments	E-Pageviews	E-Clicks	E-Enrollments	E-Payments	Unnamed: 9
0	Sat, Oct 11	7723.0	687.0	134.0	70.0	7716.0	686.0	105.0	34.0	NaN
1	Sun, Oct 12	9102.0	779.0	147.0	70.0	9288.0	785.0	116.0	91.0	NaN
2	Mon, Oct 13	10511.0	909.0	167.0	95.0	10480.0	884.0	145.0	79.0	NaN
3	Tue, Oct 14	9871.0	836.0	156.0	105.0	9867.0	827.0	138.0	92.0	NaN
4	Wed, Oct 15	10014.0	837.0	163.0	64.0	9793.0	832.0	140.0	94.0	NaN

Payment	0	1	All
Group
Control	1752	2033	3785
Experiment	1478	1945	3423
All	3230	3978	7208

Payment	0	1
Group
Control	46.287979	53.712021
Experiment	43.178498	56.821502