Lambda School Data Science - Unit 1 Sprint 3
Autograded Notebook (Canvas & CodeGrade)
This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.
Instructions
- Download this notebook as you would any other ipynb file
- Upload to Google Colab or work locally (if you have that set-up)
- Delete
raise NotImplementedError()
- Write your code in the
# YOUR CODE HERE
space - Execute the Test cells that contain
assert
statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas) - Save your notebook when you are finished
- Download as a
ipynb
file (if working in Colab) - Upload your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)
Welcome to the final Sprint Challenge of Unit 1!
In this challenge, we're going to explore two different datasets where you can demonstrate your skills with fitting linear regression models and practicing some of the linear algebra concepts you learned.
Make sure to follow the instructions in each task carefully! The autograded tests are very specific in that they are designed to test on the exact instructions.
Good luck!
Part A: Linear Regression
Use the following information to complete Tasks 1 - 11
Dataset description
The data you will work on for this Sprint Challenge is from the World Happiness Report. The report compiles data from a survey of hundreds of countries and looks at factors such as economic production, social support, life expectancy, freedom, absence of corruption, and generosity to determine a happiness "score".
In this Sprint Challenge, we're only going to look at the report for years 2018 and 2019. We're going to see how much the happiness "score" depends on some of the factors listed above.
For more information about the data, you can look here: Kaggle: World Happiness Report
# URL provided
url = "https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Happy/happiness_years18_19.csv"
# YOUR CODE HERE
import pandas as pd
import numpy as np
happy = pd.read_csv(url, index_col='Overall_rank')
# Print out the DataFrame
happy.head()
Task 1 - Test
assert isinstance(happy, pd.DataFrame), 'Have you created a DataFrame named `happy`?'
assert happy.index.name == 'Overall_rank', "Your index should be 'Overall_rank'."
assert len(happy) == 312
Task 2 - Explore the data and find NaNs
Now you want to take a look at the dataset, determine the variable types of the columns, identify missing values, and generally better understand your data.
Your tasks
- Use describe() and info() to learn about any missing values, the data types, and descriptive statistics for each numeric value
- Sum the null values and assign that number to the variable
num_null
; the variable type should be anumpy.int64
integer.
Hint: If you use np.isnull()
it will return the number of null values in each column. You want the total number of null values in the entire DataFrame; one way to do this is to apply the .sum()
method twice: .sum().sum()
# YOUR CODE HERE
happy.describe()
happy.info()
num_null = happy.isnull().sum().sum()
# Print out your integer result
print("The total number of null values is:", num_null)
Task 2 Test
import numpy as np
assert isinstance(num_null, np.int64), 'The sum of the NaN values should be an integer.'
Task 3 - Drop a column
As you noticed in the previous task, the column Perceptions_corruption
has a lot of missing values. Let's determine how many are missing and then drop the column. Note: dropping a column isn't always the best choice when faced with missing values but we're choosing that option here, partly for practice.
- Calculate the percentage of NaN values in
Perceptions_corruption
and assign the result to the variablecorruption_nan
; the value should be a float between1.0
and100.0
. - Drop the
Perceptions_corruption
column fromhappy
but keep the DataFrame name the same; use the parameterinplace=True
. You will also want to specify the axis on which to operate.
# YOUR CODE HERE
corruption_nan = ((happy['Perceptions_corruption'].isnull().sum())/(len(happy['Perceptions_corruption'])))*100
happy.drop(['Perceptions_corruption'], axis=1, inplace=True)
# Print the percentage of NaN values
print(corruption_nan)
# Print happy to verify the column was dropped
happy.head()
*Task 3 Test
assert isinstance(corruption_nan, np.float), 'The percentage of NaN values should be a float.'
assert corruption_nan >= 1, 'Make sure you calculated the percentage and not the decimal fraction.'
Task 4 - Visualize the dataset
Next, we'll create a visualization for this dataset. We know from the introduction that we're trying to predict the happiness score from the other factors. Before we do let, let's visualize the dataset using a seaborn pairplot
to look at all of the columns plotted as "pairs".
Your tasks
- Use the seaborn library
sns.pairplot()
function to create your visualization (use the starter code provided)
This task will not be autograded - but it is part of completing the challenge.
# (NOT autograded but fill in your code!)
# Import seaborn
import seaborn as sns
# Use sns.pairplot(data) where data is the name of your DataFrame
# sns.pairplot()
# YOUR CODE HERE
sns.pairplot(happy)
Task 5 - Identify the dependent and independent variables
Before we fit a linear regression to the variables in this data set, we need to determine the dependent variable (the target or y variable) and independent variable (the feature or x variable). For this dataset, we have one dependent variable and a few choices for the independent variable(s). Using the information about the data set and what you know from previous tasks, complete the following:
- Assign the dependent variable to
y_var
- Choose one independent variable and assign it to
x_var
# YOUR CODE HERE
y_var = happy['Score']
x_var = happy['GDP_per_capita']
Task 5 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 6 - Fit a line using seaborn
Before we fit the linear regression model, we'll check how well a line fits. Because you have some choices for which independent variable to select, we're going to complete the rest of our analysis using GDP per capita
as the independent variable. We're using Score
as the dependent (target) variable.
The seaborn lmplot()
documentation can be found here. You can also use regplot()
and the documentation is here
This task will not be autograded - but it is part of completing the challenge!
Your tasks:
- Create a scatter plot using seaborn with
GDP per capita
andScore
- Use
sns.lmplot()
orsns.regplot()
and specify a confidence interval of 0.95 - Answer the questions about your plot (not autograded).
# YOUR CODE HERE
sns.regplot(x="GDP_per_capita", y="Score", data=happy);
Task 6 - Short answer
- Does it make sense to fit a linear model to these two variables? In otherwords, are there any problems with this data like extreme outliers, non-linearity, etc.
- Over what range of your independent variable does the linear model not fit the data well? Over what range does a line fit the data well?
- Yes, it makes sense to fit the linear model. There are relatively few outliers, though there are some at the very highest and lowest levels of GDP_per_capita.
- Right at 0 for the independent variable isn't the greatest fit, and also above 2 for GDP_per_capita the data isn't very close to the line of best fit. Between 0.1 and 1.75 the independent variable fits the linear model well.
Task 7 - Fit a linear regression model
Now it's time to fit the linear regression model! We have two variables (GDP_per_capita
and Score
) that we are going to use in our model.
Your tasks:
- Use the provided import for the
statsmodels.formula.api
libraryols
method - Fit a single variable linear regression model and assign the model to the variable
model_1
- Print out the model summary and assign the value of R-squared for this model to
r_square_model_1
. Your value should be defined to three decimal places (example:r_square_model_1 = 0.123
) - Answer the questions about your resulting model parameters (these short answer questions will not be autograded).
NOTE: - For this task to be correctly autograded, you need to input the model parameters as specified in the code cell below. Part of this Sprint Challenge is correctly implementing the instructions in each task.
# Import the OLS model from statsmodels
from statsmodels.formula.api import ols
# YOUR CODE HERE
model_1 = ols('Score~GDP_per_capita', data=happy).fit()
# Print the model summary
print(model_1.summary())
r_square_model_1 = 0.637
Task 7 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 8 - Interpret your model
Using the model summary you printed out above, answer the following questions.
- Assign the slope of
GDP_per_capita
to the variableslope_model_1
; define it to two decimal places (example: 1.23). This variable should be a float. - Assign the p-value for this model parameter to
pval_model_1
; this variable could be either an integer or a float. - Assign the 95% confidence interval to the variables
ci_low
(lower value) andci_upper
(upper value); define them to two decimal places.
Answer the following questions (not autograded):
- Is the correlation between your variables positive or negative?
- How would you write the confidence interval for your slope coefficient?
- State the null hypothesis to test for a statistically significant relationship between your two variables.
- Using the p-value from your model, do you reject or fail to reject the null hypothesis?
- Positive correlation
- 95% between 2.064 and 2.444
- Ho: $\beta_1$ = 0 , Ha: $\beta_1 \neq$ 0
- 0<0.05, so we'd reject the null hypothesis
# YOUR CODE HERE
slope_model_1 = 2.25
pval_model_1 = 0.00
ci_low = 2.06
ci_upper = 2.44
Task 8 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 9 - Fit a multiple predictor linear regression model
For this next task, we'll add in an additional independent or predictor variable. Let's look back at the pairplot and choose another variable - we'll use Social_support
. Recall from the Guided Projects and Module Projects that we are looking to see if adding the variable Social_support
is statistically significant after accounting for the GDP_per_capita
variable.
We're going to fit a linear regression model using two predictor variables: GDP_per_capita
and Social_support
.
Your tasks:
- Fit a model with both predictor variables and assign the model to
model_2
. The format of the input to the model isY ~ X1 + X2
.- X1 =
GDP_per_capita
and X2 =Social_support
.
- X1 =
- Print out the model summary and assign the value of R-squared for this model to
r_square_model_2
. Your value should be defined to three decimal places. - Assign the value of the adjusted R-square to
adj_r_square_model_2
. Your value should be defined to three decimal places. - Answer the questions about your resulting model parameters (these short answer questions will not be autograded)
# YOUR CODE HERE
model_2 = ols('Score~GDP_per_capita+Social_support', data=happy).fit()
r_square_model_2 = 0.712
adj_r_square_model_2 = 0.710
# Print the model summary
print(model_2.summary())
Task 9 Test
# Hidden tests - you will see the results when you submit to Canvas
Task 10 - Multiple regression model interpretation
Now that we have added an additional variable to our regression model, let's look at how the explained variance (the R-squared value) changes.
Your tasks
- Find the explained variance from
model_1
and assign it to the variabler_square_percent1
; your variable should be expressed as a percentage and should be rounded to the nearest integer. - Find the explained variance (adjusted!) from
model_2
and assign it to the variabler_square_adj_percent2
; you variable should be expressed as a percentage and should be rounded to the nearest integer.
Question (not autograded):
How does the adjusted R-squared value change when a second predictor variable is added?
The adjusted r-quared value increases, so the line fits the data even better.
# YOUR CODE HERE
r_square_percent1 = 64
r_square_adj_percent2 = 71
print(r_square_percent1)
print(r_square_adj_percent2)
Task 10 Test
assert r_square_percent1 >= 1, 'Make sure you use the percentage and not the decimal fraction.'
assert r_square_adj_percent2 >= 1, 'Make sure you use the percentage and not the decimal fraction.'
# Hidden tests - you will see the results when you submit to Canvas
Task 11 - Making a prediction and calculating the residual
We're going to use our model to make a prediction. Refer to the happy
DataFrame and find the GDP_per_capita
score for "Iceland" (index 4). Then when we have a prediction, we can calculate the residual. There are actually two row entries for Iceland, both with slightly different column values. Use the column values that you can see when you print happy.head()
.
Prediction
- Assign the
GDP_per_capita
value to the variablex_iceland
; it should be float and defined out to two decimal places. - Using your slope and intercept values from
model_1
, calculate theScore
for Iceland (x_iceland
); assign this value topredict_iceland
and it should be a float.
Residual
- Assign the observed
Score
for Iceland and assign it to the variableobserve_iceland
; it should be float and defined out to two decimal places (careful with the rounding!). - Determine the residual for the prediction you made and assign it to the variable
residual_iceland
(use your Guided Project or Module Project notebooks if you need a reminder of how to do a residual calculation).
Hint: Define your slope and intercept values out to two decimal places! Your resulting prediction for Iceland should have at least two decimal places. Make sure to use the parameters from the first model (model_1
).
happy.head()
#print(model_1.params)
#model_1.params[0]
# YOUR CODE HERE
x_iceland = 1.34
predict_iceland = model_1.params[0] + model_1.params[1]*x_iceland
observe_iceland = 7.50
residual_iceland = observe_iceland - predict_iceland
# View your prediction
print('Prediction for Iceland :', predict_iceland)
print('Residual for Iceland prediction :', residual_iceland)
Task 11 Test
assert residual_iceland >= 0, 'Check your residual calculation (use observed - predicted).'
assert round(x_iceland, 1) == 1.3, 'Check your Iceland GDP value.'
assert round(observe_iceland, 1) == 7.5, 'Check your Iceland observation value for "Score".'
# Hidden tests - you will see the results when you submit to Canvas
Part B: Vectors and cosine similarity
In this part of the challenge, we're going to look at how similar two vectors are. Remember, we can calculate the cosine similarity between two vectors by using this equation:
$$\cos \theta= \frac{\mathbf {A} \cdot \mathbf {B} }{\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|}$$
$\qquad$
where
- The numerator is the dot product of the vectors $\mathbf {A}$ and $\mathbf {B}$
- The denominator is the norm of $\mathbf {A}$ times the norm of $\mathbf {B}$
Three documents, two authors
For this task, you will calculate the cosine similarity between three vectors. But here's the interesting part: each vector represents a "chunk" of text from a novel (a few chapters of text). This text was cleaned to remove non-alphanumeric characters and numbers and then each document was transformed into a vector representation as described below.
Document vectors
In the dataset you are going to load below, each row represents a word that occurs in at least one of the documents. So all the rows are all the words that are in our three documents.
Each column represents a document (doc0, doc1, doc2). Now the fun part: the value in each cell is how frequently that word (row) occurs in that document (term-frequency) divided by how many documents that words appears in (document-frequency).
cell value = term_frequency / document_frequency
Use the above information to complete the remaining tasks.
Task 12 - Explore the text documents
You will be using cosine similarity to compare each document vector to the others. Remember that there are three documents, but two authors. Your task is to use the cosine similarity calculations to determine which two document vectors are most similar (written by the same author).
Your tasks:
- Load in the CSV file that contains the document vectors (this is coded for you - just run the cell)
- Look at the DataFrame you just loaded in any way that helps you understand the format, what's included in the data, the shape of the DataFrame, etc.
You can use document vectors just as they are - you don't need to code anything for Task 12.
# Load the data - DON'T DELETE THIS CELL
url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_4/unit1_nlp/text_vectors.csv'
text = pd.read_csv(url)
## Explore the data
# (this part is not autograded)
text.head()
Task 13 - Calculate cosine similarity
Calculate the cosine similarity for three pairs of vectors and assign the results to the following variables (each variable will be a float):
- assign the cosine similarity of doc0-doc1 to
cosine_doc0_1
- assign the cosine similarity of doc0-doc2 to
cosine_doc0_2
- assign the cosine similarity of doc1-doc2 to
cosine_doc1_2
- assign the cosine similarity of doc0-doc1 to
Print out the results so you can refer to them for the short answer section.
- Answer the questions after you have completed the cosine similarity calculations.
# Use these imports for your cosine calculations (DON'T DELETE)
from numpy import dot
from numpy.linalg import norm
# YOUR CODE HERE
cosine_doc0_1 = dot(text['doc0'], text['doc1'])/(norm(text['doc0'])*norm(text['doc1']))
cosine_doc0_2 = dot(text['doc0'], text['doc2'])/(norm(text['doc0'])*norm(text['doc2']))
cosine_doc1_2 = dot(text['doc1'], text['doc2'])/(norm(text['doc1'])*norm(text['doc2']))
# Print out the results
print('Cosine similarity for doc0-doc1:', cosine_doc0_1)
print('Cosine similarity for doc0-doc2:', cosine_doc0_2)
print('Cosine similarity for doc1-doc2:', cosine_doc1_2)
Task 13 - Short answer
- Using your cosine similarity calculations, which two documents are most similar?
- If doc1 and doc2 were written by the same author, are your cosine similarity calculations consistent with this statement?
- What process would we need to follow to add an additional document column? In other words, why can't we just stick another column with (term-frequency/document-frequency) values onto our current DataFrame
text
?
- Documents 1 and 2 are the most similar
- Yes, they would be consistent because 1 and 2 had the greatest similarity of the three.
- We would have to update each cell's value because now the document-frequency of each word would be affected, potentially changing each cell's value once we compute the term-frequency/document-frequency.
Task 13 Test
# Hidden tests - you will see the results when you submit to Canvas
Additional Information about the texts used in this analysis:
You can find the raw text here. Dcoument 0 (doc0) is chapters 1-3 from "Pride and Predjudice" by Jane Austen. Document 1 (doc1) is chapters 1- 4 from "Frankenstein" by Mary Shelley. Document 2 is also from "Frankenstein", chapters 11-14.