Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

Instructions

Download this notebook as you would any other ipynb file
Upload to Google Colab or work locally (if you have that set-up)
Delete raise NotImplementedError()
Write your code in the # YOUR CODE HERE space
Execute the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
Save your notebook when you are finished
Download as a ipynb file (if working in Colab)
Upload your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

Welcome to the final Sprint Challenge of Unit 1!

In this challenge, we're going to explore two different datasets where you can demonstrate your skills with fitting linear regression models and practicing some of the linear algebra concepts you learned.

Make sure to follow the instructions in each task carefully! The autograded tests are very specific in that they are designed to test on the exact instructions.

Good luck!

Part A: Linear Regression

Use the following information to complete Tasks 1 - 11

Dataset description

The data you will work on for this Sprint Challenge is from the World Happiness Report. The report compiles data from a survey of hundreds of countries and looks at factors such as economic production, social support, life expectancy, freedom, absence of corruption, and generosity to determine a happiness "score".

In this Sprint Challenge, we're only going to look at the report for years 2018 and 2019. We're going to see how much the happiness "score" depends on some of the factors listed above.

For more information about the data, you can look here: Kaggle: World Happiness Report

Task 1 - Load the data

import both pandas and numpy
use the URL provided to read in your DataFrame
load the CSV file as a DataFrame with the name happy and set the index column as Overall_rank.
the shape of your DataFrame should be (312, 8)

# URL provided
url = "https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Happy/happiness_years18_19.csv"

# YOUR CODE HERE
import pandas as pd
import numpy as np

happy = pd.read_csv(url, index_col='Overall_rank')

# Print out the DataFrame
happy.head()

Task 1 - Test

assert isinstance(happy, pd.DataFrame), 'Have you created a DataFrame named `happy`?'
assert happy.index.name == 'Overall_rank', "Your index should be 'Overall_rank'."
assert len(happy) == 312

Task 2 - Explore the data and find NaNs

Now you want to take a look at the dataset, determine the variable types of the columns, identify missing values, and generally better understand your data.

Your tasks

Use describe() and info() to learn about any missing values, the data types, and descriptive statistics for each numeric value
Sum the null values and assign that number to the variable num_null; the variable type should be a numpy.int64 integer.

Hint: If you use np.isnull() it will return the number of null values in each column. You want the total number of null values in the entire DataFrame; one way to do this is to apply the .sum() method twice: .sum().sum()

# YOUR CODE HERE
happy.describe()
happy.info()

num_null = happy.isnull().sum().sum()

# Print out your integer result
print("The total number of null values is:", num_null)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 312 entries, 1 to 156
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Country_region           312 non-null    object 
 1   Score                    312 non-null    float64
 2   GDP_per_capita           312 non-null    float64
 3   Social_support           312 non-null    float64
 4   Healthy_life_expectancy  312 non-null    float64
 5   Freedom_life_choices     312 non-null    float64
 6   Generosity               312 non-null    float64
 7   Perceptions_corruption   86 non-null     float64
dtypes: float64(7), object(1)
memory usage: 21.9+ KB
The total number of null values is: 226

Task 2 Test

import numpy as np
assert isinstance(num_null, np.int64), 'The sum of the NaN values should be an integer.'

Task 3 - Drop a column

As you noticed in the previous task, the column Perceptions_corruption has a lot of missing values. Let's determine how many are missing and then drop the column. Note: dropping a column isn't always the best choice when faced with missing values but we're choosing that option here, partly for practice.

Calculate the percentage of NaN values in Perceptions_corruption and assign the result to the variable corruption_nan; the value should be a float between 1.0 and 100.0.
Drop the Perceptions_corruption column from happy but keep the DataFrame name the same; use the parameter inplace=True. You will also want to specify the axis on which to operate.

# YOUR CODE HERE
corruption_nan = ((happy['Perceptions_corruption'].isnull().sum())/(len(happy['Perceptions_corruption'])))*100

happy.drop(['Perceptions_corruption'], axis=1, inplace=True)
# Print the percentage of NaN values
print(corruption_nan)

# Print happy to verify the column was dropped

happy.head()

72.43589743589743

*Task 3 Test

assert isinstance(corruption_nan, np.float), 'The percentage of NaN values should be a float.'
assert corruption_nan >= 1, 'Make sure you calculated the percentage and not the decimal fraction.'

Task 4 - Visualize the dataset

Next, we'll create a visualization for this dataset. We know from the introduction that we're trying to predict the happiness score from the other factors. Before we do let, let's visualize the dataset using a seaborn pairplot to look at all of the columns plotted as "pairs".

Your tasks

Use the seaborn library sns.pairplot() function to create your visualization (use the starter code provided)

This task will not be autograded - but it is part of completing the challenge.

# (NOT autograded but fill in your code!)

# Import seaborn
import seaborn as sns

# Use sns.pairplot(data) where data is the name of your DataFrame
# sns.pairplot()

# YOUR CODE HERE
sns.pairplot(happy)

<seaborn.axisgrid.PairGrid at 0x7f649a065320>

Task 5 - Identify the dependent and independent variables

Before we fit a linear regression to the variables in this data set, we need to determine the dependent variable (the target or y variable) and independent variable (the feature or x variable). For this dataset, we have one dependent variable and a few choices for the independent variable(s). Using the information about the data set and what you know from previous tasks, complete the following:

Assign the dependent variable to y_var
Choose one independent variable and assign it to x_var

# YOUR CODE HERE
y_var = happy['Score']
x_var = happy['GDP_per_capita']

Task 5 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 6 - Fit a line using seaborn

Before we fit the linear regression model, we'll check how well a line fits. Because you have some choices for which independent variable to select, we're going to complete the rest of our analysis using GDP per capita as the independent variable. We're using Score as the dependent (target) variable.

The seaborn lmplot() documentation can be found here. You can also use regplot() and the documentation is here

This task will not be autograded - but it is part of completing the challenge!

Your tasks:

Create a scatter plot using seaborn with GDP per capita and Score
Use sns.lmplot() or sns.regplot() and specify a confidence interval of 0.95
Answer the questions about your plot (not autograded).

# YOUR CODE HERE
sns.regplot(x="GDP_per_capita", y="Score", data=happy);

Task 6 - Short answer

Does it make sense to fit a linear model to these two variables? In otherwords, are there any problems with this data like extreme outliers, non-linearity, etc.
Over what range of your independent variable does the linear model not fit the data well? Over what range does a line fit the data well?

Yes, it makes sense to fit the linear model. There are relatively few outliers, though there are some at the very highest and lowest levels of GDP_per_capita.
Right at 0 for the independent variable isn't the greatest fit, and also above 2 for GDP_per_capita the data isn't very close to the line of best fit. Between 0.1 and 1.75 the independent variable fits the linear model well.

Task 7 - Fit a linear regression model

Now it's time to fit the linear regression model! We have two variables (GDP_per_capita and Score) that we are going to use in our model.

Your tasks:

Use the provided import for the statsmodels.formula.api library ols method
Fit a single variable linear regression model and assign the model to the variable model_1
Print out the model summary and assign the value of R-squared for this model to r_square_model_1. Your value should be defined to three decimal places (example: r_square_model_1 = 0.123)
Answer the questions about your resulting model parameters (these short answer questions will not be autograded).

NOTE: - For this task to be correctly autograded, you need to input the model parameters as specified in the code cell below. Part of this Sprint Challenge is correctly implementing the instructions in each task.

# Import the OLS model from statsmodels
from statsmodels.formula.api import ols

# YOUR CODE HERE
model_1 = ols('Score~GDP_per_capita', data=happy).fit()

# Print the model summary
print(model_1.summary())

r_square_model_1 = 0.637

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Score   R-squared:                       0.637
Model:                            OLS   Adj. R-squared:                  0.636
Method:                 Least Squares   F-statistic:                     543.4
Date:                Sat, 20 Feb 2021   Prob (F-statistic):           3.82e-70
Time:                        02:55:27   Log-Likelihood:                -318.08
No. Observations:                 312   AIC:                             640.2
Df Residuals:                     310   BIC:                             647.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          3.3667      0.095     35.496      0.000       3.180       3.553
GDP_per_capita     2.2541      0.097     23.312      0.000       2.064       2.444
==============================================================================
Omnibus:                        1.623   Durbin-Watson:                   1.376
Prob(Omnibus):                  0.444   Jarque-Bera (JB):                1.629
Skew:                          -0.113   Prob(JB):                        0.443
Kurtosis:                       2.727   Cond. No.                         4.77
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Task 7 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 8 - Interpret your model

Using the model summary you printed out above, answer the following questions.

Assign the slope of GDP_per_capita to the variable slope_model_1; define it to two decimal places (example: 1.23). This variable should be a float.
Assign the p-value for this model parameter to pval_model_1; this variable could be either an integer or a float.
Assign the 95% confidence interval to the variables ci_low (lower value) and ci_upper (upper value); define them to two decimal places.

Answer the following questions (not autograded):

Is the correlation between your variables positive or negative?
How would you write the confidence interval for your slope coefficient?
State the null hypothesis to test for a statistically significant relationship between your two variables.
Using the p-value from your model, do you reject or fail to reject the null hypothesis?

Positive correlation
95% between 2.064 and 2.444
Ho: $\beta_1$ = 0 , Ha: $\beta_1 \neq$ 0
0<0.05, so we'd reject the null hypothesis

# YOUR CODE HERE
slope_model_1 = 2.25
pval_model_1 = 0.00
ci_low = 2.06
ci_upper = 2.44

Task 8 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 9 - Fit a multiple predictor linear regression model

For this next task, we'll add in an additional independent or predictor variable. Let's look back at the pairplot and choose another variable - we'll use Social_support. Recall from the Guided Projects and Module Projects that we are looking to see if adding the variable Social_support is statistically significant after accounting for the GDP_per_capita variable.

We're going to fit a linear regression model using two predictor variables: GDP_per_capita and Social_support.

Your tasks:

Fit a model with both predictor variables and assign the model to model_2. The format of the input to the model is Y ~ X1 + X2.
- X1 = GDP_per_capita and X2 = Social_support.
Print out the model summary and assign the value of R-squared for this model to r_square_model_2. Your value should be defined to three decimal places.
Assign the value of the adjusted R-square to adj_r_square_model_2. Your value should be defined to three decimal places.
Answer the questions about your resulting model parameters (these short answer questions will not be autograded)

# YOUR CODE HERE
model_2 = ols('Score~GDP_per_capita+Social_support', data=happy).fit()

r_square_model_2 = 0.712
adj_r_square_model_2 = 0.710
# Print the model summary
print(model_2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Score   R-squared:                       0.712
Model:                            OLS   Adj. R-squared:                  0.710
Method:                 Least Squares   F-statistic:                     381.5
Date:                Sat, 20 Feb 2021   Prob (F-statistic):           3.46e-84
Time:                        02:55:33   Log-Likelihood:                -282.03
No. Observations:                 312   AIC:                             570.1
Df Residuals:                     309   BIC:                             581.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          2.3178      0.144     16.051      0.000       2.034       2.602
GDP_per_capita     1.4670      0.123     11.917      0.000       1.225       1.709
Social_support     1.4499      0.162      8.964      0.000       1.132       1.768
==============================================================================
Omnibus:                        0.875   Durbin-Watson:                   1.551
Prob(Omnibus):                  0.646   Jarque-Bera (JB):                0.844
Skew:                          -0.127   Prob(JB):                        0.656
Kurtosis:                       2.973   Cond. No.                         11.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Task 9 Test

# Hidden tests - you will see the results when you submit to Canvas

Task 10 - Multiple regression model interpretation

Now that we have added an additional variable to our regression model, let's look at how the explained variance (the R-squared value) changes.

Your tasks

Find the explained variance from model_1 and assign it to the variable r_square_percent1; your variable should be expressed as a percentage and should be rounded to the nearest integer.
Find the explained variance (adjusted!) from model_2 and assign it to the variable r_square_adj_percent2; you variable should be expressed as a percentage and should be rounded to the nearest integer.

Question (not autograded):

How does the adjusted R-squared value change when a second predictor variable is added?

The adjusted r-quared value increases, so the line fits the data even better.

# YOUR CODE HERE
r_square_percent1 = 64
r_square_adj_percent2 = 71
print(r_square_percent1)
print(r_square_adj_percent2)

64
71

Task 10 Test

assert r_square_percent1 >= 1, 'Make sure you use the percentage and not the decimal fraction.'
assert r_square_adj_percent2 >= 1, 'Make sure you use the percentage and not the decimal fraction.'

# Hidden tests - you will see the results when you submit to Canvas

Task 11 - Making a prediction and calculating the residual

We're going to use our model to make a prediction. Refer to the happy DataFrame and find the GDP_per_capita score for "Iceland" (index 4). Then when we have a prediction, we can calculate the residual. There are actually two row entries for Iceland, both with slightly different column values. Use the column values that you can see when you print happy.head().

Prediction

Assign the GDP_per_capita value to the variable x_iceland; it should be float and defined out to two decimal places.
Using your slope and intercept values from model_1, calculate the Score for Iceland (x_iceland); assign this value to predict_iceland and it should be a float.

Residual

Assign the observed Score for Iceland and assign it to the variable observe_iceland; it should be float and defined out to two decimal places (careful with the rounding!).
Determine the residual for the prediction you made and assign it to the variable residual_iceland (use your Guided Project or Module Project notebooks if you need a reminder of how to do a residual calculation).

Hint: Define your slope and intercept values out to two decimal places! Your resulting prediction for Iceland should have at least two decimal places. Make sure to use the parameters from the first model (model_1).

happy.head()
#print(model_1.params)
#model_1.params[0]

# YOUR CODE HERE
x_iceland = 1.34
predict_iceland = model_1.params[0] + model_1.params[1]*x_iceland
observe_iceland = 7.50
residual_iceland = observe_iceland - predict_iceland

# View your prediction
print('Prediction for Iceland :', predict_iceland)
print('Residual for Iceland prediction :', residual_iceland)

Prediction for Iceland : 6.38714641015218
Residual for Iceland prediction : 1.1128535898478198

Task 11 Test

assert residual_iceland >= 0, 'Check your residual calculation (use observed - predicted).'
assert round(x_iceland, 1) == 1.3, 'Check your Iceland GDP value.'
assert round(observe_iceland, 1) == 7.5, 'Check your Iceland observation value for "Score".'

# Hidden tests - you will see the results when you submit to Canvas

Part B: Vectors and cosine similarity

In this part of the challenge, we're going to look at how similar two vectors are. Remember, we can calculate the cosine similarity between two vectors by using this equation:

$$\cos \theta= \frac{\mathbf {A} \cdot \mathbf {B} }{\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|}$$

$\qquad$

where

The numerator is the dot product of the vectors $\mathbf {A}$ and $\mathbf {B}$
The denominator is the norm of $\mathbf {A}$ times the norm of $\mathbf {B}$

Three documents, two authors

For this task, you will calculate the cosine similarity between three vectors. But here's the interesting part: each vector represents a "chunk" of text from a novel (a few chapters of text). This text was cleaned to remove non-alphanumeric characters and numbers and then each document was transformed into a vector representation as described below.

Document vectors

In the dataset you are going to load below, each row represents a word that occurs in at least one of the documents. So all the rows are all the words that are in our three documents.

Each column represents a document (doc0, doc1, doc2). Now the fun part: the value in each cell is how frequently that word (row) occurs in that document (term-frequency) divided by how many documents that words appears in (document-frequency).

cell value = term_frequency / document_frequency

Use the above information to complete the remaining tasks.

Task 12 - Explore the text documents

You will be using cosine similarity to compare each document vector to the others. Remember that there are three documents, but two authors. Your task is to use the cosine similarity calculations to determine which two document vectors are most similar (written by the same author).

Your tasks:

Load in the CSV file that contains the document vectors (this is coded for you - just run the cell)
Look at the DataFrame you just loaded in any way that helps you understand the format, what's included in the data, the shape of the DataFrame, etc.

You can use document vectors just as they are - you don't need to code anything for Task 12.

# Load the data - DON'T DELETE THIS CELL
url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_4/unit1_nlp/text_vectors.csv'
text = pd.read_csv(url)

## Explore the data

# (this part is not autograded)
text.head()

Task 13 - Calculate cosine similarity

Calculate the cosine similarity for three pairs of vectors and assign the results to the following variables (each variable will be a float):
- assign the cosine similarity of doc0-doc1 to cosine_doc0_1
- assign the cosine similarity of doc0-doc2 to cosine_doc0_2
- assign the cosine similarity of doc1-doc2 to cosine_doc1_2
Print out the results so you can refer to them for the short answer section.
Answer the questions after you have completed the cosine similarity calculations.

# Use these imports for your cosine calculations (DON'T DELETE)
from numpy import dot
from numpy.linalg import norm

# YOUR CODE HERE


cosine_doc0_1 = dot(text['doc0'], text['doc1'])/(norm(text['doc0'])*norm(text['doc1']))
cosine_doc0_2 = dot(text['doc0'], text['doc2'])/(norm(text['doc0'])*norm(text['doc2']))
cosine_doc1_2 = dot(text['doc1'], text['doc2'])/(norm(text['doc1'])*norm(text['doc2']))

# Print out the results
print('Cosine similarity for doc0-doc1:', cosine_doc0_1)
print('Cosine similarity for doc0-doc2:', cosine_doc0_2)
print('Cosine similarity for doc1-doc2:', cosine_doc1_2)

Cosine similarity for doc0-doc1: 0.1296380132160888
Cosine similarity for doc0-doc2: 0.09904444112880154
Cosine similarity for doc1-doc2: 0.32171252792371724

Task 13 - Short answer

Using your cosine similarity calculations, which two documents are most similar?
If doc1 and doc2 were written by the same author, are your cosine similarity calculations consistent with this statement?
What process would we need to follow to add an additional document column? In other words, why can't we just stick another column with (term-frequency/document-frequency) values onto our current DataFrame text?

Documents 1 and 2 are the most similar
Yes, they would be consistent because 1 and 2 had the greatest similarity of the three.
We would have to update each cell's value because now the document-frequency of each word would be affected, potentially changing each cell's value once we compute the term-frequency/document-frequency.

Task 13 Test

# Hidden tests - you will see the results when you submit to Canvas

Additional Information about the texts used in this analysis:

You can find the raw text here. Dcoument 0 (doc0) is chapters 1-3 from "Pride and Predjudice" by Jane Austen. Document 1 (doc1) is chapters 1- 4 from "Frankenstein" by Mary Shelley. Document 2 is also from "Frankenstein", chapters 11-14.

	Country_region	Score	GDP_per_capita	Social_support	Healthy_life_expectancy	Freedom_life_choices	Generosity	Perceptions_corruption
Overall_rank
1	Finland	7.632	1.305	1.592	0.874	0.681	0.202	0.393
2	Norway	7.594	1.456	1.582	0.861	0.686	0.286	NaN
3	Denmark	7.555	1.351	1.590	0.868	0.683	0.284	NaN
4	Iceland	7.495	1.343	1.644	0.914	0.677	0.353	0.138
5	Switzerland	7.487	1.420	1.549	0.927	0.660	0.256	0.357

	Country_region	Score	GDP_per_capita	Social_support	Healthy_life_expectancy	Freedom_life_choices	Generosity
Overall_rank
1	Finland	7.632	1.305	1.592	0.874	0.681	0.202
2	Norway	7.594	1.456	1.582	0.861	0.686	0.286
3	Denmark	7.555	1.351	1.590	0.868	0.683	0.284
4	Iceland	7.495	1.343	1.644	0.914	0.677	0.353
5	Switzerland	7.487	1.420	1.549	0.927	0.660	0.256

	Country_region	Score	GDP_per_capita	Social_support	Healthy_life_expectancy	Freedom_life_choices	Generosity
Overall_rank
1	Finland	7.632	1.305	1.592	0.874	0.681	0.202
2	Norway	7.594	1.456	1.582	0.861	0.686	0.286
3	Denmark	7.555	1.351	1.590	0.868	0.683	0.284
4	Iceland	7.495	1.343	1.644	0.914	0.677	0.353
5	Switzerland	7.487	1.420	1.549	0.927	0.660	0.256

	Unnamed: 0	word	doc1	doc2
0	0	abhorrent	0.000000	0.008915
1	1	ability	0.021156	0.000000
2	2	abject	0.000000	0.008915
3	3	able	0.000000	0.017829
4	4	abode	0.042313	0.000000