-
To be able to use probability density functions to calculate probability of specific values.
-
To identify normally distributed features.
-
To perform a hypothesis test to compare numeric data between 2 groups.
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set_context('talk')
mpl.rcParams['figure.figsize'] = [12,6]
Dataset: https://archive.ics.uci.edu/ml/datasets/student+performance
pd.set_option('display.max_columns',100)
## read in the Data/student/student-mat.csv (it uses ";" as the sep)
# display info and .head
## Calculate an Avg Grade column by averaging G1, G2,G3,
# then divide by 20, and * 100 (to make %'s')
## plot the distribution of Avg Grade
Is it normally distributed?
## use scipy's normaltest
- We have our p-value for our normaltest, but what does it mean??
- Check the docstring for the normaltest to find out the null hypothesis of the test.
## Get the mean, std, min, and max for the Avg Grade column
## generate a linearly-spaced array of values that span the min to the max
## use stats.norm.pdf to get the PDF curve that corresponds to your distribution's values
## Plot the histogram again AND then plot the pdf we calculated.
Looks pretty normal! But can we confirm for a fact that its normal?
## Plot the histogram again AND pdf again
## Add a vpsan to the plot showing the region we want to calc prob for
How can we calculate this probability? Can we use the PDF?
## try making a list of values from 90-100 and getting the pdf values
## Sum the values to get the total probability.
Whats the flaw to this approach?
## Use the cumulative density function to find prob of 90 OR lower.
Now, we want the opposite probability, probability of being GREATER Than 90.
# calc 1-prob of 90 or lower.
- Answer: there is a 2.4% chance of having a score greater than 90.
Q: Do students with internet access have different average grades than students who do not have internet access?
-
$H_0$ (Null Hypothesis): Students with internet access have the same average grades as students who do not. -
$H_A$ (Alternative Hypothesis): Students with internet access have significantly different average grades compared to students who do not.
- Visualize the histogram of Avg Grade again, but separate it into groups based on the "internet" column.
- Note: when comparing 2 groups with seaborn's histplot, you will want to add
common_norm=False
## visualize the histobram of Avg Grade again, but separate it by "internet"
## Plot a bar plot of the Avg Grade for students with internet vs those that do not have it
## Separate the 2 groups into 2 varaibles
-
Since we are comparing a numeric measurement between 2 groups, we want to run a 2-sample (AKA independent T-test).
-
The Assumptions are:
- No significant outliers
- Normality
- Equal Variance
## check yes group for outliers using z-score >3 rule.
## check no group for outliers using z-score >3 rule.
No outliers to worry about! Assumption met.
## use normaltest to check if yes group is normally distributed
## use normaltest to check if no group is normally distributed
- Did we meet the assumption of normality?
## use Levene's test to check if groups have equal variance
Did we meet the assumption of equal variance?
- Since we met all of the assumptions for the test we can proceed with our t-test.
- Next class we will discuss what we would do if we did NOT meet the assumptions.
## run stats.ttest_ind on the 2 groups
What is our p-value? Is it less than our alpha of .05? What does this mean?
Our T-Test returned a p-value of
____
. Since p</>
.05, wecan reject/fail to reject
the null hypothesis that students with internet access have the same average grades as students who do not.
We therefore conclude that there is/is not
a significant difference in Average Grades between students who do/do not have internet access.
Our visualization below shows that students with internet access have HIGHER/LOWER/EQUAL
average grades.
## Add a summary visual to support our results.