CodeAlong: From Distributions to Hypotheses

Learning Objectives

To be able to use probability density functions to calculate probability of specific values.
To identify normally distributed features.
To perform a hypothesis test to compare numeric data between 2 groups.

import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set_context('talk')
mpl.rcParams['figure.figsize'] = [12,6]

Exploring Distributions

Dataset: https://archive.ics.uci.edu/ml/datasets/student+performance

pd.set_option('display.max_columns',100)

## read in the Data/student/student-mat.csv (it uses ";" as the sep)

# display info and .head

## Calculate an Avg Grade column by averaging G1, G2,G3, 
# then divide by 20, and * 100 (to make %'s')

## plot the distribution of Avg Grade

Is it normally distributed?

## use scipy's normaltest

We have our p-value for our normaltest, but what does it mean??
- Check the docstring for the normaltest to find out the null hypothesis of the test.

Calculating Probabilities with Scipy's Probability Density Functions

## Get the mean, std, min, and max for the Avg Grade column

## generate a linearly-spaced array of values that span the min to the max

## use stats.norm.pdf to get the PDF curve that corresponds to your distribution's values

## Plot the histogram again AND then plot the pdf we calculated.

Looks pretty normal! But can we confirm for a fact that its normal?

Q1: what is the probability of a student getting a score of 90 or above?

## Plot the histogram again AND pdf again

## Add a vpsan to the plot showing the region we want to calc prob for

How can we calculate this probability? Can we use the PDF?

## try making a list of values from 90-100 and getting the pdf values


## Sum the values to get the total probability.

Whats the flaw to this approach?

## Use the cumulative density function to find prob of 90 OR lower.

Now, we want the opposite probability, probability of being GREATER Than 90.

# calc 1-prob of 90 or lower.

Answer: there is a 2.4% chance of having a score greater than 90.

Hypothesis Testing

Q: Do students with internet access have different average grades than students who do not have internet access?

State The Hypothesis

$H_0$ (Null Hypothesis): Students with internet access have the same average grades as students who do not.
$H_A$ (Alternative Hypothesis): Students with internet access have significantly different average grades compared to students who do not.

Visualize and Separate Groups

Visualize the histogram of Avg Grade again, but separate it into groups based on the "internet" column.
Note: when comparing 2 groups with seaborn's histplot, you will want to add common_norm=False

## visualize the histobram of Avg Grade again, but separate it by "internet"

## Plot a bar plot of the Avg Grade for students with internet vs those that do not have it

## Separate the 2 groups into 2 varaibles

T-Test Assumptions

Since we are comparing a numeric measurement between 2 groups, we want to run a 2-sample (AKA independent T-test).
The Assumptions are:
- No significant outliers
- Normality
- Equal Variance

Assumption: No Sig. Outliers

## check yes group for outliers using z-score >3 rule.

## check no group for outliers using z-score >3 rule.

No outliers to worry about! Assumption met.

Assumption: Normally Distributed Groups

## use normaltest to check if yes group is normally distributed

## use normaltest to check if no group is normally distributed

Did we meet the assumption of normality?

Assumption: Equal Variance

## use Levene's test to check if groups have equal variance

Did we meet the assumption of equal variance?

Perform Final Hypothesis Test (T-Test)

Since we met all of the assumptions for the test we can proceed with our t-test.
- Next class we will discuss what we would do if we did NOT meet the assumptions.

## run stats.ttest_ind on the 2 groups

What is our p-value? Is it less than our alpha of .05? What does this mean?

Our T-Test returned a p-value of ____. Since p </>.05, we can reject/fail to reject the null hypothesis that students with internet access have the same average grades as students who do not.

We therefore conclude that there is/is not a significant difference in Average Grades between students who do/do not have internet access.

Our visualization below shows that students with internet access have HIGHER/LOWER/EQUAL average grades.

## Add a summary visual to support our results.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Data		Data
Instructor		Instructor
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
PreClass - From Distributions to Hypothesis.ipynb		PreClass - From Distributions to Hypothesis.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeAlong: From Distributions to Hypotheses

Learning Objectives

Exploring Distributions

Calculating Probabilities with Scipy's Probability Density Functions

Q1: what is the probability of a student getting a score of 90 or above?

Hypothesis Testing

Q: Do students with internet access have different average grades than students who do not have internet access?

State The Hypothesis

Visualize and Separate Groups

T-Test Assumptions

Assumption: No Sig. Outliers

Assumption: Normally Distributed Groups

Assumption: Equal Variance

Perform Final Hypothesis Test (T-Test)

About

Releases

Packages

Contributors 2

Languages

License

coding-dojo-data-science/data-enrichment-wk15-lect01-codealong

Folders and files

Latest commit

History

Repository files navigation

CodeAlong: From Distributions to Hypotheses

Learning Objectives

Exploring Distributions

Calculating Probabilities with Scipy's Probability Density Functions

Q1: what is the probability of a student getting a score of 90 or above?

Hypothesis Testing

Q: Do students with internet access have different average grades than students who do not have internet access?

State The Hypothesis

Visualize and Separate Groups

T-Test Assumptions

Assumption: No Sig. Outliers

Assumption: Normally Distributed Groups

Assumption: Equal Variance

Perform Final Hypothesis Test (T-Test)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages