Slides: Testing Hypotheses with Data
TESTING HYPOTHESES WITH DATA
ST101 – DR. ARIC LABARR
A hypothesis test uses data to help evaluate an initial claim about a parameter from the population.
HYPOTHESIS TESTING
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
Do you still think the coin is fair?
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
3
Heads
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
Do you still think the coin is fair?
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
4
Heads
0.0625
Do you still think the coin is fair?
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
4
Heads
0.0625
5
Heads
0.03125
Do you still think the coin is fair?
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
No longer believe the coin is fair.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
4
Heads
0.0625
5
Heads
0.03125
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
No longer believe the coin is fair.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
P-value
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
4
Heads
0.0625
5
Heads
0.03125
NULL Hypothesis
Test Statistic
Decision on NULL Hypothesis
According to the CLT, sample means follow a Normal distribution as long as the sample size is big enough.
BIKE DATA EXAMPLE WITH MEANS
You believe the average daily number of total users is 4,000, but you want to know if there is more than that. You collect a sample of 731 days with an average daily number of total users at 4,504 with a standard deviation of 1,937.
BIKE DATA EXAMPLE WITH MEANS
You believe the average daily number of total users is 4,000, but you want to know if there is more than that. You collect a sample of 731 days with an average daily number of total users at 4,504 with a standard deviation of 1,937.
What is the probability you see this under the initial thought of 4,000 for an average?
BIKE DATA EXAMPLE WITH MEANS
What is the probability you see this under the initial thought of 4,000 for an average?
BIKE DATA EXAMPLE WITH MEANS
What is the probability you see this under the initial thought of 4,000 for an average?
BIKE DATA EXAMPLE WITH MEANS
What is the probability you see this under the initial thought of 4,000 for an average? < 0.0001
BIKE DATA EXAMPLE WITH MEANS
You believe the average daily number of total users is 4,000, but you want to know if there is more than that.
You collect a sample of 731 days with an average daily number of total users at 4,504 with a standard deviation of 1,937.
What is the probability you see this under the initial thought of 4,000 for an average? < 0.0001!
Do you still believe your original hypothesis?
BIKE DATA EXAMPLE WITH MEANS
You believe the average daily number of total users is 4,000 (NULL Hypothesis), but you want to know if there is more than that.
You collect a sample of 731 days with an average daily number of total users at 4,504 with a standard deviation of 1,937. Test Statistic
What is the probability you see this under the initial thought of 4,000 for an average? P-value
Do you still believe your original hypothesis? Decision on NULL Hypothesis
BIKE DATA EXAMPLE WITH MEANS
A hypothesis test uses data to help evaluate an initial claim about a parameter from the population.
There are 4 main steps to hypothesis testing:
State the hypotheses
Test statistic
P-value
Decision on null hypothesis
SUMMARY
NULL AND ALTERNATIVE HYPOTHESIS
TESTING HYPOTHESES WITH DATA
HYPOTHESIS TESTING
It is not always obvious how the null and alternative hypotheses should be formulated.
The context of the situation is very important in determining how the hypotheses should be stated.
In some cases it is easier to identify the alternative hypothesis first!
Typically, the alternative is what we are trying to test and want to collect evidence for.
DEVELOPING NULL AND ALTERNATIVE
The null hypothesis is the status quo, or the initial claim about the data.
For example, the average daily number of total users is 4,000.
The null hypothesis is the status quo, or the initial claim about the data.
For example, the average daily number of total users is 4,000.
The null hypothesis is about the population parameter of interest, NOT sample statistics.
Parameters are unknown, while statistics are known.
NULL VS. ALTERNATIVE
One-Sided Tests
Two-Sided Test
SUMMARY
TEST STATISTIC
TESTING HYPOTHESES WITH DATA
The test statistic summarizes the amount of information provided in the sample.
Imagine this like evidence in a court case.
Test statistics have a common form:
TEST STATISTIC
The test statistic summarizes the amount of information provided in the sample.
Imagine this like evidence in a court case.
Test statistics have a common form:
TEST STATISTIC
Sample Information
The test statistic summarizes the amount of information provided in the sample.
Imagine this like evidence in a court case.
Test statistics have a common form:
TEST STATISTIC
Null Hypothesis Information
The test statistic summarizes the amount of information provided in the sample.
Imagine this like evidence in a court case.
Test statistics have a common form:
TEST STATISTIC
Estimated Variability from Sampling Distribution of Statistic
The test statistic summarizes the amount of information provided in the sample.
Sample means need the t-distribution because of the unknown values of the population standard deviation.
TEST STATISTIC FOR MEANS
The test statistic summarizes the amount of information provided in the sample.
Sample proportions use the Normal distribution.
TEST STATISTIC FOR PROPORTIONS
The test statistic summarizes the amount of information provided in the sample.
The test statistic calculation typically requires 3 pieces of information:
Statistic – information obtained from the sample.
Null value – information about the null hypothesis.
Standard error – measure of variability for the sampling distribution of the statistic.
SUMMARY
P-VALUE AND SIGNIFICANCE LEVEL
TESTING HYPOTHESES WITH DATA
Once the test statistic has been determined, we can calculate the probability that we got the information we did from our sample, assuming that the null hypothesis is true.
The p-value is the probability we got our sample, or a sample more extreme, under the null hypothesis.
P-VALUES
If the p-value is low, this implies that the sample we obtained from the population is extremely rare IF we assume that the null hypothesis is true.
This leads us to question the validity of the null hypothesis – rejecting the null hypothesis if the p-value is low enough.
How low is low enough?
SIGNIFICANCE LEVEL VS. P-VALUE
SIGNIFICANCE LEVEL VS. P-VALUE
SIGNIFICANCE LEVEL VS. P-VALUE
SIGNIFICANCE LEVEL VS. P-VALUE
P-value
Values are “far apart” according to p-value
P-value
Values are “close together” according to p-value
P-value
Values are “far apart” according to p-value
Values are “far apart” according to p-value
P-value/2
P-value/2
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
No longer believe the coin is fair – but could it be?
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
4
Heads
0.0625
5
Heads
0.03125
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
No longer believe the coin is fair – but could it be? YES!
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
Probability
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
4
Heads
0.0625
5
Heads
0.03125
Defines the unlikely values of the sample statistic if the null hypothesis is true.
This area is typically called the rejection region of the sampling distribution.
Selected before the hypothesis test is even run!
Typical values are 0.01, 0.05, 0.10.
The p-value is the probability we got our sample, or a sample more extreme, under the null hypothesis.
If the p-value is low, this implies that the sample we obtained from the population is extremely rare IF we assume that the null hypothesis is true.
The significance level defines the unlikely values of the sample statistic if the null hypothesis is true.
SUMMARY
HYPOTHESIS TEST FOR MEANS
TESTING HYPOTHESES WITH DATA
You believe the average daily number of total users is 4,000, but you want to know if there is more than that so you can decide on orders for future bikes to be added.
You collect a sample of 731 days with an average daily number of total users at 4,504 with a standard deviation of 1,937.
With a significance level of 0.05, conduct a hypothesis test on this claim.
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
FINDING P-VALUE
One-Tail
0.25
0.20
0.15
0.10
0.05
0.025
0.01
0.005
0.001
.0005
Two-Tail
0.50
0.40
0.30
0.20
0.10
0.05
0.02
0.01
0.002
0.001
.
.
.
.
.
.
.
.
.
.
.
90
0.677
0.846
1.042
1.291
1.662
1.987
2.368
2.632
3.183
3.402
100
0.677
0.845
1.042
1.290
1.660
1.984
2.364
2.626
3.174
3.390
250
0.675
0.843
1.039
1.285
1.651
1.969
2.341
2.596
3.123
3.330
500
0.675
0.842
1.038
1.283
1.648
1.965
2.334
2.586
3.107
3.310
1000
0.675
0.842
1.037
1.282
1.646
1.962
2.330
2.581
3.098
3.300
.
.
.
.
.
.
.
.
.
.
.
FINDING P-VALUE
One-Tail
0.25
0.20
0.15
0.10
0.05
0.025
0.01
0.005
0.001
.0005
Two-Tail
0.50
0.40
0.30
0.20
0.10
0.05
0.02
0.01
0.002
0.001
.
.
.
.
.
.
.
.
.
.
.
90
0.677
0.846
1.042
1.291
1.662
1.987
2.368
2.632
3.183
3.402
100
0.677
0.845
1.042
1.290
1.660
1.984
2.364
2.626
3.174
3.390
250
0.675
0.843
1.039
1.285
1.651
1.969
2.341
2.596
3.123
3.330
500
0.675
0.842
1.038
1.283
1.648
1.965
2.334
2.586
3.107
3.310
1000
0.675
0.842
1.037
1.282
1.646
1.962
2.330
2.581
3.098
3.300
.
.
.
.
.
.
.
.
.
.
.
FINDING P-VALUE
One-Tail
0.25
0.20
0.15
0.10
0.05
0.025
0.01
0.005
0.001
.0005
Two-Tail
0.50
0.40
0.30
0.20
0.10
0.05
0.02
0.01
0.002
0.001
.
.
.
.
.
.
.
.
.
.
.
90
0.677
0.846
1.042
1.291
1.662
1.987
2.368
2.632
3.183
3.402
100
0.677
0.845
1.042
1.290
1.660
1.984
2.364
2.626
3.174
3.390
250
0.675
0.843
1.039
1.285
1.651
1.969
2.341
2.596
3.123
3.330
500
0.675
0.842
1.038
1.283
1.648
1.965
2.334
2.586
3.107
3.310
1000
0.675
0.842
1.037
1.282
1.646
1.962
2.330
2.581
3.098
3.300
.
.
.
.
.
.
.
.
.
.
.
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
P-value < 0.0005
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
You believe the average daily number of total users is 4,000, but you want to know if there is more than that so you can decide on orders for future bikes to be added OR less than 4,000 so you can pull stock from the streets, so bikes don’t sit unused.
You collect a sample of 731 days with an average daily number of total users at 4,504 with a standard deviation of 1,937.
With a significance level of 0.05, conduct a hypothesis test on this claim.
BIKE DATA EXAMPLE FOR TWO-TAIL HYPOTHESIS TEST
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
FINDING P-VALUE
One-Tail
0.25
0.20
0.15
0.10
0.05
0.025
0.01
0.005
0.001
.0005
Two-Tail
0.50
0.40
0.30
0.20
0.10
0.05
0.02
0.01
0.002
0.001
.
.
.
.
.
.
.
.
.
.
.
90
0.677
0.846
1.042
1.291
1.662
1.987
2.368
2.632
3.183
3.402
100
0.677
0.845
1.042
1.290
1.660
1.984
2.364
2.626
3.174
3.390
250
0.675
0.843
1.039
1.285
1.651
1.969
2.341
2.596
3.123
3.330
500
0.675
0.842
1.038
1.283
1.648
1.965
2.334
2.586
3.107
3.310
1000
0.675
0.842
1.037
1.282
1.646
1.962
2.330
2.581
3.098
3.300
.
.
.
.
.
.
.
.
.
.
.
FINDING P-VALUE
One-Tail
0.25
0.20
0.15
0.10
0.05
0.025
0.01
0.005
0.001
.0005
Two-Tail
0.50
0.40
0.30
0.20
0.10
0.05
0.02
0.01
0.002
0.001
.
.
.
.
.
.
.
.
.
.
.
90
0.677
0.846
1.042
1.291
1.662
1.987
2.368
2.632
3.183
3.402
100
0.677
0.845
1.042
1.290
1.660
1.984
2.364
2.626
3.174
3.390
250
0.675
0.843
1.039
1.285
1.651
1.969
2.341
2.596
3.123
3.330
500
0.675
0.842
1.038
1.283
1.648
1.965
2.334
2.586
3.107
3.310
1000
0.675
0.842
1.037
1.282
1.646
1.962
2.330
2.581
3.098
3.300
.
.
.
.
.
.
.
.
.
.
.
FINDING P-VALUE
One-Tail
0.25
0.20
0.15
0.10
0.05
0.025
0.01
0.005
0.001
.0005
Two-Tail
0.50
0.40
0.30
0.20
0.10
0.05
0.02
0.01
0.002
0.001
.
.
.
.
.
.
.
.
.
.
.
90
0.677
0.846
1.042
1.291
1.662
1.987
2.368
2.632
3.183
3.402
100
0.677
0.845
1.042
1.290
1.660
1.984
2.364
2.626
3.174
3.390
250
0.675
0.843
1.039
1.285
1.651
1.969
2.341
2.596
3.123
3.330
500
0.675
0.842
1.038
1.283
1.648
1.965
2.334
2.586
3.107
3.310
1000
0.675
0.842
1.037
1.282
1.646
1.962
2.330
2.581
3.098
3.300
.
.
.
.
.
.
.
.
.
.
.
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
P-value/2 < 0.0005
P-value/2 < 0.0005
BIKE DATA EXAMPLE FOR ONE-TAIL HYPOTHESIS TEST
ETHICS AROUND INFERENCE WITH DATA
TESTING HYPOTHESES WITH DATA
Hypothesis tests depend on sample data.
Therefore, hypothesis tests may be wrong!
There are two types of errors in hypothesis testing – Type I and Type II errors.
ERRORS IN HYPOTHESIS TESTS
TYPE I VS. TYPE II ERRORS
Correct
Type II
Type I
Correct
TRUTH
CHOICE
A Type I error is rejecting the null hypothesis when the null hypothesis was actually true.
In other words, you have a false rejection.
The probability of making a Type I error in a hypothesis test is called the significance level.
Most hypothesis tests are referred to as significance tests because they only control the Type I error.
TYPE I ERROR
A Type II error is accepting the null hypothesis when the null hypothesis was actually false.
In other words, you have falsely accepted.
The probability of NOT making a Type II error in a hypothesis test is called the power.
Difficult to control the Type II error.
Can only control for Type I or Type II at a time.
TYPE II ERROR
What if your sample of data happened to be drawn on data from only summer months with clear days?
Maybe the days would be estimated to have too many users.
This could lead to incorrect actions to be taken.
CAREFUL WITH INFERENCE
What if your sample of data happened to be drawn on data from only summer months with clear days?
Maybe the days would be estimated to have too many users.
This could lead to incorrect actions to be taken.
Hypothesis tests completely depend on the data they are built from.
Garbage in 🡪 Garbage out
CAREFUL WITH INFERENCE
What if your sample of data happened to be drawn on data from only summer months with clear days?
Maybe the days would be estimated to have too many users.
This could lead to incorrect actions to be taken.
Hypothesis tests completely depend on the data they are built from.
Garbage in 🡪 Garbage out
Hypothesis tests results reveal something, but not everything!
CAREFUL WITH INFERENCE
Hypothesis tests results reveal something, but not everything!
People sometimes forget the possibility of errors when making claims from a statistical test.
For example:
“We know that more than 4,000 bikes per day are used on average.”
CAREFUL ABOUT JUSTIFICATION
Hypothesis tests results reveal something, but not everything!
People sometimes forget the possibility of errors when making claims from a statistical test.
For example:
“We have strong evidence that more than 4,000 bikes per day are used on average.”
CAREFUL ABOUT JUSTIFICATION
Hypothesis tests results reveal something, but not everything!
People sometimes forget the possibility of errors when making claims from a statistical test.
For example:
“We have strong evidence that more than 4,000 bikes per day are used on average.”
Remember the analogy of a court case 🡪 we incorrectly claim people are guilty sometimes. Careful about rushing to judgement!
CAREFUL ABOUT JUSTIFICATION
A Type I error is rejecting the null hypothesis when the null hypothesis was actually true.
A Type II error is accepting the null hypothesis when the null hypothesis was actually false.
Hypothesis tests completely depend on the data they are built from.
People sometimes forget the possibility of errors when making claims from a statistical test.
SUMMARY