NEXT STEPS WITH DATA

ST101 – DR. ARIC LABARR


ANALYSIS OF VARIANCE

NEXT STEPS WITH DATA


We have studied hypothesis tests and confidence intervals that have focused on one population parameter – for example, one average compared to a number. 

However, sometimes we like to compare multiple parameters against each other, like comparing two or more averages.


COMPARING TWO OR MORE AVERAGES


When comparing averages between two groups of data, we must think about how spread out the data is. 

When comparing many averages statistically we call this an Analysis of Variance (ANOVA).

Need to account for the spread in the data when comparing means!

COMPARING TWO OR MORE AVERAGES


COMPARING TWO AVERAGES

 

 

How close are these values?


COMPARING TWO AVERAGES

 

 

How close are these values?

Spread of the two distributions is not overlapping by much.


COMPARING TWO AVERAGES

 

 

Don’t appear to be too close!

Spread of the two distributions is not overlapping by much.


COMPARING TWO AVERAGES

 

 

How close are these values?


COMPARING TWO AVERAGES

 

 

How close are these values?

Spread of the two distributions seems to overlap a lot.


COMPARING TWO AVERAGES

 

 

Appear to be rather close!

Spread of the two distributions seems to overlap a lot.


Do accountants, on average, make more than teachers?

COMMON QUESTIONS ANOVA CAN HELP WITH


Do people treated with one of the two new drugs have higher average T-cell counts than people in the control group?

COMMON QUESTIONS ANOVA CAN HELP WITH


Do people spend different amounts depending on which type of credit card they have?

COMMON QUESTIONS ANOVA CAN HELP WITH


DOES WINTER HAVE LOWER AVERAGE TOTAL USERS?

Winter looks to be lower on average!


DOES WINTER HAVE LOWER AVERAGE TOTAL USERS?

Look at how spread out the values of users in winter are!


One-Way ANOVA – design in which independent samples are obtained from 2 or more categories of a single explanatory variable, then testing whether these categories have equal means.

For example:

Variable of interest is total users – looking at the average total users.

Explanatory variable is season – looking to see if average total users changes across season.

Number of categories is 4 – looking to see if the average total users changes across the 4 seasons.


ONE-WAY ANOVA


One-Way ANOVA – design in which independent samples are obtained from k categories of a single explanatory variable, then testing whether these k categories have equal means.

Null Hypothesis:


Alternative Hypothesis:


ONE-WAY ANOVA HYPOTHESES

 

 


One-Way ANOVA – design in which independent samples are obtained from k categories of a single explanatory variable, then testing whether these k categories have equal means.

Assumptions:

Groups are Normally distributed

Groups have equal variance / spread

Independence of observations


ONE-WAY ANOVA HYPOTHESES


One-Way ANOVA – design in which independent samples are obtained from k categories of a single explanatory variable, then testing whether these k categories have equal means.

Assumptions:

Groups are Normally distributed – total users across each season is Normally distributed

Groups have equal variance / spread – total users across each season have equal variance

Independence of observations – total users from each day don’t depend on each other


ONE-WAY ANOVA HYPOTHESES


Variation in an ANOVA can come from two places – within a category and between different categories.

Between-Sample Variability – variability in the variable of interest that exists between categories of an explanatory variable.

Within-Sample Variability – variability in the variability of interest that exists within a category of an explanatory variable.

SOURCES OF VARIATION


Variation in an ANOVA can come from two places – within a category and between different categories.

Between-Sample Variability – variability in the variable of interest that exists between categories of an explanatory variable.     WHAT CATEGORIES CAN EXPLAIN

Within-Sample Variability – variability in the variability of interest that exists within a category of an explanatory variable.     WHAT CATEGORIES CANNOT EXPLAIN


SOURCES OF VARIATION


COMPARING TWO AVERAGES

 

 

Between-Sample Variability


COMPARING TWO AVERAGES

 

 

Between-Sample Variability

Within-Sample Variability


COMPARING TWO AVERAGES

 

 

Within-Sample Variability

Between-Sample Variability


SPLITTING VARIABILITY IN ANOVA

Variability Between Groups

Variability Within Groups

Total Variability

Compare the ratio of these variances!


SPLITTING VARIABILITY IN ANOVA

Variability Between Groups

Variability Within Groups

Total Variability

If within group/sample variability is much bigger than the categories’ averages aren’t that different.


SPLITTING VARIABILITY IN ANOVA

Variability Between Groups

Variability Within Groups

Total Variability

If between group/sample variability is much bigger than the categories’ averages are different.


 

BIKE DATA EXAMPLE


One-Way ANOVA – design in which independent samples are obtained from 2 or more categories of a single explanatory variable, then testing whether these categories have equal means.

Variation in an ANOVA can come from two places – within a category and between different categories.

Between-Sample Variability – variability in the variable of interest that exists between categories of an explanatory variable.

Within-Sample Variability – variability in the variability of interest that exists within a category of an explanatory variable.


SUMMARY


MULTIPLE COMPARISONS

NEXT STEPS WITH DATA


DOES WINTER HAVE LOWER AVERAGE TOTAL USERS?

Winter looks to be lower on average!


If you reject the null hypothesis on the F-test what does that mean?


NEXT STEPS AFTER ANOVA


If you reject the null hypothesis on the F-test what does that mean? Evidence shows at least one category is different.


But which category?!?!?!?!


NEXT STEPS AFTER ANOVA


If you reject the null hypothesis on the F-test what does that mean? Evidence shows at least one category is different.


Once a difference is detected, must test each individual pair of categories to find where all the differences are – a process called multiple comparisons or ad-hoc testing.


NEXT STEPS AFTER ANOVA


You have a coin which lands on heads 50% of the time when flipped.

What is the probability of flipping a head on your first flip?


What is the probability of flipping a head on your second flip?


What is the probability of flipping at least one head in two flips?



MULTIPLE COMPARISONS PROBLEM


You have a coin which lands on heads 50% of the time when flipped.

What is the probability of flipping a head on your first flip?

50%

What is the probability of flipping a head on your second flip?

50%

What is the probability of flipping at least one head in two flips?

75%

MULTIPLE COMPARISONS PROBLEM


You have a test which makes an error 5% of the time when performed.

What is the probability of making an error on your first test?


What is the probability of making an error on your second test?


What is the probability of making at least one error in two tests?


MULTIPLE COMPARISONS PROBLEM


You have a test which makes an error 5% of the time when performed.

What is the probability of making an error on your first test?

5%

What is the probability of making an error on your second test?

5%

What is the probability of making at least one error in two tests?

9.75%

MULTIPLE COMPARISONS PROBLEM


You have a test which makes an error 5% of the time when performed.

What is the probability of making an error on your first test?

5%

What is the probability of making an error on your second test?

5%

What is the probability of making at least one error in two tests?

9.75%

MULTIPLE COMPARISONS PROBLEM

Comparison-wise Error


You have a test which makes an error 5% of the time when performed.

What is the probability of making an error on your first test?

5%

What is the probability of making an error on your second test?

5%

What is the probability of making at least one error in two tests?

9.75%

MULTIPLE COMPARISONS PROBLEM

Experiment-wise Error


 

TWO DIFFERENT TYPES OF ERROR


MULTIPLE COMPARISONS METHODS

Number of Groups Compared

Number of Comparisons

Experimentwise Error Rate (α=0.05)

2

1

.05

3

3

.14

4

6

.26

5

10

.40

Comparison-wise Error: α = 0.05

Experiement-wise Error: 1 – (1 – α)# comparisons


 

TUKEY-KRAMER TESTS


 

BIKE DATA EXAMPLE – ANOVA 


BIKE DATA EXAMPLE – MULTIPLE COMPARISONS 

95% Experiment-wise Confidence Intervals

Spring – Winter

Summer – Winter

Fall – Winter

Summer – Spring

Fall – Spring

Fall – Summer


BIKE DATA EXAMPLE – MULTIPLE COMPARISONS 

95% Experiment-wise Confidence Intervals

Spring – Winter

Summer – Winter

Fall – Winter

Summer – Spring

Fall – Spring

Fall – Summer

If confidence interval contains 0, then two seasons are NOT statistically different.


BIKE DATA EXAMPLE – MULTIPLE COMPARISONS 

95% Experiment-wise Confidence Intervals

Spring – Winter

Summer – Winter

Fall – Winter

Summer – Spring

Fall – Spring

Fall – Summer

Fall is only 264.2 total users a day lower on average than Spring.


BIKE DATA EXAMPLE – MULTIPLE COMPARISONS 

95% Experiment-wise Confidence Intervals

Spring – Winter

Summer – Winter

Fall – Winter

Summer – Spring

Fall – Spring

Fall – Summer

Winter does NOT seem to be close statistically to any of the other months.


If you reject the null hypothesis on the F-test that means there is evidence that shows at least one category is different.

Once a difference is detected, must test each individual pair of categories to find where all the differences are – a process called multiple comparisons or ad-hoc testing.

In the process of testing many individual pairs, errors (comparison-wise) are bound to happen if not controlled across an entire experiment (experiment-wise).


SUMMARY


LINEAR REGRESSION

NEXT STEPS WITH DATA


 

REVIEW OF CORRELATION


CORRELATION IS NOT EVERYTHING

Correlation is a measure of strength of a linear relationship but does not say what the linear relationship is.

Plot has two sets of data with exact same correlation of 0.99.

However, the relationship is different between the two.



REGRESSION MODELING

Many people across industries devote research funding to discover how variables are related (modeling).

The simplest graphical technique to relate two quantitative variables is through a straight-line relationship – called the simple linear regression (SLR) model.

Most models are more extensive and complicated than SLR models, but SLR models form a good foundation.


BIKE DATA EXAMPLE

What if you wanted to predict the number of registered users based on the temperature outside?

What is the best guess line for the following?

 


BIKE DATA EXAMPLE

Simple Linear Regression:

Intercept

 


BIKE DATA EXAMPLE

Simple Linear Regression:

Slope

 


BIKE DATA EXAMPLE

What if you wanted to predict the number of registered users based on the temperature outside?

What is the best guess line for the following?

 


BIKE DATA EXAMPLE

What if you wanted to predict the number of registered users based on the temperature outside?

What is the best guess line for the following?

 

How do we determine this?


SIMPLE EXAMPLE

Predicting sales revenue (thousands of $) with advertising expenditure (hundreds of $).

What is the “best” line through these 5 data points?



SIMPLE EXAMPLE

Predicting sales revenue (thousands of $) with advertising expenditure (hundreds of $).

What is the “best” line through these 5 data points?



SIMPLE EXAMPLE

Predicting sales revenue (thousands of $) with advertising expenditure (hundreds of $).

What is the “best” line through these 5 data points?



SIMPLE EXAMPLE

Predicting sales revenue (thousands of $) with advertising expenditure (hundreds of $).

What is the “best” line through these 5 data points?



SIMPLE EXAMPLE

Let’s pick one line and work through how we would approach this.


SIMPLE EXAMPLE

How “wrong” were you at each point?


SIMPLE EXAMPLE

How “wrong” were you at each point?

Look at the vertical deviations from the data point to the line – called residuals.

Deviations


SIMPLE EXAMPLE

How “wrong” were you at each point?

Look at the vertical deviations from the data point to the line – called residuals.

Deviations


SIMPLE EXAMPLE

How “wrong” were you at each point?

Look at the vertical deviations from the data point to the line – called residuals.

Can sum up all the deviations to calculate “total” error.

Deviations


SIMPLE EXAMPLE

How “wrong” were you at each point?

Look at the vertical deviations from the data point to the line – called residuals.

Can sum up all the deviations to calculate “total” error.

These errors have both positive and negative values so they would cancel each other out if we just added them.

Deviations


SIMPLE EXAMPLE

How “wrong” were you at each point?

Look at the vertical deviations from the data point to the line – called residuals.

Can sum up all the deviations to calculate “total” error.

Summing the squared errors (error2) removes the effect of the direction of the error.

Deviations


It can be shown that there is only one line for which the sum of the squared errors is minimized.

This line is called the line of best fit or the least squares regression line.

LEAST SQUARES REGRESSION


SIMPLE EXAMPLE

The line of best fit for the 5 data points in the scatterplot is shown in the darker line.

Not the original line we used for our prediction.

Computers can easily and quickly calculate this best line for us.


 


BIKE DATA EXAMPLE

What if you wanted to predict the number of registered users based on the temperature outside?

The best fit line for this relationship is:



This line is the closest line to each point simultaneously in terms of squared vertical distances.

 


The simplest graphical technique to relate two quantitative variables is through a straight-line relationship – called the simple linear regression (SLR) model.

Look at the vertical deviations from the data point to the line – called residuals.

It can be shown that there is only one line for which the sum of the squared errors is minimized – called the line of best fit or the least squares regression line.


SUMMARY


Última modificación: lunes, 17 de octubre de 2022, 13:29