Slides: Relationships in Data
RELATIONSHIPS IN DATA
ST101 – DR. ARIC LABARR
EXPLORING DATA RELATIONSHIPS
Exploring data reveals potential insights and valuable uses of that information.
Visuals help explore data.
Distributions, bar charts, stacked bar charts, boxplots, scatterplots, etc.
Are visuals enough?
BOXPLOTS
RELATIONSHIPS IN DATA
EXAMPLE – BIKE RENTAL DATA
Boxplots are visual representations of 5 (and sometimes more!) summary measures about a set of data.
Minimum value
1st quartile
Median
3rd quartile
Maximum value
5 NUMBER SUMMARY
Boxplots are visual representations of 5 (and sometimes more!) summary measures about a set of data.
Minimum value
1st quartile
Median
3rd quartile
Maximum value
5 NUMBER SUMMARY
The minimum value of a variable is numerically lowest value that the variable takes.
Here are the lowest temperatures from our bike rental data (°F): 22.6°, 25.8°, 25.8°, 26.7°, …
MINIMUM VALUE
The minimum value of a variable is numerically lowest value that the variable takes.
Here are the lowest temperatures from our bike rental data (°F): 22.6°, 25.8°, 25.8°, 26.7°, …
MINIMUM VALUE
Minimum value
A percentile provides information about how the data are spread over the interval from the smallest values to the largest value.
The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 - p) percent of the items take on this value or more.
For example, the student’s test score was in the 93rd percentile.
QUARTILES AND PERCENTILES
Quartiles are specific percentiles that are commonly used.
The first quartile, Q1, is the 25th percentile.
The second quartile is the 50th percentile, which we already defined – the median.
The third quartile, Q3, is the 75th percentile.
QUARTILES AND PERCENTILES
Quartiles are calculated in similar ways as the median with all data ordered from smallest to largest.
Typically, let computers do this calculation.
In our bike rental data, these are the quartiles:
Q1 = 46.08°F
Median = 59.76°F
Q3 = 73.08°F
QUARTILE CALCULATION
The interquartile range (IQR) of a data set if the difference between the third and first quartiles:
This is the middle 50% of the data set and is not bothered by extreme observations in the tails of the data set.
INTERQUARTILE RANGE (IQR)
In our bike rental data, these are the quartiles:
Q1 = 46.08°F
Median = 59.76°F
Q3 = 73.08°F
INTERQUARTILE RANGE (IQR)
The maximum value of a variable is numerically highest value that the variable takes.
Here are the highest temperatures from our bike rental data (°F): … , 88.5°, 89.4°, 89.4°, 90.5°
MAXIMUM VALUE
The maximum value of a variable is numerically highest value that the variable takes.
Here are the highest temperatures from our bike rental data (°F): … , 88.5°, 89.4°, 89.4°, 90.5°
MAXIMUM VALUE
Maximum value
BOXPLOT
Boxplots are visual representations of 5 (and sometimes more!) summary measures about a set of data.
Maximum value = 90.50°F
3rd quartile = 73.08°F
Median = 59.76°F
1st quartile = 46.08°F
Minimum value = 22.60°F
BOXPLOT
Boxplots are visual representations of 5 (and sometimes more!) summary measures about a set of data.
Maximum value = 90.50°F
3rd quartile = 73.08°F
Median = 59.76°F
Mean = 59.51°F
1st quartile = 46.08°F
Minimum value = 22.60°F
EXAMPLE – BIKE RENTAL DATA
Outliers
The minimum and maximum value lines are now the maximum and minimum values within an outlier boundary of the IQR.
Anything outside of this boundary (low or high) is considered an outlier.
OUTLIERS ON A BOXPLOT
1.5 IQR RULE
1.5 IQR RULE
Bike Data:
Maximum value = 90.50°F
3rd quartile = 73.08°F
Median = 59.76°F
1st quartile = 46.08°F
Minimum value = 22.60°F
IQR = 27
1.5 IQR RULE
Bike Data:
Maximum value = 90.50°F
3rd quartile = 73.08°F
Median = 59.76°F
1st quartile = 46.08°F
Minimum value = 22.60°F
IQR = 27
1.5 IQR RULE
Bike Data:
Maximum value = 90.50°F
3rd quartile = 73.08°F
Median = 59.76°F
1st quartile = 46.08°F
Minimum value = 22.60°F
IQR = 27
1.5 IQR RULE
EXAMPLE – BIKE RENTAL DATA
Outliers
Boxplots are visual representations of 5 (and sometimes more!) summary measures about a set of data.
Minimum value – smallest value in a variable
1st quartile – 25th percentile of a variable
Median – 50th percentile of a variable
Mean – average value of the variable
3rd quartile – 75th percentile of a variable
Maximum value – largest value of a variable
Outliers – anything outside of the 1.5 IQR Rule boundary on your variable
SUMMARY
IDEA OF ANALYSIS OF VARIANCE (ANOVA)
RELATIONSHIPS IN DATA
DOES WINTER HAVE LOWER AVERAGE TOTAL USERS?
DOES WINTER HAVE LOWER AVERAGE TOTAL USERS?
Winter looks to belower on average!
DOES WINTER HAVE LOWER AVERAGE TOTAL USERS?
Look at how spread outthe values of users inwinter are!
When comparing averages between two groups of data, we must think about how spread out the data is.
Compare the 5 number summary between Summer and Winter.
COMPARING AVERAGES
COMPARING AVERAGES
Summer 5ish Number Summary
Maximum: 8714
3rd Quartile: 6954
Mean: 5644.3
Median: 5354
1st Quartile: 4580
Minimum: 1115
Winter 5ish Number Summary
Maximum: 7836
3rd Quartile: 3472
Mean: 2604.1
Median: 2209
1st Quartile: 1534
Minimum: 431
COMPARING AVERAGES
Summer 5ish Number Summary
Maximum: 8714
3rd Quartile: 6954
Mean: 5644.3
Median: 5354
1st Quartile: 4580
Minimum: 1115
Winter 5ish Number Summary
Maximum: 7836
3rd Quartile: 3472
Mean: 2604.1
Median: 2209
1st Quartile: 1534
Minimum: 431
When comparing averages between two groups of data, we must think about how spread out the data is.
Compare the 5 number summary between Summer and Winter.
When comparing many averages statistically we call this an Analysis of Variance (ANOVA).
Need to account for the spread in the data when comparing means!
COMPARING AVERAGES
When comparing many groups’ averages statistically we call this an analysis of variance (ANOVA).
Need to account for the spread in the data when comparing means.
SUMMARY
INTERPRETING SCATTERPLOTS
RELATIONSHIPS IN DATA
EXAMPLE – BIKE RENTAL DATA
Scatterplots are visual representations of comparing two different quantitative variables.
For each observation in your data, you are looking at the value of two quantitative variables which are plotted one on each axis in the plot.
SCATTERPLOT
EXAMPLE – BIKE RENTAL DATA
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
…
EXAMPLE – BIKE RENTAL DATA
Temperature = 46.7, Registered Users = 654
VISUALIZING RELATIONSHIPS
Viewing the relationship between two quantitative variables on a scatterplot is very beneficial.
Linear relationship – relationship between variables that exhibits a fairly straight / linear pattern
Nonlinear relationship – relationship between variables that exhibits a pattern that is nonlinear in nature
VISUALIZING RELATIONSHIPS
Viewing the relationship between two quantitative variables on a scatterplot is very beneficial.
Positive relationship – as one variable increases (or decreases) the other has a tendency to do the same
Negative relationship – as one variable increases (or decreases) the other has a tendency to do the opposite
POSITIVE AND RELATIVELY LINEAR RELATIONSHIP
Scatterplots are visual representations of comparing two different quantitative variables, which are plotted one on each axis in the plot.
Positive relationship – as one variable increases (or decreases) the other has a tendency to do the same
Negative relationship – as one variable increases (or decreases) the other has a tendency to do the opposite
SUMMARY
CORRELATION
RELATIONSHIPS IN DATA
CORRELATION
46
The Pearson correlation coefficient is unit less – no units when describing it.
CORRELATION COEFFICIENT
47
The Pearson correlation coefficient is unit less – no units when describing it.
CORRELATION COEFFICIENT
Bounded between -1 and 1
48
The Pearson correlation coefficient is unit less – no units when describing it.
CORRELATION COEFFICIENT
Negative values implya negative linear relationship
49
The Pearson correlation coefficient is unit less – no units when describing it.
CORRELATION COEFFICIENT
Positive values implya positive linearrelationship
50
The Pearson correlation coefficient is unit less – no units when describing it.
CORRELATION COEFFICIENT
Values near 0 impliesno real linearrelationship
51
Values of 1 or -1 imply a perfect linear relationship between y and x.
CORRELATION COEFFICIENT
52
CORRELATION OF 0.54
CORRELATION OF -0.22
POTENTIAL ISSUES WITH CORRELATION
Two of the biggest problems with correlation are the following:
Outliers
Causation
OUTLIERS IN CORRELATION
Outliers can lead to false conclusions about correlation if you don’t visualize the data to help us see what might be going on.
Outliers can make relationships that aren’t really there.
OUTLIERS IN CORRELATION
Outliers can lead to false conclusions about correlation if you don’t visualize the data to help us see what might be going on.
Outliers can make relationships that aren’t really there.
OUTLIERS IN CORRELATION
Outliers can lead to false conclusions about correlation if you don’t visualize the data to help us see what might be going on.
Outliers can hide relationships that are really there.
OUTLIERS IN CORRELATION
Outliers can lead to false conclusions about correlation if you don’t visualize the data to help us see what might be going on.
Outliers can hide relationships that are really there.
SUMMARY
CORRELATION AND CAUSATION
RELATIONSHIPS IN DATA
POTENTIAL ISSUES WITH CORRELATION
Two of the biggest problems with correlation are the following:
Outliers
Causation
CORRELATION OF 0.54
Confusing correlation and causation is a common phenomena.
All correlation implies is a linear trend may exist between two variables of interest.
Many famous examples of correlations that are not causations.
CORRELATION VS. CAUSATION
64
Ice cream sales are positively correlated with shark attacks.
Is it because we taste better with more ice cream?
CLASSIC EXAMPLE
Ice Cream Sales
Shark Attacks
?
65
Ice cream sales are positively correlated with shark attacks.
Is it because we taste better with more ice cream?
What else may be causing this relationship?
CLASSIC EXAMPLE
Ice Cream Sales
Shark Attacks
?
66
Ice cream sales are positively correlated with shark attacks.
Is it because we taste better with more ice cream?
What else may be causing this relationship?
CLASSIC EXAMPLE
Ice Cream Sales
Shark Attacks
X
High Temperatures
More Swimming
67
In some examples, there is an underlying factor that is related to both of the correlated variables.
Not always the case:
Divorce rate in Maine and US consumption of margarine per person.
US consumption of mozzarella cheese (per person) and awarded PhD’s in civil engineering.
Decrease in number of pirates and increase in global warming.
Many, many more…
CORRELATION VS. CAUSATION
Correlation does not imply causation.
In some examples, there is an underlying factor that is related to both of the correlated variables, but not always the case.
SUMMARY
IDEA OF REGRESSION
RELATIONSHIPS IN DATA
CORRELATION IS NOT EVERYTHING
Correlation is a measure of strength of a linear relationship but does not say what the linear relationship is.
Plot has two sets of data with exact same correlation of 0.99.
However, the relationship is different between the two.
REGRESSION MODELING
Many people across industries devote research funding to discover how variables are related (modeling).
The simplest graphical technique to relate two quantitative variables is through a straight-line relationship – called the simple linear regression (SLR) model.
Most models are more extensive and complicated than SLR models, but SLR models form a good foundation.
BIKE DATA EXAMPLE
What if you wanted to predict the number of registered users based on the temperature outside?
What is the best guess line for the following?
SIMPLE LINEAR REGRESSION MODEL
Simple Linear Regression:
Intercept
Slope
74
BIKE DATA EXAMPLE
Simple Linear Regression:
Intercept
BIKE DATA EXAMPLE
Simple Linear Regression:
Slope
SIMPLE LINEAR REGRESSION MODEL
The intercept is the value of the average of the registered users when the temperature equals zero.
The slope is the average increase in the registered users with a one degree (F) increase in the temperature.
BIKE DATA EXAMPLE
What if you wanted to predict the number of registered users based on the temperature outside?
What is the best guess line for the following?
MORE COMPLICATED MODELING
Models can be more complicated than just straight-line relationships.
Beyond the scope of this course.
Correlation is a measure of strength of a linear relationship but does not say what the linear relationship is.
The simplest graphical technique to relate two quantitative variables is through a straight-line relationship – called the simple linear regression (SLR) model.
The intercept is the value of the average of the y-axis variable when the x-axis variable equals zero.
The slope is the average increase in the y-axis variable with a one-unit increase in the x-axis variable.
SUMMARY