Slides: Review of Data
REVIEW OF DATA
ST101 – DR. ARIC LABARR
WHAT IS/ARE DATA?
data
noun
\ˈdā - tə\
factual information used as a basis for reasoning, discussion, or calculation
Information – measurements or values describing an object, person, place, thing, etc.
Inference – using information to come to some conclusion.
Want to use the information to draw conclusions and make better decisions in the context of our problem.
Who, what, where, when, why, how?
DATA TABLE
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
Observations
DATA TABLE
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
Variables
QUALITATIVE (CATEGORICAL) VARIABLES
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
QUANTITATIVE (NUMERICAL) VARIABLES
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
GATHERING DATA
Data is everywhere.
1 zettabyte = 1,000,000,000,000 GB
With all this data being gathered and stored, we need to understand good practices of gathering data.
Data gathered without thinking ahead of time leaves itself open for problems later.
*IDC Digital Universe
GATHERING DATA
Population
Sample
Statistic
Parameter
Population – set of all objects/individuals of interest.
Sample – subset of the population that information is actually obtained.
Statistic – measures computed from a sample.
Parameter – measures computed from a population.
8
RANDOMNESS AND SAMPLING
Population
Sample
Statistic
Parameter
Having randomness helps make the sample representative of the population.
Protects us from having certain pieces ofinformation overly influence our sample.
RANDOMNESS AND SAMPLING
Population
Sample
Statistic
Parameter
Having a good representative samplemeans the inference we make from thestatistic to the parameter is reasonable!
Need good sampling to have good estimates.
Bias – certain outcomes are favored over other outcomes in samples.
2 Common Types of Bias:
Selection Bias
Undercoverage
Nonresponse
Sampling Bias
Convenience sampling
Voluntary sampling
BAD SAMPLING METHODS LEAD TO BIAS
Need good sampling to have good estimates.
4 Common Techniques:
Simple Random Sampling (SRS)
Stratified Random Sampling (STS)
Cluster Sampling
Systematic Sampling
GOOD SAMPLING METHODS
The gathering of data leads to questions around the ethical collection and use of that data.
As Christians we are held to an even higher standard around ethical considerations.
In observational studies / experiments we must keep the interest of the subject we are collecting data from at the forefront.
Collection of data:
Institutional review boards
Informed consent
Confidentiality
ETHICAL CONSIDERATIONS AROUND DATA
EXPLORING DIFFERENT TYPES OF VARIABLES
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
Qualitative – explore within a category or across categories.
EXPLORING DIFFERENT TYPES OF VARIABLES
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
Quantitative – explore center, spread, and “look” of variables.
EXPLORING VARIABLES VISUALLY
Qualitative Variable Plots
Pie chart – graph in which a circle is divided into sections that each represent a proportion of the whole.
Bar chart – numerical values of variables are represented by the height or length of lines or rectangles of equal width.
Quantitative Variable Plots
Line graph – uses lines to connect individual data points over time.
Scatterplot – the values of two variables are plotted along two axes, the pattern of the resulting points revealing any relationship present.
EXPLORING VARIABLES NUMERICALLY
Measures of Center or “Typical”
Mode – the mode of a variable is the most common value.
Mean – the mean of a variable is the sum of all the values divided by the number of values.
Median – value in the middle when the data items are arranged in ascending order.
Measures of Spread or Variation
Range – difference between the largest and smallest values.
Variance – measure of dispersion around the mean of the data set.
Standard deviation – the square root of the variance (helps with units of variance).
PROBABILITY
The probability that an event happens is a numerical measure of the likelihood of that event’s occurrence.
0
0.5
1
Likelihood of Occurrence
Probability
The event is
very unlikely
to happen.
The event is
equally likely
to happen as
unlikely to.
The event is
very likely
to happen.
18
LAW OF LARGE NUMBERS
The law of large numbers states that as the number of independent trials increases, in the long run the proportion for a certain event gets closer and closer to a single value (the probability of the event).
The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
Relative frequencies can be used as estimates to the probability of an event occurring.
Probability distributions for discrete random variables are best described with tables, graphs, or equations.
PROBABILITY DISTRIBUTION
Let x be the number of TV’s sold at a small department store in one day where x can only take the values of {0, 1, 2, 3, 4, 5}
We expect to sell 1.88 TV’s per day on average with variance of 2.522.
DISCRETE PROBABILITY EXAMPLE
TV’s Sold
Number of Days (Freq)
0
90
0.25
0.00
0.883
1
85
0.23
0.23
0.177
2
70
0.19
0.38
0.002
3
45
0.12
0.36
0.150
4
50
0.14
0.56
0.629
5
25
0.07
0.35
0.681
365
1.00
1.88
2.522
The binomial distribution looks at the probabilities of the number of successes occurring in the n independent trials.
The binomial probability function is comprised of two intuitive pieces:
BINOMIAL DISTRIBUTION
Number of outcomes providing exactly
x successes in n trials
Probability of a particular
sequence of trial outcomes
with x successes in n trials
PROBABILITIES ON INTERVALS
A random variable follows a uniform distribution whenever the probability is proportional to the interval’s length.
In other words, every value has an equal probability of happening.
The probability density function for the uniform distribution is:
UNIFORM PROBABILITY DISTRIBUTION
The Normal probability distribution is one of the most common and important distributions for describing a continuous random variable.
The Normal distribution is the foundation of statistical inference:
Hypothesis Testing
Confidence Intervals
Regression Analysis
Appears in nature and real-world data.
NORMAL PROBABILITY DISTRIBUTION
CHARACTERISTICS OF NORMAL DISTRIBUTION
More Probable
Less Probable
EMPIRICAL RULE
All Normal distributions can be converted into standard Normal distributions for ease of computing probabilities under the curve.
CONVERSION OF NORMAL DISTRIBUTIONS
SAMPLING ERROR
Population: 1, 3, 5, 5, 7, 9, 4, 6, 10, 2
Sample 1: 1, 10, 6, 9
Sample 2: 1, 3, 2, 5
If sample statistics (like the sample mean) had a predictable pattern,
then the errors would have a typical pattern as well!
CENTRAL LIMIT THEOREM
A point estimator cannot be expected to provide the exact value of the population parameter.
An interval estimate can be computed by adding and subtracting a margin or error to the point estimate:
The purpose of an interval estimate is to provide information about how close the point estimate is to the value of the parameter.
MARGIN OF ERROR
33
Confidence Intervals are interval estimates where we say we have a certain level of confidence in the interval.
For example, we are 95% confident that the population average daily number of total users of the bike rental company is between 4,000 and 5,000.
CONFIDENCE INTERVALS
If we were to take many samples (same size) that
each produced different confidence intervals, then
95% of them would contain the true parameter.
CONFIDENCE INTERVALS EXAMPLE
CONFIDENCE INTERVALS FOR MEANS AND PROPORTIONS
Proportions:
Means:
CONFIDENCE INTERVALS FOR MEANS AND PROPORTIONS
Proportions:
Means:
I have a coin that you believe is fair to start.
To test if this coin is fair, you ask me to flip the coin repeatedly and record the results.
No longer believe the coin is fair.
HYPOTHESIS TESTING THROUGH EXAMPLE
Flip Number
Result
P-value
1
Heads
0.50
2
Heads
0.25
3
Heads
0.125
4
Heads
0.0625
5
Heads
0.03125
NULL Hypothesis
Test Statistic
Decision on NULL Hypothesis
HYPOTHESIS TESTING
A hypothesis test uses data to help evaluate an initial claim about a parameter from the population.
There are 4 main steps to hypothesis testing:
State the hypotheses
Test statistic
P-value
Decision on null hypothesis
How likely is this to happen?
TYPE I VS. TYPE II ERRORS
Correct
Type II
Type I
Correct
TRUTH
CHOICE
Data is everywhere.
We can do amazing things with data – explore it, understand it, make inferences from it.
God gives us glimpses of this world through data.
Must use the information we find wisely and ethically.
SUMMARY
COMPLETE EXAMPLE WITH BIKE DATA
REVIEW OF DATA
You have been hired as a data analyst by IAA BikeSharing Inc. (the “Customer”) and your task is to provide key insights and recommendations about their business.
The Customer is a bike rental business, operating in Washington DC and Arlington, VA.
OVERVIEW OF PROBLEM
“Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership to rental and return has become automatic. Through these systems, a user can easily rent a bike from a particular position and return it back to another position. Currently, there are over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environment and health issues…
Compared to other transport services such as bus or subway, the duration of travel from the departure to the arrival position is explicitly recorded in these systems. This feature turns bike sharing systems in a virtual sensor network that can be used for sensing mobility in the city…”
MISSION STATEMENT
The company has provided detailed rental and environmental data for a two-year period (2011 and 2012).
The data are based on their Washington DC operations and cover measures such as daily rental counts, precipitation, day of week, season, and other variables which might have a potential impact on rental behavior.
DATA
Describe the key statistical measures of registered and casual users.
Identify any extreme observations in the counts of registered and casual users.
Do they tend to appear on certain days?
Certain seasons?
Describe any changes in registered users from 2011 to 2012.
Do you see any improvements?
Are these improvements evident across all months?
Are there any months that stand out, in terms of over- or under- performance?
KEY GOALS OF ANALYSIS – PART 1
Identify key probabilities around customers:
Probability a customer is a registered user given the season is Fall
Probability a customer is a registered user given the season is Summer
Provide interval estimates for these probabilities
The Customer’s Marketing Division also has preconceived notions on the average number of total users for the 2012 seasons as follows to help develop their marketing budgets:
The average number of total users in the Summer is no less than 6500.
The average number of total users in the Fall is no more than 6500.
Validate the above claims using the appropriate statistical tests.
KEY GOALS OF ANALYSIS – PART 2
Describe the key statistical measures of registered and casual users.
Identify any extreme observations in the counts of registered and casual users.
Do they tend to appear on certain days?
Certain seasons?
Describe any changes in registered users from 2011 to 2012.
Do you see any improvements?
Are these improvements evident across all months?
Are there any months that stand out, in terms of over- or under- performance?
KEY GOALS OF ANALYSIS – PART 1
REGISTERED VS. CASUAL USERS
Registered Users
Casual Users
MEASURES OF CENTER / TYPICAL
Registered Users
Mean = 3,656.2 users per day
Median = 3,662 users per day
Casual Users
Mean = 848 users per day
Median = 713 users per day
MEASURES OF VARIABILITY
Registered Users
Range = 6,946 – 20 = 6,926 users per day
Standard Deviation = 686.6 users per day
IQR = 4,783.3 – 2,488.5 = 2,294.8 users per day
Casual Users
Range = 3,410 – 2 = 3,408 users per day
Standard Deviation = 1,560.3 users per day
IQR = 1,096.5 – 315.3 = 781.2 users per day
Describe the key statistical measures of registered and casual users.
Identify any extreme observations in the counts of registered and casual users.
Do they tend to appear on certain days?
Certain seasons?
Describe any changes in registered users from 2011 to 2012.
Do you see any improvements?
Are these improvements evident across all months?
Are there any months that stand out, in terms of over- or under- performance?
KEY GOALS OF ANALYSIS – PART 1
REGISTERED VS. CASUAL USERS
Registered Users
Casual Users
5 HIGHEST REGISTERED USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Registered Users
9/26/2012
Wednesday
Fall
Clear /
Partly Cloudy
71.3
63.1
6,946
9/21/2012
Friday
Summer
Clear /
Partly Cloudy
68.3
66.9
6,917
10/10/2012
Wednesday
Fall
Clear /
Partly Cloudy
61.1
63.1
6,911
10/24/2012
Wednesday
Fall
Clear /
Partly Cloudy
67.3
63.6
6,898
10/3/2012
Wednesday
Fall
Misty
73.2
79.4
6,844
5 HIGHEST REGISTERED USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Registered Users
9/26/2012
Wednesday
Fall
Clear /
Partly Cloudy
71.3
63.1
6,946
9/21/2012
Friday
Summer
Clear /
Partly Cloudy
68.3
66.9
6,917
10/10/2012
Wednesday
Fall
Clear /
Partly Cloudy
61.1
63.1
6,911
10/24/2012
Wednesday
Fall
Clear /
Partly Cloudy
67.3
63.6
6,898
10/3/2012
Wednesday
Fall
Misty
73.2
79.4
6,844
5 HIGHEST REGISTERED USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Registered Users
9/26/2012
Wednesday
Fall
Clear /
Partly Cloudy
71.3
63.1
6,946
9/21/2012
Friday
Summer
Clear /
Partly Cloudy
68.3
66.9
6,917
10/10/2012
Wednesday
Fall
Clear /
Partly Cloudy
61.1
63.1
6,911
10/24/2012
Wednesday
Fall
Clear /
Partly Cloudy
67.3
63.6
6,898
10/3/2012
Wednesday
Fall
Misty
73.2
79.4
6,844
5 HIGHEST REGISTERED USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Registered Users
9/26/2012
Wednesday
Fall
Clear /
Partly Cloudy
71.3
63.1
6,946
9/21/2012
Friday
Summer
Clear /
Partly Cloudy
68.3
66.9
6,917
10/10/2012
Wednesday
Fall
Clear /
Partly Cloudy
61.1
63.1
6,911
10/24/2012
Wednesday
Fall
Clear /
Partly Cloudy
67.3
63.6
6,898
10/3/2012
Wednesday
Fall
Misty
73.2
79.4
6,844
5 HIGHEST CASUAL USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
5/19/2012
Saturday
Spring
Clear /
Partly Cloudy
68.4
45.6
3,410
5/27/2012
Sunday
Spring
Clear /
Partly Cloudy
76.0
69.7
3,283
4/7/2012
Saturday
Spring
Clear /
Partly Cloudy
54.6
25.4
3,252
9/15/2012
Saturday
Summer
Clear /
Partly Cloudy
69.1
50.1
3,160
3/17/2012
Saturday
Winter
Misty
61.1
75.6
3,155
5 HIGHEST CASUAL USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
5/19/2012
Saturday
Spring
Clear /
Partly Cloudy
68.4
45.6
3,410
5/27/2012
Sunday
Spring
Clear /
Partly Cloudy
76.0
69.7
3,283
4/7/2012
Saturday
Spring
Clear /
Partly Cloudy
54.6
25.4
3,252
9/15/2012
Saturday
Summer
Clear /
Partly Cloudy
69.1
50.1
3,160
3/17/2012
Saturday
Winter
Misty
61.1
75.6
3,155
5 HIGHEST CASUAL USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
5/19/2012
Saturday
Spring
Clear /
Partly Cloudy
68.4
45.6
3,410
5/27/2012
Sunday
Spring
Clear /
Partly Cloudy
76.0
69.7
3,283
4/7/2012
Saturday
Spring
Clear /
Partly Cloudy
54.6
25.4
3,252
9/15/2012
Saturday
Summer
Clear /
Partly Cloudy
69.1
50.1
3,160
3/17/2012
Saturday
Winter
Misty
61.1
75.6
3,155
5 HIGHEST CASUAL USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
5/19/2012
Saturday
Spring
Clear /
Partly Cloudy
68.4
45.6
3,410
5/27/2012
Sunday
Spring
Clear /
Partly Cloudy
76.0
69.7
3,283
4/7/2012
Saturday
Spring
Clear /
Partly Cloudy
54.6
25.4
3,252
9/15/2012
Saturday
Summer
Clear /
Partly Cloudy
69.1
50.1
3,160
3/17/2012
Saturday
Winter
Misty
61.1
75.6
3,155
5 LOWEST REGISTERED USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Registered Users
10/29/2012
Monday
Fall
Rain or Snow
54.8
88.0
20
1/27/2011
Thursday
Winter
Clear /
Partly Cloudy
34.1
68.8
416
12/26/2012
Wednesday
Winter
Rain or Snow
38.2
82.3
432
12/25/2011
Sunday
Winter
Clear /
Partly Cloudy
40.8
68.1
451
1/26/2011
Wednesday
Winter
Rain or Snow
36.0
86.3
472
5 LOWEST CASUAL USER DAYS
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
10/29/2012
Monday
Fall
Rain or Snow
54.8
88
2
12/26/2012
Wednesday
Winter
Rain or Snow
38.2
82.3
9
1/18/2011
Tuesday
Winter
Misty
35.9
86.2
9
1/27/2011
Thursday
Winter
Clear /
Partly Cloudy
34.1
68.8
15
1/12/2011
Wednesday
Winter
Clear /
Partly Cloudy
32.2
60.0
25
Describe the key statistical measures of registered and casual users.
Identify any extreme observations in the counts of registered and casual users.
Do they tend to appear on certain days?
Certain seasons?
Describe any changes in registered users from 2011 to 2012.
Do you see any improvements?
Are these improvements evident across all months?
Are there any months that stand out, in terms of over- or under- performance?
KEY GOALS OF ANALYSIS – PART 1
REGISTERED USERS OVER TIME
REGISTERED USERS OVER TIME
REGISTERED USERS OVER TIME
REGISTERED USERS OVER TIME
AVERAGE REGISTERED USERS YEAR OVER YEAR
AVERAGE REGISTERED USERS YEAR OVER YEAR
AVERAGE REGISTERED USERS YEAR OVER YEAR
Identify key probabilities around customers:
Probability a customer is a registered user given the season is Fall
Probability a customer is a registered user given the season is Summer
Provide interval estimates for these probabilities
The Customer’s Marketing Division also has preconceived notions on the average number of total users for the 2012 seasons as follows to help develop their marketing budgets:
The average number of total users in the Summer is no less than 6500.
The average number of total users in the Fall is no more than 6500.
Validate the above claims using the appropriate statistical tests.
KEY GOALS OF ANALYSIS – PART 2