Slides: Exploring Data
EXPLORING DATA
ST101 – DR. ARIC LABARR
Why do we organize and explore our data?
A lot of insights can be drawn from just organizing, exploring, and looking at your data.
Different types of data need to be summarized differently.
EXPLORATION
TYPES OF VARIABLES
There are two main types of variables:
Qualitative – data with a measurement scale inherently categorical.
Quantitative – data that are numeric and define a value or quantity.
EXPLORING DIFFERENT TYPES OF VARIABLES
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
EXPLORING DIFFERENT TYPES OF VARIABLES
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
Qualitative – explore within a category or across categories.
Qualitative – explore within a category or across categories.
Example questions:
What do Saturdays look like?
What do clear days look like in comparison to rainy days?
Is the winter different than the summer?
EXPLORING QUALITATIVE DATA
EXPLORING DIFFERENT TYPES OF VARIABLES
…
Date
Weekday
Season
Weather Type
Temperature
(°F)
Humidity
(%)
# Casual Users
# Registered Users
1/1/2011
Saturday
Winter
Misty
46.7
80.6
331
654
1/2/2011
Sunday
Winter
Misty
48.4
69.6
131
670
1/3/2011
Monday
Winter
Clear /
Partly Cloudy
34.2
43.7
120
1229
1/4/2011
Tuesday
Winter
Clear /
Partly Cloudy
34.5
59.0
108
1454
1/5/2011
Wednesday
Winter
Clear /
Partly Cloudy
36.8
43.7
82
1518
Quantitative – explore center, spread, and “look” of variables.
Quantitative – explore center, spread, and “look” of variables.
Example questions:
What is the typical temperature in my data?
Is the number of users trending up or down?
What is the range of humidity values? Are they narrow or very spread out?
EXPLORING QUANTITATIVE DATA
A lot of insights can be drawn from just organizing, exploring, and looking at your data.
Different types of data need to be summarized differently.
Qualitative – explore within a category or across categories.
Quantitative – explore center, spread, and “look” of variables.
SUMMARY
DISPLAYING QUALITATIVE DATA
EXPLORING DATA
Qualitative – explore within a category or across categories.
Example questions:
What do Saturdays look like?
What do clear days look like in comparison to rainy days?
Is the winter different than the summer?
EXPLORING QUALITATIVE DATA
Qualitative – explore within a category or across categories.
Different graphs are used for different tasks:
Pie chart – comparison across all categories (distribution of categories).
Bar chart – comparison across specific categories
Regular
Side-by-side
Stacked
EXPLORING QUALITATIVE DATA
PIE CHART
Pie chart – graph in which a circle is divided into sections that each represent a proportion of the whole.
Best used when looking to show entire distribution (or set of categories) for a qualitative variable.
PIE CHART
Pie chart – graph in which a circle is divided into sections that each represent a proportion of the whole.
Best used when looking to show entire distribution (or set of categories) for a qualitative variable.
Donut chart is basically the same thing with the center missing.
PIE CHART
Donut charts allow us to compare different groups’ distributions across all categories.
Casual
Registered
BAR CHART
Bar chart – numerical values of variables are represented by the height or length of lines or rectangles of equal width.
Best used when looking to compare specific categories to each other.
BAR CHART
Bar chart – numerical values of variables are represented by the height or length of lines or rectangles of equal width.
Best used when looking to compare specific categories to each other.
Stacked bar charts break down the first categories into subcategories.
BAR CHART
Bar chart – numerical values of variables are represented by the height or length of lines or rectangles of equal width.
Best used when looking to compare specific categories to each other.
Side-by-side bar charts look at these comparisons across multiple categories.
Qualitative – explore within a category or across categories.
Pie chart – graph in which a circle is divided into sections that each represent a proportion of the whole.
Bar chart – numerical values of variables are represented by the height or length of lines or rectangles of equal width.
SUMMARY
DISPLAYING QUANTITATIVE DATA
EXPLORING DATA
Quantitative – explore center, spread, and “look” of variables.
Example questions:
What is the typical temperature in my data?
Is the number of users trending up or down?
What is the range of humidity values? Are they narrow or very spread out?
EXPLORING QUANTITATIVE DATA
Quantitative – explore center, spread, and “look” of variables.
Different graphs are used for different tasks:
Line graph – look at how a variable changes over time.
Scatterplot – comparison between different quantitative variables.
EXPLORING QUANTITATIVE DATA
LINE GRAPH
Line graph – uses lines to connect individual data points over time.
Best when wanting to see how things change across time.
LINE GRAPH
Line graph – uses lines to connect individual data points over time.
Best when wanting to see how things change across time.
LINE GRAPH
Line graph – uses lines to connect individual data points over time.
Best when wanting to see how things change across time.
LINE GRAPH
Line graph – uses lines to connect individual data points over time.
Best when wanting to see how things change across time.
SCATTERPLOT
Scatterplot – the values of two variables are plotted along two axes, the pattern of the resulting points revealing any relationship present.
Best used when trying to explore relationships between two quantitative variables.
SCATTERPLOT
Scatterplot – the values of two variables are plotted along two axes, the pattern of the resulting points revealing any relationship present.
Best used when trying to explore relationships between two quantitative variables.
Quantitative – explore center, spread, and “look” of variables.
Line graph – uses lines to connect individual data points over time.
Scatterplot – the values of two variables are plotted along two axes, the pattern of the resulting points revealing any relationship present.
SUMMARY
DESCRIBING CENTER
EXPLORING DATA
When exploring data, a good summary of a variable might be what a “typical” value of that variable would look like.
What is “typical” really mean?
Qualitative variable – most common category.
Quantitative variable – focus on the center of the values of the variable.
“TYPICAL” VALUE
MODE
Mode – the mode of a variable is the most common value.
Typically reported with qualitative variables more than quantitative variables.
In our data, the “typical” weather day (according to mode) is clear or cloudy.
MEAN
Mean – the mean of a variable is the sum of all the values divided by the number of values.
Number of the
observations in
the sample
MEAN
Mean – the mean of a variable is the sum of all the values divided by the number of values.
Sum of the values
of the n observations
MEAN
Mean – the mean of a variable is the sum of all the values divided by the number of values.
MEAN
Mean – the mean of a variable is the sum of all the values divided by the number of values.
In our data, the “typical” weather day (according to mean) is 59.51°F.
MEDIAN
Median – value in the middle when the data items are arranged in ascending order.
For an odd number of observations:
26
30
27
22
24
29
13
Original Data
MEDIAN
Median – value in the middle when the data items are arranged in ascending order.
For an odd number of observations:
26
30
27
22
24
29
13
Original Data
13
22
24
26
27
29
30
Ascending Order
MEDIAN
Median – value in the middle when the data items are arranged in ascending order.
For an odd number of observations:
26
30
27
22
24
29
13
13
22
24
26
27
29
30
Median
MEDIAN
Median – value in the middle when the data items are arranged in ascending order.
For an even number of observations:
26
30
27
22
24
29
13
13
22
24
26
27
29
30
27
27
MEDIAN
Median – value in the middle when the data items are arranged in ascending order.
In our data, the “typical” weather day (according to median) is 59.76°F.
MEAN VS. MEDIAN
Whenever a data set has extreme values, the median is the preferred measure of center.
Mean is bothered by extreme values, while median is not.
26
30
27
22
24
29
13
13
22
24
26
27
29
30
Median = 26
Mean = 24.42
MEAN VS. MEDIAN
Whenever a data set has extreme values, the median is the preferred measure of center.
Mean is bothered by extreme values, while median is not.
26
300
27
22
24
29
13
13
22
24
26
27
29
300
Median = 26
Mean = 63
When exploring data, a good summary of a variable might be what a “typical” (or center) value of that variable would look like.
Mode – the mode of a variable is the most common value.
Mean – the mean of a variable is the sum of all the values divided by the number of values.
Median – value in the middle when the data items are arranged in ascending order.
SUMMARY
DESCRIBING SPREAD
EXPLORING DATA
Center can only get you so far with describing a variable’s “typical” value.
Typically, we also consider how spread out a data set is as well.
This is called variability or dispersion.
MEASURES OF VARIABILITY
RANGE
Range – difference between the largest and smallest values.
Highest temperature = 90.5°F
Lowest temperature = 22.6°F
RANGE
Range – difference between the largest and smallest values.
Highest temperature = 90.5°F
Lowest temperature = 22.6°F
Range = 90.5 – 22.6 = 67.9°F
RANGE
Range – difference between the largest and smallest values.
Very sensitive to observations with extreme values as it only focuses on the largest and smallest values in the data.
Highest temperature = 90.5°F
Lowest temperature = 22.6°F
Range = 90.5 – 22.6 = 67.9°F
RANGE
Range – difference between the largest and smallest values.
In our data, the “spread” of temperature (according to range) is 67.9°F.
Highest temperature = 90.5°F
Lowest temperature = 22.6°F
Range = 90.5 – 22.6 = 67.9°F
VARIANCE
Variance – measure of dispersion around the mean of the data set.
Average of the squared distances between each data value and mean.
VARIANCE
Variance – measure of dispersion around the mean of the data set.
Average of the squared distances between each data value and mean.
Temperature
(°F)
46.7
-12.8
163.8
48.4
-11.1
123.2
34.2
-25.3
640.1
34.5
-25.0
625.0
36.8
-22.7
515.3
…
…
…
72.5
13.0
169.0
VARIANCE
Variance – measure of dispersion around the mean of the data set.
Average of the squared distances between each data value and mean.
Temperature
(°F)
46.7
-12.8
163.8
48.4
-11.1
123.2
34.2
-25.3
640.1
34.5
-25.0
625.0
36.8
-22.7
515.3
…
…
…
72.5
13.0
169.0
Add together and divide by 730 (= 731 – 1)
VARIANCE
Variance – measure of dispersion around the mean of the data set.
In our data, the “spread” of temperature (according to variance) is 239.8°F squared.
Temperature
(°F)
46.7
-12.8
163.8
48.4
-11.1
123.2
34.2
-25.3
640.1
34.5
-25.0
625.0
36.8
-22.7
515.3
…
…
…
72.5
13.0
169.0
VARIANCE
Variance – measure of dispersion around the mean of the data set.
In our data, the “spread” of temperature (according to variance) is 239.8°F squared.
Temperature
(°F)
46.7
-12.8
163.8
48.4
-11.1
123.2
34.2
-25.3
640.1
34.5
-25.0
625.0
36.8
-22.7
515.3
…
…
…
72.5
13.0
169.0
???
STANDARD DEVIATION
The problem with variance is that it is in terms of squared units of the data.
To correct for this, we have the standard deviation, which is just the square root of the variance.
STANDARD DEVIATION
The problem with variance is that it is in terms of squared units of the data.
To correct for this, we have the standard deviation, which is just the square root of the variance.
In our data, the “spread” of temperature (according to standard deviation) is 15.5°F.
Variance (and standard deviation) possess two common characteristics:
If the variance equals zero, then all of the data in the data set has the same value.
All measures of spread are positive (or nonnegative if zero spread) in value.
2 CHARACTERISTICS OF VARIANCE
Don’t only look at center, but also variability of a variable.
Range – difference between the largest and smallest values.
Variance – measure of dispersion around the mean of the data set.
Standard deviation – the square root of the variance (helps with units of variance).
SUMMARY