Video Transcript: Relationships in Data - Part 1
Welcome to this section of the course. In this section of the course, we're going to be talking about relationships in data. One of the hardest things to do when teaching an introductory course is really to try and sell the idea of how much this is impactful when we don't cover all of the great impacts that this can have. Again, we're just laying the foundation of statistics here, so what we're going to do in this section of the course is we're going to talk briefly about some things like boxplots and scatterplots, and how they set up for more advanced analysis that we'll touch on at the very end of this course, but we have to lay the foundation for that advanced analysis throughout the rest of this course. Remember what we've talked about previously. Exploring data reveals potential insights, and those insights provide valuable uses of information. And we explored our data a lot of times through the help of visuals. We talked about distributions, bar charts, stacked bar charts, boxplots, and scatterplots, and then we started thinking, are visuals actually enough? And again, that's what this section of the course is really going to explore, is how can we use some of these visuals, like boxplots and scatterplots, for instance, to help inform more advanced thoughts around analysis, so let's jump into the first one. Let's talk about boxplots. Remember, you've seen this boxplot before. What this is doing is it's showing the daily users by season for each of the seasons there on the bottom, the x axis, spring, summer, fall, and winter, and you can get an idea of the distribution of number of daily users for each one of those seasons. Now, before we talked about those x's in the middle of the boxplot that told us the mean, the average, and then we talked about this idea of spread, but let's actually define all these points in the boxplot. Boxplots are a visual representation of what we sometimes call a five number summary. A five number summary about a set of data are these following five numbers: the minimum value, the first quartile, the median, the third quartile, and the maximum value. What we're going to do is we're going to talk in more detail about each one of these, so let's talk first about the minimum value. The minimum value of a variable really is exactly how it sounds. It's just the variable that's numerically lowest, and so if you were to line up all of the values of a variable from smallest to largest, it would be the first one. It is the lowest value, the smallest value, the furthest to the left. If you lined them all up from smallest to largest, so if we were to look at the lowest temperatures from our bike rental data set, again from smallest all the way up, you would see the first four lowest temperatures would be 22.6 degrees, 25.8 degrees, another 25.8 degrees, 26.7 degrees, and so on. Well, the minimum value is going to be the smallest out of all of these, so again it is the lowest value in our data set. Now, the minimum value would make sense intuitively. Let's talk about the idea of quartiles and percentiles, though you may not have ever heard of a quartile, and two of the five numbers in a five number summary are quartiles, the first quartile and the third quartile, but to understand a quartile, we need to understand a percentile. A percentile provides information
about how the data are spread over a certain interval, again from smallest all the way up to largest value. The pth percentile of a data set is a value such that at least p percent of the items take on that value or less, and then the opposite would also be true, 100 minus p percent of the items would take on that value or more. For example, you've probably heard of something before, along the lines of a student's test score was in the 93rd percentile. Well, what does that mean? That means that 93% of all test scores were the same as or below that student's test. Score, of course, if 93% of all test scores were the same or below that student's test score. Another way of thinking about it is that only 7% of test scores were higher than this student's test score. Another example might be height, maybe you know you're in the 80th percentile of height, that would mean that 80% of the people in the population are shorter than you. Of course, that would also mean that 20% of the people in the population are taller than you. That's the idea of a percentile. So, again, although it sounds complicated initially, it really is intuitive, and something you've probably used already in the past. Okay, well, if those are percentiles, then what are quartiles? Well, quartiles are very specific percentiles that are commonly used. The first quartile is basically the 25th percentile. 25% of the values in a variable are below that point, and 75% are above it. The second quartile is the 50th percentile. Basically, half of your data is below, and half of your data is above that point. We already have defined something for this, though. The second quartile has a very specific name. That name is the median. Remember, what the median was from our previous lectures. The median was just the halfway point of your data. Half of your data was below it, half of your data was above it. So that's the same thing as the second quartile. The second quartile is where half of your data is below and half of your data is above. The third quartile is the 75th percentile, again 75% of your data is below that point, or 25% is above it. So, how do we do a quartile calculation? Well, quartiles are calculated in a similar way that we calculated the median. We have to line up all of our data from smallest to largest, for the median, we pick the number in the middle, and then we basically said, all right, well, if half of our data is below and half of our data is above, that is our median value. We can do the same thing with quartiles, we can take the bottom half of our data and pick the number in the middle, that would be the first quartile, the top half of our data, pick the number in the middle, that would be the third quartile. Now we're not going to get too much into the calculation of this. When you start calculating things beyond just medians, we typically let computers do this calculation. So I'll go ahead and report the quartiles for you for our bike rental data set, when it comes to temperatures, our bike rental data set has the first quartile of 46.08 degrees. In other words, 25% of all the days in our data set have 46.08 degrees Fahrenheit or lower. The median, if you remember from a previous lecture, was 59.76 degrees, or in other words, half of the days in our data set have a temperature that is 59.76 degrees or lower. The
third quartile is 73.08 degrees, or in other words, 75% of the days in our data set have a temperature that is 73.08 degrees or smaller. Quartiles provide us with an extra piece of information, we call this the interquartile range, or IQR for short. The interquartile range of a dataset is really the distance or the difference between the third and the first quartile. You can think about this as sort of the middle 50% of the data, so remember if the first quartile has 25% below it, and the third quartile has 75% below it, or 25% above it. Then what's in between the first and third quartile would be the middle 50% of your data. So, if you wanted to sort of get an idea of, okay. Well, what does my data look like in terms of the middle of my data? Well, the interquartile range can provide that information for you, just like the median. The interquartile range is not bothered by extreme observations, by outliers in the tail of your data set. It can still give you a good sense of your data, for our temperature data, for example, the interquartile range would be 73.08 degrees minus 46.08 degrees. This would be an interquartile range of 27 degrees. Wonderful, so we have a spread of 27 degrees for our inner quartile range. Awesome, one more number to go. So, the maximum value of a variable is numerically the highest value of that variable. So, again, if the minimum is the smallest value, the maximum is the largest value. Again, if we were to rank our observations from smallest to largest, you pick the highest one. For temperatures in our bike rental data set, our maximum value would be 90.5 degrees. And there we go, we have our five numbers. With those five numbers, we can create the box plot, so the maximum value, 90.5 degrees, is the upper line, the line at the very, very top. The minimum value, 22.6 degrees, is the line at the very, very bottom. So, with that being the case, the box in the middle, those are your quartiles. The box is defined by your inner quartile range, your 46.08 degrees and your 73.08 degrees, with our median being the line in the very center of that box. So again, with these five numbers, we've created the entire boxplot that you see here on the right. So we have the box in the middle, again defined by your quartiles, with the median splitting them with the line in the middle of the box. Then we have whiskers extending all the way out to the maximum and the minimum. So sometimes you'll hear this referred to as a box and whisker plot. Same idea. Now I mentioned previously that although a five number summary is what a boxplot typically is, some people throw in a little bit of extra information. For example, we can always have the mean as well, that's the X on a boxplot, so again, it's not a traditional thing. boxplots are just a summary of the five main numbers, but it is nice to be able to also see the mean on a boxplot. Again, it'll be typically denoted with some kind of X or a line in the middle, but that's not the only thing that boxplots can reveal to us. They can also reveal to us things that we call outliers. So, if you notice here on our boxplot for daily users by season, we have a couple of points that go beyond what we thought were the maximum and the minimum. You see these points called outliers go so far away from the main part of the box that we
want to label them specifically, but how do we determine how far away is far enough away? So, when we have outliers on a boxplot, the minimum and maximum value lines are now the maximum and minimum values within some kind of outlier boundary, anything outside of this boundary, high or low, is considered an outlier. So, again, when looking back here, these outliers are actual data points for the fall. We had a really, really low day in the fall, for the winter, we had a really, really high day in the winter. For winter, that would be the actual maximum value. For the fall, that would be the actual minimum value. But sometimes people put an outlier boundary as the maximum and the minimum to really signify points that just look different, so how do we define this outlier boundary? It's defined by something we call the one and a half IQR rule, or the one and a half interquartile range rule. Remember, we've talked about the idea of an interquartile range, it's the spread of the box in the middle. Well, if you were to basically take the first quartile, the bottom part of the box, and subtract off one and a half times the actual width of the box, the inner quartile range, that would be the lower boundary for an outlier, if you were to take the third quartile and add one and a half times the inner quartile range, that would be the high or the upper boundary for an outlier. Let me show you with some numbers. It may help see things a little bit better. Remember, we have our five data points here that defined our boxplots specifically around temperatures. The interquartile range for our temperature data set was 27 degrees, so if we were to do 46.08 degrees, that's our first quartile, and subtract off one and a half times 27 then we would be left with 5.58 degrees. Now we don't have any temperatures in our data set that are lower than 5.58 degrees, so the bottom line will only go down to the minimum value, but let's imagine you had a temperature of two degrees, then that bottom line would go all the way down to 5.58 and two would be left as a little circle down there at the value of two, signifying that this two degrees is really, really far away from the rest of your data. It looks like an outlier. We can do the same thing up top, so if we take our third quartile, 73.08 degrees, and add one and a half times 27 We will have 113.58 degrees, or in other words, if we had a temperature higher than 113.58 degrees in our data set, it would be an outlier now. Again, the maximum temperature we have in our data set is only 90.5 degrees, so the top of our boxplot will only go up to that value, but at least it gives you an idea of what these ranges look like and why we see things on the plots that we do sometimes with those little dots, so again in our bike temperature data set, the boundary, this outlier boundary is 5.58 to 113.58 Any temperature outside of that is considered an outlier. Again, we don't have any in our temperature variable, but we do have some in daily users. Again, we can see a really small count of daily users in the fall that looks abnormally small, and a really large count of daily users in the winter that looks abnormally large. That's why you see these two little dots. Awesome, so let's summarize. Now you know how to be able to read a boxplot. A boxplot is just a visual representation
of five number summaries, sometimes a little bit more. So the main five numbers of a boxplot are the minimum, the first quartile, the median, the third quartile, and the maximum values. However, sometimes people will also put on there the average or the mean, as well as outliers, if they feel like they want to define some kind of outlier boundary using the one and a half IQR rule. So that is the end of this lecture, and I look forward to seeing you next time.