Welcome to this section of the course. In this section of the course, we're going  to be talking about relationships in data. One of the hardest things to do when  teaching an introductory course is really to try and sell the idea of how much this is impactful when we don't cover all of the great impacts that this can have.  Again, we're just laying the foundation of statistics here, so what we're going to  do in this section of the course is we're going to talk briefly about some things  like boxplots and scatterplots, and how they set up for more advanced analysis  that we'll touch on at the very end of this course, but we have to lay the  foundation for that advanced analysis throughout the rest of this course.  Remember what we've talked about previously. Exploring data reveals potential  insights, and those insights provide valuable uses of information. And we  explored our data a lot of times through the help of visuals. We talked about  distributions, bar charts, stacked bar charts, boxplots, and scatterplots, and then we started thinking, are visuals actually enough? And again, that's what this  section of the course is really going to explore, is how can we use some of these visuals, like boxplots and scatterplots, for instance, to help inform more  advanced thoughts around analysis, so let's jump into the first one. Let's talk  about boxplots. Remember, you've seen this boxplot before. What this is doing  is it's showing the daily users by season for each of the seasons there on the  bottom, the x axis, spring, summer, fall, and winter, and you can get an idea of  the distribution of number of daily users for each one of those seasons. Now,  before we talked about those x's in the middle of the boxplot that told us the  mean, the average, and then we talked about this idea of spread, but let's  actually define all these points in the boxplot. Boxplots are a visual  representation of what we sometimes call a five number summary. A five number summary about a set of data are these following five numbers: the minimum  value, the first quartile, the median, the third quartile, and the maximum value.  What we're going to do is we're going to talk in more detail about each one of  these, so let's talk first about the minimum value. The minimum value of a  variable really is exactly how it sounds. It's just the variable that's numerically  lowest, and so if you were to line up all of the values of a variable from smallest  to largest, it would be the first one. It is the lowest value, the smallest value, the  furthest to the left. If you lined them all up from smallest to largest, so if we were  to look at the lowest temperatures from our bike rental data set, again from  smallest all the way up, you would see the first four lowest temperatures would  be 22.6 degrees, 25.8 degrees, another 25.8 degrees, 26.7 degrees, and so on.  Well, the minimum value is going to be the smallest out of all of these, so again  it is the lowest value in our data set. Now, the minimum value would make sense intuitively. Let's talk about the idea of quartiles and percentiles, though you may  not have ever heard of a quartile, and two of the five numbers in a five number  summary are quartiles, the first quartile and the third quartile, but to understand  a quartile, we need to understand a percentile. A percentile provides information 

about how the data are spread over a certain interval, again from smallest all the way up to largest value. The pth percentile of a data set is a value such that at  least p percent of the items take on that value or less, and then the opposite  would also be true, 100 minus p percent of the items would take on that value or more. For example, you've probably heard of something before, along the lines  of a student's test score was in the 93rd percentile. Well, what does that mean?  That means that 93% of all test scores were the same as or below that student's test. Score, of course, if 93% of all test scores were the same or below that  student's test score. Another way of thinking about it is that only 7% of test  scores were higher than this student's test score. Another example might be  height, maybe you know you're in the 80th percentile of height, that would mean  that 80% of the people in the population are shorter than you. Of course, that  would also mean that 20% of the people in the population are taller than you.  That's the idea of a percentile. So, again, although it sounds complicated  initially, it really is intuitive, and something you've probably used already in the  past. Okay, well, if those are percentiles, then what are quartiles? Well, quartiles are very specific percentiles that are commonly used. The first quartile is  basically the 25th percentile. 25% of the values in a variable are below that  point, and 75% are above it. The second quartile is the 50th percentile.  Basically, half of your data is below, and half of your data is above that point. We already have defined something for this, though. The second quartile has a very  specific name. That name is the median. Remember, what the median was from  our previous lectures. The median was just the halfway point of your data. Half  of your data was below it, half of your data was above it. So that's the same  thing as the second quartile. The second quartile is where half of your data is  below and half of your data is above. The third quartile is the 75th percentile,  again 75% of your data is below that point, or 25% is above it. So, how do we do a quartile calculation? Well, quartiles are calculated in a similar way that we  calculated the median. We have to line up all of our data from smallest to  largest, for the median, we pick the number in the middle, and then we basically  said, all right, well, if half of our data is below and half of our data is above, that  is our median value. We can do the same thing with quartiles, we can take the  bottom half of our data and pick the number in the middle, that would be the first quartile, the top half of our data, pick the number in the middle, that would be  the third quartile. Now we're not going to get too much into the calculation of  this. When you start calculating things beyond just medians, we typically let  computers do this calculation. So I'll go ahead and report the quartiles for you  for our bike rental data set, when it comes to temperatures, our bike rental data  set has the first quartile of 46.08 degrees. In other words, 25% of all the days in  our data set have 46.08 degrees Fahrenheit or lower. The median, if you  remember from a previous lecture, was 59.76 degrees, or in other words, half of  the days in our data set have a temperature that is 59.76 degrees or lower. The 

third quartile is 73.08 degrees, or in other words, 75% of the days in our data set have a temperature that is 73.08 degrees or smaller. Quartiles provide us with  an extra piece of information, we call this the interquartile range, or IQR for  short. The interquartile range of a dataset is really the distance or the difference  between the third and the first quartile. You can think about this as sort of the  middle 50% of the data, so remember if the first quartile has 25% below it, and  the third quartile has 75% below it, or 25% above it. Then what's in between the  first and third quartile would be the middle 50% of your data. So, if you wanted  to sort of get an idea of, okay. Well, what does my data look like in terms of the  middle of my data? Well, the interquartile range can provide that information for  you, just like the median. The interquartile range is not bothered by extreme  observations, by outliers in the tail of your data set. It can still give you a good  sense of your data, for our temperature data, for example, the interquartile  range would be 73.08 degrees minus 46.08 degrees. This would be an  interquartile range of 27 degrees. Wonderful, so we have a spread of 27  degrees for our inner quartile range. Awesome, one more number to go. So, the  maximum value of a variable is numerically the highest value of that variable.  So, again, if the minimum is the smallest value, the maximum is the largest  value. Again, if we were to rank our observations from smallest to largest, you  pick the highest one. For temperatures in our bike rental data set, our maximum  value would be 90.5 degrees. And there we go, we have our five numbers. With  those five numbers, we can create the box plot, so the maximum value, 90.5  degrees, is the upper line, the line at the very, very top. The minimum value,  22.6 degrees, is the line at the very, very bottom. So, with that being the case,  the box in the middle, those are your quartiles. The box is defined by your inner  quartile range, your 46.08 degrees and your 73.08 degrees, with our median  being the line in the very center of that box. So again, with these five numbers,  we've created the entire boxplot that you see here on the right. So we have the  box in the middle, again defined by your quartiles, with the median splitting them with the line in the middle of the box. Then we have whiskers extending all the  way out to the maximum and the minimum. So sometimes you'll hear this  referred to as a box and whisker plot. Same idea. Now I mentioned previously  that although a five number summary is what a boxplot typically is, some people throw in a little bit of extra information. For example, we can always have the  mean as well, that's the X on a boxplot, so again, it's not a traditional thing.  boxplots are just a summary of the five main numbers, but it is nice to be able to also see the mean on a boxplot. Again, it'll be typically denoted with some kind  of X or a line in the middle, but that's not the only thing that boxplots can reveal  to us. They can also reveal to us things that we call outliers. So, if you notice  here on our boxplot for daily users by season, we have a couple of points that  go beyond what we thought were the maximum and the minimum. You see  these points called outliers go so far away from the main part of the box that we 

want to label them specifically, but how do we determine how far away is far  enough away? So, when we have outliers on a boxplot, the minimum and  maximum value lines are now the maximum and minimum values within some  kind of outlier boundary, anything outside of this boundary, high or low, is  considered an outlier. So, again, when looking back here, these outliers are  actual data points for the fall. We had a really, really low day in the fall, for the  winter, we had a really, really high day in the winter. For winter, that would be the actual maximum value. For the fall, that would be the actual minimum value. But sometimes people put an outlier boundary as the maximum and the minimum to  really signify points that just look different, so how do we define this outlier  boundary? It's defined by something we call the one and a half IQR rule, or the  one and a half interquartile range rule. Remember, we've talked about the idea  of an interquartile range, it's the spread of the box in the middle. Well, if you  were to basically take the first quartile, the bottom part of the box, and subtract  off one and a half times the actual width of the box, the inner quartile range, that would be the lower boundary for an outlier, if you were to take the third quartile  and add one and a half times the inner quartile range, that would be the high or  the upper boundary for an outlier. Let me show you with some numbers. It may  help see things a little bit better. Remember, we have our five data points here  that defined our boxplots specifically around temperatures. The interquartile  range for our temperature data set was 27 degrees, so if we were to do 46.08  degrees, that's our first quartile, and subtract off one and a half times 27 then we would be left with 5.58 degrees. Now we don't have any temperatures in our  data set that are lower than 5.58 degrees, so the bottom line will only go down  to the minimum value, but let's imagine you had a temperature of two degrees,  then that bottom line would go all the way down to 5.58 and two would be left as a little circle down there at the value of two, signifying that this two degrees is  really, really far away from the rest of your data. It looks like an outlier. We can  do the same thing up top, so if we take our third quartile, 73.08 degrees, and  add one and a half times 27 We will have 113.58 degrees, or in other words, if  we had a temperature higher than 113.58 degrees in our data set, it would be an outlier now. Again, the maximum temperature we have in our data set is only  90.5 degrees, so the top of our boxplot will only go up to that value, but at least  it gives you an idea of what these ranges look like and why we see things on the plots that we do sometimes with those little dots, so again in our bike  temperature data set, the boundary, this outlier boundary is 5.58 to 113.58 Any  temperature outside of that is considered an outlier. Again, we don't have any in  our temperature variable, but we do have some in daily users. Again, we can  see a really small count of daily users in the fall that looks abnormally small, and a really large count of daily users in the winter that looks abnormally large.  That's why you see these two little dots. Awesome, so let's summarize. Now you know how to be able to read a boxplot. A boxplot is just a visual representation 

of five number summaries, sometimes a little bit more. So the main five numbers of a boxplot are the minimum, the first quartile, the median, the third quartile,  and the maximum values. However, sometimes people will also put on there the  average or the mean, as well as outliers, if they feel like they want to define  some kind of outlier boundary using the one and a half IQR rule. So that is the  end of this lecture, and I look forward to seeing you next time. 



Modifié le: lundi 1 juin 2026, 13:57