Welcome, let's continue our discussion around relationships and data by talking  about the idea of analysis of variance. We've seen this chart before. It's a bar  chart comparing the average total users by season, and when we were looking  at this chart previously, we wanted to say, winter looks like it has a lower number on average in terms of total users, definitely seems to be lower than at least all  the others, and we've looked at this a few different times with numbers as well  as visualizations, but remember our last lecture. Our last lecture, we talked  about boxplots, and boxplots help reveal the variability that exists within a  specific category. So, when just looking at these averages, Winter's average  does seem noticeably lower, however, we can see that winter still has a wide  spread in terms of daily users, it can be everywhere from around 500 daily users all the way up to almost 8000 daily users, and so how can we reconcile this idea that winter looks lower on average, but in terms of its spread doesn't look to be  all too different than other values. That's the idea of how we can compare things statistically with averages when comparing averages between two groups of  data, or even more than two groups of data, we must think about how spread  out the data actually is, and so when looking at how spread out the data actually is, that five number summary is a great way of being able to summarize that. So  let's just compare two groups. Let's think about summer and compare it to  winter. When looking at summer and winter, instead of just looking at the  average in summer and the average in winter, let's look at the five number  summary between summer and winter. Let's look essentially at a boxplot with  numbers, so summer's five number summary. Remember, the minimum, the first quartile, the median, the third quartile, and the maximum are displayed on the  left hand side. Now I've also added in that sixth number, the mean, just because  that's what we're actually comparing on the right hand side, you see Winter's  five number summary, the minimum, the first quartile, the median, the third  quartile, and the maximum. So we can see again by putting the mean on  Winter's number summary as well that the average value of Winter 2604.1 daily  users versus the average value of summer 5644.3 daily users. Winter is  definitely lower, or at least appears much lower, but again, How much lower  when we account for the spread, let's take a look. The minimum value on the  five number summary for summer is 1115 users. We had a day in summer  where only a little over 1000 users actually use our bike rental service. However, if we look on the right hand side, when looking at the maximum value for winter,  we see that there was one day in winter where we had almost 8000 people use  it, specifically 7836 people used our bike rental service, so yes, on average  summer looks to be higher than winter, but boy, there's a lot of spread, and  there's some definite overlap in this spread. So, how do we reconcile this? How  can we legitimately make an argument that one average is bigger than another.  When we know that there's a lot of overlap between these two groups, when you want to compare averages statistically, we call this analysis of variance. 

Sometimes we call this ANOVA. The idea of analysis of variance is you're taking into account the variance or the spread of your data when comparing the center  of your data, so again, let's go back. I want to be able to compare the average,  the mean. On the left hand side to the average or mean on the right hand side, I  want to see if 5600 is bigger than 2600 Well, I mean it looks like it's definitely  bigger, it's 3000 bigger. However, again I have this idea of a very spread out  data set, so I should at least take that into account, that the spread of these data sets is rather wide, so again, when we want to be able to take into account that  spread, which, remember, one way of measuring spread is variance, we can use analysis of variance, that is how we can statistically compare these things. If I  wanted to say statistically that winter is definitely lower than summer on  average, then this is the type of analysis we would use. Now we don't have  enough under our belts yet to be able to pull this off. We need to learn about the  idea of randomness, we need to learn about the idea of hypothesis testing, so  we have a lot to do. However, I want to make sure I mention this early in the  course, because I feel a lot of times it's really hard in an introductory course to  see the forest through the trees. There's so much detail about foundational  things that you may not always realize the benefit that is to come later on, so  this is one of those benefits I want to be able to compare statistically if winter  really is lower than summer, but we have to account for some variability that  exists in our data, and so with that we need to account for possible randomness, we need to account for hypothesis testing, and so when we start learning these  concepts, we're just laying the foundation for much more advanced analysis that we can do later. All right, so in summary, when comparing many groups'  averages statistically, we call this an analysis of variance. The reason we call  this an analysis of variance is because we need to account for the spread in the  data when comparing means, when comparing averages. Now, you might think,  well, why would I even care? One number is higher than the other number. Fair  point, but let's think about it in this way. If I told you that summer, on average,  had 5000 users a day, and winter had 2000 users a day, you'd go, I really think  summer is higher than winter. Okay. Well, what if I told you summer had 5000  users a day on average, and winter had 4999 users a day on average? Well,  5000 is higher than 4999 but you'd probably look at me and go, you know what,  those are close enough. I don't really think they're that different. Ah, that's the  point. So, how different is different enough for us to really say what we call  statistically that one group is different in terms of its average than another?  That's what we're talking about with analysis of variance, perfect. So, analysis of variance is one way of sort of taking a boxplot and putting some statistical  analysis behind it. Let's look at another foundational plot. Let's look at a  scatterplot. When looking at a scatter plot like the one you see here, remember  what we're trying to do is we're trying to compare two different variables, so  we're trying to compare the variable temperature here on the x axis, the 

horizontal axis, the bottom axis to our total daily registered users. Now this is not total users, this is registered users, and that is on the y axis, the left hand axis,  the vertical axis. So, how do we look at a scatterplot? Well, again, scatterplots  are just visual representations comparing two different quantitative variables for  each observation in your data set. You're looking at the value of two quantitative  variables plotted at the same time on two different axes. Let's take a look. Let's  imagine we take a look at the first day in our data set, so we're looking at  January 1, 2011 Now, let's take a look at the two variables we're trying to  visualize with our scatterplots, temperature and number of registered users.  Dollars perfect, so for that first observation, January 1, 2011 the temperature  was 46.7 degrees, and we had 654 registered users use our bike rental service  on that day, so if we were to look at a scatterplot, that point 46.7 on the x axis  and 654 on the y axis would be represented by this exact point here. Now, if we  were to do that for every single point in our data set, that fills in all the other dots that you see here on this plot. All these other dots represent a single day in our  data set, and a single combination of both temperature and number of  registered users. Wonderful, but what can these plots reveal for us, in terms of  relationships, when we're looking at scatterplots, we can see a variety of things  when it comes to relationships between these two quantitative variables. The  first is whether or not we have what we call a linear relationship a linear  relationship is a relationship between these two variables that exhibits a fairly  straight linear pattern. Basically, it always looks like a straight line, not perfectly  on a straight line, but the data cloud seems to be moving in a straight line, a  nonlinear relationship, for example, though, would be a relationship between  variables that exhibits a pattern that is nonlinear in nature. It seems to be  curving a little bit. Again, we can view these things on our scatterplot, so we can  take a look at this scatterplot to try and determine if we have what appears to be a linear relationship or a nonlinear relationship, but that's not the only thing we  can look at a scatterplot to look for. We also have what we call a positive and a  negative relationship. A positive relationship would imply that as one variable  moves the other variable has a tendency to move in the same way, so for our  example, as temperature increases, the number of users tends to also increase, or you could think about it, as temperature decreases, then the number of users  tends to also decrease. This is referred to as a positive relationship. They move  in the same direction as temperature goes up, number of users tends to go up,  as temperature goes down, number of users tends to go down. If the variables  were to move in an opposite relation in an opposite direction, we would say they have a negative relationship. So, as one variable moves, the other variable has  a tendency to move in the opposite direction. So, for our example, it would imply that as temperature went up, the number of users would actually go down, and  then as temperature went down, the number of users would go up. Now that's  not what we see in our data set. In our data set, it looks like we have a positive 

relationship as the temperature increases, the number of daily registered users  has a tendency of increasing. It's not perfect, it's not a straight line, but we can  see that as temperature goes down, daily registered users goes down, as  temperature goes up, daily registered users has a tendency of going up, so  again they move in the same direction. This would be a positive relationship.  Now, whether or not we have a linear relationship, that may be a little bit harder  to see with the naked eye. It's hard to be able to tell because we have a rather  big blob of data here, and we're trying to see, as temperature goes up, does  registered users also go up linearly, or does it look a little bit curvilinear? No, we  can deal with that in another lecture, but for right now, just understanding the  idea of what a scatterplot can tell you. Scatterplots, again, are just a visual  representation of comparing two different quantitative variables. When we look  at these two different variables, the main thing that we can see is whether or not  we have a positive or a negative relationship. A positive relationship would imply that the two. Variables have a tendency of moving together. A negative  relationship would imply that the variables have a tendency of moving opposite  of each other. We can also detect linear and nonlinear, but that is a little bit  harder to detect with the eyes as compared to something like positive or  negative. Wonderful, so we saw how to be able to use boxplots in a statistical  way by looking at analysis of variance, and then we started talking about the  idea of a scatterplot. In our next lecture, we're going to learn how to be able to  apply some statistical analysis on that scatterplot, but for right now that is the  end of this lecture, and I look forward to seeing you in the next.



Last modified: Monday, June 1, 2026, 1:58 PM