Video Transcript: Relationships in Data - Part 2
Welcome, let's continue our discussion around relationships and data by talking about the idea of analysis of variance. We've seen this chart before. It's a bar chart comparing the average total users by season, and when we were looking at this chart previously, we wanted to say, winter looks like it has a lower number on average in terms of total users, definitely seems to be lower than at least all the others, and we've looked at this a few different times with numbers as well as visualizations, but remember our last lecture. Our last lecture, we talked about boxplots, and boxplots help reveal the variability that exists within a specific category. So, when just looking at these averages, Winter's average does seem noticeably lower, however, we can see that winter still has a wide spread in terms of daily users, it can be everywhere from around 500 daily users all the way up to almost 8000 daily users, and so how can we reconcile this idea that winter looks lower on average, but in terms of its spread doesn't look to be all too different than other values. That's the idea of how we can compare things statistically with averages when comparing averages between two groups of data, or even more than two groups of data, we must think about how spread out the data actually is, and so when looking at how spread out the data actually is, that five number summary is a great way of being able to summarize that. So let's just compare two groups. Let's think about summer and compare it to winter. When looking at summer and winter, instead of just looking at the average in summer and the average in winter, let's look at the five number summary between summer and winter. Let's look essentially at a boxplot with numbers, so summer's five number summary. Remember, the minimum, the first quartile, the median, the third quartile, and the maximum are displayed on the left hand side. Now I've also added in that sixth number, the mean, just because that's what we're actually comparing on the right hand side, you see Winter's five number summary, the minimum, the first quartile, the median, the third quartile, and the maximum. So we can see again by putting the mean on Winter's number summary as well that the average value of Winter 2604.1 daily users versus the average value of summer 5644.3 daily users. Winter is definitely lower, or at least appears much lower, but again, How much lower when we account for the spread, let's take a look. The minimum value on the five number summary for summer is 1115 users. We had a day in summer where only a little over 1000 users actually use our bike rental service. However, if we look on the right hand side, when looking at the maximum value for winter, we see that there was one day in winter where we had almost 8000 people use it, specifically 7836 people used our bike rental service, so yes, on average summer looks to be higher than winter, but boy, there's a lot of spread, and there's some definite overlap in this spread. So, how do we reconcile this? How can we legitimately make an argument that one average is bigger than another. When we know that there's a lot of overlap between these two groups, when you want to compare averages statistically, we call this analysis of variance.
Sometimes we call this ANOVA. The idea of analysis of variance is you're taking into account the variance or the spread of your data when comparing the center of your data, so again, let's go back. I want to be able to compare the average, the mean. On the left hand side to the average or mean on the right hand side, I want to see if 5600 is bigger than 2600 Well, I mean it looks like it's definitely bigger, it's 3000 bigger. However, again I have this idea of a very spread out data set, so I should at least take that into account, that the spread of these data sets is rather wide, so again, when we want to be able to take into account that spread, which, remember, one way of measuring spread is variance, we can use analysis of variance, that is how we can statistically compare these things. If I wanted to say statistically that winter is definitely lower than summer on average, then this is the type of analysis we would use. Now we don't have enough under our belts yet to be able to pull this off. We need to learn about the idea of randomness, we need to learn about the idea of hypothesis testing, so we have a lot to do. However, I want to make sure I mention this early in the course, because I feel a lot of times it's really hard in an introductory course to see the forest through the trees. There's so much detail about foundational things that you may not always realize the benefit that is to come later on, so this is one of those benefits I want to be able to compare statistically if winter really is lower than summer, but we have to account for some variability that exists in our data, and so with that we need to account for possible randomness, we need to account for hypothesis testing, and so when we start learning these concepts, we're just laying the foundation for much more advanced analysis that we can do later. All right, so in summary, when comparing many groups' averages statistically, we call this an analysis of variance. The reason we call this an analysis of variance is because we need to account for the spread in the data when comparing means, when comparing averages. Now, you might think, well, why would I even care? One number is higher than the other number. Fair point, but let's think about it in this way. If I told you that summer, on average, had 5000 users a day, and winter had 2000 users a day, you'd go, I really think summer is higher than winter. Okay. Well, what if I told you summer had 5000 users a day on average, and winter had 4999 users a day on average? Well, 5000 is higher than 4999 but you'd probably look at me and go, you know what, those are close enough. I don't really think they're that different. Ah, that's the point. So, how different is different enough for us to really say what we call statistically that one group is different in terms of its average than another? That's what we're talking about with analysis of variance, perfect. So, analysis of variance is one way of sort of taking a boxplot and putting some statistical analysis behind it. Let's look at another foundational plot. Let's look at a scatterplot. When looking at a scatter plot like the one you see here, remember what we're trying to do is we're trying to compare two different variables, so we're trying to compare the variable temperature here on the x axis, the
horizontal axis, the bottom axis to our total daily registered users. Now this is not total users, this is registered users, and that is on the y axis, the left hand axis, the vertical axis. So, how do we look at a scatterplot? Well, again, scatterplots are just visual representations comparing two different quantitative variables for each observation in your data set. You're looking at the value of two quantitative variables plotted at the same time on two different axes. Let's take a look. Let's imagine we take a look at the first day in our data set, so we're looking at January 1, 2011 Now, let's take a look at the two variables we're trying to visualize with our scatterplots, temperature and number of registered users. Dollars perfect, so for that first observation, January 1, 2011 the temperature was 46.7 degrees, and we had 654 registered users use our bike rental service on that day, so if we were to look at a scatterplot, that point 46.7 on the x axis and 654 on the y axis would be represented by this exact point here. Now, if we were to do that for every single point in our data set, that fills in all the other dots that you see here on this plot. All these other dots represent a single day in our data set, and a single combination of both temperature and number of registered users. Wonderful, but what can these plots reveal for us, in terms of relationships, when we're looking at scatterplots, we can see a variety of things when it comes to relationships between these two quantitative variables. The first is whether or not we have what we call a linear relationship a linear relationship is a relationship between these two variables that exhibits a fairly straight linear pattern. Basically, it always looks like a straight line, not perfectly on a straight line, but the data cloud seems to be moving in a straight line, a nonlinear relationship, for example, though, would be a relationship between variables that exhibits a pattern that is nonlinear in nature. It seems to be curving a little bit. Again, we can view these things on our scatterplot, so we can take a look at this scatterplot to try and determine if we have what appears to be a linear relationship or a nonlinear relationship, but that's not the only thing we can look at a scatterplot to look for. We also have what we call a positive and a negative relationship. A positive relationship would imply that as one variable moves the other variable has a tendency to move in the same way, so for our example, as temperature increases, the number of users tends to also increase, or you could think about it, as temperature decreases, then the number of users tends to also decrease. This is referred to as a positive relationship. They move in the same direction as temperature goes up, number of users tends to go up, as temperature goes down, number of users tends to go down. If the variables were to move in an opposite relation in an opposite direction, we would say they have a negative relationship. So, as one variable moves, the other variable has a tendency to move in the opposite direction. So, for our example, it would imply that as temperature went up, the number of users would actually go down, and then as temperature went down, the number of users would go up. Now that's not what we see in our data set. In our data set, it looks like we have a positive
relationship as the temperature increases, the number of daily registered users has a tendency of increasing. It's not perfect, it's not a straight line, but we can see that as temperature goes down, daily registered users goes down, as temperature goes up, daily registered users has a tendency of going up, so again they move in the same direction. This would be a positive relationship. Now, whether or not we have a linear relationship, that may be a little bit harder to see with the naked eye. It's hard to be able to tell because we have a rather big blob of data here, and we're trying to see, as temperature goes up, does registered users also go up linearly, or does it look a little bit curvilinear? No, we can deal with that in another lecture, but for right now, just understanding the idea of what a scatterplot can tell you. Scatterplots, again, are just a visual representation of comparing two different quantitative variables. When we look at these two different variables, the main thing that we can see is whether or not we have a positive or a negative relationship. A positive relationship would imply that the two. Variables have a tendency of moving together. A negative relationship would imply that the variables have a tendency of moving opposite of each other. We can also detect linear and nonlinear, but that is a little bit harder to detect with the eyes as compared to something like positive or negative. Wonderful, so we saw how to be able to use boxplots in a statistical way by looking at analysis of variance, and then we started talking about the idea of a scatterplot. In our next lecture, we're going to learn how to be able to apply some statistical analysis on that scatterplot, but for right now that is the end of this lecture, and I look forward to seeing you in the next.