Video Transcript: Relationships in Data - Part 3
Welcome. Let's finish off our section on relationships and data by now trying to look at some statistical relationships around those scatterplots we talked about last time. Here we're going to look at the idea of correlation. Now, I'll be honest, correlation is a popular term, you've probably used it yourself before. In fact, it's
thrown around a lot by people who may not understand the implications, or really lack thereof, of what they're saying in the world of statistics. If you were to tell someone that something has a correlation, you're most likely implying what we refer to as the Pearson correlation coefficient, also denoted as little r, it is a measure of strength, not just of the relationship between two variables, but the strength of the linear relationship between two variables. So, again, a lot of times people will say, oh, this is correlated with this other thing. Again, in a statistical world, that would imply a linear relationship between them, so you have to be careful in how you use that term, so when it comes to the Pearson correlation coefficient, this number is what we call unit less. Basically, it has no units when describing it. It's just a number, specifically it's just a number that's bounded by negative one and one. It can be any value between negative one and one, but again, it really doesn't have any units. We don't sit there and say, well, you have a correlation of point $5 or you have a correlation of negative point two inches. There are no units, you just have a correlation of a specific number. Now, when a correlation is negative, basically below zero, that implies that you have a negative relationship between the two variables that you're talking about when it comes to correlation. So, with correlation, we're describing two variables much like we are in a scatterplot, specifically we're describing two quantitative variables, so a negative relationship would result in a negative correlation, something between again zero and negative one. Of course, that means that a positive correlation, a number between zero and positive one, would imply a positive linear relationship between these two quantitative variables. So, again, remember, a positive relationship, as one number goes up, the other number has a tendency of going up. A negative relationship, as one number goes up, the other has a tendency of going down, and we can see that here, right, as x gets bigger from left to right, y gets smaller from top to bottom, whereas our positive relationship, as x gets bigger from left to right, y gets bigger from bottom to top, any kind of value of correlation that's near zero would imply no real linear relationship. So you can see here with this blob of data points in this middle chart, as x moves, well, y doesn't really seem to move in any kind of pattern, x could go up, y may go up or down, y could go up, x could go up or down. There's really no apparent relationship between these two variables that would again imply a correlation of basically zero. Now the closer a number is to one or negative one implies the stronger the relationship. In fact, a value of one or negative one actually implies a perfect linear relationship between your two quantitative variables, again, I'm just calling them x and y. So, on the left-hand plot here, what can we see? We can see as x goes up, y goes
up, perfect. So, on the left-hand plot, we have a positive relationship, but notice how it goes up perfectly in a straight line, that would imply the correlation of the plot on the left hand side would be a correlation of exactly one, a perfect positive
relationship. On the right hand plot, as x goes up, y goes down, and again it does so perfectly along. A straight line, that would mean that the correlation on the right-hand plot would be negative one. We have a perfect negative relationship. So we can see here by going back to these plots, if we look at a plot where we see a blob of data, you really don't have any relationship. If you see a positive relationship, you're going to have a positive correlation, a negative relationship, a negative correlation, and the closer those numbers are to one or negative one, the closer, the tighter, the stronger the relationship exists all the way up until you get to one or negative one, and you have a perfect relationship between these two numbers. So, again, if we were to have a correlation of .9 versus a correlation of .2, although both are positive, .9 has a stronger, a tighter, a more linear relationship than .2, so let's look at our data. So again, we're looking at the idea of temperature versus total daily registered users. The correlation between these two things is .54. So again, we noticed last time it looks like there's a positive relationship as temperature goes up, total daily registered users tends to also go up, but it doesn't go up in a perfectly straight line, so it doesn't have a perfect correlation of one. Here, this correlation is .54. Let me show you another correlation. This is a correlation between wind speed and total daily registered users, and this looks a lot more like a blob than it does a linear relationship. We don't really see a lot of direction to this, if we were to calculate the correlation between these two variables again, wind speed and total daily registered users, we'd have a correlation of negative .22, so .22 it's not as strong as the point five four that we saw earlier, which is not surprising, this .54, it looks like, is a little bit tighter of a relationship as compared to this blob of data, but it does look like, at least according to the negative, that it has a little bit of a negative relationship, you know, if we were to look at low wind speeds, you can really see any number of daily registered users, but if you look at really high wind speeds, we don't typically see as many daily registered users, so we have a correlation that's negative, but boy, this is not really a strong correlation, at least in comparison to what we saw previously now. Now, there are a couple potential issues to deal with when it comes to correlation. They are outliers and causation. So, let's talk about each one of these first. Let's talk about the idea of outliers, and we've talked about outliers before. Outliers, again, or just the idea that you have a data point that doesn't really look like the rest. Well, outliers can actually lead to false conclusions about correlation if you don't visualize the data to help you see what's going on again. That's why we focus so much in this class early on about visualization and good plots to use to visualize things. If we don't visualize our data and we just jump straight into numbers and summarizing things and analysis, we could screw up some things and not even
ever know it. So, let me show you an example. Outliers can make relationships exist that aren't really there. Take a look at this blob of data. This blob of data between x and y has a correlation of zero. Now, this line that you see would be what we call a best fit line. We'll talk about that in a little bit, but essentially think about that line as trying to find the positive or the negative relationship between these two variables, because it's not going up or going down, it doesn't really find any relationship, positive or negative, between these two variables, however, if I were to add an outlier, this red dot up here on the upper right-hand side of my plot, the new correlation between these two variables is now .92. Wow, we went from zero, no correlation at a single day. Point, and now my correlation is .92. Well, why is that the case? Well, take a look at the line. If we were to try and draw a line through our data set, it would look like, wow, we have a really strong positive relationship, but in reality that's only because we have a one single data point that's driving that relationship. Without that data point, we really don't think there's any relationship at all. So we have to be very careful. Outliers in a data set may throw off some of our calculations. In fact, let's go back to our wind speeds. Take a look at that wind speed of almost 35 miles per hour on the lower right hand side, that little outlier right there. We don't see a lot of other wind speeds that high could be making us think that this relationship is more negative than it actually is, so we have to be careful, okay. So that's one way that outliers can make relationships that aren't really there. However, outliers can also hide relationships that are really there. So again, let me give you an example. Here's an example where we have a perfect linear positive relationship. All these dots are increasing, and they're increasing in a straight line. So the variables x and y have a perfect positive relationship. Their correlation is one, but again, let's add in a red dot. Oh wow, we add in that one red dot, and suddenly the correlation of this data set between x and y now is negative point one. So, according to just the correlation number alone, it would say you have a weak relationship between x and y, and that weak relationship, because it's really close to zero, is also negative, because it has a negative value, but if you were to just look at your data, you'd go, oh no, no, no, that little red dot doesn't look like it fits in with the rest of the pattern. Most of my data is really positively related. We have this one random data point that may be pulling things down, so you have to be very careful. This is why we always, always, always visualize our data. So, let's summarize real quick. The Pearson correlation coefficient, r, is what we typically refer to when we say two things are correlated. It is a measure of strength of relationship between two quantitative variables. However, it is also a measure of strength of linear relationship. So you have to be careful when you say two things are correlated statistically, you're saying they're linearly related to each other. Negative values of correlation imply a negative linear relationship. Positive values of correlation imply a positive linear relationship, and values near zero imply there's really no linear
relationship. Also, remember that outliers can make and or hide real relationships that are or are not there, so you have to be careful, and which is why we always, always, always visualize our data, but like we said, outliers aren't the only thing that can bother correlation. Another thing that people have a mistake of doing when it comes to correlation is trying to assign what we call causation. So, again, two of the biggest problems with correlation are outliers and causation. We already mentioned outliers. Let's talk about causation, so we see that our two variables, temperature and total daily registered users, have a correlation of .54. Now that means they have a positive relationship, however, that does not mean that temperature causes daily registered users to go up or down. It just means they have a tendency of moving together. So, again, as temperature goes up, that's not causing more daily registered users. It just means as temperature goes up, we also tend to see daily registered users go up. You have to be careful confusing correlation and causation. It's really common, and unfortunately, it implies that one thing may actually directly impact another when that's not actually the case again. All correlation does is imply that there's some kind of trend between these two variables. It implies some kind of relationship, as one moves, the other tends to move, but it doesn't mean that one causes the other to move. Let me give you some examples, because there are many famous examples of correlations that are not causations. For example, ice cream sales are positively correlated with shark attacks. The more ice cream that people sell, the more people get eaten by sharks. Well, I mean, if we were to think that this correlation is because of causation. Then we would say, well, ice cream sales then make us taste better. If we taste better, if I taste a little bit more chocolatey, then I guess the sharks are going to like eating me more. So that would mean that ice cream sales cause shark attacks. No, that's not actually true, and so, since that's not actually true, there must be some underlying variable that's actually going on. So, what else may be causing this relationship, where, when we see high ice cream sales, we also see high numbers of shark attacks. Well, it's probably rather intuitive to think it's all because of high temperatures. If the temperature outside is really hot, people have a tendency of buying more ice cream. If the temperature outside is really hot, then people also have a tendency of swimming in the ocean more, and more swimming in the ocean means you're around sharks more often, which means you have a tendency of having more shark attacks, so you see it's not that ice cream sales cause shark attacks, it's that they both happen to be moving in the same way, they have a relationship, but not a causal relationship. In fact, what's actually going on is that high temperatures drive ice cream sales and high temperatures drive shark attacks, so a lot of times there's an underlying factor that is related to both of the correlated variables, and so again temperature would be an example, but like I said, there are many famous examples of correlation and no causation, and not even having any kind of
underlying factor that's related. So, for example, did you know the divorce rate in Maine and the US consumption of margarine per person is highly correlated, so we're not saying that consuming more margarine is going to make people in Maine hate each other more and want to get a divorce, but there really is no underlying factor that would make sense on why these two things are related in the first place. Same thing for US consumption of mozzarella cheese. Did you know that the US consumption of mozzarella cheese per person is highly correlated with the number of awarded PhDs in civil engineering. Did you also know that the decrease in the number of pirates in the world is actually highly correlated with the increase in global warming? Again, there's these examples that have nothing to do with each other, so you have to be very, very careful about this idea of correlation and causation. So, remember, correlation does not imply causation. Now, am I saying that causation isn't actually there? No, there actually may be causation, but correlation is not the reason why there is causation. There has to be some other kind of thought around it. So, for example, it is probably related and true that smoking causes lung cancer, but it is not statistics that say smoking causes lung cancer. It's medical research that says smoking causes lung cancer. So, again, we have a correlation between temperature and bike usage. I can only say that those two things are related. I'm not going to say that high temperatures drive people to making them use our bikes, it's probably not actually the case. So, again, a lot of times in these examples, there's an underlying factor, but that's not always true. Some things are correlated, and we just have no real logical reason why. But let's extend this idea of correlation just a little bit further. Let's talk about the idea of regression. So, correlation is not everything. Correlation implies that there's some kind of linear relationship between two variables. However, correlation is just a measure of strength of linear. Relationship, it doesn't actually say what the linear relationship is. Let me give you an example. Take a look at the plot on the right hand side of the screen. The plot here on the right-hand side has two different sets of data, marked by circles and marked by x's. Notice how both of these circles and x's have very strong linear relationships. In fact, both of them have a correlation of .99. The x's have a correlation of .99, and the circles have a correlation of .99. However, we see that they're not pointing in the exact same direction, so correlation just measures the strength of the relationship. It doesn't necessarily measure what that exact linear relationship is. That is done through what we call regression modeling. Now, again, regression modeling will touch on at the very end of this course, but really we can't talk about regression modeling until again we have some foundational concepts under our belt. We still need the idea of randomness that we're going to talk about, this idea of correlation that we're talking about now. We have to get the idea of residuals, we have to get the idea of hypothesis testing. So, again, all I'm trying to do with these next couple slides is just show you what we can do once we get this foundational
statistics course under our belt. So many people across many industries devote a lot of research money to discovering how variables are related to each other, this is what we call modeling a simple graphical technique to relate two quantitative variables through a straight line relationship. Basically, try and put a straight line in a scatterplot, that would be called a simple linear regression model. Sometimes we'll call this an SLR. Now, most models are more extensive and complicated than these simple linear regressions, but simple linear regressions do form a good foundation. So, let me give you an example of our bike data set again. What if you wanted to predict the number of registered users based only on the outside temperature, so in other words, if I give you the outside temperature, you tell me your best guess of total daily registered users. Well, what is the best guess line for that scatterplot? Well, the best guess line for that scatterplot that you see over there on the right hand side, that dotted straight line is the line that fits our data the best. We'll talk more about that in a later lecture, but you can see the equation for it here on the left hand side of the screen. Now, there are a lot of things going on in that equation, so let me go ahead and define some things for you. That equation is called the simple linear regression equation. On the left-hand side, we have the number of predicted users, what we think is going to happen. On the right-hand side, we have a number called the intercept, then we add a number called the slope times the other variable here temperature, so we're going to use temperature by multiplying it by a slope, then adding an intercept to predict the number of users. This is kind of like what you may remember in grade school when looking at the idea of something graphically, if you remember slope-intercept form, you probably saw this as y equals mx plus b. Same idea here, the y would be predicted users, the x would be temperature, the m here is represented by that funky looking b, that's the Greek letter beta, and then the intercept is what you would think of as B, is here represented by again that funky letter beta with a little zero beside it. Let me try and visually show you what I mean by the intercept. The intercept is literally where that diagonal dashed line would cross the y axis. In other words, if temperature were zero, how many predicted users do I think we would have? It'd be a really small number, and that really small number you can see here on the right-hand side plot. That number again is represented by beta zero. The slope, on the other hand, is basically the idea of the direction, the angle of this line. Another way we think about slope is what we call rise over run, basically how far. Are up do we go on total daily registered users for a certain level of left and right on temperature that is represented by the beta one value, the slope, so you can think about it as the intercept is the value of the average number of registered users. When the temperature is zero, the slope is the average increase in registered users with a one degree increase in temperature. So, if temperature goes up by a degree, how many more people do I think are going to actually use the actual bike service? Again, we would use
a computer to figure out all these numbers, and by using a computer, the numbers are the following. So, our predicted number of users is going to be 418.42 plus 54.4 times temperature. In other words, you give me a temperature, I would multiply it by 54.4 then I would add 418.42 and that would be the number of predicted users I would guess would actually use our bike rental service that day. Now, again, How did we get these numbers? How do we find that line of best fit? That's all something we'll cover by the end of the course, but I'm just trying to show you how these scatterplots and these correlations really are going to drive some more advanced analysis later on. Now, of course, a straight line, as we can see on the right-hand side, may not fit your data the best. Maybe you can get something that is a little bit more complicated, like this curved line that you see that may fit your data a little bit better, but anything that's not a straight line is well beyond the scope of this course. So, let's summarize. Correlation is a measure of strength of linear relationship, however, does not say what that linear relationship is the simplest graphical technique to relate two quantitative variables is through a straight line relationship, and that's called the simple linear regression model. That simple linear regression model is just that slope-intercept form that you probably learned back in grade school. You have an intercept and you have a slope, this again relates to that idea of just trying to understand the relationship between two quantitative variables that we've been talking about over the last couple lectures. Wow, we've looked at a lot in this section trying to understand just relationships that exist in variables, but that is the end of this lecture. That is the end of this section, and I look forward to seeing you in the next one.