Welcome. Let's finish off our section on relationships and data by now trying to  look at some statistical relationships around those scatterplots we talked about  last time. Here we're going to look at the idea of correlation. Now, I'll be honest,  correlation is a popular term, you've probably used it yourself before. In fact, it's  

thrown around a lot by people who may not understand the implications, or  really lack thereof, of what they're saying in the world of statistics. If you were to  tell someone that something has a correlation, you're most likely implying what  we refer to as the Pearson correlation coefficient, also denoted as little r, it is a  measure of strength, not just of the relationship between two variables, but the  strength of the linear relationship between two variables. So, again, a lot of  times people will say, oh, this is correlated with this other thing. Again, in a  statistical world, that would imply a linear relationship between them, so you  have to be careful in how you use that term, so when it comes to the Pearson  correlation coefficient, this number is what we call unit less. Basically, it has no  units when describing it. It's just a number, specifically it's just a number that's  bounded by negative one and one. It can be any value between negative one  and one, but again, it really doesn't have any units. We don't sit there and say,  well, you have a correlation of point $5 or you have a correlation of negative  point two inches. There are no units, you just have a correlation of a specific  number. Now, when a correlation is negative, basically below zero, that implies  that you have a negative relationship between the two variables that you're  talking about when it comes to correlation. So, with correlation, we're describing  two variables much like we are in a scatterplot, specifically we're describing two  quantitative variables, so a negative relationship would result in a negative  correlation, something between again zero and negative one. Of course, that  means that a positive correlation, a number between zero and positive one,  would imply a positive linear relationship between these two quantitative  variables. So, again, remember, a positive relationship, as one number goes up, the other number has a tendency of going up. A negative relationship, as one  number goes up, the other has a tendency of going down, and we can see that  here, right, as x gets bigger from left to right, y gets smaller from top to bottom,  whereas our positive relationship, as x gets bigger from left to right, y gets  bigger from bottom to top, any kind of value of correlation that's near zero would  imply no real linear relationship. So you can see here with this blob of data  points in this middle chart, as x moves, well, y doesn't really seem to move in  any kind of pattern, x could go up, y may go up or down, y could go up, x could  go up or down. There's really no apparent relationship between these two  variables that would again imply a correlation of basically zero. Now the closer a number is to one or negative one implies the stronger the relationship. In fact, a  value of one or negative one actually implies a perfect linear relationship  between your two quantitative variables, again, I'm just calling them x and y. So,  on the left-hand plot here, what can we see? We can see as x goes up, y goes 

up, perfect. So, on the left-hand plot, we have a positive relationship, but notice  how it goes up perfectly in a straight line, that would imply the correlation of the  plot on the left hand side would be a correlation of exactly one, a perfect positive 

relationship. On the right hand plot, as x goes up, y goes down, and again it  does so perfectly along. A straight line, that would mean that the correlation on  the right-hand plot would be negative one. We have a perfect negative  relationship. So we can see here by going back to these plots, if we look at a  plot where we see a blob of data, you really don't have any relationship. If you  see a positive relationship, you're going to have a positive correlation, a  negative relationship, a negative correlation, and the closer those numbers are  to one or negative one, the closer, the tighter, the stronger the relationship exists all the way up until you get to one or negative one, and you have a perfect  relationship between these two numbers. So, again, if we were to have a  correlation of .9 versus a correlation of .2, although both are positive, .9 has a  stronger, a tighter, a more linear relationship than .2, so let's look at our data. So again, we're looking at the idea of temperature versus total daily registered  users. The correlation between these two things is .54. So again, we noticed last time it looks like there's a positive relationship as temperature goes up, total  daily registered users tends to also go up, but it doesn't go up in a perfectly  straight line, so it doesn't have a perfect correlation of one. Here, this correlation is .54. Let me show you another correlation. This is a correlation between wind  speed and total daily registered users, and this looks a lot more like a blob than  it does a linear relationship. We don't really see a lot of direction to this, if we  were to calculate the correlation between these two variables again, wind speed and total daily registered users, we'd have a correlation of negative .22, so .22  it's not as strong as the point five four that we saw earlier, which is not  surprising, this .54, it looks like, is a little bit tighter of a relationship as compared to this blob of data, but it does look like, at least according to the negative, that it has a little bit of a negative relationship, you know, if we were to look at low wind speeds, you can really see any number of daily registered users, but if you look  at really high wind speeds, we don't typically see as many daily registered users, so we have a correlation that's negative, but boy, this is not really a strong  correlation, at least in comparison to what we saw previously now. Now, there  are a couple potential issues to deal with when it comes to correlation. They are  outliers and causation. So, let's talk about each one of these first. Let's talk  about the idea of outliers, and we've talked about outliers before. Outliers, again, or just the idea that you have a data point that doesn't really look like the rest.  Well, outliers can actually lead to false conclusions about correlation if you don't  visualize the data to help you see what's going on again. That's why we focus so much in this class early on about visualization and good plots to use to visualize  things. If we don't visualize our data and we just jump straight into numbers and  summarizing things and analysis, we could screw up some things and not even 

ever know it. So, let me show you an example. Outliers can make relationships  exist that aren't really there. Take a look at this blob of data. This blob of data  between x and y has a correlation of zero. Now, this line that you see would be  what we call a best fit line. We'll talk about that in a little bit, but essentially think  about that line as trying to find the positive or the negative relationship between  these two variables, because it's not going up or going down, it doesn't really  find any relationship, positive or negative, between these two variables,  however, if I were to add an outlier, this red dot up here on the upper right-hand  side of my plot, the new correlation between these two variables is now .92.  Wow, we went from zero, no correlation at a single day. Point, and now my  correlation is .92. Well, why is that the case? Well, take a look at the line. If we  were to try and draw a line through our data set, it would look like, wow, we have a really strong positive relationship, but in reality that's only because we have a  one single data point that's driving that relationship. Without that data point, we  really don't think there's any relationship at all. So we have to be very careful.  Outliers in a data set may throw off some of our calculations. In fact, let's go  back to our wind speeds. Take a look at that wind speed of almost 35 miles per  hour on the lower right hand side, that little outlier right there. We don't see a lot  of other wind speeds that high could be making us think that this relationship is  more negative than it actually is, so we have to be careful, okay. So that's one  way that outliers can make relationships that aren't really there. However,  outliers can also hide relationships that are really there. So again, let me give  you an example. Here's an example where we have a perfect linear positive  relationship. All these dots are increasing, and they're increasing in a straight  line. So the variables x and y have a perfect positive relationship. Their  correlation is one, but again, let's add in a red dot. Oh wow, we add in that one  red dot, and suddenly the correlation of this data set between x and y now is  negative point one. So, according to just the correlation number alone, it would  say you have a weak relationship between x and y, and that weak relationship,  because it's really close to zero, is also negative, because it has a negative  value, but if you were to just look at your data, you'd go, oh no, no, no, that little  red dot doesn't look like it fits in with the rest of the pattern. Most of my data is  really positively related. We have this one random data point that may be pulling things down, so you have to be very careful. This is why we always, always,  always visualize our data. So, let's summarize real quick. The Pearson  correlation coefficient, r, is what we typically refer to when we say two things are  correlated. It is a measure of strength of relationship between two quantitative  variables. However, it is also a measure of strength of linear relationship. So you have to be careful when you say two things are correlated statistically, you're  saying they're linearly related to each other. Negative values of correlation imply  a negative linear relationship. Positive values of correlation imply a positive  linear relationship, and values near zero imply there's really no linear 

relationship. Also, remember that outliers can make and or hide real  relationships that are or are not there, so you have to be careful, and which is  why we always, always, always visualize our data, but like we said, outliers  aren't the only thing that can bother correlation. Another thing that people have a mistake of doing when it comes to correlation is trying to assign what we call  causation. So, again, two of the biggest problems with correlation are outliers  and causation. We already mentioned outliers. Let's talk about causation, so we  see that our two variables, temperature and total daily registered users, have a  correlation of .54. Now that means they have a positive relationship, however,  that does not mean that temperature causes daily registered users to go up or  down. It just means they have a tendency of moving together. So, again, as  temperature goes up, that's not causing more daily registered users. It just  means as temperature goes up, we also tend to see daily registered users go  up. You have to be careful confusing correlation and causation. It's really  common, and unfortunately, it implies that one thing may actually directly impact  another when that's not actually the case again. All correlation does is imply that there's some kind of trend between these two variables. It implies some kind of  relationship, as one moves, the other tends to move, but it doesn't mean that  one causes the other to move. Let me give you some examples, because there  are many famous examples of correlations that are not causations. For  example, ice cream sales are positively correlated with shark attacks. The more  ice cream that people sell, the more people get eaten by sharks. Well, I mean, if  we were to think that this correlation is because of causation. Then we would  say, well, ice cream sales then make us taste better. If we taste better, if I taste a little bit more chocolatey, then I guess the sharks are going to like eating me  more. So that would mean that ice cream sales cause shark attacks. No, that's  not actually true, and so, since that's not actually true, there must be some  underlying variable that's actually going on. So, what else may be causing this  relationship, where, when we see high ice cream sales, we also see high  numbers of shark attacks. Well, it's probably rather intuitive to think it's all  because of high temperatures. If the temperature outside is really hot, people  have a tendency of buying more ice cream. If the temperature outside is really  hot, then people also have a tendency of swimming in the ocean more, and  more swimming in the ocean means you're around sharks more often, which  means you have a tendency of having more shark attacks, so you see it's not  that ice cream sales cause shark attacks, it's that they both happen to be  moving in the same way, they have a relationship, but not a causal relationship.  In fact, what's actually going on is that high temperatures drive ice cream sales  and high temperatures drive shark attacks, so a lot of times there's an  underlying factor that is related to both of the correlated variables, and so again  temperature would be an example, but like I said, there are many famous  examples of correlation and no causation, and not even having any kind of 

underlying factor that's related. So, for example, did you know the divorce rate in Maine and the US consumption of margarine per person is highly correlated, so  we're not saying that consuming more margarine is going to make people in  Maine hate each other more and want to get a divorce, but there really is no  underlying factor that would make sense on why these two things are related in  the first place. Same thing for US consumption of mozzarella cheese. Did you  know that the US consumption of mozzarella cheese per person is highly  correlated with the number of awarded PhDs in civil engineering. Did you also  know that the decrease in the number of pirates in the world is actually highly  correlated with the increase in global warming? Again, there's these examples  that have nothing to do with each other, so you have to be very, very careful  about this idea of correlation and causation. So, remember, correlation does not  imply causation. Now, am I saying that causation isn't actually there? No, there  actually may be causation, but correlation is not the reason why there is  causation. There has to be some other kind of thought around it. So, for  example, it is probably related and true that smoking causes lung cancer, but it  is not statistics that say smoking causes lung cancer. It's medical research that  says smoking causes lung cancer. So, again, we have a correlation between  temperature and bike usage. I can only say that those two things are related. I'm not going to say that high temperatures drive people to making them use our  bikes, it's probably not actually the case. So, again, a lot of times in these  examples, there's an underlying factor, but that's not always true. Some things  are correlated, and we just have no real logical reason why. But let's extend this  idea of correlation just a little bit further. Let's talk about the idea of regression.  So, correlation is not everything. Correlation implies that there's some kind of  linear relationship between two variables. However, correlation is just a measure of strength of linear. Relationship, it doesn't actually say what the linear  relationship is. Let me give you an example. Take a look at the plot on the right hand side of the screen. The plot here on the right-hand side has two different  sets of data, marked by circles and marked by x's. Notice how both of these  circles and x's have very strong linear relationships. In fact, both of them have a  correlation of .99. The x's have a correlation of .99, and the circles have a  correlation of .99. However, we see that they're not pointing in the exact same  direction, so correlation just measures the strength of the relationship. It doesn't  necessarily measure what that exact linear relationship is. That is done through  what we call regression modeling. Now, again, regression modeling will touch on at the very end of this course, but really we can't talk about regression modeling  until again we have some foundational concepts under our belt. We still need  the idea of randomness that we're going to talk about, this idea of correlation  that we're talking about now. We have to get the idea of residuals, we have to  get the idea of hypothesis testing. So, again, all I'm trying to do with these next  couple slides is just show you what we can do once we get this foundational 

statistics course under our belt. So many people across many industries devote  a lot of research money to discovering how variables are related to each other,  this is what we call modeling a simple graphical technique to relate two  quantitative variables through a straight line relationship. Basically, try and put a  straight line in a scatterplot, that would be called a simple linear regression  model. Sometimes we'll call this an SLR. Now, most models are more extensive  and complicated than these simple linear regressions, but simple linear  regressions do form a good foundation. So, let me give you an example of our  bike data set again. What if you wanted to predict the number of registered  users based only on the outside temperature, so in other words, if I give you the  outside temperature, you tell me your best guess of total daily registered users.  Well, what is the best guess line for that scatterplot? Well, the best guess line for that scatterplot that you see over there on the right hand side, that dotted  straight line is the line that fits our data the best. We'll talk more about that in a  later lecture, but you can see the equation for it here on the left hand side of the  screen. Now, there are a lot of things going on in that equation, so let me go  ahead and define some things for you. That equation is called the simple linear  regression equation. On the left-hand side, we have the number of predicted  users, what we think is going to happen. On the right-hand side, we have a  number called the intercept, then we add a number called the slope times the  other variable here temperature, so we're going to use temperature by  multiplying it by a slope, then adding an intercept to predict the number of users. This is kind of like what you may remember in grade school when looking at the  idea of something graphically, if you remember slope-intercept form, you  probably saw this as y equals mx plus b. Same idea here, the y would be  predicted users, the x would be temperature, the m here is represented by that  funky looking b, that's the Greek letter beta, and then the intercept is what you  would think of as B, is here represented by again that funky letter beta with a  little zero beside it. Let me try and visually show you what I mean by the  intercept. The intercept is literally where that diagonal dashed line would cross  the y axis. In other words, if temperature were zero, how many predicted users  do I think we would have? It'd be a really small number, and that really small  number you can see here on the right-hand side plot. That number again is  represented by beta zero. The slope, on the other hand, is basically the idea of  the direction, the angle of this line. Another way we think about slope is what we  call rise over run, basically how far. Are up do we go on total daily registered  users for a certain level of left and right on temperature that is represented by  the beta one value, the slope, so you can think about it as the intercept is the  value of the average number of registered users. When the temperature is zero,  the slope is the average increase in registered users with a one degree increase in temperature. So, if temperature goes up by a degree, how many more people  do I think are going to actually use the actual bike service? Again, we would use 

a computer to figure out all these numbers, and by using a computer, the  numbers are the following. So, our predicted number of users is going to be  418.42 plus 54.4 times temperature. In other words, you give me a temperature, I would multiply it by 54.4 then I would add 418.42 and that would be the  number of predicted users I would guess would actually use our bike rental  service that day. Now, again, How did we get these numbers? How do we find  that line of best fit? That's all something we'll cover by the end of the course, but I'm just trying to show you how these scatterplots and these correlations really  are going to drive some more advanced analysis later on. Now, of course, a  straight line, as we can see on the right-hand side, may not fit your data the  best. Maybe you can get something that is a little bit more complicated, like this  curved line that you see that may fit your data a little bit better, but anything  that's not a straight line is well beyond the scope of this course. So, let's  summarize. Correlation is a measure of strength of linear relationship, however,  does not say what that linear relationship is the simplest graphical technique to  relate two quantitative variables is through a straight line relationship, and that's  called the simple linear regression model. That simple linear regression model is just that slope-intercept form that you probably learned back in grade school.  You have an intercept and you have a slope, this again relates to that idea of  just trying to understand the relationship between two quantitative variables that  we've been talking about over the last couple lectures. Wow, we've looked at a  lot in this section trying to understand just relationships that exist in variables,  but that is the end of this lecture. That is the end of this section, and I look  forward to seeing you in the next one.



Последнее изменение: вторник, 2 июня 2026, 08:20