Video Transcript: Exploring Data - Part 2
Let's continue our discussion around exploring data. In this lecture, we're going to be talking about the idea of center and really how can we describe a typical value. So, let's jump in. When exploring data, a good summary of a variable might be something along the lines of a typical value that that variable might take, for example, if a qualitative variable, such as a categorical variable or weather, came up. What would you recommend as the most typical value? If someone asked you what the typical day looked like in terms of weather, most likely you would pick the most common category, that is what you would probably mean by typical for a quantitative variable. On the other hand, we like to focus in on the center value of the variable, where center can be defined a couple of different ways. This is what we're going to be talking about here in this lecture. Let's jump into that qualitative variable, typical first. The first thing we're going to talk about is referred to as the mode. The mode of a variable is the most common value. This is the thing that we would typically report with a categorical or a qualitative variable, more so than a quantitative variable, I don't really care the exact value of temperature, for example, that occurred most often, but when it comes to a qualitative or categorical variable, something like you see here on the right-hand side in our donut chart, you can see the idea of what weather could look like in three possible categories, we have clear or cloudy, misty, rainy or snow. So, again, if someone were to ask you in your data set, what is the typical weather day? Well, if you were looking at this qualitative variable, you would look at these three categories, and say clear or cloudy happens more often than not. It is the one that happens the most often. Again, it is the mode. So, if someone were to ask the typical weather day, you'd probably tell them it was clear or cloudy. Again, this would make intuitive sense when it comes to qualitative variables, it wouldn't make sense if someone asked you what a typical weather day would look like, and you would pick rainy or snowy, because that doesn't really happen all too often in your data set. So, again, the mode is what we would commonly think of with a categorical or qualitative variable, but what about a quantitative variable? If mode doesn't work very well for quantitative, what does the first thing we're going to look at is what we refer to as the mean. Another name for the mean would be the average. So, what do we mean when we say the average or the mean. Well, the average or mean of a variable is the sum of all of the values in that variable divided by the total number of values that you have. So, for example, if we had a variable x, again you can think of x as temperature, of height, of weight, whatever you like, but we have some variable x. What we would do is we would sum up all of the values of x inside of our data set, so x1 for example, would be the first value of x if x was temperature, it would be the first temperature in your data set. x2 would be the second temperature in your data set. x3 would be the third temperature, and so on and so forth all the way up to x n, where n represents the number of observations you have in your data or your sample, so again we were to sum up
all the values, x1 plus x2 plus x3 plus, and so on and so forth all the way out to x n. After we sum up all these values, we're going to divide by the sample size n, and that gives us the average or the mean. A shorter way of writing this is what you see here on the screen. That big symbol in front of the x is referred to as the summation. Notice how we have a couple of things written beside the summation symbol, we have i equals one on the bottom and n on the top, followed by x with a little i beside it. Well, what does that mean? Well, the summation symbol means that we're going to sum up, we're going to add together a bunch of different values. Well, what values are we going to add? Well, we're going to add the x's. You see the x there on the right. Okay, but what values of x are you going to add? Well, I'm going to add the values of x from one all the way up to n, that is what the i equals one on the bottom and the n on the top are referring to. I'm going to sum up all the values where i takes the value of 1, 2, 3, all the way up to n. Well, let's find where that i is. Well, that I is a little value right beside x. So, again, if you were to replace x i with x1 you would get the first value here that you saw in the previous equation. Then we're going to add the next value of i, x2 that gives us the next value in this equation, then we're going to add x3 and x4 all the way out to x n, so again, what you see here, x1 plus x2 plus x3 all the way out to x n is the same thing we have written here, it's just a little bit shorter to write it this way, so we sum up all the values of x, we divide by the sample size, and that is how we calculate the average. Again, in our data set, we would just sum up all the values of the temperature that we have in our data set. First, we have 731 days in our data set, hence the 731 you see at the bottom. Then what we do is we sum up all the values on the top. 46.7 degrees was the temperature on the first day. 44 I'm sorry, 48.4 degrees is the temperature on the second day. So we do 46.7 plus 48.4 plus 34.2 and so on and so forth until we add all 731 days in our data set we divide by 731 and we find out that the average temperature inside of our data set is 59.51 degrees Fahrenheit. Specifically, so in our data, if someone asked you the typical weather day in terms of temperature and according to the average or mean, it would be 59.51 degrees Fahrenheit again. Hopefully that makes intuitive sense. We want to try and summarize what temperature is typical, so if someone were to ask you typical, a lot of times people report things like the average. If I were to ask you what you think the typical height of Americans is, you'd probably guess what the average height of Americans was, or if I asked you to look at the typical income of you and your friends, you would probably give me the average. It's the same idea here. We're trying to look at the idea of typical through the average, however, like I mentioned last time, the average is not the only way of being able to measure typical when it comes to a quantitative variable. Another way of looking at typical is what we refer to as the median. The median is literally the value in the middle when your data items are arranged from smallest to largest, we call this ascending order. So, again, the median is just literally the
value in the middle. Okay. Well, let's see a couple examples of the median. Let's imagine we have an odd number of observations over here on the right hand side, you have seven observations. This is an odd number: 1, 3, 5, 7, So we have an odd number of observations. Okay, well, literally we want the observation in the middle, but these observations aren't in order. First, if you pick the observation in the middle here, it wouldn't be the right value. We have to first line them up from smallest to largest, so that's what we've done here. The smallest value is 13, the largest value is 30. Now I want you to try and find it. What value is in the middle of these seven observations? There it is, literally right there in the middle. Three observations on the left, three observations on the right. 26 would be the median of these numbers. Okay. Well, that makes sense. Sense for odd numbers, where you're always going to have the same number of observations on the left and the same number of observations on the right, but what do we do if we had an even number of observations? Two observations, four observations, six, eight. Well, we have eight observations here again. That top row contains eight different values, but first things first, we need to arrange them in order, so that's what we're doing on the bottom, we're arranging them again in the order from smallest to largest, so 13 is still the smallest, 30 is still the largest, so what are we going to look at for the median? Well, there is no one number in the middle. There are three numbers to the left of 26 but there are three numbers to the right of 27 So we have those two numbers in the middle when we're calculating the median, and we have two numbers in the middle. What we do is we take the average, the mean of those two numbers. So, again, remember what we just talked about. When it came to the average, or the mean, we sum these two numbers up: 26 plus 27 and divide by two, the total number of numbers that we're trying to take the average of. So, the median of this data set would be the average of the middle two numbers, which would be 26.5 Remember, don't forget when you're calculating the median, you have to make sure that things are in order from smallest to largest, so if you were to look at our typical weather day again, according to the median here, it would be 59.76 degrees Fahrenheit. So, if you were to look at the data set that's on the website, you could rearrange all 731 days of temperature from the smallest temperature all the way to the largest temperature, if you were to do that and find the number right in the middle, it would be 59.76 degrees. Now, why do some people report averages or means and others may report medians? Well, let's take a look at the comparison to see what would happen if something extreme happened in our data. Whenever a data set has an extreme value, a value that doesn't look like the rest of the values, sometimes we call these outliers. The median is definitely preferred in terms of a measure of center. You see, the mean is bothered and affected by extreme values. The median, well, is not, and so, because of that, the median a lot of times might be a better idea of what typical means for the center of something as compared to the average. Let
me show you an example that you see here. We have again the same seven numbers that we were looking at earlier. We saw earlier that these seven numbers have a median of 26 Well, if you were to take the average of these seven numbers, which again we would add them all together, 13 plus 22 plus 24 plus 26 plus 27 plus 29 plus 30, and divide that by seven, you would get the average of 24.42 Okay, well, these numbers are relatively close to each other, so it doesn't look like they're reporting too different a piece of information. But let's change one of the values, shall we? Instead of 30, is that last number on the bottom row. What if we had 300? So, again, we have seven numbers. If we were to look at the median of those seven numbers, 26 is still the number in the middle. No matter how we change that, we have the same order 13, 22, 24, 26, 27, 29 but instead of 30 it's now 300 So the median does not change because the order did not change, and if the order did not change, then the median won't change. However, take a look at the average. Wow, that average value, the mean is 63 Now, again, extreme observations bother the average. Imagine if you were to take a look at the average height of people you know, but then you met an NBA player that was over seven feet tall. Well, that average height just shot up tremendously, not because really more people you know got taller, it's just because you had one extremely tall person you knew. The median wouldn't have changed, but the average would have. So, again, when you have data that doesn't have a lot of extreme values, the average won't be too bad as a measure. Of center, however, when you do the median is a much better measure, measure of center. Hopefully that helps you see this a little bit. So let's summarize this lecture. When exploring data, a good summary of a variable might be what we would call a typical value or a center value of that variable. For categorical variables, the mode is the most common value of typical. It's just the variable, it's the value of the variable, actually, that just happens most often. It's the most reported category. However, when it comes to quantitative variables, we have two different ways of looking at typical or center, that would be the mean, also known as the average, as well as the median. The mean is just the sum of all the values divided by however many values you just summed up, and the median is literally the value in the middle when your data is arranged from smallest to largest. And remember, means are bothered by extreme observations. Well, medians really aren't perfect, so that gives us an idea of how to be able to look at the center of different variables. Next time we'll look at spread of variables, but for right now that is the end of this lecture, and I look forward to seeing you next time.