Video Transcript: Distributions of Statistics from Data - Part 2
Welcome. Let's continue talking about the idea of the distributions of statistics from data but now focusing in on a very specific distribution. We're going to be looking at the sampling distribution for the sample mean x bar. Remember, sample statistics are just guesses, they're point estimates of the population parameter, and different population parameters have different sample statistics. So we're going to focus on the most common one, the average. If you wanted to know the average of a population, you would take a sample, you would calculate the average of your sample, and that would probably be your best guess. So, let's talk about what those sample averages would look like. The sampling distribution of a sample average, or sample mean, x bar, is the probability distribution of all the possible values of the sample mean, think about it as if I had the ability to look at every single possible sample of the same size from a population, and I were to plot all of the means on a histogram, what would they look like? That's what we're talking about with the sampling distribution of X bar. I'll show you visually here in a moment, but these distributions, as we've talked about previously, have some characteristics about them. They have their own mean and they have their own variance. So, here's a couple of nice facts. The sampling distribution of the sample mean has an expected value, an average of the population mean, mu. Hold on, let's think about that briefly for a moment. So, if I were to look at all possible samples, if I were to look at all of the averages from all of those possible samples of the same size, and if I were to take the average of all those averages, I know a lot of things going on that would equal the population mean, that's why the sample mean is a good guess of the population mean, because the sample mean, on average, in other words, if I were to look at all sample means and take their average, they would actually equal the population mean, and we can also say something about the standard deviation of x bar, the standard deviation of the sample means, the standard deviation of the sample means is the population standard deviation here divided by the square root of the sample size. Hold on, so wait, what now? So the spread in that distribution of all sample means is whatever the spread was in the population divided by the square root of the sample size of all of those samples, so again, let's imagine you had a bunch of samples. Pick any size sample you want. Let's say you looked at samples of 100 from the population of the United States. If you were to look at all possible samples of size 100 from the United States, and you were to calculate the average from each one of those samples, and you were to look at the standard deviation of those averages of that distribution, it would be the population standard deviation divided by the square root of 100 and that would make sense, that the distribution of sample means, the distribution of averages should have a smaller spread than the population's distribution, right? If I were to look at an average of any group of people, it's not going to be as extreme as any one person, that's the idea, where the spread is going to be smaller. Now, this is just a little fun side fact. If your sample is larger
than 5% of the population, we do have a little bit of an adjustment we can make, or the capital N here is the size of the population, and the little n is the size of the sample. Most of the time, you're never going to deal in this kind of situation, so don't worry about it. But for some people, you may actually deal with this. You may have a small population, maybe you're trying to measure some idea of a sample of. Of a of a species that may be running low on their population. Well, if that's the case, then maybe your sample actually is larger than 5% of the population, but like I said, most of the time you won't have to deal with this. So, let's talk about this idea of samples and averages of these samples and distributions of these averages. Let's talk about what I mean, and let's talk about it visually. Let's imagine what you see here on the slide is the population. The population is normally distributed. It has a mean of zero, so the population mean, the parameter mu is zero. It has a standard deviation of one. So the population standard deviation, sigma, is one. So let's imagine you were to take a random sample from this population, and let's imagine that random sample had just 10 observations in it, so you see these 10 numbers. These 10 numbers were drawn from a normal distribution with a mean of zero and a standard deviation of one. So you have these 10 numbers in your sample. If you were to take the average of those 10 numbers, you would get negative 0.1 close to zero, not exactly zero, that negative 0.1 Remember from our last lecture, that's just a guess of the true population, and it doesn't look like too bad of a guess. The true value is zero, I guess negative 0.1 not too bad. Well, let's imagine you took another sample, completely different sample, 10 completely different numbers. They have a sample mean of negative 0.6 Okay, awesome. Again, taking another sample of numbers, we're looking at their average. Let's do it again. Let's look at another sample of 10 numbers. Here's another sample of 10 numbers from this very normal distribution, and so with these 10 numbers we can take another average, and that average is 0.3 Let's do it again. So we have another sample, another 10 numbers, their average is 0.4 Let's imagine you kept doing this over and over and over. You looked at every single possible sample of size 10 from this distribution, and what you did is you looked at all of their sample means. That's what I'm talking about. If you were to take a sample, write down its mean, take another sample, write down its mean, take another sample, write down its mean, and do this over and over and over again as many times as you possibly can until you got all of the samples. The question is, what would that distribution of sample means look like if I were to look at those sample averages and put them on a histogram? What would they look like? The best part is, we know it's predictable, they follow a normal distribution, so if you were to look at all of these sample means, these sample means, these x bars would be a normal distribution. What would be the average of this normal distribution? The average of this normal distribution, if you were to take the average of all the x bars, you would get an average of zero if you were to take
the standard deviation of all those x bars. It would be the population standard deviation, 1 divided by the square root of your sample size, 10. This is something we know about sample means. so notice again it has the same mean as the population, but it's a little bit more narrow now. You may be thinking, okay, this kind of makes intuitive sense. If the population is normally distributed, then sure, I could believe that sample means also are normally distributed, that makes sense. Okay, let's look at another population. Let's look at a completely different population, one that doesn't look anything like a normal distribution. Let's look at a uniform distribution, and we've talked about uniform distributions previously. So, when looking at this uniform distribution again, it has a mean of zero. This has a standard deviation of one, but unlike the normal distribution, where you get a lot of values around zero, and the further and further away from zero you get, the less likely you are to get those values. A uniform distribution, you have an equal chance of getting anything from negative 1.73 all the way up to 1.73 Any number in here has an equal chance of being selected. So let's do the exact same thing that we did previously. Let's take a sample again. Here we're taking a sample of 10 different observations. If we were to take the average of that sample, the average of that sample would be 0.3 If we were to do it again, take another sample, again, another 10 observations. These 10 observations came from this population. Everybody in this population has an equal chance of being selected. That's not the case with the normal distribution. The normal distribution, because there are more people around the center, hence the big hump in the middle, they're more likely to be selected, whereas here everyone's got an equal chance. So you have sample one, it has an average of 0.3 You have sample two, it has an average of negative 0.1 Let's take another sample. It has an average of negative 0.2 You get what we're doing at this point. Let's take another sample. It has an average of 0.1 So, again, over and over and over again, we're going to take a look at all of these samples, all of these averages, and we're going to try and plot their distribution. We're going to put them all on a histogram, and I know what you're probably thinking, cool, I bet these things follow a wait a minute, they follow a normal distribution, they follow a normal distribution that also has a mean of zero and also has a standard deviation of one divided by the square root of our sample size, 10. Wait a minute, but this population looks like a uniform distribution. This population was a normal distribution, and you're telling me that no matter what that population is, I get the same shape for the sample averages. Yes, that's the beauty of mathematics. This is what we call the central limit theorem, as long as we take a large enough sample, which we consider 50 or more, was so I guess our example wasn't exactly the greatest. We were taking samples of size 10, but it made it easier to show you the numbers. But if we were to take a large sample size 50 or more, the central limit theorem states that the sampling distribution of all sample means, if you were to look at all possible samples of the same size, if
you were to calculate the sample mean, the sample average from each one of those samples, and you were to plot that distribution, it would be approximately normally distributed, no matter what the original population looks like. The original population can be normal. The original population can be uniform. The original population can be exponential. It can be anything you wanted. It did not matter what the original population was. According to the central limit theorem, if I look at large enough samples, the sampling distribution of the sample mean is approximately normal, and not only is it approximately normal, it has a mean of the population mean, mu, and a standard deviation that's sigma over the square root of n. Wow, think about the power of what we've just done. I told you in the end of last lecture, if we only had a predictable pattern for sample statistics, like the sample mean, then it didn't matter if all we had was one sample and we didn't know the population mean, mu, we could get some idea about what's going on. That's the beauty of it. No matter what your population looks like, doesn't matter. Sample means are going to follow a normal distribution as long as your sample size is large enough. Now you may be asking, what if my sample size isn't large enough? What if my sample size is less than 50 observations, less than 50 people in my sample. Well, then the sampling distribution of x bar is only normal if the original population is normal. So, if your original population is normally distributed, then sample means are always going to be normally distributed. That's fine. The power of the central limit theorem says that if you take large enough samples, though, it doesn't matter what the population looks like, your sample averages will still follow a normal distribution. So, how powerful is this? How is this helping us? Well, let's take a look at an example using our bike data set, so the average data. Early number of total users is 4504 with a standard deviation of 1937 What is the probability that a sample of 50 days - I'm not talking about a single day anymore, I'm talking about the probability that a sample of 50 days has an average, not saying that every one of the 50 days has to do this, but the sample of 50 days has an average number of users between 4000 - 5000. Let's take a look at how we can answer this. So, based on our previous example, all of the possible sample means from samples of size 50 would have the following distribution, right. So we have a large sample size, so we have a large sample size 50 or more. Therefore, sample means will always follow a normal distribution with that large sample size, it doesn't matter what the original population looks like. We have a large enough sample size, our sample means will follow a normal distribution. Well, what's the center of that distribution? It's the same as the population mean, mu. It's 4504 Well, what's the standard deviation of this sample mean distribution will be the population standard deviation 1937 divided by the square root of our sample size, divided by the square root of 50 that would leave a sample mean standard deviation of 273.93 and we have, more importantly, a normal distribution. Now, what have we learned about normal distributions? You can
turn any normal distribution into a standard normal distribution. All you've got to do is take the point you're interested in, subtract off the normal distribution's mean, divide by the normal distribution standard deviation, and you could answer any question you want. Are you starting to see the power of this now? Unlike in previous chapters, where we had to assume the distribution was normal before we could answer these questions here, as long as you deal with sample means that come from large enough samples, you already have a normal distribution, no assumption needed. So the average daily total number of users is 4504 with a standard deviation of 1937 What's the probability a sample of 50 days has an average between 4000 - 5000? Well, because I'm looking at a sample of 50 or more, those days and their averages follow a normal distribution, so I can take a look at this calculation. I can do whatever number I'm interested in, minus the mean, divide by the standard deviation, which, remember, would be just subtracting off the population mean and dividing by the population standard deviation over the square root of n. And so we could just plug these numbers in. Right? I want 5000 that's the number I'm interested in, minus the population mean, 4504 divide by the standard deviation of 273.93 and I get a z value of 1.81 if you remember when we were looking at this last time we were looking at this on a standard normal table, so what we're saying is the point 5000 on the distribution that has an average of 4504 with a standard deviation of 273.93 that point 5000 is the same point as 1.81 on a standard normal distribution, which means that we could look it up in a table. The probability we see a number smaller than 1.81 is 0.9649 In other words, 96.49% of the time we will see a sample of 50 days having an average lower than 5000 but that's not the whole question. It wasn't just lower than 5000 it was between 4000 - 5000. So we got to deal with that 4000 number as well. So that's exactly what we're doing down here, we're doing the exact same calculation, except instead of just for 5000 we're doing it for 4000 so again, same idea, 4000 minus the mean 4504 divided by the standard deviation 273.93 gives us a point of negative 1.84 If we were to look at that on a standard normal table, the area to the left of negative 1.84 would be point 0.0329 or in other words, there is a 3.29% chance that a sample of 50 days has an average below 4000 Well, wait a minute. Here, if I know the probability of you having below 5000 and the probability of you having below 4000 then I could look in the middle. The middle of those would then be 96.49% minus 3.29% and that would basically be 93.2% or a probability of 0.932 and this is a great way of viewing it, looking at the shaded area in the middle, the shaded area to the left of 5000 was 0.9649 The shaded area to the left of 4000 was 0.0329 So that means the area in the middle has to be 0.9320 There's a 93% chance that a sample of 50 days will have an average, not all the days have to be between here, just the average of these 50 days has to be between 4000 - 5000 total users. There's a 93% chance that's going to happen. That's the power of the central limit theorem. That's the power
of the normal distribution, that's why we focus so much on the normal distribution, is because of properties like this, because sample means follow a normal distribution as long as you have a large enough sample size, so let's imagine instead of looking at a sample of size 50, we take a sample of size 100 the expected value, the average would remain the same, 4504 However, the standard error, the standard deviation of x bar, would decrease. Instead of 1937 over the square root of 50, it would be 1937 over the square root of 100 which would be 193.7 or in other words, it would be a more narrow distribution, which would make sense, right? If you had a sample of 100 days, you're going to have a lot better idea of what's going to happen. It's going to be a lot tighter, a lot less spread than a sample of 50 days, right? If I asked you, "Hey, what do you have more confidence in? What do you believe more, a sample of 100 or a sample of 50? You'd probably say a sample of 100 because there's more data there. Well, then, if you believe in it more, wouldn't it have a more narrow spread? That's the idea, and so it has a more narrow spread to it. You increase the sample size, your spread is going to get more narrow. You decrease the sample size, your spread is going to get a little further out, because you're a little bit less, you have a little bit less idea of what's going to happen. Oh boy, we have talked about so much. Let's summarize here real quick. The sampling distribution of x bar is the probability distribution of all possible values of the sample mean x bar with samples of the same size. Now this sampling distribution of x bar has a mean and expected value of the population mean mu and a standard deviation of sigma over the square root of n, and now the most important thing we learned, if we have a large sample, a sample size of 50 or more, then the central limits theorem states that the sampling distribution of x bar is approximately normally distributed, regardless of what the population distribution looks like, and that's powerful. Hopefully, that gets you a little bit excited about statistics. I know it still gets me excited about statistics, but that is the end of this lecture, and I look forward to seeing you in the next one.