MAT 161 - Introduction to Statistics (3 Credits): Video Transcript: Distributions of Statistics from Data - Part 3

welcome. Let's finish up this section on the distributions of statistics from data by talking about another sampling distribution, instead of the sampling distribution for the sample mean, like our last lecture. Here we're going to be talking about the sampling distribution for our population parameter guess of proportion, the sample proportion. Let's talk about proportions. Means, well, they're not the only thing of interest in the population. Another typical problem would be to estimate the proportion of the population p that has a certain attribute. Again, we've talked about the average height of all Americans. Let's talk about the proportion of Americans with brown hair. That would be the idea of a population proportion P. Now, just like with all the other populations, since we cannot view the whole population, we have to use the sample proportion P with a little carrot on top, we call it P hat for this estimate. Now we can ask, we can ask the same question we asked for sample means. For sample means, we said, well, if we were to look at all possible samples of the same size, and look at the average of all of those samples. The sample means, what would those sample means look like if we were to plot them all in a histogram? What would their distribution be? We can ask the same thing about sample proportions. If we were to take all samples of the same size, calculate the proportion from those samples that have some kind of attribute, and we were to look at all of the possible sample proportions. If we were to look at all of those sample proportions on a histogram, what distribution would those sample proportions follow? That's the idea of the sampling distribution of the sample proportion, to be able to understand the sampling distribution for sample proportions. Let's take a look about the idea of proportions, because you see sample proportions, well, they're actually quite similar to sample means let me show you what I mean. Let's imagine we have these five customers. Of these five customers, three of them are male, two of them are female. So, what is the proportion of females in this five customer sample? Well, the proportion of females would be 2, 2 females divided by the 5 total people I have here. So the population proportion would be 0.4 or 40% of my sample are females. Okay. okay. Well, what if we were instead of labeling these as male and female, label them as zeros and ones. Every female gets a one, every male gets a zero, and I were to tell you, let's take the average of those zeros and ones. Well, what would the average be? We've talked about averages before. The average would be the sum of all five of these numbers, so 0 + 1 + 1 + 0 + 0 divided by the total number. Well, that's 5, and that would give us a weight that would give us the same thing. It would give us 0.4, you see, sample means and sample proportions are actually the same thing. If we were to look at the sample proportion of a group of people that has a certain attribute, that would be the exact same thing as us taking the average of a variable that labels 1 for people that have that attribute and 0 for people that don't, so here we have males and females, so 1 for being a female, 0 for not, our males, we take the average of those 0s and 1s, we get the actual proportion of females,

which is 0.4. So, actually, sample proportions are just sample means. Well, wait a minute. Then, if sample proportions are sample means, then that means the sampling distribution of p hat actually follows the normal distribution. Now we can go back to the central limit theorem we talked about in our last lecture. In our last lecture, we talked about that sample means follow a predictable distribution. That predictable distribution is a normal distribution. As long as the sample size was big enough, well, because p hat is basically the same thing as x bar, proportions are basically the same as sample means. Then the sampling distribution of p hat is also approximately the normal distribution whenever the sample size is large enough. However, here, when dealing with zeros and ones, large enough is a little bit different. When we're talking about a population proportion, we're not going to be talking about any numbers outside of the values zero to one, right? We're talking about a proportion of people that have an attribute. It's like the probability of somebody having an attribute, we're containing all of the answers between zero and one, so that would mean that because of that, we're not looking at any data value, like height can be any number, weight can be any number, and so we have a little bit different of a sample size when we're dealing with numbers that can take any range, when dealing with numbers that have to be between zero and one, a large enough sample size is basically the sample size times the population proportion p has to be greater than or equal to five, and the sample size times one minus the population proportion p has to also be greater than or equal to five. Well, what does that mean? Well, let's actually think about it in terms of something a little different. Essentially, you can think about these as how many ones and how many zeros do you have in your data set. You need at least five in each of the two categories. So, for example, if we had our males and our females as our two categories, we would need at least five males and at least five females. If we were to talk about brown hair versus not brown hair, then we would need at least five people with brown hair in our sample and at least five people without brown hair in our sample, that's what we need for a large enough sample size. So it's a little bit different than what we had for sample means, but the same premise still holds. As long as you have a large sample size, then you've got the fact that p hat, the sample proportion will follow an approximate normal distribution, and so when we have small sample sizes like 10, they can actually still be close to a normal distribution for values of p that are actually near 0.5, because what's one half of 10? Well, that's five, so that would be five yeses and five nos. However, with very small values of p or very large values of p, values of p close to zero or close to one, we need much larger sample sizes, so it's not just sort of a generic number, you know, with means it was just get 50 with these being a idea of ones and zeros, or two categories, we need at least five in each one of the categories, that's the idea of what we're looking at, and if your situation is rare. It may take a lot of observations just to get five successful things, you know. If we

were trying to figure out the proportion of days in Florida that drop below freezing, let's imagine that doesn't happen too often. We might need a lot of days before we actually see five days below freezing. That's the idea now, because of the fact that we can use a little bit of the central limit theorem here. We know that the sampling distribution of p hat is looking like the normal distribution. Well, the normal distribution, if you remember, is completely defined by a mean and a variance, or a mean and a standard deviation, so what do we have for that? Well, we actually know it. It's actually a known thing. The sampling distribution of p hat, again, if we were to look at all the possible p hats, take all the possible samples of the same size, look all of the possible sample proportions from those samples, and if we were to take an average of all of those sample proportions, that would be their expected value, and that would be the population proportion. P, again, that would make sense, right? If we were to look at all samples of the same size, and we were to do something along the lines of population proportion of females, and we were to look at all the samples of the same size, and looked at the sample proportion of females, and we were to take the average of all of those samples, it'd be the same as the true population proportion of females when it comes. The standard deviation, again, we have a known calculation. It involves square roots, it involves sample size, but it's not quite the same as what we saw for means. Here we have the population proportion p times one minus the population proportion p divided by the sample size, and now you take the square root of the whole thing. All right, now we've defined the sampling distribution of p hat. It's normally distributed as long as the sample size is large enough. It has an average that is the true population p, the true population proportion, and a standard deviation that is the square root of p times one minus p over n, but let's see this in action. Let's go back to our bike data again. So, let's imagine you think that people are more likely to rent a bike on a clear or a cloudy day compared to misty, rainy, or snowy, so in your data 63% of the days are clear or cloudy. So, what is the probability you sample 50 days and less than half of those 50 days are clear or cloudy again? Now we're dealing with proportions, right? I want to know what's the probability that less than half of the time a proportion less than 0.5, less than half of the time the days that I sample are going to be clear or cloudy, because again I want clear or cloudy days, I don't want misty, rainy, or snowy days. So, let's take a look at first of all, do we have a large enough sample size? Well, so we have a sample size of 50, but that's not a guarantee that it's large enough. Let's see, in our sample size of 50, and we know that 63% of all the days in my data set are clear cloudy, that would mean that 63% or 0.63 times 50, would be 31.5, so we expect 31.5 out of those 50 days to be clear cloudy. Well, 31.5 is bigger than 5, so we're good there. But let's look at the other side. So, let's look at one minus 0.63 that would be 0.37 If we were to multiply that by 50, that would be 18.5. So, of those 50 days, I expect 31.5 of them to be clear or cloudy.

I expect 18.5 of them to be misty, rainy, or snowy. Both of those numbers are bigger than 5, so I'm good. I'm set. I've got a big enough sample size here, so now I can do my calculation, and the calculation is the exact same that we've

been doing. We know that P hat, because we have a large sample size, follows a normal distribution. Well, if it has a normal distribution, then that means it has a standardized normal distribution that we can convert it to. All we do is take the

number that we're interested in, P hat, subtract off the average of p hat, well, that would just be p, and divide by the standard deviation of p hat, well, that would just be the square root of p times one minus p over n. Again, do you see the value in the beauty of the central limit theorem and the normal distribution? Because of the central limit theorem we know that p hat follows a normal distribution, because normal distributions are so nice to work with, we can answer any question about any normal distribution by converting it to a standard normal distribution, so we can plug these numbers in the interest that I have, the P hat I'm interested in is 0.5. I'm going to do 0.5 minus 0.63 Again, where's that 0.5 coming from? I want to know the probability that less than half of the days, so 0.5 of the days. So I want to know the probability of 0.5 five minus 0.63, that's my population parameter, divided by the square root of 0.63 times 1 minus 0.63 over 50, so I want to know where is the spot 0.5 on a normal distribution with a mean of 0.63 and a standard deviation of 0.068 Well, that would be at the exact location of negative 1.91 on a standard normal distribution, and if we were to look that number up in our. Actual t in our actual normal distribution table, we would get the value of 0.0281 or in other words, there's a 2.81% chance that I sample 50 days and less than half of them are clear or cloudy, that makes sense, right? If I know in my whole data set 63% of the time clear or cloudy happens, then what are the chances of me getting 50 days where less than half of them are clear or cloudy? It's going to be rather small. How small? There's a 2.81% chance of that happening again. This is the beauty of what we can do with the normal distribution and with sampling distributions. So, let's summarize. So, another typical problem would be to estimate the proportion of the population that has a certain attribute, we call that p, and we're going to estimate that population proportion p with the sample proportion p hat. The sampling distribution of p hat is approximately normally distributed, that’s the beauty of it, the central limit theorem holds p hat is approximately normally distributed as long as our sample size is large enough, meaning that we have at least five successes and five failures in our data set, and so, because of that, we can answer any questions about population proportions. Because of the sampling distribution of x bar being normal, we can answer any question about normal distributions involving x bar as well. Again, at the very beginning of this section, I pose the question, what could we do when we don't know everything, because all we get is one sample. Wouldn't it be nice if sample statistics follow predictable patterns? They do. Sample statistics, like the sample mean or the

sample proportion, follow a predictable pattern called the normal distribution, and you've seen the normal distribution from the normal distribution. You are able to answer a variety of questions, and we can answer those same questions now about sample means and sample proportions, and this is just again laying the foundation for even more fun stuff to come, but that is the end of this lecture. That is the end of this section, and I look forward to seeing you in the next.

آخر تعديل: الاثنين، 22 يونيو 2026، 8:28 AM

Video Transcript: Distributions of Statistics from Data - Part 3

معلومات

إتصل بنا