welcome. Let's finish up this section on the distributions of statistics from data by talking about another sampling distribution, instead of the sampling distribution  for the sample mean, like our last lecture. Here we're going to be talking about  the sampling distribution for our population parameter guess of proportion, the  sample proportion. Let's talk about proportions. Means, well, they're not the only thing of interest in the population. Another typical problem would be to estimate  the proportion of the population p that has a certain attribute. Again, we've talked about the average height of all Americans. Let's talk about the proportion of  Americans with brown hair. That would be the idea of a population proportion P.  Now, just like with all the other populations, since we cannot view the whole  population, we have to use the sample proportion P with a little carrot on top, we call it P hat for this estimate. Now we can ask, we can ask the same question we asked for sample means. For sample means, we said, well, if we were to look at all possible samples of the same size, and look at the average of all of those  samples. The sample means, what would those sample means look like if we  were to plot them all in a histogram? What would their distribution be? We can  ask the same thing about sample proportions. If we were to take all samples of  the same size, calculate the proportion from those samples that have some kind of attribute, and we were to look at all of the possible sample proportions. If we  were to look at all of those sample proportions on a histogram, what distribution  would those sample proportions follow? That's the idea of the sampling  distribution of the sample proportion, to be able to understand the sampling  distribution for sample proportions. Let's take a look about the idea of  proportions, because you see sample proportions, well, they're actually quite  similar to sample means let me show you what I mean. Let's imagine we have  these five customers. Of these five customers, three of them are male, two of  them are female. So, what is the proportion of females in this five customer  sample? Well, the proportion of females would be 2, 2 females divided by the 5  total people I have here. So the population proportion would be 0.4 or 40% of  my sample are females. Okay. okay. Well, what if we were instead of labeling  these as male and female, label them as zeros and ones. Every female gets a  one, every male gets a zero, and I were to tell you, let's take the average of  those zeros and ones. Well, what would the average be? We've talked about  averages before. The average would be the sum of all five of these numbers, so  0 + 1 + 1 + 0 + 0 divided by the total number. Well, that's 5, and that would give  us a weight that would give us the same thing. It would give us 0.4, you see,  sample means and sample proportions are actually the same thing. If we were  to look at the sample proportion of a group of people that has a certain attribute,  that would be the exact same thing as us taking the average of a variable that  labels 1 for people that have that attribute and 0 for people that don't, so here  we have males and females, so 1 for being a female, 0 for not, our males, we  take the average of those 0s and 1s, we get the actual proportion of females, 

which is 0.4. So, actually, sample proportions are just sample means. Well, wait  a minute. Then, if sample proportions are sample means, then that means the  sampling distribution of p hat actually follows the normal distribution. Now we  can go back to the central limit theorem we talked about in our last lecture. In  our last lecture, we talked about that sample means follow a predictable  distribution. That predictable distribution is a normal distribution. As long as the  sample size was big enough, well, because p hat is basically the same thing as  x bar, proportions are basically the same as sample means. Then the sampling  distribution of p hat is also approximately the normal distribution whenever the  sample size is large enough. However, here, when dealing with zeros and ones,  large enough is a little bit different. When we're talking about a population  proportion, we're not going to be talking about any numbers outside of the  values zero to one, right? We're talking about a proportion of people that have  an attribute. It's like the probability of somebody having an attribute, we're  containing all of the answers between zero and one, so that would mean that  because of that, we're not looking at any data value, like height can be any  number, weight can be any number, and so we have a little bit different of a  sample size when we're dealing with numbers that can take any range, when  dealing with numbers that have to be between zero and one, a large enough  sample size is basically the sample size times the population proportion p has to be greater than or equal to five, and the sample size times one minus the  population proportion p has to also be greater than or equal to five. Well, what  does that mean? Well, let's actually think about it in terms of something a little  different. Essentially, you can think about these as how many ones and how  many zeros do you have in your data set. You need at least five in each of the  two categories. So, for example, if we had our males and our females as our two categories, we would need at least five males and at least five females. If we  were to talk about brown hair versus not brown hair, then we would need at least five people with brown hair in our sample and at least five people without brown  hair in our sample, that's what we need for a large enough sample size. So it's a little bit different than what we had for sample means, but the same premise still  holds. As long as you have a large sample size, then you've got the fact that p  hat, the sample proportion will follow an approximate normal distribution, and so  when we have small sample sizes like 10, they can actually still be close to a  normal distribution for values of p that are actually near 0.5, because what's one half of 10? Well, that's five, so that would be five yeses and five nos. However,  with very small values of p or very large values of p, values of p close to zero or  close to one, we need much larger sample sizes, so it's not just sort of a generic number, you know, with means it was just get 50 with these being a idea of ones and zeros, or two categories, we need at least five in each one of the  categories, that's the idea of what we're looking at, and if your situation is rare. It may take a lot of observations just to get five successful things, you know. If we 

were trying to figure out the proportion of days in Florida that drop below  freezing, let's imagine that doesn't happen too often. We might need a lot of  days before we actually see five days below freezing. That's the idea now,  because of the fact that we can use a little bit of the central limit theorem here.  We know that the sampling distribution of p hat is looking like the normal  distribution. Well, the normal distribution, if you remember, is completely defined  by a mean and a variance, or a mean and a standard deviation, so what do we  have for that? Well, we actually know it. It's actually a known thing. The  sampling distribution of p hat, again, if we were to look at all the possible p hats,  take all the possible samples of the same size, look all of the possible sample  proportions from those samples, and if we were to take an average of all of  those sample proportions, that would be their expected value, and that would be the population proportion. P, again, that would make sense, right? If we were to  look at all samples of the same size, and we were to do something along the  lines of population proportion of females, and we were to look at all the samples  of the same size, and looked at the sample proportion of females, and we were  to take the average of all of those samples, it'd be the same as the true  population proportion of females when it comes. The standard deviation, again,  we have a known calculation. It involves square roots, it involves sample size,  but it's not quite the same as what we saw for means. Here we have the  population proportion p times one minus the population proportion p divided by  the sample size, and now you take the square root of the whole thing. All right,  now we've defined the sampling distribution of p hat. It's normally distributed as  long as the sample size is large enough. It has an average that is the true  population p, the true population proportion, and a standard deviation that is the  square root of p times one minus p over n, but let's see this in action. Let's go  back to our bike data again. So, let's imagine you think that people are more  likely to rent a bike on a clear or a cloudy day compared to misty, rainy, or  snowy, so in your data 63% of the days are clear or cloudy. So, what is the  probability you sample 50 days and less than half of those 50 days are clear or  cloudy again? Now we're dealing with proportions, right? I want to know what's  the probability that less than half of the time a proportion less than 0.5, less than half of the time the days that I sample are going to be clear or cloudy, because  again I want clear or cloudy days, I don't want misty, rainy, or snowy days. So,  let's take a look at first of all, do we have a large enough sample size? Well, so  we have a sample size of 50, but that's not a guarantee that it's large enough.  Let's see, in our sample size of 50, and we know that 63% of all the days in my  data set are clear cloudy, that would mean that 63% or 0.63 times 50, would be  31.5, so we expect 31.5 out of those 50 days to be clear cloudy. Well, 31.5 is  bigger than 5, so we're good there. But let's look at the other side. So, let's look  at one minus 0.63 that would be 0.37 If we were to multiply that by 50, that  would be 18.5. So, of those 50 days, I expect 31.5 of them to be clear or cloudy. 

I expect 18.5 of them to be misty, rainy, or snowy. Both of those numbers are  bigger than 5, so I'm good. I'm set. I've got a big enough sample size here, so  now I can do my calculation, and the calculation is the exact same that we've  

been doing. We know that P hat, because we have a large sample size, follows  a normal distribution. Well, if it has a normal distribution, then that means it has  a standardized normal distribution that we can convert it to. All we do is take the  

number that we're interested in, P hat, subtract off the average of p hat, well,  that would just be p, and divide by the standard deviation of p hat, well, that  would just be the square root of p times one minus p over n. Again, do you see  the value in the beauty of the central limit theorem and the normal distribution?  Because of the central limit theorem we know that p hat follows a normal  distribution, because normal distributions are so nice to work with, we can  answer any question about any normal distribution by converting it to a standard normal distribution, so we can plug these numbers in the interest that I have, the P hat I'm interested in is 0.5. I'm going to do 0.5 minus 0.63 Again, where's that  0.5 coming from? I want to know the probability that less than half of the days,  so 0.5 of the days. So I want to know the probability of 0.5 five minus 0.63, that's my population parameter, divided by the square root of 0.63 times 1 minus 0.63  over 50, so I want to know where is the spot 0.5 on a normal distribution with a  mean of 0.63 and a standard deviation of 0.068 Well, that would be at the exact  location of negative 1.91 on a standard normal distribution, and if we were to  look that number up in our. Actual t in our actual normal distribution table, we  would get the value of 0.0281 or in other words, there's a 2.81% chance that I  sample 50 days and less than half of them are clear or cloudy, that makes  sense, right? If I know in my whole data set 63% of the time clear or cloudy  happens, then what are the chances of me getting 50 days where less than half  of them are clear or cloudy? It's going to be rather small. How small? There's a  2.81% chance of that happening again. This is the beauty of what we can do  with the normal distribution and with sampling distributions. So, let's summarize.  So, another typical problem would be to estimate the proportion of the  population that has a certain attribute, we call that p, and we're going to estimate that population proportion p with the sample proportion p hat. The sampling  distribution of p hat is approximately normally distributed, that’s the beauty of it,  the central limit theorem holds p hat is approximately normally distributed as  long as our sample size is large enough, meaning that we have at least five  successes and five failures in our data set, and so, because of that, we can  answer any questions about population proportions. Because of the sampling  distribution of x bar being normal, we can answer any question about normal  distributions involving x bar as well. Again, at the very beginning of this section, I pose the question, what could we do when we don't know everything, because  all we get is one sample. Wouldn't it be nice if sample statistics follow  predictable patterns? They do. Sample statistics, like the sample mean or the 

sample proportion, follow a predictable pattern called the normal distribution,  and you've seen the normal distribution from the normal distribution. You are  able to answer a variety of questions, and we can answer those same questions now about sample means and sample proportions, and this is just again laying  the foundation for even more fun stuff to come, but that is the end of this lecture. That is the end of this section, and I look forward to seeing you in the next.



آخر تعديل: الاثنين، 22 يونيو 2026، 8:28 AM