Welcome. Let's finish up this section of the course by talking about the interval  estimation now of the sample statistic x bar, our sample mean. As another  reminder, an interval estimate is basically computed by taking that point  estimate that you calculate from your sample and adding in that margin of error,  just like we talked about a couple lectures ago. Now, the purpose of an interval  estimate again is to provide some idea of wiggle room, some notion of how  close we are with our guess. Now, you'll also remember that the sampling  distribution of x bar is going to be a key factor in this, just like the sampling  distribution of p hat was important to us calculating our confidence interval for p  hat in the last lecture. The sampling distribution of x bar is going to play a key  role in helping us calculate the confidence interval for x bar in this lecture. Now,  what is the sampling distribution of x bar? If you remember, according to the  central limit theorem, the sampling distribution of x bar is approximately normal  as long as your sample size is big enough, for a sample size being big enough  here is where we had at least 50 observations inside of our sample, that means  that the sampling distribution of x bar looks like the following: you basically have  this normal distribution centered around the true population mean, mu, with a  standard deviation of x bar being the population standard deviation sigma  divided by the square root of n, our sample size, so as again a quick refresher, if we were to take many, many, many, many, many, many samples all of the same  size, calculated the sample average x bar from each one of those samples, and  plotted them all on a distribution, you would see this, the normal distribution, as  long as those samples were big enough, at least 50 observations. However,  there is a problem that we have here that we did not have when it came to our  proportions, that problem exists right here. We don't know the population  standard deviation. Now I know you're thinking, wait a minute, we kind of had a  problem like this with proportions. With proportions, we didn't know what the  population proportion p was so we just estimated it with p hat. Well, that's great,  but we already had an estimate for p, it was p hat, because p hat was what we  were already estimating. The problem with means is we don't have an estimate  for sigma already. We have an estimate for mu. We have an estimate for the  population mean. Our sample mean is an estimate for the population mean, just  like our sample proportion is an estimate for the population proportion, but here  we have something different. We don't already have an estimate for sigma, so  we're going to have to calculate another estimate. So we have two estimates  going on here. We're estimating the population mean, mu, with x bar, and now  we're going to have to estimate the population standard deviation sigma, and  we're going to estimate that with the sample standard deviation s. I know you're  thinking, okay, well, so now we have to make two estimates. You're right, now  we have to estimate both mu and sigma. Now I know what you think, that that  doesn't sound like it'd be too big of a problem, but it actually is quite a big  problem. Since we do not know the population standard deviation and need to 

estimate it, we actually have to add extra error into our two into our calculations.  This makes sense. Estimating two things is going to have more error than just  estimating one, right? If I asked you to be able to tell me whether or not I flip a  coin and get heads or tails, or I gave you the option of trying to guess two coin  flips being heads or tails, you would always say I'll just take guessing one. Why  there's more chance of you being wrong having to guess twice than there is of  you having to guess once, and so because now we have to guess twice, we  have to guess the population mean with x bar. We have to guess the population  standard deviation with s. We need to add in some more built-in wiggle room.  Again, we didn't have this problem with proportions because we're still only  estimating one number. We're still only estimating the population proportion p.  I'm just using that estimate in multiple places, but I still only need to estimate  one thing here. I need to estimate two, and so because I need to estimate two,  the normal distribution is no longer a good approximation for the sampling  distribution of x bar. You see, the central limit theorem works really well if all you  care about is x bar, but the second you start going, I need to estimate something else as well. We're going to need to use another distribution. Luckily, we have  another distribution that we can use. It's called the student t distribution. Now,  the t distribution is a family of distributions, much like the normal distribution is a  family of distributions. The normal distribution can take many shapes, it can  have many means and many different spreads. Well, the t distribution is also a  family. Now we won't get into all the details behind why it's called the student t  distribution, that's a fun little Google experiment, if you really cared, but we'll just call it the t distribution now. The t distribution is symmetric, just like the normal  distribution, however, it has thicker tails, in other words, it has a little bit more  wiggle room in the tails, it has a wider margin of error, if you will. Now, the t  distribution is defined by a single number, we call it degrees of freedom. This  degrees of freedom tells us essentially how wide the t distribution is. Think about this degrees of freedom being similar to a normal distribution's standard  deviation. It’s a little bit more formally, degrees of freedom are the number of  independent pieces of information that go into the computation of s, our sample  standard deviation. More degrees of freedom leads to less dispersion, so the  more degrees of freedom we have, the more information we have to calculate s, which is actually going to make our distribution more and more narrow. Again, I  know it may be a little confusing. Let's think about this intuitively, your degrees of freedom is calculated as the sample size, little n minus one. Basically, the bigger your sample size, the more confidence, probably not the best word, the more  belief you have in your number, so the bigger your sample size, the more narrow the t distribution is going to be. The smaller your sample size, the more wiggle  room this t distribution is going to have to have to account for the fact that you  don't have a lot of data backing this up as samples get larger and larger and  larger and larger, the t distribution becomes approximately just the standard 

normal distribution. So, when we have really large samples, it's basically just like using a normal distribution, but for small samples it's going to add even more  wiggle room, and that's the idea, right. We're trying to say that we need to  estimate two things, and because we need to estimate two things, we should  add more wiggle room, but of course, if we have extremely large samples,  1000s upon 1000s upon 1000s of people, then you know what that extra wiggle  room we need to add is really, really small, but if we don't have a lot of people in  our sample, then we need to add a little extra wiggle room, because we're  calculating two things as compared to one, that's all I'm saying. Here's to try and visualize this for you, so you have the standard normal distribution, that's the  more narrow curve that you see here. Then you see two different t distributions,  one with 20 degrees of freedom and one with 10 degrees of freedom. You'll  notice the t distribution with 10 degrees of freedom is a little bit wider than the t  distribution with 20 degrees of freedom, which is also wider than the standard  normal distribution. Again, this is to try and help you visualize what we're doing,  because of the fact that we have to estimate two numbers, we need to add in  some more wiggle room, as you can see the t distribution being wider, having  more data in the tails and less data in the middle is a good way of being able to  add in that extra wiggle room, while still keeping that bell-shaped symmetric  curve that we like to see. All right, so we can use the same idea that we had  when it came to the empirical rule, except we're going to use the t distribution  instead of the normal distribution for the confidence interval, so we still have the  same idea, though. If we want the middle percentage of our data, what we're  going to do is we're going to take the error and split that error into two pieces,  one below the interval, one above the interval, and just like we did with  proportions, we're going to take our estimate, then we're going to add and  subtract some number, except instead of from a normal distribution, it's going to  be from a t distribution, and then we're going to multiply that by the standard  deviation of x bar. All right, so let's imagine again we want a 95% confidence  interval. That means we're going to put 2.5% of our error below the confidence  interval and 2.5% of our error above the confidence interval, again with the idea  being we're going to be wrong 5% of the time, but we don't know whether we're  going to be wrong high or wrong low, so we're going to split that error into two  pieces, but again, just like we did for the normal distribution, where we would go, well, what is the middle 95% We need to do the same thing for the t distribution.  What is the middle 95% and what essentially is that value of t, when our sample  size, for example, is 30. Well, just like we had a normal table, we also have a t  table. Yay, more tables to look at. I know, I know, they can be a little bit  confusing. The nice part is we can use calculators that already have these built  in. A lot of computers already have these built in, and we can also use these  tables as well. But the t table is actually much more designed for confidence  intervals than something like the normal table that we've been using up until 

now. Let me direct your eyes to the t table here. I'm just showing you a few of  the rows, but on the left-hand side, the far left-hand column, it is the degrees of  freedom. So, again, you just find the degrees of freedom you're interested in.  The column names at the top, the very, very top row tells what kind of  confidence interval you would like, so for example, we have a sample size of 30, that means we have degrees of freedom of 30 minus one, or 29 so I would look  at row 29 and then I would look at column 95% and that tells me that the value  of t alpha over two, that piece of the confidence interval, is a little bigger than  two 2.045 Again, this isn't too surprising, right? This number is a little bit bigger  than the normal distribution's value at 95% The normal distribution at 95% was  1.96 This is 2.045 Again, we're adding in a little extra wiggle room because we  had to estimate an extra number, that's the idea. So we would essentially have  this estimate minus 2.045 times our standard error, and then our estimate plus  2.045 times our standard error, which is essentially our confidence interval. You  have your estimate x bar plus or minus your margin of error, again, more  formally, the confidence interval for x bar, with our confidence coefficient of one  minus alpha, basically alpha being the error rate, is just going to be your point  estimate x bar plus or minus the value not from a normal distribution, the value  from the t distribution times your standard error of x bar s over the square root of n. Now remember, why do we call this a standard error? Well, the standard  deviation of x bar is sigma over the square root of n. We don't know sigma, we  have to estimate that, so because we're estimating a standard deviation, it is  now called a standard error. Awesome. Now we do have some assumptions we  have to think about here. Remember, this is all still is relying on the central limit  theorem, so for large samples. A sample size greater than or equal to 50, I can  calculate this confidence interval for the mean from any population. Doesn't  matter what population distribution we have, your data can look however it  wants in terms of the population's distribution, as long as you have large  samples. Central limit theorem holds, you can do the calculation we just talked  about. However, for small samples, samples less than 50, we need to assume  the population follows a normal distribution to actually pull off that calculation.  So, just another additional thing to think about. As long as you take a big  sample, you shouldn't have to worry. All right, so let's go ahead and work  through an example with our bike data. So, the average daily number of total  users is 4504 with a standard deviation of 1937 Now, remember, we have a  sample of 731 days, so let's build a 95% confidence interval for the average  daily number of total users. All right, so again we need to use the t distribution  instead of the normal distribution for the confidence intervals of x bar. So the  question now becomes, what is the value of t for an n equal to 731? Okay, so for an n of 731 our degrees of freedom would be 730 and that corresponds to a t  value of 1.965 Whoa, hold on a second. The normal distribution was 1.96 Yes,  and as the sample size gets bigger and bigger and bigger and bigger, what do 

we know? We know that the t distribution looks more and more and more and  more like the normal distribution, so this number is going to look closer and  closer and closer to that normal distribution's 1.96 So for us it's 1.965 awesome. So now we can just plug it into our equation, so we have our sample mean 4504 plus or minus 1.965 times 1937 our sample standard deviation divided by the  square root of our sample size, 731 or in other words, what we can say is that  our confidence interval for the average daily number of users is 4504 plus or  minus about 141 people, more specifically 140.8 or if you wanted to say it a  different way, you could say that our confidence interval for the average daily  number of users is between 4363.2 people up to 4644.8 people. Personally, I  like the left hand side better, I like 4504 plus or minus 140.8 I don't know why,  just resonates with me a little bit better. Everyone's different, though. You might  like that idea of point estimate plus or minus margin of error, or you might like  actually looking at the interval itself. Either way, it's the same thing, so you'd be  fine. All right, let's wrap it all up, so the confidence interval for the sample mean  x bar with a confidence coefficient of one minus alpha, that's an error rate of phi  of alpha, is the following: it's x bar plus or minus that value from the t distribution t alpha over two times the standard error of x bar s over the square root of n.  Wow, we've covered so much when it comes to this section of the course.  Hopefully, you can start to see how all these sections are now starting to build  on themselves. Like I mentioned, we talked about probabilities, we talked about  normal distributions, we talked about sampling distributions, all to lead up to  these calculations of confidence intervals, so now whenever we have a sample,  it's not just, well, I think that the average number of users is 4504 no, I think the  average number of users is 4504 plus or minus 141 I'm giving myself a little  wiggle room. It adds credence to whatever you say when you're trying to share  your results. People are going to believe you a lot more if you give them some  idea of wiggle room as compared to just a single number, right? If I walked up to you and said the average height of Americans, according to my data, according  to my sample, is six feet tall. Okay, well, that's that's good. Thank you. But if I  said I think the average height of all Americans, according to my sample, is six  feet tall, plus or minus three inches, you'd probably believe that second  statement, a little bit more, because you're like, oh, okay, that helps. It gives me  some idea of context, it gives me a little bit of wiggle room on that estimate.  That's the whole idea. When we report estimates, we should typically report  wiggle room with them, so that people can believe in them more, they can see  what kind of error that we think we're going to have with those estimates. So this is why we do these things in statistics. This is why we laid all that foundation, so  we could start giving numbers like this, which are so much more meaningful  when we report results from data, as compared to just a single number. But that  is the end of this lecture. That is the end of this section, and I look forward to  seeing you in the next one. 



Modifié le: lundi 22 juin 2026, 08:35