Video Transcript: Distributions of Statistics from Data - Part 1
welcome. Let's continue into our next section of the course, where we're going to be talking about distributions of statistics that we get from data. Now, we've been talking about a lot of distributions lately. Over the last two sections, we talked about distributions of discrete data, we talked about distributions of continuous data. Now we're talking about distributions of statistics, but let's remember what a statistic is. Let's have a little bit of a review. Remember this slide, we talked about four different things: the population, the parameter, the sample, and the statistic. Remember, the population is the set of all objects or all individuals that you're interested in finding information out about. Of course, in a real-world scenario, we rarely get a chance to talk to the entire population. Instead, we have to sample. A sample is a subset of the population. This is where we actually gather our information. The goal of any good sampling technique is to have a sample that represents the population well. Now, if the sample is where we're actually obtaining information, then what we actually measure from that sample is what we call a statistic. Remember, again, a statistic is some kind of measure that's computed from a sample. Now, this statistic is what we use to estimate a parameter. A parameter is a measure computed from a population, so you can think about the idea like this. I am interested in knowing the average height of all Americans. All Americans would be the population. The average height would be the parameter from that population. I can't actually talk to all Americans, so I'll take a sample of them. From that sample, I can calculate that sample's average height. That average height from my sample is going to be my best guess for that average height parameter from the population. So everything works together, remember population and parameter both have p's. Sample and statistic both have s's. It's the easiest way to remember. Statistics describe samples, parameters describe populations. But again, why do we care so much about this idea of a statistic and a parameter? Remember, a statistic is a guess of a parameter. More formally, we call those statistics point estimators. They're point estimators because we actually have a single number estimate from a population, that's what we're looking for. Now, different population parameters have different corresponding sample statistics. For example, the population parameter mu, that's again that u with the little tail on the front, that would again be some average or mean, that's the population mean, the point estimator, the statistic for the population mean, would just be the sample mean. If you have to guess at a population average, your sample's average is the best guess you have. When it comes to variance, sigma squared, that again describes a population, that's the population's variance again, we don't typically see this, so instead we're going to measure the variance of our sample, we call that s squared. Remember, also we've talked about the idea of proportions, you have a proportion p, let's say the proportion of people that have brown hair, so you have the proportion p in your population parameter, but again we don't get a chance to measure that, so we'll
take a sample, and the proportion in our sample is p with a little caret over top of it, we call it p hat. Now, remember, samples are estimates, they don't represent the entire population. Now we hope that they do a good job of estimating the population, but they aren't the population again. That's the whole point. We can't talk to the whole population, we can talk to a sample, so that sample isn't the whole population, it's just an estimate of it. Well, remember, then statistics, which come from samples, that would mean that statistics are just estimates of the parameters. They're not going to be exactly right, because they're again just educated guesses. They're estimates, as you. Imagine with any kind of estimation comes a chance of making errors, so let's talk about that. Let's imagine you had a population that consisted of these 10 numbers: 1, 3, 5, 5, 7, 9, 4, 6, 10, and 2, if you wanted to know the average of these 10 numbers, that would be 5.2 that would be a population average, that is a parameter. Now, obviously, you wouldn't have a population that's probably the size of 10, but it just gives us a good example to look at. Let's take a sample again, imagining I can't talk to the entire population. I'm going to take a sample, a sample randomly of four of those 10 observations. That random sample consists of the numbers 1, 10, 6, and 9. If I were to take the sample average from this sample, that would be six and a half, that is a statistic that 6.5 is a statistic, because it was calculated from a sample that 5.2 on the upper right hand side, that's a parameter because it was calculated from a population, but let's imagine you got a different sample, instead of sample one. Let's say you wanted to take another sample. Let's call it sample two. Again, you randomly select four out of the 10 observations, and you get the numbers 1, 3, 2, and 5. Well, this sample average this statistic from this sample, this sample average is 2.75 Well, wait a minute, both of those samples produced statistics that were wrong. They were both trying to estimate the same number, weren't they? Remember, the population average is 5.2 The sample is trying to estimate the population, so both of those estimates, 6.5 and 2.75 are wrong. that gets us to this idea about sampling error. Sampling error occurs when there is a difference between a sample point estimate and the corresponding population parameter. In all honesty, sampling error happens all the time. It's just a matter of how big that sampling error is. So, again, let's take a look here. For that first sample, we could say the sampling error was our guess, 6.5 minus the truth, 5.2 our sampling error would be 1.3 I was off by 1.3 for sample two. On the other hand, our guess was 2.75 Our truth was 5.2 Our sampling error from the second sample is negative 2.45 So again, here are our errors. However, let's think about this a little bit more realistically. In a realistic scenario, what do you actually know? This, this is typically all we know. Think about it. If you knew the whole population, and you knew the population mean, mu, why would you ever take a sample? The reason you're taking a sample is because you don't know the population, which means you don't know the truth. You don't know the population average, in this case, was
5.2 In fact, very rarely do you ever actually get more than one sample. Typically, you only get one sample, so all you see is a sample here that has an average of 6.5 and that's your best guess. In fact, because we don't know the true value of
5.2 We don't even know how wrong we are. We don't know if this 6.5 is a good guess or it's a bad guess. So we have all these possible guesses that we could get from all of our possible samples, and then let's start thinking about this a little more. If sample statistics like we have here, like the sample mean, let's imagine they had a predictable pattern. If they had a predictable pattern, then the errors would have a predictable pattern as well, and if the errors have a predictable pattern, even if we don't know the true value, we could say something potentially about how right or how wrong we are. That's what we're talking about in this section of the course. We're talking about sampling error, and we're talking about distributions of these sample statistics. What if we were to look at the distribution of all possible sample means? What would that look like? Well, let's go ahead and summarize. Sample statistics are just single number estimates. We call these point estimates, and they're point estimates of some kind of population parameter. Now, sampling error occurs when there's a difference between a sample point estimate and the corresponding population parameter. Unfortunately, we don't get a chance to measure this sampling error all too often because we don't know the true population parameter. However, if sample statistics like the sample mean that we used in our example had a predictable pattern, then the errors would have a typical and predictable pattern as well, and that's what we're going to look forward to in our next lecture. But for now, that is the end of this lecture, and I look forward to seeing you next time.