Video Transcript: Distribution of Continuous Data - Part 2
Welcome. Let's continue our conversations around distributions of continuous data by focusing on what is probably the most popular distribution in all of statistics. We call that the normal distribution. The normal distribution is one of the most common and important distributions for describing a continuous random variable. In all honesty, the normal distribution is the foundation of statistical inference, hypothesis testing, confidence intervals, regression analysis. All these things hinge on the normal distribution. We'll be talking about all three of these things later on in the course, which is why we have to understand the normal distribution now, because again it will be the underpinning of all these other concepts. The most interesting part about the normal distribution as well is that it appears all over the place in nature, in real world data. Again, it's amazing to see the patterns that God has put out there in the world. This is one of the most popular ones, the bell-shaped curve, the normal distribution. You have probably seen it before. This is the shape of the normal distribution. Now, this shape, this bell-shaped curve, where it has a big hump in the middle, and it sort of gets smaller and smaller and smaller, the further out in the tails you go. Does have an actual equation to it. Now, you do not need to memorize or understand this equation in all its details, but some people like seeing these things, and so I wanted to make sure I did bring it forward. This is the equation to calculate that nice, pretty bell-shaped curve, you see there. The big thing I want to focus about this equation is two different things that are involved in the equation. The first is the mean. Wait a minute, that's mu, that's the mean. We've seen that before. So the middle of your data, the average of your data, the mean of your data plays a big role in the normal distribution. We also have the standard deviation, sigma. Again, the standard deviation plays a big role in the normal distribution. In fact, these are really the only two things you can define in this equation, the pi that you see here is the good old 3.1415 number that you've probably learned about before. The e that you see there is what we refer to as the exponential function. So the only thing that you have that's there that is unknown would be mu and sigma, the x's that you see there, all the x's are just your data itself. So, really, how your data looks in terms of its average and how spread out your data is is all you need to know to completely define this normal distribution. In fact, that's one of the many important characteristics that the normal distribution has. So, starting at the bottom again, the normal distribution is completely defined by the mean and the standard deviation. We'll get to more of that here in a moment, but let's talk about some of these other points as well, the normal distribution is perfectly symmetric, or more formally, we would say it has a skewness of zero. It's not skewed one way or the other, and it looks like that here in the picture as well, right? You have a symmetric distribution. If we were to draw a vertical line down the middle, you would see the same thing on both sides. The normal distribution is also what we refer to as unimodal. It basically means you have one big
collection of data in the middle, so if you had bimodal data, you would see two humps in your data curve. If it was trimodal data, you would see three humps in your data curve, but because you have one big hump of data here, with the tails just getting smaller and smaller and smaller away from that collection of data in the middle, we call this a unimodal distribution. Another fun characteristic about the normal distribution, the mean, the median, the mode, they're all equal to each other, and they're all equal at the exact middle. So, the median is the peak of the curve, the mean is the peak of the curve, the mode is the peak of the curve. So, when you look at the average of your data, it would be right in the middle of everything that we see. Let's talk about another one of these characteristics: asymptotic to the x axis. Whoa, okay, hold on a second. There's some big words there that are a little bit more mathematical. Let's try and understand what they mean. Asymptotic to the x axis means that. Literally, a normal distribution can take any value, any value from negative infinity all the way up to positive infinity. I know what you're thinking. Well, wait a minute, it looks like the curve kind of ends on the left hand side and ends on the right hand side at some point. Well, actually, that's not the case. The curve goes off into infinity in both directions. It's just there's so small a chance of actually seeing things far away from the middle of the distribution that it's basically very, very unlikely. And in fact, that's really what the normal distribution is. Kind of think about it like a bar graph, but a lot of bar graphs stack side by side, so the height of your normal distribution is where things are more likely to take place, so the more probable values of your data are going to take place closer to the middle of your distribution, the less probable values are going to take place far away from the middle of your distribution. Kind of think about it like heights of people. Let's say the average height of people is six feet tall. That means that most people would be around six feet tall. That doesn't mean that there aren't people that are much taller or much shorter than six feet tall, but that they occur with a less probability of happening. So, again, there are also people who are seven feet tall. There's just not as many of them, and so they are less probable because they're further away from the middle the mean of your data. Let's talk about that last point that we led with as well, completely defined by the mean and standard deviation. What do you mean? Well, let's talk about the mean. The mean, the average of your data basically defines where this normal distribution is located, like I said, the mean is the middle of the distribution, so you can see here three different normal distributions, all of them have the same spread, they have the same shape, however, they're centered at three different points, one of them is centered at 0.One of them is centered at five, and one of them is centered at negative 10. So you can see that shifting the same distribution can be done with the mean. So if we had something with a mean of zero and I wanted to shift the whole distribution to the right, I could add five, for example, to every value, and it would shift the whole distribution to the right, and we can do
the same thing going the other direction by subtracting things, but really, again, the average of your data defines the center of this bell-shaped curve. The spread of your data defines how spread out your data is in terms of the bell shaped curve. So, again, here are two different normal distributions. Now, both of these normal distributions have the same center; they're both centered at the same mean. However, they have different spreads. The one that is more spread out does not have as high of a peak in the middle, and more of its data is spread out into the tails in terms of the idea of just width, whereas the one in the smaller standard deviation is a little bit more narrow in terms of the hump itself, so again, don't want to get this too confused with the last concept of asymptotic to the x axis. Both of the normal distributions you see here on this screen can take any value from negative infinity to positive infinity. However, it's a matter of where most of the data is located, so if your data is more spread out, it's not as located tightly around the mean as another data set might be, that's the idea. So, let's summarize real quick. Well, the normal probability distribution, again, is one of the most common in nature, as well as one of the most important distributions in mathematics, the normal distribution is the foundation of all statistical inference. Later chapters in this course, like hypothesis testing, confidence intervals, even introductions to regression analysis, are all going to hinge on the normal distribution, and this normal distribution has some wonderful mathematical characteristics. Again, it's symmetric, it's asymptotic to the x axis, it's unimodal, it's completely defined by the mean and the standard deviation. The mean and the median and the mode are all equal to each other. All of these are wonderful characteristics that we can take advantage of. In fact. We're going to take advantage of some of them now as we talk about what we call the empirical rule. So, what is the empirical rule? Well, the empirical rule is basically the idea that the normal distribution has a very predictable shape, and because it's predictable, we can use that shape to our advantage, and this all goes back to the world of probabilities, so the probabilities for a normal random variable are determined by the area underneath that bell-shaped curve, so again the total area under the curve is one. Now, since the total area underneath that bell-shaped curve that I showed you previously is one, and we know that the normal distribution is perfectly symmetric around the mean, which is also the median. Then the area of the curve below the mean and the area of the curve above the mean are both point five half of your data is below the average, half of your data is above the average. That's one of the beautiful aspects of the normal distribution. So, if I were to sit there and ask you a question about, well, what's the probability you get someone that is greater than six feet tall and you know that the average of your people are six feet, then well, half of them are going to be greater than six feet tall, because they follow a normal distribution. Now, again, this only works if it follows a normal distribution. Without that normal distribution, we wouldn't be able to say what we're going to be talking about
here, but again, that split of 50/50 isn't the only thing we can do when it comes to the normal distribution, like I mentioned, we have something that we refer to as the empirical rule. Well, what does the empirical rule state? Well, the empirical rule has three pieces to it. The first piece is that 68% roughly more exact, it's 68.26% but we say roughly 68% or if you want, that's roughly two thirds of your data, but with the normal distribution, roughly 68% of your data is contained within one standard deviation of the mean. Well, what do I mean by that? I mean that if we were to look one standard deviation below the mean, so the mean minus one standard deviation, and then we were to look one standard deviation above the mean, the mean plus one standard deviation, then everything in the middle of that, so everything in between one standard deviation below and one standard deviation above. If we were to look at all of that data and say how much of that data, or how much of our data exists in that range. Well, if your data follows a normal distribution, then no matter what the mean is, no matter what the standard deviation is, 68% of your data is within one standard deviation of that mean, isn't that neat? So, again, no matter what the mean or standard deviation is, if you have a normal distribution, we know it has this characteristic, but, like I said, there are three parts to the empirical rule. This is just the first. The second part of the empirical rule is that roughly 95% of your data, again to be more exact, 95.44 but roughly 95% of your data is within two standard deviations of the mean, or in other words, if we were to look two standard deviations below the mean and two standard deviations above the mean, and look at all the possible values in between those two, then that would encompass 95% of our data, and again, it doesn't matter what that mean and standard deviation are, if your data follows a normal distribution, this holds true. The last component of the empirical rule is that 99.7% of your data, almost all of your data, not quite all of it, but almost all of your data is within three standard deviations of the mean, so again, if we were to look three standard deviations below the mean, the mean minus three standard deviations, and three standard deviations above the mean, mean plus three standard deviations, if we were to look at all the possible values of our data in between those two ranges, those two boundaries, then that would basically have 99.7% of our data, almost all of our data is within three standard deviations of the mean, so all three of these pieces together form the empirical rule, or as some people like to call it, the 68 95 99.7 rule again has a completely understandable name, it just helps them remember what the values are. So, again, we more formally call it the empirical rule, but like I said, it could be called the 68 95 99.7 rule, and if you think about it, because our distribution is symmetric, because of the fact that we know how many or what percentage of the data falls within certain ranges in our normal distribution, then we can know a lot of things, right? So, if you were to look between the average, the mean, and one standard deviation above the mean, that little slice is 34% Well, why? Because if you go one standard deviation on
either side of the mean, that's 68% So, if we were to split that in two, that would be 34% and so on and so forth. We can actually fill in all these little blocks. So, again, if you were to know that 95% of your data is within two standard deviations of the mean, and 68% is within one standard deviation of the mean. That means there's about 27% in between those two ranges. Again, divide that by two, and you've got 13 and a half percent in each of those pieces. So, like I said, we can figure out what all of these little pieces of the normal distribution are now I know you might be thinking this just sounds like math for the sake of math, but this can actually help us analyze data. Let me show you. Let's imagine that new employees at a company have previous years of professional experience that follow a normal distribution where the average is seven and a half years, and the standard deviation is two and a half years. Okay. Well, because it follows a normal distribution, because I know the mean, because I know the standard deviation. Then I can answer a question like this: What's the probability that any random new employee has between five and 10 years of experience. Well, let's take a look at our normal distribution. So our normal distribution would have this shape again, same shape, but the center is 7.5 the mean is 7.5 and the standard deviation is 2.5 Well, that means that if I were to go 7.5 down to five and 7.5 up to 10, one standard deviation below, one standard deviation above, that between five and 10 years of experience contains 68% of my data, or in other words, there's a 68% chance that a random new employee has between five and 10 years of experience, isn't that neat? and we can answer a variety of other questions, because we can fill in this normal distribution again, we start with a mean of 7.5 in the middle, then we go down by standard deviations, so 7.5 then we go down to five, then 2.5 then zero, because of the fact that again our standard deviation is 2.5 so we can do the same thing above the mean, 7.5 plus one standard deviation is 10 plus another standard deviation is 12.5 plus another standard deviation is 15, so again we can answer a lot of different questions. Let's ask another same distribution. What is the probability that any random new employee has between two and a half and 10 years of experience. Well, again, if we were to fill out this chart, we could basically isolate where the two and a half is, where the 10 is, and we can just add those pieces together. We would have about 13 and a half percent of our people between two and a half and five years, about 34% of our people between five and seven and a half years, another 34% of our people between seven and a half and 10 years. Well, that's around 81 and a half percent of our people have between two and a half and 10 years of experience, or in other words, the probability of any one random new employee having between two and a half and 10 years of experience is 81 and a half percent. This is the power of the normal distribution. If we can show our data follows a normal distribution, we can answer so many questions about it. It's kind of like those distributions we were playing with, with the discrete distributions. Remember, how we were
dealing with binomial situations. We were just sort of filling in the equation to get an idea of what was going on. Same idea here, except the normal distribution has so well-defined areas that we could look at a variety of possibilities. All right, let's summarize. The empirical rule, also known as the 68 95 99.7 rule, is good for quick, fast, and rough analysis of data that follows a normal distribution. Now it's not exactly the best for exact analysis, unless our interests are only in the integer of standard deviations, so for example, if I wanted to go back and you said, well, what about someone with three years of experience all the way up to 14 years of experience, I can't do that as easily here. So that's a downside of this empirical rule approach. So we'll need to find another way to quickly calculate area under the curve, when we have fractions of standard deviations that we go away from the mean, but that's the next lecture. For now, that is the end of this lecture, and I look forward to seeing you in the next one.