Video Transcript: Distribution of Continuous Data - Part 3
All right, let's finish our conversations around continuous data by continuing where we left off in our last lecture. In our last lecture, we finished off with the idea of the empirical rule. However, we realized at the very end that the empirical rule works really well when you're looking at specific points around standard deviations, however, if we have things that aren't on those exact points, it's a little bit harder to figure out. That is where standardized scores come in. A random variable having a normal distribution with a mean of zero and a standard deviation of one is said to have what we call a standard or a standardized normal probability distribution. The reason this is important is that people have basically taken the empirical rule to the extreme, and they have done it not just for one standard deviation, two standard deviations, and three standard deviations away from the mean, but for a standard normal distribution, one with a mean of zero and standard deviation of one, they have taken the empirical rule all the way down to multiple decimal places, so 0.01 standard deviations from the mean, 0.02 standard deviations from the mean, and so on and so forth, all the way up to 3.49 standard deviations from the mean, and so they have put all of these empirical rule calculations into a table that we call a probability table, and so what we can do is we can use these previously done calculations, or in all honesty, you can use software like Excel or a variety of other softwares to do this as well, or calculator to do this, but we can use these previous calculations in these tables to be able to calculate the same kinds of questions we had last time around the normal distribution, but with a lot more fine-tuned detail. Now, the only downside is, is these calculations have been done for a single normal distribution, again a normal distribution with a mean of zero and a standard deviation of one. I know what you might be thinking, well, if my normal distribution doesn't have a mean of zero or a standard deviation of one. What in the world am I to do? The nice part about the mathematics of the normal distribution is that all normal distributions can be converted into standard normal distributions, so that we can calculate these probabilities so much easier, and that's what we're going to be talking about here in this lecture, first of all, let's talk about the idea of a standard normal table. So, on the course website, you'll see an actual full standard normal table. This is just a snippet of it. Here, the standard normal table is an extension, again, of the empirical rule, where the area underneath the standard normal curve to the left of any point is calculated all the way up to two decimal places. Let me show you. You'll notice that there are rows and that there are columns. These things together form the two decimal place number. So let's take a look at the row 0.8 and the column 0.03 What that is basically telling you is the probability on a standard normal distribution that's to the left of the point 0.83 is a probability of 0.7967 Some people see things better with numbers, but some people see things better with pictures. Let's take a look at a picture. So, again, we're going to take a look at the point 0.83 Well, what is that point 0.83 standard deviations above the mean?
If we were to look at the point that is 0.83 standard deviations above the mean, and we were to look at all of the data that exists below that point, so if we were to shade in the entire normal curve below the point that is 0.83 standard deviations above the mean, then the normal distribution will be roughly 80% of your data, or more specifically, 0.7967 So, hopefully that makes sense. So, again, we can do the same thing with any point up to two decimal places on a standard normal curve, for example, if we were to look at one half of a standard deviation above the mean, we would look at 0.5 for the row, 0.00 for the column, and we would see a probability of 0.6915, so again we're trying to look at different standard deviations and how far or how much data exists below each of these points. Now you may be thinking, what if I don't want the data below a point? What if I want the data above a point? Well, the nice part is, remember some of those rules of probability. So now all of the class is going to start building on itself. So those rules of probability, well, one of the rules of probability is that all possible things have to add up to a probability of one. So if I know that 0.7967 or 79.67% of my data is below the point 0.83 Then, if I were to look above that point, or to the right of that point, it would be one minus that number, or in other words, what we're saying would be that 79.67% of your data is below this point. That means that 20.33% of your data must be above this point, right? If all of your data consists of 100% then pick any point if you know the probability being below that point, you also know the probability being above that point. It's just one minus whatever you had before. Now, again, this works great if we have a standard normal distribution, but what if we don't have a standard normal distribution? What if our data doesn't follow that? If our data doesn't follow a standard normal distribution, we're going to have to convert it again. Luckily, all normal distributions can be converted into standard normal distributions to make these probabilities under the curve easier to calculate, so let's imagine you had a normal distribution, like you see here. It's a little more spread out than a standard normal, and it's not centered at zero, it's centered at 10. Well, what we can do essentially is we're going to shift the distribution to be centered at zero, and then we're either going to shrink or expand the distribution, so it has a standard deviation of one. Once we do that, what we can basically do is basically follow this premise that all normal distributions have the same relationship when it comes to standard deviations and means, so if I want to know this shaded area on the upper distribution, if I could find that same point on a standard normal distribution, then these two shaded areas would be the same, so if I know that, for example, 20% of my data is above the point 0.83 on a standard normal distribution. Well, then wherever that point is on my normal distribution is also going to have 20% of their data above that point. Now, how do we convert these other normal distributions to standardized normal distributions. What we have to do is we use what we call a z score, or a standardization score, that allows us to convert any single point on a normal
distribution with a mean of zero and a standard deviation of sigma to the corresponding point on the standard normal distribution, so in other words, pick any point on any normal distribution, subtract that normal distribution's mean, so take the point x, subtract the mean, divide by the standard deviation, and that's going to give you the point on the standard normal distribution. Let me show you through the idea of an example again. Now, let's assume the daily number of total users follows a normal distribution. So, let's imagine when we have this for our bike data example. So, if we assume that the daily number of total users follows a normal distribution, and we know that the average daily number of total users is 4504 with a standard deviation of 1937 Then we can ask ourselves this question, What's the probability that any random day has more than 6000 total users. Okay, again, a very reasonable question. We know 6000 is more rare than 4504 because 4504 is right in the middle of our data, but we want to know just how rare this is. So, what's the probability that any random day, when a typical day is 4504 users, what's the probability that any random day has more than 6000 users? Again, unfortunately, there's not a normal probability table for the normal distribution with a mean of 4504 and a standard deviation of 1937 but there is a normal distribution table for a standard normal distribution. So, let's convert our data to that. So, if we were to look at 6000 we were to subtract our mean of 4504 we were to divide our standard deviation, 1937 then we would get a value of 0.77 or in other words, 6000 that value is 0.77 standard deviations above the mean of 4504 so if you were to look at 0.77 of 1937 that would be the difference between 6000 and 4504 that's all we're saying, and so that point 6000 on that normal distribution is the same point as 0.77 on the standard normal distribution. So let's look up our standard normal table. Well, if we were to try and figure out what's the area under the curve, what's the probability of being below this point on a normal distribution that's standardized? We would look at the row 0.7 the column 0.07 that would give us the point 0.77 and we'll see the probability is 0.7794 0h,7794 Oh, neat. Okay, so in other words, there is a 77.94% chance that we're going to have less than 6000 total users in a day, but that's not what the question was. The question was, what's the probability that any random day has more than 6000 total users. Well, if the table told us that there's a 77.94% chance of being less than 6000 then we can just subtract that from one and we would know there's a 22.06% chance of being greater than 6000 or a probability of 0.2206 See how that works. I didn't need the empirical rule. In fact, I couldn't use the empirical rule because 6000 wasn't a nice number on my distribution of 4504 and 1937 but I was able to still figure out what's the probability of being above this number again. Visually, what we're doing is we're basically saying the point 6000 on the normal distribution with 4504 as the mean and 1937 as the standard deviation is the same point as 0.77 on a standard normal distribution with a mean of zero and a standard deviation of one, which means the shaded area in the tail is the exact same 0.2206 Now I know this
chart that you see here, this picture, those shaded areas don't look like they're the exact same. You'll have to pardon me. This is not an exact normal distribution, but with those two exact normal distributions, they would be the exact same. Let's work through another example to make sure we've got this. Now assume that the daily number of total users follows a normal distribution. The average daily number of total users again is 4504 standard deviation of 1937 Well, instead of asking what's the probability that we have more or less than a certain number of users in a day, you may ask sort of the reverse question, What is the number of daily users that would be in the bottom 10% of daily users? So, if we wanted to know, you know, how bad could bad get on a day of daily users? If I know on average I expect 4504 Great, but I want to know what is the worst 10% of days look like. So, how many total users did I have on the bottom 10% of days? So we would sort of work this problem in reverse instead of looking up x, what we're now doing is looking up a probability. We're trying to look up a z value, so we would start from the center of the table and work our way outwards. Basically, what point on a normal distribution, a standard normal distribution, has only 10% of the data below it. Well, if we were to look at the point negative 1.28 again, we're looking at the row negative 1.2 the column 0.08 We would see a value that's really close to 10%. 0.1003 It's as close as we can get with our table. In other words, that 10% of the data below are on a standard normal distribution. 10% of your data is below negative 1.28 but what's that on our distribution? Well, we just have to work backwards if we know negative 1.28 is the point on the distribution we want to look at, we need to find x. So I know x minus the mean divided by the standard deviation has to give me negative 1.28 Through some playing around with algebra, we can figure out that x is 2,024.64 2,024.64 or in other words, on the worst 10% of days, we're going to have roughly 2000 users or less that actually use our bike rental service that day. Again, what we're basically saying is that point 2,024.64 1,024.64 is the same point as negative 1.28 on the standard normal distribution, so hopefully this has allowed you to see that no matter what normal distribution you have, you can answer any kind of probability question on it because of the fact that we have this nice consistent shape across all normal distributions, so let's summarize a random variable having a normal distribution with a mean of zero and a standard deviation of one is said to be called a standard normal distribution, all normal distributions can actually be converted into this standard normal distribution, which is so helpful for us, because we actually know what kind of probabilities exist on a standard normal distribution through the help of standard normal probability tables, so that means that we can answer, we can ask and answer any kind of probability question on any kind of normal distribution, you have your data, you have a certain normal distribution, you can say, hey, what's the probability above or below a certain point, all we have to do is convert it to a standard normal distribution. First, and then we can answer that
exact question. The beauty of math, and why the normal distribution is so important to statistics, and really just beyond statistics, because it can answer so many questions for us. But that is the end of this lecture. That is the end of this section, and I look forward to seeing you in the next.