Video Transcript: Distribution of Continuous Data - Part 1
Welcome. In this next section of the course, we're going to be talking about distributions of continuous data. Let's remind ourselves of some things first. Remember, we've talked previously about what a random variable is. A random variable is a numerical description of the outcome of an experiment. Remember, it's basically a notion that we do not know exactly what is going to happen, but we know the set of possible things that could happen. Now, this random variable can either be discrete or continuous. In the previous section of the course, we talked about the discrete case, and remember, a discrete random variable may assume either a finite number of values, for example, number of TVs sold at a small department store, 0, 1, 2, 3, 4, or 5, or an infinite sequence of values, number of people that could walk into the department store, 0, 1, 2, 3, etc. Well, now we're going to move into the idea of a continuous random variable. A continuous random variable may assume any numerical value in an interval or a collection of intervals. Remember, we've talked about this a little bit previously, when talking about this previously, we said, what if we had the number like distance to the store, where we could always find a value in between two other values. For example, when looking at a discrete random variable, number of people who walk into the store, 0, 1, 2. 3 We can't have any logical value, for example, between two and three. There is no such thing as a half a person, so 2.5 people walking into the store wouldn't make any sense. However, a continuous random variable, something like distance, I can always have a distance that exists in between two other distances, so for example, if someone lived between, if someone lived two miles away from a store and someone lived three miles away from the store, I can say, well, it would make sense that someone could actually live two and a half miles away from the store. Okay, well, if someone can live two miles away from the store, or two and a half miles away from the store, again, I can make a logical argument that someone could live two and a quarter miles away from the store, and see how we can keep doing this over and over and over again. Someone could live between two and 2.0001 miles away from the store, there's always a number smaller, and there's always a value that makes sense in that smaller number in between two other ones. That's the idea of a continuous random variable. Now, if you remember from last time, when it came to discrete random variables, we were looking at their distributions, and looking at their distributions, we were trying to calculate things like probability. What's the probability that someone sells two TVs from this small department store today? What's the probability that I roll a dice and get a two? What's the probability of me flipping a coin four times and getting heads each time? We were talking about probabilities of random variables. Unfortunately, for continuous random variables, it is not possible to talk about the probability of a random variable taking on a particular value again. I don't have the ability here to be able to say, what's the probability of me selling exactly two televisions? What's the probability of me rolling exactly three twos in
10 rolls of the dice? Instead, when it comes to continuous random variables, we talk about the probability of a random variable assuming a value inside of some kind of interval. Let me show you what I mean visually. So, the probability of a random variable assuming a value inside of a given interval, let's say between any two numbers, let's call those numbers x1 and x2 So, if we were to graph out the distribution, you can see this bell curve shaped distribution here on the screen, and I wanted to know, well, what's the probability that my random variable that follows this bell-shaped curve is between the numbers x1 and x2 Well, that's what the shaded area is here on this graph, and in fact, if you wanted to know the probability, it is the value of this shaded area. More mathematically, we call this the area Under the curve, or the area under the graph in between those two points, that graph, that bell-shaped curve you see, is what we call the probability density function. Now, I know I've thrown a lot at you here, so let me go ahead and try and help you connect these dots. Remember when we were talking about discrete random variables, we could ask something like, what's the probability of selling two TVs today, or what's the probability of selling four TVs today, or you could ask, what's the probability of selling two or three or four TVs today, all of those things were possible, however, because a random variable being continuous means that each value, and there's an infinite number of those values, each value so infinitesimally small, the probability of any one exact value is so hard to calculate, it's basically like zero, so instead I can give you a range of values. So, what's the probability that, for example, height of the people in this class is between five foot six and six feet tall? Again, there can be so many different heights in between five foot six inches and six feet tall, and so with that being the case, I am saying I can't calculate the probability of being exactly a specific height, but I can calculate the range of those probabilities, I can calculate the probability of being in between two numbers, that's the general idea, that's what we're doing with continuous variables now. If it doesn't completely make sense yet, don't worry, that's what this whole section of the course is about. This whole section of the course is about continuous random variables and their distributions. So, we'll see many examples of how we deal with this in the real world, that will hopefully connect some examples and this idea for you, but again, this is the foundation of everything we're going to be talking about over the next three lectures. Here are some popular continuous distributions. You may have heard about them before. The first one being the uniform distribution, we'll talk about this one here in this lecture in a few moments. It basically is the idea that every value has an equal chance of happening. It's kind of like the flip of a coin with the discrete random variable, but here I have an infinite number of values, they just all have an equal shot. Another common distribution you may have heard about is what we refer to as the exponential distribution. Something is growing exponentially, or something is decreasing exponentially. And last, but not least, one of the most
common distributions that we work with in statistics is what we refer to as the normal distribution. Sometimes, more formally, people call it the Gaussian distribution. I'll just prefer to call it the normal distribution. So, we'll go with that. And, although you may not have ever known its name, you've probably seen it before. It's that bell-shaped curve. So, in summary, a continuous random variable may assume any numerical value in an interval or a collection of intervals. Now, again, though it is not possible to talk about the probability of the random variable assuming a particular value, instead we talk about probabilities of intervals for these variables, so let's talk about that first distribution we mentioned, the uniform distribution. It's a great initial distribution to start with, and it's a great one to be able to see some basic examples to help solidify these concepts of ranges of values instead of an exact value, so a random variable follows a uniform distribution whenever the probability is the proportional to the interval's length. Whoa, okay, that first bullet point has a lot of big words there, in, I'm not sure I understand all of them, so let's talk about it again. A random variable follows a uniform distribution whenever the probability is proportional to the interval's length. In other words, whenever every single possibility has an equal chance of happening. That's the basic idea here. Every value has an equal probability of occurring. Okay, so the probability density function for the uniform distribution is given by this equation. So, if you wanted to know what that actual box equation is, what box equation am I talking about the equation, where you actually saw just this box, that's the distribution. If you want to know what that equation is, that's what you see here. It is one over b minus a. Well, wait, what are b and what are a? B and a are the ranges of the values of x, so for example, if I have a uniform distribution between the values of zero and one, then a would be zero, b would be one, and the function would be one over one minus zero. Now that is the main part of the distribution. The rest of the values of x take a value of zero. Anything outside this range has a zero chance of actually happening. Let me give it to you in a real world example. Let's assume that sales calls that go into a company are uniformly distributed by the years of experience of the sales staff, so that everyone has the same chance of getting a call. So again, think about you have a variety of different people at a on a sales staff, and they're each taking sales calls to not try and have any favoritism. Every single person has an equal chance of getting a sales call. Okay, well, then that means that it doesn't matter what years of experience you have. You could have anywhere from two years of experience to 12 years of experience, and that's what we're going to say for our example, but every one of them has an equal chance of getting a sales call, so if we were to think about this in terms of that equation, we would say x, the number of years of experience, takes the values between two and 12, and the chances of you getting any one of those values of x between two and 12 years of experience. Basically, the chances of you getting anyone in the sales department is equal. The years of experience
has no bearing on your likelihood of getting a sales call, so our equation would be one over b minus a, one over 12 minus two, that would give a value of 1/10 So, let me show you what this looks like visually. Okay, so I have one over 10 as the height of this rectangle. The length of this rectangle is between two and 12. Oh, wait a minute. The length, then, if you think about it, is 10, right? What's 12 minus two? Well, 12 minus two is 10, so it's actually not surprising that the height of this rectangle is 1/10 because I'm splitting that distance of 10 equally across the entire range, so I have a height of 1/10 Let me go back. That is what I mean by this first bullet point. A random variable follows a uniform distribution whenever the probability is proportional to the interval's length. Remember, in other words, every value has an equal probability, so the length of this interval is 10. It's between two and 12, and every value has a probability that's proportional to that 1/10 hopefully that helps connect the dots a little bit, but how can we use this? Well, you could answer some questions if you knew that this was true. If you knew that sales calls go into a company and they're uniformly distributed by the years of experience of the sales staff, then you could ask yourself a question like this: What is the probability a call is answered by an employee with 10 to 12 years of experience? Oh, well, that's just this shaded area of the rectangle, and that's what we're talking about when it comes to an interval. I can't tell you the exact probability that someone with 10 years of experience will answer the sales call. I can't tell you the exact probability of someone with 11 years of experience answering the sales call, because someone could have 10.5 years of experience or 10.8 years of experience. There's so many different values, but I can tell you the probability that someone between 10 and 12 years of experience answers that sales call. It's the highlighted area of the graph. So, again, going back to what we saw previously, sorry, going to jump back a few slides here when we were looking here at this interval. We said that if we looked at an interval between two values and looked at the area under the graph, that is the probability, and that's all we're doing right here. What is the area under that graph? Well, the probability that you get between 10 to 12 years experience by the person you call if sales are evenly distributed, if they're uniformly distributed across all salespeople. Well, that just means that it's going to be essentially a length of 2, 12, minus 10 times the height, we're just looking at the area of that rectangle again. Now we're just going back to the idea of how do you calculate the area of a rectangle? You calculate its length and multiply it by its height. So here I have a length of 10 to 12, so I have a length of two. I multiply it by its height of 1/10 Oh, I get a number that's 0.2 In other words, there's a 20% chance that if you were to call someone at this sales center that they would have between 10 to 12 years of experience. Now, again, that would make intuitive sense, right? If you can have anywhere from two to 12 years of experience, that is an interval of 10. And if I wanted to look at the chances of you having between 10 and 12 years of experience that's an interval of two.
Well, two out of 10 is the same as 20% That's what we're doing here. That's what we're doing. So hopefully that example helps see this idea of how to be able to look at an interval when it comes to a continuous variable, now the nice part about distributions is here we also have the idea of being able to calculate expected values and variances. The expected value from a uniform distribution is the sum of the two values on the outside of the range a plus b divided by two. Oh, wait a minute, it's basically the average between the smallest value and the largest value. Now, what is variance? The variance, or the spread of this distribution is just going to be the largest value b minus the smallest value a. We're going to square that number and divide it by 12. Don't worry about the math on how we got that equation, that's for a more complicated class. This is just an idea of this is what the spread is. Okay, so again, now we can go back to our same example. Assume that sales calls that go into a company are uniformly distributed by the years of experience of the sales staff, so that everyone has the same chance of getting a call. So another question we could ask, instead of what's the probability of you getting someone between 10 to 12 years of experience, is what is the expected years of experience of a person answering a new sales call. Okay, well, again, that makes sense. That's a reasonable question. What do we expect to have in terms of experience on people answering sales calls, well, we can calculate that the expected value is nothing but the smallest number plus the largest number divided by two, or it's just the average between those two numbers, which for us is seven, that would be right in the middle, right in between two and 12, which again, that makes intuitive sense, right? If everyone has an equal chance of being selected, the youngest person in terms of experience has two years of experience. The person with the most years of experience has 12 years of experience. Then essentially, what we have is, well, I expect someone with about seven years experience right in the middle to be the person answering the sales call, not every time, yes, but on average it's like rolling a dice, right? What is the average roll of a dice? Well, I'm not going to get a one every time or a six every time, but I'm going to get values equally there, but on average I'll roll something in the middle. Same here, I can get a two just as much as I can get a 12, but on average in the middle I would expect a seven. We can do the same thing with variance. The variance, the spread of this would be the largest value, 12 minus the smallest value, two. So, 12 minus two is 10. 10 squared is 100 divided by 12 would be 8.33 So, again, we have an idea of 8.33 in terms of the spread of this distribution. Now we've talked about standard deviation in the past. If you'd like, you can take the square root of 8.33 to get the standard deviation of this distribution. I invite you to do that yourself. All right, we've talked about a lot in this lecture. Let's summarize. So a random variable follows a uniform distribution whenever the probability is proportional to the interval's length. Remember the example we went through, we had an interval length of 10, you had between two
and 12 years of experience, therefore the probability of each of those values was 1/10 Now the probability density function for the uniform distribution is the equation that you see here. Again, it helps relate that idea of pick any interval between a and b, and the proportion, the probability density function for each value is just one over that difference between a and b, right. So, again, in our example, a was 2, b was 12, so there's a difference of 10. So our calculation was one over 10. I know it's a lot of stuff, but all through these examples, we're hopefully trying to solidify this idea of how to be able to think about randomness, how to be able to think about probability. Now, not for just a discrete distribution, but for a continuous distribution. So, that is the end of this lecture. I look forward to seeing you in the next one.