Video Transcript: Distribution of Discrete Data - Part 1
Welcome to the next section of the course. Let's talk about distributions of discrete data. Of course, to be able to talk about distributions, we have to answer the question, What are distributions? Well, unfortunately, to answer that question, we need to ask another question, What is a random variable? A random variable is a numerical description of the outcome of an experiment. They can be either discrete or continuous. A discrete random variable is one that can assume either a finite number of values or an infinite sequence of values. Let me give you some examples to hopefully help you understand these concepts. Let's imagine we have a random variable, let's call it X, and let X be the number of TVs sold at a small department store in one day. Awesome, so again again, a random variable here is a numerical description of the outcome of an experiment. Let's let the experiment be how many TVs we're going to sell at a small department store. We don't know what that value is going to be ahead of time. That's what makes it random. It's a variable because it's a numerical description of some kind of outcome, so a random variable here is a variable that is a numerical description that we're not sure what value it's going to be of a possible set of known values, again the idea of random here is that we know what could happen, we're just not sure which one actually does ahead of time. The idea of discrete is a notion where we have a finite number of values, so for example again let X be the number of TVs sold at a small department store in one day, but let's imagine the TV store only has five televisions in stock. Therefore, the number of TVs sold at that department store in one day can only take the values of 0, 1, 2, 3, 4, and 5. It can't be anything other than those. It can't be more than 5, because 5 are all I have. It can't be anything that's not an integer, for example, 2.3 because what is 2.3 televisions, so it can only take the value zero through 5. That is an example of a finite number that would make this random variable the number of TVs sold at a small department store a discrete random variable, but a finite number of values isn't the only way you can have a discrete random variable. We also could have an infinite sequence of values. So, for example, let's now let our random variable again, we'll call it X, be the number of customers arriving in one day at that small department store, now that being the case, X can take any wide range of values, starting at zero. No one shows up to the department store to 1, 2, 3, 4,, and so on and so forth, with really no cap on top. You could almost imagine there could be an infinite number of people that show up at this small department store in a day. Now, again, this is an example of a discrete random variable. You may be thinking, well, why is this discrete? I have so many possible values this could take. Well, it's because of the fact that it only takes predetermined values, for example, integers 0, 1, 2, 3, 4,, and so there's only predetermined sequences that can actually happen. So, again, a discrete random variable, and this is a little hard to grasp sometimes, but a discrete random variable is a random variable, a numerical description of the outcome of an experiment, where that outcome is either a
finite number of values or an infinite sequence of values, something like integers 0 1 2, and so on, but I said that random variables can either be discrete or continuous. So, what would a continuous random variable be? Well, a continuous random variable may assume any numerical value in an interval or collection of intervals, really you can think about it as any single possible value between two numbers, so let's again go over a couple examples. It may help us again. Let's look at a discrete random variable. Let X be a discrete random variable, where it's the number of individuals living in a home again that can take on a value of 0 1 2, all the way up, probably to the capacity of the home. So, again, it is a discrete set of values. However, a continuous random variable example would be something where X, being a random variable, and it is the distance in miles from home to a store. Now, if you think about distance, it can take on any possible value in between two numbers. Let's imagine you have a store that is two miles away. Well, then that means there could also be a store that's 1.9 miles away, which means there could also be a store that's 1.8 Well, is there a store that's in between 1.8 and 1.9 Well, yes, you could have a store that's 1.85 miles away. The idea of a continuous random variable is that you can always find another possible value in between two values that you say this is not the case for discrete. Again, let's look at discrete as the number of individuals living in a home. Let's imagine I give you two possibilities, zero and 5. Well, you can find a value that's possible in between, let's say 3. Okay, well, now you can find values that are between zero and three, that's let's say 1, but now you have to stop. There are no values possible for the number of people living in a home that's between zero and one. We can't have fractions of a person, however. Again, for a continuous example, you can always find an example in between. So, again, if I said, "Well, I live from zero to 10 miles away, well, you can find a number in between zero and 10. Let's say 5. Okay. Well, you can find a number between zero and 5. Let's say 1. You can still find a distance in between zero and one. Let's say half a mile. Well, you can also find a distance that's between zero and half a mile, a quarter of a mile, and so on and so forth. Again, there's an infinite number of possible values in a small range because of the fact that you can always find another number in between them, that's what makes it continuous. There is no breakpoint in between any values, you can always find a smaller breakpoint. Again, I know these concepts can be a little bit difficult sometimes, but it's okay. Take a moment and try and wrap your mind around these things again. Best way to think about discrete is it can take only predefined values, where you can find some gap in between two values, and a continuous example can take on any number of values where you can't find any gap in between two values. So, let's summarize. A random variable is a numerical description of the outcome of an experiment. Now, a random variable can be either discrete, where it may assume either a finite number of values or an infinite sequence of values, or a random variable can be continuous, where it
may assume any numerical value in an interval or collection of intervals. Sometimes it's best to see these in examples. So that's what the rest of this section is going to do. We're going to be talking about discrete random variables and discrete distributions. The next section of the course will talk about continuous random variables and continuous distributions. So, let's jump in. Let's talk about discrete probability distributions. Hold on, I threw another word in there - probability. Luckily, we've seen the word probability before. We talked about it in our previous section on randomness. Now we're just applying that to a distribution. The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable. Well, remember, what is a random variable? It's a set of possible outcomes, numerical outcomes of an experiment. So, if I'm assigning probabilities over the values of a random variable, then I'm assigning probabilities to each possible outcome. Essentially, we're basically trying to ask the question. What is the frequency of occurrence of different values of the variable? If we can know how often something occurs, or with what probability something occurs, then we can understand the distribution of that variable, or the distribution of that data. Let's talk about some brief notation first. When I say frequency, I mean the number of observations in each category of the data set. Again, if we wanted to sort of think about it, just basically count up the number of times you see that category occur in your data. Relative frequency is the proportion of times you see that category occur, so for example, if we had 10 observations and 5 of them were blue, for example, then the that would be the frequency 5, the relative frequency would be one half 5 of the 10, now the cumulative frequency is the summary of all of the categories up to a certain point. So, again, let's imagine you had multiple categories. The idea of cumulative frequency is that you're not just looking at the number of observations in one category, but you're looking at the number of observations growing as you add more categories, so again, let's imagine I had an example where I had 10 observations, and I was looking at three different colors. Let's imagine colors of car, let's say blue, red, and yellow. Well, I could look at the number frequency of blue cars, then I could look at the number of blue and red cars, then I could look at the number of blue, red, and yellow cars. Seeing how I'm adding these categories together, that's what we call cumulative frequency. I'm cumulatively bringing the frequency together the same thing holds for a cumulative relative frequency. If relative frequency was the proportion of times a category occurs, the cumulative relative frequency would again be cumulating all of those proportions together, so so we can use these relative frequencies as an estimate to the probability of an event occurring. Remember, when we talked about probability in our last section, we said that we could use relative frequencies or historical data to really give us a better idea to estimate probabilities of something occurring, that's exactly what we're going to do. Probability distributions for discrete data, for discrete random
variables, are best described with tables or graphs, or really equations. Let's see an example again. Let's go back to our small department store, so let X be the number of TVs sold at a small department store in one day, where X can only take the values of 0, 1, 2, 3, 4,, or 5. So let's imagine we observed the past year of data. Let's imagine we observed 365 days. Well, let's see what we have down below. In the first column, we have the number of TVs sold that day. In the second column, we have the number of days where we saw that number of TVs sold again. This is the frequency, so for example, let's look at that first row. That first row is saying that there were 90 days in the last year where we sold zero TVs. The second row in our data is telling us that there are 85 days in the last year where we sold one TV, the third row is telling us there are 70 days in the last year where we sold two TVs, and you can see the rest for three, four, and 5 TVs sold in a day. Let's think about what the cumulative frequency and the relative frequency would be for these. Again, this is just the first category, so the cumulative frequency would still remain at 90. It's still looking at zero TV sold. The relative frequency would say, how many days did we observe the frequency of this category divided by the total number of days or the total number of observations in our data set. Well, there's 365 days in our data set, and we observed this category zero TV sold 90 times, or in other words, 25% about of the time or a proportion of .25 of the time this small department store sold zero TVs. Okay, let's now look at the second row again. We have one TV sold 85 times, so 85 days of the last year we only sold one TV. Well, what's the cumulative frequency? The cumulative frequency would be the summation of these. It would be looking at zero TV sold or one TV sold. So now we're looking at, we have 90 plus 85 we're adding the categories together. That's the idea again of cumulative frequency again. How do we get the relative frequency? Here we have 85 days in the TV sold one category out of the 365 days would leave you with a proportion of .23, or let's say around 23% Again, I invite you to be able to fill out the rest of this table to sort of see if you understand all the concepts. So, again, if we go down to the TV sold number two row, so again, How many days did we sell two TVs? Well, there were 70 days that we sold two TVs. The cumulative frequency of this category would be 0 1, and 2. 90, plus 85 plus 70, and the relative frequency would be 70 over 365 Again, I invite you to go and fill in rows 3, 4, and 5, and make sure you understand how I'm getting the numbers for a cumulative frequency and relative frequency that you see. Well, let's take a look at that relative frequency for a moment. Those kind of look like well probabilities, right? If I told you, okay, let's just pick a random day over the last year. What's the probability that you sold zero TVs on that day? Well, you could just look at the relative frequency and say, well, 25% of the time we sold zero TVs. So, if I had to say, what's the probability we sold zero TVs at one single day in the last year, it would be about 25% Again, this goes back to our last section, where we were talking about probabilities and using relative
frequencies to be able to estimate what probabilities are. Now, when it came to flipping a coin, we could use intuition, but here, when we have data like this, How many TVs am I going to sell tomorrow? Well, that's a lot harder to use intuition on. Historical data will be able to help us with that. So now we can look at the historical data of the last year to help us answer that question. So again, we have a discrete random variable, we have an unknown value. How many TVs will I sell tomorrow? We know what it could take: 0, 1, 2, 3, 4,, and 5. But since we don't know what will happen tomorrow yet, we could look at all the possibilities and look historically at how many times we sold 0, 1, 2, 3, 4, or 5 TVs. Again, from this example we could sit here and say, oh well, there's about a 25% chance tomorrow we're not going to sell any TVs, and if we go back, there is a 14% chance, for example, we sell four TVs. So, instead of thinking about this last category as relative frequency, you could also think about it as the probability that the random variable takes that specific value. Okay, hold on, that just sounded a lot more complicated, but in the end the concept is still simple. Instead of thinking about something as a relative frequency when looking back at historical data, you can think about it in this case because our variable is discrete as the probability that our random variable takes that specific value. So, what's the probability that we sell zero TVs on a single day? Well, relatively in the past we saw that happen 25% of the time. So, the probability that would happen tomorrow would also be 25% So, that's the beauty of being able to use data and distributions of data to be able to help us answer questions around probabilities, so again, if you were working at this small department store, you could look back at the historical data and say, "Hey, I have an idea of how many TVs we're going to sell tomorrow. All right. Thrown a lot at you in this lecture. Let's go ahead and summarize. So, the probability distribution for a random variable describes how probabilities are distributed over the values of the random variable. In other words, the probability distribution for a discrete random variable, here a categorical variable, describes basically the probability of getting each category, the way we do that is we calculate some things about each category historically. We look at the frequency, that's just the number of observations in each category in our data set. We look at the relative frequency, that's the proportion of observations in that category. This is how we estimate probability, and we can also look at both frequency and relative frequency cumulatively. All right, a lot of stuff to be able to look at, but now we're starting to piece it all together. This is how we can use data to answer more complicated questions than just what is the typical day at this store. Well, I could look at things like, for example, the distribution of what TV sales are at this store, and now I can get a lot more fine-tune when answering the question, what is the typical day? So that is the end of this lecture. I look forward to seeing you in the next one.