Video Transcript: Randomness in Data - Part 2
Welcome. Let's continue our conversation around randomness in data by now exploring some basic probability rules. Last time we talked about the idea of probability, what it is, but there are some rules, and when it comes to being able to calculate and being able to manipulate probabilities, so for example, you do not always know all of the sample point probabilities in an event. However, there are some basic probability relationships that you can still use to calculate the probability that an event occurs. Here are the four most common ones: the complement of an event, the union of two events, the intersection of two events, and mutually exclusive events. So let's talk about each one of these in this lecture. The complement of an event, the complement of an event is defined to be the event consisting of all the sample points that are not in a specific event, so for example, let's imagine you have the chart on the right-hand side. The big rectangle is all the possible events that could occur. Event A is the circle you see there on the left-hand side, okay? The complement of A is basically everything else that's well, not A. We basically say if we have event A, event A's complement is all the other events that aren't A. The complement of A is typically denoted with A with a little c in the upper right hand side, or sometimes A with a bar over top. I don't like the idea of A with a bar over the top, because that's how we have defined averages earlier. So, we'll go more with A with a little C in the upper right hand side. Awesome, so if you want to think about rolling a dice, and you want to think about the event A as rolling a 1, then the complement of A would be rolling anything other than a 1, it would be rolling a 2 3 4 5, or 6. All right, let's continue our thoughts around the idea of these rules. What about the union of two events? Well, the union of an event A and an event B is the event containing all the possible points that are in A or B, or both. So, again, you have the sample space of all possible things that could happen. Event A is defined here by the left-hand circle. Event B would be defined here by the right-hand circle, and the union of A and B is basically all of the outcomes that contain both A or B, so we typically denote this union as A with a U in the middle, B, that would imply the idea of A union B, and we can see a drawing of that here on the right-hand side, A and B essentially now become one combined shape with each other. However, we also have what we call the intersection of two events. The intersection of an event A and an event B is the event containing all of the sample points that are in both A and B, so it's not just A and it's not just B. They have to be contained in both A and B for it to be considered part of the intersection, that is, the darkly shaded region in between the two events A and B that you see here on the right hand side, that is the outcomes that both of these events share. So, when looking, for example, at a union of two events, let's again imagine rolling a dice. Let's imagine event A is rolling a 1 or a 2. Let's imagine Event B is rolling a 2 or a 3. The union between those would be rolling a 1, 2, or 3, because it would contain all the possible outcomes of both events. You can think of the union as an or statement, what are the chances of you
getting A or B? Well, the chances of you getting A or B would be the chances of A rolling a 1 or a 2, or B rolling a 2 or a 3, or in other words, rolling a 1 or a 2 or a 3. However, the intersection would be where they are both happening. This is an and statement again. If event A was rolling a 1 or a 2 on a dice, and event B was rolling a 2 or a 3 on a dice, then the intersection of those would be rolling a 2, because that is the only thing they both have in common, we denote this
intersection by an upside-down U, so we say A with an upside-down U, B, that would imply A intersect B. Again, you can see it as the shaded region on the right-hand side with these three concepts, the concept of complement, the concept of union, and the concept of intersection. We now have what we call the addition law. The addition law provides a way to compute the union of events A and B, so the probability of the event A union B, or if you want to think about it, the probability of event A or B. When you hear the words or, think of the term union. When you hear the words and think of the term intersection. So, how do we calculate the probability of A or B? In other words, the union of A and B. Well, that would be the probability of A plus the probability of B minus any of the overlap that A and B have, so minus A intersect B, which would make sense, right? If I wanted to calculate the union of the events A and B. I'm going to get all of the events in A, that is the circle of A. I'm going to get all of the events in B, that's the circle of B. But if you notice, I've over counted that little piece in the middle that they share. I've over counted that intersection, so if I wanted to figure out this shaded area here that you see, I want all the events in A, I want all the events in B. However, since I over counted the events that they both share, I'm going to have to subtract off the intersection, so that I'm just left with counting those specific outcomes only once, so hopefully that makes a little bit of intuitive sense, but we'll see it again in an example here shortly. Now, what if A and B didn't have an intersection? What if A and B can't happen at the same time, that would be what we would refer to as mutually exclusive events. Two events are mutually exclusive if they have no sample points in common. In other words, there is no intersection between them. This also means that the events cannot both occur. If one event occurs the other cannot, and so again, let's imagine you had something like the flip of a coin. A heads and tails would be mutually exclusive events. I can't get both a heads and a tails with one flip of a coin. If I were to roll a dice, rolling a 1 or rolling a 6 would be mutually exclusive events. I can't roll a 1 and a 6 at the same time. However, when we came up with our example earlier of rolling a dice, event A was rolling a 1 or a 2, event B was rolling a 2 or a 3. Well, both of those events can actually occur at the same time. If I roll a 2, I've satisfied event A, 1 or 2, and if I roll a 2, I would satisfy event B, which is a 2 or a 3. And so that being the case, those actually do have an intersection. They are not mutually exclusive, but if we had two mutually exclusive events, then with you wanted to know the probability of a union B, or the probability of A or B, you would just add the two probabilities together again,
the probability of you flipping a heads, plus the probability of you flipping a tails, .5 plus .5 is 1. You, there are no intersections between them, you can't get both a heads and a tail, so you do not have to worry about adjusting this addition. Let's see this through an example, it may make it a little bit easier. So, let's look at an example from our data set here. What I have is the weather on the far left hand column. The weather on the far left hand column is represented by three categories: clear or cloudy, misty, rain or snow, so. The other columns that you see across the top represent different seasons, spring, summer, fall, and winter. Notice the last column on the right-hand side gives me the total number of users across each one of those different types of weather patterns, so the first row is going to give me the total number of clear or cloudy day or clear or cloudy users. Essentially, how many users used our bike rental on a clear or cloudy day. The bottom row gives me the total across all of the seasons, so that first column is going to tell me the total number of users in the spring, of course, the bottom row and farthest right-hand column is going to tell me the total number of users altogether. So, again, just to be able to look at this chart, make sure we understand it. What we're saying is that there are 626,986 users. users who use our service in the spring on a clear or cloudy day. There are 799,443 users who used our service on a summer day that was clear or cloudy, and so on and so forth all the way down to there were only 3,739 users who used our service on a rainy or snowy day in the winter so this shows us again all of our users across weather and across different seasons, so what is the probability that a random customer uses the bike service in the fall and it was raining or snowing? Okay, well, we can again go back to our table, so let's take a look at the fall column and the rainy or snowy row. Okay, well, of the little over 3 million total users, 19,000 1600, I'm sorry, 616 times a user used our service in the fall when it was rainy or snowy, but what's the probability. Well, again, let's look at how many times we actually saw the event happen. So 19,616 times we saw this event happen out of the 3,292,679 different events we saw, or different user events that we saw, would mean that only a probability of.06, or in other words, .6%, so .6% of the time a customer uses the bike service in fall and rainy or snowy, so if you were to randomly close your eyes and we had all of our users in a bucket and you were to randomly select a user out of the bucket, we're saying there's a point 6% chance that that specific user that you selected used the bike service in the fall when it was rainy and or snowing. Awesome, let's change one word in this question. What is the probability that a random customer uses the bike service in the fall or when it was raining or snowing, big difference here. We said it had to be both fall and rain or snow. This is an intersection here, it's fall or rain or snow. This is a union. Let's take a look. So, let's see, here we have all of our fall users in the highlighted fall column, there's 841,613 users in the fall, remember of those 841,613 users, that is how many people used it in the fall. Out of the 3.2 million total users we have. Let's look at
rain or snow. Let's look at that highlighted row. Let's again look at that whole row. That whole row says that there's 37,869 users who used our service in the rain or the snow, that again is the sum of that whole row, that 37,869 is the addition of 3507, 11,007 19,616, 3739 that fall. Column, the 841,613 that is the sum of all of the fall numbers, 519,487 plus 302,510 plus 19,616 So, if we wanted to look at how many users used our service in the fall or used our service in the rain or the snow? We could just add the 841,613 users in the fall, plus the 37,869 users in the rain or the snow, but wait, hold on. Where did those things cross? We counted the 19,616 users twice. That 841,000 users in the fall had the 19,616 users that we see highlighted twice. The 37,869 users in the rain or the snow had the 19,616 users in there as well, so we've counted those users twice. So we, if we want to know how many users used our service in the fall or the rain or the snow, we should subtract off those 19,616 because we sort of double counted them, and that's exactly what we do, so we're going to take that 37,869 users, those are the total number of users in the rain or the snow divided by the total number of users overall. Then we're going to add the fall users, the 841,613 fall users, divided by again the total number of users. However, again in adding those first two numbers, we accidentally counted 19,616 people twice, so we need to subtract off 19,616 people, because we don't want to double count them, so that would leave us with 859,866 users that used our service in the fall, or when it was raining or snowing, if we divide that by the total number of users, 3,292,679 we would get a probability of .261 or basically a little bit over 26% There's a 26% chance you could randomly select a customer and they used the bike service in the fall or it was rainy or snowy, so again be very careful with the addition law, because we can accidentally count people twice, that going back was the same idea as the intersection that we saw when we wanted to add together events A and B. Well, we've counted that intersection twice. So, the addition law has us subtract off that intersection as long as our events are not mutually exclusive. These events right here are not mutually exclusive because people were able to both use our service in the fall and the rain or snow, and so we accidentally counted those 19,616 people twice, but after subtracting that off we get the right number for our probability. So let's summarize the complement of an event A is defined to be all of the events that are not in A. The union of an event A with an event B, denoted A union B, with that little U in the middle, is the event containing all sample points that are in A or B, or both. We can compute this union with the addition law again. Just be careful, you don't accidentally count that intersection twice. The intersection of an event A and B, denoted A intersect B with an upside-down U, is the event containing all the sample points that are in both A and B. Again, when you think of probabilities, unions think of or, what's the probability of A or B? Intersections think of and, what's the probability of A and B. Now, lastly, two events are mutually exclusive if the events have no sample points in common. In other words, if they don't intersect,
then they both can't happen at the same time, wow, I know we've ramped up the math in this class really quickly here with this lecture, so hopefully you understand a little bit about what we're talking about when it comes to probabilities, when it comes to unions and intersections, as well as how to be able to use those together to get the addition law, so. So that is the end of this lecture, and I look forward to seeing you in the next.