Video Transcript: Gathering Data - Part 2
Let's go ahead and talk about our next section in gathering data. Let's focus on the idea of randomness. Let's go back to the example that we talked about previously. Remember when we were talking about the retail chain, trying to understand about a new product that they're selling and how well they're selling this new product across the entire retail chain. Remember, how we couldn't talk to all 2135 stores, so we talked to 179 stores instead. But there was a key word in there. We said that we picked those stores randomly. Well, what do you think of when you think of randomness, a lot of times people will think of something like they don't know what's going to happen, or sometimes people will think of the idea of fairness, like equal chances for outcomes, but what do we mean in statistics when we say something is random? If an outcome is random, then what we say is that we know the particular outcomes that something could have, but are unsure of which of those outcomes is about to happen. So, not knowing what is about to happen, that would sort of be a common way people think of random. Well, this is kind of true in how we think about random and statistics. We know what could happen, but we're just not sure which of the outcomes will happen. So, it's not that we don't know what's going to happen at all. We know what could happen. We're just trying to figure out what of the outcomes will let's take the example of flipping a fair coin. We know ahead of time that this coin can produce for us either heads or tails, however, we just are unsure of which one it's going to produce. Notice here we know the outcomes, we're just not sure of the specific outcome that's about to occur. Well, what about that idea of fairness or equal chances of outcomes? When we say something is random in statistics, fairness could be true, but this is actually not required. An unfair coin, for example, is still random. If I had a coin that was weighted that landed on heads more often than it landed on tails, it's still random. I'm still not sure which outcome will occur, but it doesn't mean that the outcomes each have an equal chance of happening. So, when we talk about random, think about it not as not knowing what's going to happen, or not as fairness, but think about it as you know what could happen, you're just not sure what is going to happen of those possible outcomes. So, how does this play a role in what we're doing? Well, remember, we can't talk to the entire population. Instead, we have to talk to a sample. Having randomness helps make this sample that we have representative of the population. Basically, it protects us from having certain pieces of information overly influence our sample. Go back to the example I used previously with trying to measure average height of American adults and getting a sample of only NBA players, most likely that sample was not collected randomly because a random sample of adult population in the United States probably won't end up with all NBA players inside the sample, maybe some, but not all. If we have a good representative sample, what that allows us to do is it means that the inference we make, the insights we make from the sample statistic are reasonable estimates for that population parameter that we really
care about. Remember, if I want to know something more about the average height of all Americans, I hope my sample represents the adult heights of Americans well, because then my sample and the information I get from my sample is going to be a good guess of the number that I care about, so in summary, when we talk about randomness in statistics, an outcome is random if we know the particular outcomes that something could take. The question now becomes, which of those outcomes is actually about to happen. Having randomness actually helps make samples representative of their respective populations. We like to use randomness to collect samples, because having a good sample means that the insights we get from that sample, the inferences we make from that sample, and those statistics make the parameter, which remembers what we care about about the population, reasonable. Okay, so we've talked about the idea that we want randomness in samples. Let's talk about some different sampling techniques. We're going to first start with bad sampling techniques, sort of show you how it's not done, and then we're going to move into the idea of good sampling techniques to show you how to do it well. Again, I keep going back to this chart, because this chart summarizes everything we need to know about population sample statistics and parameters. We need good sampling to make good estimates. Without good sampling, the insights we get from our data aren't going to be very reliable. There are many different ways to sample from data. Mistakes in sampling, though, can lead to what we call bias. Bias means that certain outcomes are favored over other outcomes in our sample in a way that doesn't represent the population. That last part is key. We want a sample to represent the population. NBA players are a rare collection of American adults, and in fact, they're also international adults as well. So, that being the case, we don't want to eliminate all NBA players from our sample, but we don't expect our sample to be only NBA players, because that means our sample would be bias. It's giving favored outcomes to taller individuals just because we did bad sampling. So, here are two common types of bias that we see in sampling. The first is what we call selection bias. The second is what we call sampling bias. Let's work through each one of these. Let's talk about selection bias first. In selection bias, we have a couple of different ways we can get selection bias. The first is called undercoverage. The second is called nonresponse. When looking at undercoverage, what do we mean? We mean specifically that the sampling frame and the population are not equal to each other. In other words, we're trying to be able to gather a group of people that is supposed to represent the population, however, the list we're gathering this group of people from doesn't represent the population at all, and that becomes a problem if we're trying to make inferences about a population, then that sampling frame, remember the sampling frame is the actual list from where we get our sample at better look like the population. In fact, we hope the population and the sampling frame are the same. If the sampling frame and the
population are not the same, then that means the sample probably doesn't represent the population, which means that that can lead to incorrect and bias inference being made. I used this example earlier, but let me go back to it. Let's take the example of a phone book. It's a dated example, which I think makes it even more reliable. There are still phone books out there, phone books that supposedly have adults living in the area, and their respective phone numbers. However, these phone numbers are usually landline phone numbers. A significant number of adults nowadays use only cell phones instead of actually using a landline. If that's the case, then people who only use cell phones are not in the phone book. Also, some people request to not be listed in the phone book, whether that is because of some kind of privacy concerns, whether they are extremely wealthy. We also have people that are not inherently in the phone book who are on the other side of that spectrum, people who don't have houses, people who are either homeless or who are living with a friend or living with another family member. These people are not listed in phone books either. So you can see if I were to go to the phone book and I were to call people in the phone book to try and garner information from them that phone book doesn't represent the group as a whole of people I'm probably trying to get a hold of, unless the population of interest is people listed in the phone book, but if it was something like adults in a specific county. Well, then unfortunately the phone book is probably not a good representation of adults in that county. This is an example of undercoverage. Nonresponse is another type of bias. What is nonresponse? The idea of non response is that a subject in a sample cannot or will not respond or allow themselves to be measured again. The problem with this is those who respond or who actually allow themselves to be measured may not represent the population as a whole. So again, we have this piece where we have disagreement between the population and the sample we took. The best part is most likely you've participated in nonresponse bias. Have you ever had a telemarketer call you? That means you were selected to be part of a sample. Have you ever not responded to a telemarketer, or when picking up the phone and realizing it's a telemarketer, decided not to be able to give them their, your information, that is nonresponse. You are a subject in a sample that either could not or will not respond, so therefore the only people who are responding to telemarketers may be different than people who aren't. Okay, so that gives us some idea around selection bias. What about sampling bias? Again, there are two common types of sampling bias: first one being convenient sampling, the second one being voluntary sampling. Let's take a look at those convenience sampling. I like to call this laziness. The idea of convenience sampling is simple. It's a technique that selects subjects from a population just based on how easy it is to be able to reach them, how accessible they are. Just because subjects are easy to talk with, however, does not mean that they represent the population of interest as a whole. And again, the second you don't have the population being
represented well, then that means you could have bias inference being made. Let's go with another example you may have participated in before, have you ever walked around a shopping store, and you ever seen people with a clipboard trying to stop people and ask them questions to take a survey? How many times have you walked around those people or walked a completely different way just so they wouldn't bug you to be able to take that survey? Well, there are some people that love taking those surveys, what do you think that says about them? Those people are probably different than the population as a whole. So, again, surveyors that only talk to people who want to be talked to could have a problem with convenience sampling. Another thing that is close to this is what we refer to as voluntary sampling. Basically, it's a technique where subjects volunteer themselves to the sample. You ever been a part of a crowd where someone asks for volunteers? Well, if you have, then the people you notice, probably the people who volunteer may be a little bit different than the people who don't. So, again, people who volunteer don't necessarily represent the population of interest as a whole. One funny story about this that I remember from a few years back is that there was this survey that was sent out to people in England, and it was a marriage questionnaire, and they were trying to figure out pieces of information about marriages. Well, based on that survey, people in the United Kingdom basically found that 80% of people in marriages were unhappy, and you might think, wow, that seems rather high, and it was the problem with the marriage questionnaire and survey they sent out was it was extremely long. We're talking 10 plus pages of questions to fill out. So, who do you think takes the time to fill out that many questions? Most of the people that filled out those questions were unhappy in their marriages. Well, finally I get to tell someone how I don't like my marriage, whereas people who were happy in their marriages were like, I'm not going to take the time to be able to fill this out, I'm good. So, you see, again, volunteering the people who volunteer may be different than the people who don't, so in summary, we need good sampling to have good estimates. Bias is when certain outcomes are favored of the over other outcomes in samples. Now, there are two common types of bias: selection bias and sampling bias, two common ways of getting selection bias are through under coverage and nonresponse. Two common ways of getting sampling bias are through convenience sampling and voluntary sampling. All right, so we talked about randomness, we talked about its need. We can see how there is a lack of randomness in some of these bias techniques that we've talked about, these bad sampling techniques. So, let's focus on some good sampling techniques to finish off this lecture. Again, we need good sampling to have good estimates. So, what are some common techniques that we can use? Well, statistical sampling techniques use selection methods based on chance selection and randomness instead of convenience or judgment, so that they can be better representations of the population as a whole. Four of the most
common statistical sampling techniques are referred to as simple random sampling, also known as SRS, stratified random sampling, cluster sampling, and systematic sampling. Let's go through those. Simple random sampling, simple random sampling is a method of sampling items from a population such that every possible sample of a specific size has an equal chance of being selected, so for example, you have some population of US adults, you want to take a sample of 500 people, every single possible group of 500 people in the United States has an equal chance of being selected. That's the idea of simple random sampling. What are some advantages here? At least in terms of statistics, there's no bias here. And one of the best, one of the best parts about this is there's no previous information about the sample needed ahead of time. You just really need a list of all individuals. If you have a list of all individuals, you can basically have a computer just select some of them at random. The downside is, well, you need a list of all individuals. I don't know about you, but I don't have a list of everyone in the United States, and so, if you wanted to try and get a sample of people that represents adults in the United States, well, you don't have that list, so that makes this technique really hard. Also, if you did have a list of everyone in the United States, it's going to send you all over the place. A simple random sample can be expensive, it can be time consuming, and in all honesty, can be hard to implement, which is probably why it's the gold standard. It's the thing we would like to be able to use if we had the ability to do it. Another kind of sampling is what we refer to as stratified random sampling. This method of sampling has items in the population divided ahead of time, so before we ever sample, we're going to divide our population into subgroups. These subgroups are what we refer to as strata, or a single one as a stratum, and now each member of the population will belong to one of these subgroups. What we're going to do is we're going to sample items from every single strata with simple random sampling as an example within each group. Again, let me give you an example. Let's imagine you want to market both to males and females in the population of the United States for your new line of clothing. Okay, so you don't want just average height of adults, you want to make sure that you understand that males and females might have different heights, so you want to make sure your sample includes both males and females. Wonderful, so you can split the population into two groups, and then make sure you stratify, or make sure you sample from each. Now, what are some advantages here? Some advantages is that smaller sample sizes in a stratified random sample can actually achieve the same accuracy as a simple random sample, and if you think about it, makes sense. We're making sure that our sample is representing all the groups of interest in a population, and that's the goal of a sample. The only downside, though, is you need information about the population ahead of time to split on, so again, if you wanted to split on something like gender, or if you wanted to split on age groups, or income, or if you were trying to split on education level, you
need to know that piece of information ahead of time about your population, so you can split people into those subgroups. Well, what is a way of doing this grouping where we may not need to know all that information ahead of time? That's what we refer to as cluster sampling. Cluster sampling is a method of sampling items where the population, again, is divided ahead of time into subgroups. However, we're going to call these subgroups clusters now. Each member of the population again belongs only to one cluster. However, unlike stratified sampling, where we made sure we sampled from every single subgroup in cluster sampling, we're only going to sample from some of the subgroups. You can think about it as we're sampling from a sample of clusters, so we're not going to look at all of the subgroups. We'll just look at a small piece of them and only sample those. Some of the advantages to cluster sampling is it overcomes issues with travel time and expense, and it's a lot easier to implement than simple random sample or a stratified random sample. So, for example, if we still wanted to get the average height of all Americans, maybe I don't want to talk to people just at random across the country. What I'll do is I will take my list of states, for example, and I will randomly select four states. Wonderful. So now I don't need to go all over the place. Now I can go to only these four states, and maybe within these four states I sample only five cities within each state. So, again, you can imagine states and cities within states as clusters. Now, this does mean that we need information about the population ahead of time, but we don't need a total list. I just need a list of possible clusters, and then I can dive deeper. Now, the downside of this technique, though, is it might have a little bit of bias if those random clusters aren't representative of the population as a whole. For example, if those four states all happen to be in the northeast part of the country, and my clothing line is not just for people in the Northeast, then that may not be a good representation, so you want to make sure that when you look at only a subgroup of clusters, that these clusters still represent the population as a whole. All right, one last one, we're going to talk about systematic sampling, it's a method of sampling items that involves selecting every kth item in the population after randomly selecting a starting point between one and k. Well, hold on, what do I mean by that? Well, essentially, let's imagine that you had again a population of a million people that you could talk to, you on a sample of 10,000 of those people. Okay, well, if you had a list again, this is the downside. If you had a list of people, what you could do is select every 10,000th person on your list to be able to get that popular, get that sample from your population. We do this all the time in manufacturing. I want to do a quality check on things coming off of a manufacturing line. Say, for example, I'm making widgets. I want to make sure my widgets are being produced with high quality, but I don't have the time to check every single widget, so I'm going to gather every 20th widget off the line, and if those widgets look good, then I think the machine is working fine. Again, the advantages to
systematic sampling: it's very easy to get a sample if you have a selection process much like that assembly line, like I mentioned earlier. However, there could be a little bit of bias. What if, for example, every 20th widget just happens to be really good? Well, again, there's no reason why that would be the case necessarily in that example, but that could lead to a problem. All right, so let's work through an example with some of these good sampling techniques. A large worldwide financial company wants to develop a new retirement plan for the company. They want to survey different managers of branches around the world to find out the most important strategies the new retirement plan should contain. They have 5000 branches worldwide and want to personally interview these branch managers, so they have information about the branch size and the state, province location of the branch, and they want to talk to 50 branch managers. Well, let's look at those four different strategies. If we were to do a simple random sample, we would just randomly sample 50 branches and go find their managers, and then interview them. We would need a list of branches to be able to pull this off, and again, this might cause some travel issues, and some time concerns, depending where these 50 branches are located. In a stratified random sample approach, we would want to potentially get both, or I'm sorry, all of small, medium, and large branches. We want to make sure each one of them is represented, so what we're going to do is we're going to select a sample that's proportional from each group to make sure it looks like the population, so let's imagine you see this pie chart on the right hand side, 60% of our branches are small, 32% of our branches are medium, and only 8% of our branches are large. So, if we wanted to talk to 50 branch managers in all, we could talk to 30 small branch managers. Why? Because 30 is 60% of 50. We could talk to 16 medium branch managers, because 16 is 32% of 50, and we could talk to four large branch managers, because four is 8% of 50. So, again, I have 50 branch managers I'm talking to, but I've broken my branches down into three groups, and making sure I get people from each group, cluster sampling, on the other hand, would be something like this. We could split our branches up by state, and we're going to randomly sample five states, like you see here, and what we're going to do is randomly select 10 branches in each of these states that would give us our sample of 50. Now, what about some potential bias? Well, what if these five states don't represent the population of all states very well? New York and Texas, for example, are some of the largest population states. Maybe that would give us a different perspective than some of the others, but notice how we have some low population states in here as well, like South Dakota. So, again, we're trying to be able to balance that out in how we do our clustering. Last but not least, let's imagine how we would do this with systematic sampling. Well, we have a list of 5000 branches we want to sample 50 branches. If we take 5000 and divide it by 50, that means we have 100 So basically, what we would do is we would take a look at the first 100 branches, we would select one of those at
random, let's say branch nine, and then we would take every 100th branch after that, branch nine, branch 109 branch 209 branch 309 all the way up to branch 4909 That would be the idea of systematic sampling. So this puts it all together again, summarizing all four of those techniques across the problem that we're looking at. In summary, again, we need good sampling to have good estimates. Four common statistical sampling techniques that are good are simple random sampling, stratified random sampling, cluster sampling and systematic sampling each has some advantages and disadvantages, but these techniques will be much better than those bias techniques that we talked about earlier. Well, good. Now you're learning a little bit about how we have some good ways and some bad ways of being able to collect and gather data, that is the end of this lecture. And I look forward to seeing you in the next one.