Let's go ahead and talk about our next section in gathering data. Let's focus on  the idea of randomness. Let's go back to the example that we talked about  previously. Remember when we were talking about the retail chain, trying to  understand about a new product that they're selling and how well they're selling  this new product across the entire retail chain. Remember, how we couldn't talk  to all 2135 stores, so we talked to 179 stores instead. But there was a key word  in there. We said that we picked those stores randomly. Well, what do you think  of when you think of randomness, a lot of times people will think of something  like they don't know what's going to happen, or sometimes people will think of  the idea of fairness, like equal chances for outcomes, but what do we mean in  statistics when we say something is random? If an outcome is random, then  what we say is that we know the particular outcomes that something could have, but are unsure of which of those outcomes is about to happen. So, not knowing  what is about to happen, that would sort of be a common way people think of  random. Well, this is kind of true in how we think about random and statistics.  We know what could happen, but we're just not sure which of the outcomes will  happen. So, it's not that we don't know what's going to happen at all. We know  what could happen. We're just trying to figure out what of the outcomes will let's  take the example of flipping a fair coin. We know ahead of time that this coin can produce for us either heads or tails, however, we just are unsure of which one  it's going to produce. Notice here we know the outcomes, we're just not sure of  the specific outcome that's about to occur. Well, what about that idea of fairness  or equal chances of outcomes? When we say something is random in statistics,  fairness could be true, but this is actually not required. An unfair coin, for  example, is still random. If I had a coin that was weighted that landed on heads  more often than it landed on tails, it's still random. I'm still not sure which  outcome will occur, but it doesn't mean that the outcomes each have an equal  chance of happening. So, when we talk about random, think about it not as not  knowing what's going to happen, or not as fairness, but think about it as you  know what could happen, you're just not sure what is going to happen of those  possible outcomes. So, how does this play a role in what we're doing? Well,  remember, we can't talk to the entire population. Instead, we have to talk to a  sample. Having randomness helps make this sample that we have  representative of the population. Basically, it protects us from having certain  pieces of information overly influence our sample. Go back to the example I  used previously with trying to measure average height of American adults and  getting a sample of only NBA players, most likely that sample was not collected  randomly because a random sample of adult population in the United States  probably won't end up with all NBA players inside the sample, maybe some, but  not all. If we have a good representative sample, what that allows us to do is it  means that the inference we make, the insights we make from the sample  statistic are reasonable estimates for that population parameter that we really 

care about. Remember, if I want to know something more about the average  height of all Americans, I hope my sample represents the adult heights of  Americans well, because then my sample and the information I get from my  sample is going to be a good guess of the number that I care about, so in  summary, when we talk about randomness in statistics, an outcome is random if  we know the particular outcomes that something could take. The question now  becomes, which of those outcomes is actually about to happen. Having  randomness actually helps make samples representative of their respective  populations. We like to use randomness to collect samples, because having a  good sample means that the insights we get from that sample, the inferences  we make from that sample, and those statistics make the parameter, which  remembers what we care about about the population, reasonable. Okay, so  we've talked about the idea that we want randomness in samples. Let's talk  about some different sampling techniques. We're going to first start with bad  sampling techniques, sort of show you how it's not done, and then we're going  to move into the idea of good sampling techniques to show you how to do it well. Again, I keep going back to this chart, because this chart summarizes  everything we need to know about population sample statistics and parameters.  We need good sampling to make good estimates. Without good sampling, the  insights we get from our data aren't going to be very reliable. There are many  different ways to sample from data. Mistakes in sampling, though, can lead to  what we call bias. Bias means that certain outcomes are favored over other  outcomes in our sample in a way that doesn't represent the population. That last part is key. We want a sample to represent the population. NBA players are a  rare collection of American adults, and in fact, they're also international adults as well. So, that being the case, we don't want to eliminate all NBA players from our sample, but we don't expect our sample to be only NBA players, because that  means our sample would be bias. It's giving favored outcomes to taller  individuals just because we did bad sampling. So, here are two common types  of bias that we see in sampling. The first is what we call selection bias. The  second is what we call sampling bias. Let's work through each one of these.  Let's talk about selection bias first. In selection bias, we have a couple of  different ways we can get selection bias. The first is called undercoverage. The  second is called nonresponse. When looking at undercoverage, what do we  mean? We mean specifically that the sampling frame and the population are not  equal to each other. In other words, we're trying to be able to gather a group of  people that is supposed to represent the population, however, the list we're  gathering this group of people from doesn't represent the population at all, and  that becomes a problem if we're trying to make inferences about a population,  then that sampling frame, remember the sampling frame is the actual list from  where we get our sample at better look like the population. In fact, we hope the  population and the sampling frame are the same. If the sampling frame and the 

population are not the same, then that means the sample probably doesn't  represent the population, which means that that can lead to incorrect and bias  inference being made. I used this example earlier, but let me go back to it. Let's  take the example of a phone book. It's a dated example, which I think makes it  even more reliable. There are still phone books out there, phone books that  supposedly have adults living in the area, and their respective phone numbers.  However, these phone numbers are usually landline phone numbers. A  significant number of adults nowadays use only cell phones instead of actually  using a landline. If that's the case, then people who only use cell phones are not in the phone book. Also, some people request to not be listed in the phone book, whether that is because of some kind of privacy concerns, whether they are  extremely wealthy. We also have people that are not inherently in the phone  book who are on the other side of that spectrum, people who don't have houses, people who are either homeless or who are living with a friend or living with  another family member. These people are not listed in phone books either. So  you can see if I were to go to the phone book and I were to call people in the  phone book to try and garner information from them that phone book doesn't  represent the group as a whole of people I'm probably trying to get a hold of,  unless the population of interest is people listed in the phone book, but if it was  something like adults in a specific county. Well, then unfortunately the phone  book is probably not a good representation of adults in that county. This is an  example of undercoverage. Nonresponse is another type of bias. What is  nonresponse? The idea of non response is that a subject in a sample cannot or  will not respond or allow themselves to be measured again. The problem with  this is those who respond or who actually allow themselves to be measured may not represent the population as a whole. So again, we have this piece where we have disagreement between the population and the sample we took. The best  part is most likely you've participated in nonresponse bias. Have you ever had a  telemarketer call you? That means you were selected to be part of a sample.  Have you ever not responded to a telemarketer, or when picking up the phone  and realizing it's a telemarketer, decided not to be able to give them their, your  information, that is nonresponse. You are a subject in a sample that either could  not or will not respond, so therefore the only people who are responding to  telemarketers may be different than people who aren't. Okay, so that gives us  some idea around selection bias. What about sampling bias? Again, there are  two common types of sampling bias: first one being convenient sampling, the  second one being voluntary sampling. Let's take a look at those convenience  sampling. I like to call this laziness. The idea of convenience sampling is simple. It's a technique that selects subjects from a population just based on how easy it is to be able to reach them, how accessible they are. Just because subjects are  easy to talk with, however, does not mean that they represent the population of  interest as a whole. And again, the second you don't have the population being 

represented well, then that means you could have bias inference being made.  Let's go with another example you may have participated in before, have you  ever walked around a shopping store, and you ever seen people with a  clipboard trying to stop people and ask them questions to take a survey? How  many times have you walked around those people or walked a completely  different way just so they wouldn't bug you to be able to take that survey? Well,  there are some people that love taking those surveys, what do you think that  says about them? Those people are probably different than the population as a  whole. So, again, surveyors that only talk to people who want to be talked to  could have a problem with convenience sampling. Another thing that is close to  this is what we refer to as voluntary sampling. Basically, it's a technique where  subjects volunteer themselves to the sample. You ever been a part of a crowd  where someone asks for volunteers? Well, if you have, then the people you  notice, probably the people who volunteer may be a little bit different than the  people who don't. So, again, people who volunteer don't necessarily represent  the population of interest as a whole. One funny story about this that I remember from a few years back is that there was this survey that was sent out to people  in England, and it was a marriage questionnaire, and they were trying to figure  out pieces of information about marriages. Well, based on that survey, people in  the United Kingdom basically found that 80% of people in marriages were  unhappy, and you might think, wow, that seems rather high, and it was the  problem with the marriage questionnaire and survey they sent out was it was  extremely long. We're talking 10 plus pages of questions to fill out. So, who do  you think takes the time to fill out that many questions? Most of the people that  filled out those questions were unhappy in their marriages. Well, finally I get to  tell someone how I don't like my marriage, whereas people who were happy in  their marriages were like, I'm not going to take the time to be able to fill this out,  I'm good. So, you see, again, volunteering the people who volunteer may be  different than the people who don't, so in summary, we need good sampling to  have good estimates. Bias is when certain outcomes are favored of the over  other outcomes in samples. Now, there are two common types of bias: selection bias and sampling bias, two common ways of getting selection bias are through  under coverage and nonresponse. Two common ways of getting sampling bias  are through convenience sampling and voluntary sampling. All right, so we  talked about randomness, we talked about its need. We can see how there is a  lack of randomness in some of these bias techniques that we've talked about,  these bad sampling techniques. So, let's focus on some good sampling  techniques to finish off this lecture. Again, we need good sampling to have good  estimates. So, what are some common techniques that we can use? Well,  statistical sampling techniques use selection methods based on chance  selection and randomness instead of convenience or judgment, so that they can be better representations of the population as a whole. Four of the most 

common statistical sampling techniques are referred to as simple random  sampling, also known as SRS, stratified random sampling, cluster sampling, and systematic sampling. Let's go through those. Simple random sampling, simple  random sampling is a method of sampling items from a population such that  every possible sample of a specific size has an equal chance of being selected,  so for example, you have some population of US adults, you want to take a  sample of 500 people, every single possible group of 500 people in the United  States has an equal chance of being selected. That's the idea of simple random  sampling. What are some advantages here? At least in terms of statistics,  there's no bias here. And one of the best, one of the best parts about this is  there's no previous information about the sample needed ahead of time. You just really need a list of all individuals. If you have a list of all individuals, you can  basically have a computer just select some of them at random. The downside is, well, you need a list of all individuals. I don't know about you, but I don't have a  list of everyone in the United States, and so, if you wanted to try and get a  sample of people that represents adults in the United States, well, you don't  have that list, so that makes this technique really hard. Also, if you did have a list of everyone in the United States, it's going to send you all over the place. A  simple random sample can be expensive, it can be time consuming, and in all  honesty, can be hard to implement, which is probably why it's the gold standard.  It's the thing we would like to be able to use if we had the ability to do it. Another  kind of sampling is what we refer to as stratified random sampling. This method  of sampling has items in the population divided ahead of time, so before we ever sample, we're going to divide our population into subgroups. These subgroups  are what we refer to as strata, or a single one as a stratum, and now each  member of the population will belong to one of these subgroups. What we're  going to do is we're going to sample items from every single strata with simple  random sampling as an example within each group. Again, let me give you an  example. Let's imagine you want to market both to males and females in the  population of the United States for your new line of clothing. Okay, so you don't  want just average height of adults, you want to make sure that you understand  that males and females might have different heights, so you want to make sure  your sample includes both males and females. Wonderful, so you can split the  population into two groups, and then make sure you stratify, or make sure you  sample from each. Now, what are some advantages here? Some advantages is  that smaller sample sizes in a stratified random sample can actually achieve the  same accuracy as a simple random sample, and if you think about it, makes  sense. We're making sure that our sample is representing all the groups of  interest in a population, and that's the goal of a sample. The only downside,  though, is you need information about the population ahead of time to split on,  so again, if you wanted to split on something like gender, or if you wanted to split on age groups, or income, or if you were trying to split on education level, you 

need to know that piece of information ahead of time about your population, so  you can split people into those subgroups. Well, what is a way of doing this  grouping where we may not need to know all that information ahead of time?  That's what we refer to as cluster sampling. Cluster sampling is a method of  sampling items where the population, again, is divided ahead of time into  subgroups. However, we're going to call these subgroups clusters now. Each  member of the population again belongs only to one cluster. However, unlike  stratified sampling, where we made sure we sampled from every single  subgroup in cluster sampling, we're only going to sample from some of the  subgroups. You can think about it as we're sampling from a sample of clusters,  so we're not going to look at all of the subgroups. We'll just look at a small piece  of them and only sample those. Some of the advantages to cluster sampling is it overcomes issues with travel time and expense, and it's a lot easier to  implement than simple random sample or a stratified random sample. So, for  example, if we still wanted to get the average height of all Americans, maybe I  don't want to talk to people just at random across the country. What I'll do is I will take my list of states, for example, and I will randomly select four states.  Wonderful. So now I don't need to go all over the place. Now I can go to only  these four states, and maybe within these four states I sample only five cities  within each state. So, again, you can imagine states and cities within states as  clusters. Now, this does mean that we need information about the population  ahead of time, but we don't need a total list. I just need a list of possible clusters, and then I can dive deeper. Now, the downside of this technique, though, is it  might have a little bit of bias if those random clusters aren't representative of the population as a whole. For example, if those four states all happen to be in the  northeast part of the country, and my clothing line is not just for people in the  Northeast, then that may not be a good representation, so you want to make  sure that when you look at only a subgroup of clusters, that these clusters still  represent the population as a whole. All right, one last one, we're going to talk  about systematic sampling, it's a method of sampling items that involves  selecting every kth item in the population after randomly selecting a starting  point between one and k. Well, hold on, what do I mean by that? Well,  essentially, let's imagine that you had again a population of a million people that  you could talk to, you on a sample of 10,000 of those people. Okay, well, if you  had a list again, this is the downside. If you had a list of people, what you could  do is select every 10,000th person on your list to be able to get that popular, get  that sample from your population. We do this all the time in manufacturing. I  want to do a quality check on things coming off of a manufacturing line. Say, for  example, I'm making widgets. I want to make sure my widgets are being  produced with high quality, but I don't have the time to check every single  widget, so I'm going to gather every 20th widget off the line, and if those widgets look good, then I think the machine is working fine. Again, the advantages to 

systematic sampling: it's very easy to get a sample if you have a selection  process much like that assembly line, like I mentioned earlier. However, there  could be a little bit of bias. What if, for example, every 20th widget just happens  to be really good? Well, again, there's no reason why that would be the case  necessarily in that example, but that could lead to a problem. All right, so let's  work through an example with some of these good sampling techniques. A large worldwide financial company wants to develop a new retirement plan for the  company. They want to survey different managers of branches around the world  to find out the most important strategies the new retirement plan should contain.  They have 5000 branches worldwide and want to personally interview these  branch managers, so they have information about the branch size and the state, province location of the branch, and they want to talk to 50 branch managers.  Well, let's look at those four different strategies. If we were to do a simple  random sample, we would just randomly sample 50 branches and go find their  managers, and then interview them. We would need a list of branches to be able to pull this off, and again, this might cause some travel issues, and some time  concerns, depending where these 50 branches are located. In a stratified  random sample approach, we would want to potentially get both, or I'm sorry, all  of small, medium, and large branches. We want to make sure each one of them  is represented, so what we're going to do is we're going to select a sample that's proportional from each group to make sure it looks like the population, so let's  imagine you see this pie chart on the right hand side, 60% of our branches are  small, 32% of our branches are medium, and only 8% of our branches are large. So, if we wanted to talk to 50 branch managers in all, we could talk to 30 small  branch managers. Why? Because 30 is 60% of 50. We could talk to 16 medium  branch managers, because 16 is 32% of 50, and we could talk to four large  branch managers, because four is 8% of 50. So, again, I have 50 branch  managers I'm talking to, but I've broken my branches down into three groups,  and making sure I get people from each group, cluster sampling, on the other  hand, would be something like this. We could split our branches up by state, and we're going to randomly sample five states, like you see here, and what we're  going to do is randomly select 10 branches in each of these states that would  give us our sample of 50. Now, what about some potential bias? Well, what if  these five states don't represent the population of all states very well? New York  and Texas, for example, are some of the largest population states. Maybe that  would give us a different perspective than some of the others, but notice how we have some low population states in here as well, like South Dakota. So, again,  we're trying to be able to balance that out in how we do our clustering. Last but  not least, let's imagine how we would do this with systematic sampling. Well, we  have a list of 5000 branches we want to sample 50 branches. If we take 5000  and divide it by 50, that means we have 100 So basically, what we would do is  we would take a look at the first 100 branches, we would select one of those at 

random, let's say branch nine, and then we would take every 100th branch after  that, branch nine, branch 109 branch 209 branch 309 all the way up to branch  4909 That would be the idea of systematic sampling. So this puts it all together  again, summarizing all four of those techniques across the problem that we're  looking at. In summary, again, we need good sampling to have good estimates.  Four common statistical sampling techniques that are good are simple random  sampling, stratified random sampling, cluster sampling and systematic sampling  each has some advantages and disadvantages, but these techniques will be  much better than those bias techniques that we talked about earlier. Well, good.  Now you're learning a little bit about how we have some good ways and some  bad ways of being able to collect and gather data, that is the end of this lecture.  And I look forward to seeing you in the next one.



पिछ्ला सुधार: मंगलवार, 26 मई 2026, 9:01 AM