Welcome to the next section of the course. In the previous section, we really  tried to define what data is. Now, let's talk about how to be able to gather data,  like we've mentioned previously. Data is everywhere. With all this data being  gathered and stored, we need to understand good practices of gathering data,  you can't just gather any data you want in any way you want to be able to do  some of the insights and inferences that we want to do with data without thinking ahead of times. You're going to be left open for problems later on, whenever you gather data, so we need to try and prevent that with some forward thinking.  That's what this next section of the course is about. The main concepts in this  section of the course. First, we're going to be talking about the difference  between samples and populations. Then we're going to look at the idea of  randomness. After randomness, we're going to be talking about both good as  well as bad sampling methods. You can't understand really what's good unless  you also understand what is bad, and then we'll be wrapping up with talking  about the idea of ethical concerns around data. So, why do we even care? Why  are we collecting data in the first place? Or remember, as we talked about  previously, we're collecting data so we can make better decisions around a  group of people, places, things, whatever you're interested in. We want to get  insights about that and make better decisions around that. Well, along those  lines, why would data help with that? If the data represents the things we are  interested in, and that is a big if, it can provide insights. However, this first piece, where data represents the things we are interested in, this is not trivial. This is  not something we can just take for granted. In fact, it is foundational to  everything we do. Without this, we can have a lot of problems. If data does not  represent the things we are interested in, that means it can provide us with  misleading results. Those misleading results can lead to incorrect decisions, and those incorrect decisions can just lead to problems. And so, again, it's really  crucial for us to make sure that our data represents the things we're interested  in. So we have to put care in how we collect data. Let's talk about an example.  Imagine you wanted to know the average height of the adult population in the  United States. Imagine you want to know this because you're designing a new  clothing line for adults. So you take a sample of people will define sample here  in a little bit, but for right now, just think a subset of people, since you think it'll be impossible to ask everyone in the United States what their height is. So, your  sample consists entirely of professional basketball players. Do you see any  problem here? Well, professional basketball players are probably taller than  most adults in the United States. So, if their heights are taller, then our guess of  what the average adult in the United States looks like will be too tall. Therefore,  if we make our clothes, which were originally thought of to be designed for the  average person, to be designed for someone from our sample of professional  basketball players, then we're not going to be able to sell these clothes to the  normal person, because it's not going to fit them very well. That means we've 

also wasted resources, and we've sent employees down the wrong path, so you  can see that data problems here led us to make incorrect conclusions, and  those incorrect conclusions led us to bad insights and bad actions, so so the  data we made decisions from did not represent the people we wanted to serve.  That's the basic idea. If you can make sure your data represents the people you  want to serve, then that'll make this so much easier. It's not that the data was  bad, it's not that I don't believe that you are able to correctly get the heights of  those NBA players, it's just that your data wasn't collected in a way that didn't  provide the insights we wanted, so how do we ensure. We don't make these  mistakes in the future. It all revolves around those three subjects we were  talking about: samples and population, randomness, and the idea of what is a  good versus what is a bad sampling method. So, let's quickly summarize data  gathered without thinking ahead of time leaves itself open for problems later. If  data represents the things we're interested in, wonderful, it can provide insights.  However, if data doesn't represent the things we're interested in, it can provide  misleading results and lead to incorrect decisions, so let's jump in and start  talking about the idea of samples and populations, so we can further understand how to be able to gather data well. Before we start gathering data, it's good for  us to know who or what we are interested in gathering information about, we  should also consider what we want to know about this group that we're  interested in. In other words, Who do you want to talk to? Why do you want to  talk to them? That leads us to what we call a population. A population is a set of  all individuals or all objects or all places or all things. It's the entire set of what  you're interested in. If you're looking to try and understand the average height of all Americans, all Americans would be your population. If you are trying to  understand what the most common color of car is in the United States, cars in  the United States would be your population. Now, one thing you can probably  quickly see from those two examples, populations are usually too large to obtain information about. We really don't have the opportunity, whether it's because of  costs, whether it's because of time, or whether it's because of feasibility, to be  able to obtain information from the entire population. Again, what if you want to  know the average heights of adults in the United States? Adults in the United  States would be your population. Well, it's probably impossible to actually get  information from all adults in the United States in a timely, inexpensive, and  efficient fashion. If you are actually able to obtain information from the whole  population, this is what is called a census, we've seen censuses before, and  again, these are where we can actually talk to hopefully everybody in the  population. Most of the time, we can't take a census to be able to answer all of  the questions that we have, so we must pay attention to details of the  population. You can't just define the population loosely. Again, I want to know the average height of adults in the United States. Well, that leads to some  questions. What do you consider an adult? 18 years old and older, 21 years old 

and older. What is actually an adult to you? Another question, if this is for  marketing a new clothing line, do you actually want to talk to all adults? Maybe  you want adults of a certain age range, maybe you want adults between the  ages of 18 and 35 is this business or is this a casual clothing line that may again change the population of people you want to talk to. What about a certain region of the country? Is this clothing line have to deal with heavy winter jackets,  maybe people who would use those more often than people who wouldn't would be a better group of people to look at. In all honesty, a lot of problems with  sampling doesn't come from the fact that people aren't trying to be able to do it  right. A lot of problems with sampling comes from not fully defining the  population well, so again, maybe you do want to know the average heights of  adults in the United States to try and make your clothing line better, but getting  more details about specifically what you want can actually help with all of the  sampling concerns that we might deal with later. All right, so we want to be able  to talk to these adults. Wonderful. What do we want to collect about these  adults? So, if a population is a set of all objects or individuals of interest, what  you want to know about the population. Is what we call a parameter. A  parameter is some kind of measure, some kind of collected information that you  get from a population. So, for example, if our population was adults in the United States, our parameter was the average height, it is something that we want to  measure or compute about our population. Now, like I mentioned earlier, we're  probably not going to be able to have the ability to talk to everybody in the  population, so instead, what we do is we talk to a subset of the population. This  is where our information is actually gathered from. This subset of the population  is referred to as a sample. Now, for this to be a good sample, it should represent the population well. Go back to our example earlier with the basketball players  that sample did not represent the population we were interested in well, and  because of that it led to problems. But how do you collect a sample? Typically,  we use what is called a sampling frame. A sampling frame is actually the list  from which the sample is taken, so for example, let's imagine you wanted to look up and ask people about their heights when it comes to adults in the United  States. So you pull out a phone book. Now, I know that's a little bit of an  outdated example, but I think it serves the point really well. The question now  becomes, are you in your local phone book? If you don't know what a phone  book is, you might not be, and if you're not in your local phone book, that may  give you an example that the sampling frame, the place you're trying to collect  information, and your sample from may not equal the population. This leads to a  bad type of sampling that we'll talk about later. Okay, so we have a sample. This is our subset of the population. We had a parameter that describes the  population well. What describes the sample? That is what we refer to as a  statistic. A statistic is something that we measure or compute from a sample. So, again, if the population was adults in the United States, if the population 

parameter was the average height of those adults, our sample would be the  group of people we actually ask the height of, and our statistic would be the  average height of that group of people we actually talk to. Our sample, sample  statistics are what we call a point estimate of the population parameter.  Basically, you don't know the average height of all Americans. You have a  guess, though. That guess is your statistic, the statistic you obtained from your  sample. So, again, our point estimate is just a single number estimate, or a  single number guess of some unknown parameter, there's something you want  to know, you collect some data to be able to help you figure that out, and you  make a guess of what that answer is, that is called a point estimate. So, let's see all these things put together. First, we have our population. Remember, it's a set  of all individuals that we're interested in. Unfortunately, we can't actually talk to  the entire population, so instead we talk to a sample. This sample is a subset of  the population that we actually collect information about. What are we going to  do with that sample? We're going to calculate a statistic. That statistic is some  number or some calculation that we make from the sample itself: average  height, most prominent eye color, average weight, something that we're trying to calculate about the sample. Now, what we're going to use that statistic for is to  be able to make a guess of a parameter. That parameter measures the same  thing about a population, so that parameter is something that we would calculate if we were able to get a hold of the entire population, so that parameter  summarizes a population. Do you see it all together? We have a population  we're interested in. We can't talk to all of them, so we take a sample from that  sample. We're going to calculate the thing we are interested in, the number that  we're interested in, the statistic. It that statistic we're going to use to estimate  some parameter, which remember that parameter summarizes the population  that we were interested in in the first place. Let's work through an example.  Imagine a retail chain is trying to determine if a new product they introduced is  selling well across their stores. The retail chain has 2135 stores nationwide. The analyst in charge of this product is tasked to estimate the average daily sales of  this new product across all stores. Now, because of older computing technology, I'm sure there are people here listening to this lecture that understand that  completely. This forces the company to not be able to talk to all the stores  nationwide, but instead they're going to have to randomly pick 179 stores spread out evenly throughout the nation to calculate the data from the average daily  sales from these 179 stores is $129.19 So let's identify the population, the  sample, the parameter, and the statistic, and then we can ponder a little bit on  whether or not there are any sampling frame issues. First things first, let's  identify the population. The population we have is the 2135 stores nationwide.  This is the collection of all stores we wish we could talk to. Unfortunately, we  cannot talk to all of those stores, so instead we're going to have to talk to a  subset of those stores, a smaller collection of them, specifically 179 stores 

spread evenly throughout the nation. Now, what is it that we wanted to know  about the population? Well, we wanted to know what the average daily sales of  this new product was across those 2135 stores nationwide. Again, we couldn't  talk to the whole population, so there's no way for us to be able to get this actual number. However, we can guess that number by calculating the same thing from our 179 stores, so we can calculate the average daily sales of this new product  across our 179 stores in our sample, that provides us with an average daily  sales of $129.19 that would be your statistic. Do you see it all put together? We  want to talk to the 2135 stores. We're not able to, so we talked to 179 stores  instead. What did we want to know about those 2135 stores? We wanted to  know what the average daily sales of a new product was, I can't talk to all of  them, so I can only talk to these 179 from those 179 stores. Though the average sales of the product in those stores is $129.19 so my best guess of the average  daily sales of the new product across all stores would be this same number. My  guess of the parameter is going to be this $129.19 Now, am I going to be exactly right? Probably not. And in all honesty, if these 179 stores represent the 2135  stores well. That number is probably going to be pretty close. However, if those  179 stores don't represent those 2135 stores well, that number is going to be  really off. Do you notice any kind of sampling frame issues here? At least  initially, it doesn't look like there are too many problems. We have 2135 stores.  We're not picking the easiest 179 stores to talk with. We're not picking them all  in one region of the country. We're spreading them out evenly throughout the  country to be able to make sure we understand things that are happening all  over the country, when it comes to this new product, so at least initially it doesn't look like there's any sampling frame issues here that would give us questions  about whether or not those 179 stores actually did a really good job of  representing the 2135 stores in all. All right, let's wrap all this up in a summary  again. Remember, a population is just a set of all of the individuals or objects  that you're interested in gathering information about. Unfortunately, you can't talk to the entire population, so because of that, you have to talk to a sample. This  sample is just a subset of the population. In typically we use something called a  sampling frame to actually draw a sample from. We hope this sampling frame is  actually equal to the population. Now, from that sample, we're going to calculate  what we call a statistic. It's an actual measure from our sample itself. That  statistic is going to be our point estimate of something grander, something  larger, the parameter, that parameter is the thing we're interested in about the  population, so we have this population, this group of things or people that we're  interested in, the thing we're interested in about them is the parameter, and  we're going to use that statistic we calculated from our sample to be our best  guess of what that parameter is. So hopefully you can start to see the  importance here of how we gather data. We have to understand who we're 

talking to to be able to gather it well. That's the end of this lecture, and I look  forward to seeing you in the next one.



இறுதியாக மாற்றியது: செவ்வாய், 26 மே 2026, 8:57 AM