Video Transcript: Gathering Data - Part 1
Welcome to the next section of the course. In the previous section, we really tried to define what data is. Now, let's talk about how to be able to gather data, like we've mentioned previously. Data is everywhere. With all this data being gathered and stored, we need to understand good practices of gathering data, you can't just gather any data you want in any way you want to be able to do some of the insights and inferences that we want to do with data without thinking ahead of times. You're going to be left open for problems later on, whenever you gather data, so we need to try and prevent that with some forward thinking. That's what this next section of the course is about. The main concepts in this section of the course. First, we're going to be talking about the difference between samples and populations. Then we're going to look at the idea of randomness. After randomness, we're going to be talking about both good as well as bad sampling methods. You can't understand really what's good unless you also understand what is bad, and then we'll be wrapping up with talking about the idea of ethical concerns around data. So, why do we even care? Why are we collecting data in the first place? Or remember, as we talked about previously, we're collecting data so we can make better decisions around a group of people, places, things, whatever you're interested in. We want to get insights about that and make better decisions around that. Well, along those lines, why would data help with that? If the data represents the things we are interested in, and that is a big if, it can provide insights. However, this first piece, where data represents the things we are interested in, this is not trivial. This is not something we can just take for granted. In fact, it is foundational to everything we do. Without this, we can have a lot of problems. If data does not represent the things we are interested in, that means it can provide us with misleading results. Those misleading results can lead to incorrect decisions, and those incorrect decisions can just lead to problems. And so, again, it's really crucial for us to make sure that our data represents the things we're interested in. So we have to put care in how we collect data. Let's talk about an example. Imagine you wanted to know the average height of the adult population in the United States. Imagine you want to know this because you're designing a new clothing line for adults. So you take a sample of people will define sample here in a little bit, but for right now, just think a subset of people, since you think it'll be impossible to ask everyone in the United States what their height is. So, your sample consists entirely of professional basketball players. Do you see any problem here? Well, professional basketball players are probably taller than most adults in the United States. So, if their heights are taller, then our guess of what the average adult in the United States looks like will be too tall. Therefore, if we make our clothes, which were originally thought of to be designed for the average person, to be designed for someone from our sample of professional basketball players, then we're not going to be able to sell these clothes to the normal person, because it's not going to fit them very well. That means we've
also wasted resources, and we've sent employees down the wrong path, so you can see that data problems here led us to make incorrect conclusions, and those incorrect conclusions led us to bad insights and bad actions, so so the data we made decisions from did not represent the people we wanted to serve. That's the basic idea. If you can make sure your data represents the people you want to serve, then that'll make this so much easier. It's not that the data was bad, it's not that I don't believe that you are able to correctly get the heights of those NBA players, it's just that your data wasn't collected in a way that didn't provide the insights we wanted, so how do we ensure. We don't make these mistakes in the future. It all revolves around those three subjects we were talking about: samples and population, randomness, and the idea of what is a good versus what is a bad sampling method. So, let's quickly summarize data gathered without thinking ahead of time leaves itself open for problems later. If data represents the things we're interested in, wonderful, it can provide insights. However, if data doesn't represent the things we're interested in, it can provide misleading results and lead to incorrect decisions, so let's jump in and start talking about the idea of samples and populations, so we can further understand how to be able to gather data well. Before we start gathering data, it's good for us to know who or what we are interested in gathering information about, we should also consider what we want to know about this group that we're interested in. In other words, Who do you want to talk to? Why do you want to talk to them? That leads us to what we call a population. A population is a set of all individuals or all objects or all places or all things. It's the entire set of what you're interested in. If you're looking to try and understand the average height of all Americans, all Americans would be your population. If you are trying to understand what the most common color of car is in the United States, cars in the United States would be your population. Now, one thing you can probably quickly see from those two examples, populations are usually too large to obtain information about. We really don't have the opportunity, whether it's because of costs, whether it's because of time, or whether it's because of feasibility, to be able to obtain information from the entire population. Again, what if you want to know the average heights of adults in the United States? Adults in the United States would be your population. Well, it's probably impossible to actually get information from all adults in the United States in a timely, inexpensive, and efficient fashion. If you are actually able to obtain information from the whole population, this is what is called a census, we've seen censuses before, and again, these are where we can actually talk to hopefully everybody in the population. Most of the time, we can't take a census to be able to answer all of the questions that we have, so we must pay attention to details of the population. You can't just define the population loosely. Again, I want to know the average height of adults in the United States. Well, that leads to some questions. What do you consider an adult? 18 years old and older, 21 years old
and older. What is actually an adult to you? Another question, if this is for marketing a new clothing line, do you actually want to talk to all adults? Maybe you want adults of a certain age range, maybe you want adults between the ages of 18 and 35 is this business or is this a casual clothing line that may again change the population of people you want to talk to. What about a certain region of the country? Is this clothing line have to deal with heavy winter jackets, maybe people who would use those more often than people who wouldn't would be a better group of people to look at. In all honesty, a lot of problems with sampling doesn't come from the fact that people aren't trying to be able to do it right. A lot of problems with sampling comes from not fully defining the population well, so again, maybe you do want to know the average heights of adults in the United States to try and make your clothing line better, but getting more details about specifically what you want can actually help with all of the sampling concerns that we might deal with later. All right, so we want to be able to talk to these adults. Wonderful. What do we want to collect about these adults? So, if a population is a set of all objects or individuals of interest, what you want to know about the population. Is what we call a parameter. A parameter is some kind of measure, some kind of collected information that you get from a population. So, for example, if our population was adults in the United States, our parameter was the average height, it is something that we want to measure or compute about our population. Now, like I mentioned earlier, we're probably not going to be able to have the ability to talk to everybody in the population, so instead, what we do is we talk to a subset of the population. This is where our information is actually gathered from. This subset of the population is referred to as a sample. Now, for this to be a good sample, it should represent the population well. Go back to our example earlier with the basketball players that sample did not represent the population we were interested in well, and because of that it led to problems. But how do you collect a sample? Typically, we use what is called a sampling frame. A sampling frame is actually the list from which the sample is taken, so for example, let's imagine you wanted to look up and ask people about their heights when it comes to adults in the United States. So you pull out a phone book. Now, I know that's a little bit of an outdated example, but I think it serves the point really well. The question now becomes, are you in your local phone book? If you don't know what a phone book is, you might not be, and if you're not in your local phone book, that may give you an example that the sampling frame, the place you're trying to collect information, and your sample from may not equal the population. This leads to a bad type of sampling that we'll talk about later. Okay, so we have a sample. This is our subset of the population. We had a parameter that describes the population well. What describes the sample? That is what we refer to as a statistic. A statistic is something that we measure or compute from a sample. So, again, if the population was adults in the United States, if the population
parameter was the average height of those adults, our sample would be the group of people we actually ask the height of, and our statistic would be the average height of that group of people we actually talk to. Our sample, sample statistics are what we call a point estimate of the population parameter. Basically, you don't know the average height of all Americans. You have a guess, though. That guess is your statistic, the statistic you obtained from your sample. So, again, our point estimate is just a single number estimate, or a single number guess of some unknown parameter, there's something you want to know, you collect some data to be able to help you figure that out, and you make a guess of what that answer is, that is called a point estimate. So, let's see all these things put together. First, we have our population. Remember, it's a set of all individuals that we're interested in. Unfortunately, we can't actually talk to the entire population, so instead we talk to a sample. This sample is a subset of the population that we actually collect information about. What are we going to do with that sample? We're going to calculate a statistic. That statistic is some number or some calculation that we make from the sample itself: average height, most prominent eye color, average weight, something that we're trying to calculate about the sample. Now, what we're going to use that statistic for is to be able to make a guess of a parameter. That parameter measures the same thing about a population, so that parameter is something that we would calculate if we were able to get a hold of the entire population, so that parameter summarizes a population. Do you see it all together? We have a population we're interested in. We can't talk to all of them, so we take a sample from that sample. We're going to calculate the thing we are interested in, the number that we're interested in, the statistic. It that statistic we're going to use to estimate some parameter, which remember that parameter summarizes the population that we were interested in in the first place. Let's work through an example. Imagine a retail chain is trying to determine if a new product they introduced is selling well across their stores. The retail chain has 2135 stores nationwide. The analyst in charge of this product is tasked to estimate the average daily sales of this new product across all stores. Now, because of older computing technology, I'm sure there are people here listening to this lecture that understand that completely. This forces the company to not be able to talk to all the stores nationwide, but instead they're going to have to randomly pick 179 stores spread out evenly throughout the nation to calculate the data from the average daily sales from these 179 stores is $129.19 So let's identify the population, the sample, the parameter, and the statistic, and then we can ponder a little bit on whether or not there are any sampling frame issues. First things first, let's identify the population. The population we have is the 2135 stores nationwide. This is the collection of all stores we wish we could talk to. Unfortunately, we cannot talk to all of those stores, so instead we're going to have to talk to a subset of those stores, a smaller collection of them, specifically 179 stores
spread evenly throughout the nation. Now, what is it that we wanted to know about the population? Well, we wanted to know what the average daily sales of this new product was across those 2135 stores nationwide. Again, we couldn't talk to the whole population, so there's no way for us to be able to get this actual number. However, we can guess that number by calculating the same thing from our 179 stores, so we can calculate the average daily sales of this new product across our 179 stores in our sample, that provides us with an average daily sales of $129.19 that would be your statistic. Do you see it all put together? We want to talk to the 2135 stores. We're not able to, so we talked to 179 stores instead. What did we want to know about those 2135 stores? We wanted to know what the average daily sales of a new product was, I can't talk to all of them, so I can only talk to these 179 from those 179 stores. Though the average sales of the product in those stores is $129.19 so my best guess of the average daily sales of the new product across all stores would be this same number. My guess of the parameter is going to be this $129.19 Now, am I going to be exactly right? Probably not. And in all honesty, if these 179 stores represent the 2135 stores well. That number is probably going to be pretty close. However, if those 179 stores don't represent those 2135 stores well, that number is going to be really off. Do you notice any kind of sampling frame issues here? At least initially, it doesn't look like there are too many problems. We have 2135 stores. We're not picking the easiest 179 stores to talk with. We're not picking them all in one region of the country. We're spreading them out evenly throughout the country to be able to make sure we understand things that are happening all over the country, when it comes to this new product, so at least initially it doesn't look like there's any sampling frame issues here that would give us questions about whether or not those 179 stores actually did a really good job of representing the 2135 stores in all. All right, let's wrap all this up in a summary again. Remember, a population is just a set of all of the individuals or objects that you're interested in gathering information about. Unfortunately, you can't talk to the entire population, so because of that, you have to talk to a sample. This sample is just a subset of the population. In typically we use something called a sampling frame to actually draw a sample from. We hope this sampling frame is actually equal to the population. Now, from that sample, we're going to calculate what we call a statistic. It's an actual measure from our sample itself. That statistic is going to be our point estimate of something grander, something larger, the parameter, that parameter is the thing we're interested in about the population, so we have this population, this group of things or people that we're interested in, the thing we're interested in about them is the parameter, and we're going to use that statistic we calculated from our sample to be our best guess of what that parameter is. So hopefully you can start to see the importance here of how we gather data. We have to understand who we're
talking to to be able to gather it well. That's the end of this lecture, and I look forward to seeing you in the next one.