Video Transcript: Gathering Data - Part 3
So let's finish off this section on gathering data here with our last lecture. The first thing we're going to talk about in this lecture is the idea of experiments. If you've ever collected data before, some of you may have collected data through the use of an experiment, but what is an experiment? Well, to understand what an experiment is. First, we have to understand what an observational study is. Typically, data collection usually gets classified as either an observational study or an experimental study. Let's look at an observational study, because that's really all of the examples we've been using up until now. In an observational study, the researcher or the person collecting the data does not interfere or intervene in the process of collecting data. Basically, it requires selecting a sample. However, in an experimental study, also known as an experiment, the researcher or the person gathering the data, specifically manipulates the conditions in which the study is carried out. This is by design. It requires selecting a sample and conducting and designing an experiment around that sample. Let's take a look at some examples. Imagine you wanted to know the average height of the adult population in the United States, because you're designing a new clothing line for adults. Well, this would be an example of an observational study. We're just observing what has happened, height in our population of interest. The byte data set would be another example of an observational study. All we're doing is we're just looking back at time and seeing in the past how many people have used our service and what the weather happened to be at that time. An experiment is structured a little bit differently, though, but before we can understand an experiment and look at an example, let's talk a little bit about the terminology people use when referring to an experiment. In an experiment, the researcher, or the person gathering the data, randomly assigns what they call treatments to experimental units. Okay, what do we mean? Well, an experimental unit is just an observation. Think about the person of interest or an object of interest. Well, let's start from the bottom and work our way up. A treatment is a specific experimental condition. Okay, well, it's experimental condition. It's something that we can potentially control. Well, what can we control in experiments? We can control certain variables. These variables are known as factors. It's basically a variable used to predict, and it takes on a finite number of values. Again, think about it as a categorical variable, or a qualitative variable. In experiments, we call them factors. The level of a factor is basically the categories inside of our qualitative variable, so factor think qualitative variable level think category in that qualitative variable, so again, when we apply a treatment, this is a specific experimental condition, usually it's the level of a factor, if there's only one factor, or a certain combinations of the levels from several factors. Let me give you an example to help walk through some of these things. Let's imagine a mechanical engineer wanted to determine which variables influence gas mileage of a certain year and model of car. Well, gas mileage is the variable that we're interested in. You can think of cars as the
experimental units. The factors that we're going to study will be tire pressure, which has two levels, low and standard, as well as octane rating of fuel, which again here will have three levels: regular, mid grade, and premium. Again, these factors are just qualitative variables, right? Tire pressure is just a qualitative variable. Octane rating of fuel is just a qualitative variable, but in experiments we call them factors. Now we're also going to measure other things, but we're going to try and control for these things. For example, we're going to try and control for weather conditions, or route, or tire type. So this would be an example of what we're looking at. A treatment would be a combination of things like low tire pressure with regular octane ratings, and then standard tire pressure with regular octane ratings, just to be able to compare low versus standard tire pressure in one level of octane rating, and we would do that for all the levels of octane rating as well, the key thing that makes this study an experimental study is the active role the researcher plays in manipulating the environment. Again, take a look back at this example. We're going to take very specific examples and test those specific predetermined examples. This is not something we're going back and looking at afterwards. That makes it really hard sometimes to be able to actually have a true experiment. In fact, a lot of times in certain healthcare scenarios, in economic scenarios, we can't actually do real experiments. Now, yes, you may have heard of drug studies before, and those are experiments, but let's imagine you wanted to measure the effects of smoking on children, you think smoking is bad. You want to see the impact that it may have on children. Well, it would be rather unethical to actually have an experiment where we basically said this group of children is going to receive secondhand smoke, this other group of children is not. Let's see which one's bothered more, that would be an unethical thing to do. Same idea for the second example. Let's imagine that we're trying to measure the effects of family unit income as a child for college performance. In other words, we're going to basically put certain families in poverty, we're going to give other families a lot of money, let the children grow up and see how they perform in college. Again, that's not an ethical thing to do. These would be examples of observational studies. We can look back and see children who have been around secondhand smoke before and see if we can measure some effects of what's going on, but we're not intentionally putting children in the future inside of a situation that may harm them. Again, the big difference between an experimental study and an observational one is, in an observational one, we look back on data that has already occurred. In experimental studies, we designed them ahead of time to collect data in the future. There are three key components to a well-designed experiment. So, let's talk about these components. The first deals with randomization, where we're basically taking treatments and randomly assigning them to experimental units, we've dealt with randomness before, but again now we're talking about it in terms of an experiment. We want to make sure that
we're not specifically designing and assigning treatments to very specific people, we want to again randomly assign them to make sure we're getting an unbiased approach. The second component that is key is what we call replication. Replication is when we have multiple subjects assigned the same treatment. If you only tested, for example, a new drug out on one person, and that worked really well for that one person. How do you know you just didn't get lucky? The idea of an experiment is to make sure we can repeat it, so we have replication. Subjects who have the same treatment are called replicates. The more replication you have, the more confidence you can have in your study conclusions. That is why you see big, well-designed experiments typically worked on many individuals. We also have what we call control. A control is where some study conditions are held constant. This helps us reduce variability, controlling certain variables. Sometimes we call these nuisance variables that can impact what we're interested in will allow us to make a better inference about what's actually going on. It basically makes sure we can see things easier. We can actually see differences because of our treatments, not because of other things. Again, we can go back to that car example. The whole idea of holding constant the weather conditions, the route, or the tire type is because those are potential nuisance factors that get in the way of gas mileage, so these would be things that we are trying to control, so when we're taking different cars and we're measuring the gas mileage of different cars under different conditions like tire pressure or octane ratings. We're going to make sure those different cars drive the same route. We're going to make sure they have the same tire type, and we're going to make sure we do it under the same weather conditions. That way, those things won't impact our study on what tire pressure and octane rating can do to gas mileage, so let's summarize an observational study, which I'll be honest with you, is a lot of what data analysis is these days, is a place where researchers or data gatherers do not interfere or intervene in the process of collecting data, we just observe data, and we try and understand associations and relationships after the fact. An experimental study, however, is where our researcher manipulates the conditions specifically in which the study is carried out. What we are trying to do with experiments is we are trying to be able to isolate the effects of treatments, so we can more confidently say that something is happening because of a treatment. Now, there are three key components to a well-designed experiment: first is randomization, second is replication, and last we have control, excellent. So we've talked about the idea of experiments, but that does bring up another subject that we should talk about when it comes to gathering data, and that is the idea of data ethics. The gathering of data leads to questions around the ethical collection and use of that data. We talked about an example previously. We want to understand the smoking effects on children. It would not be an ethical thing to do to be able to subject children to these kinds of situations just so we can understand some experimental question, as
Christians, though we're held to an even higher standard around ethical considerations, so we always must keep these things in mind as we're living up to a higher standard. I personally believe that God not only makes us stewards of money, but he also makes us stewards of a variety of things in our lives, our talents, our jobs. Here are the data that we collect. We need to do so in an ethical manner. So, in observational studies or experiments, we must keep the interest of the subject we're collecting data from at the forefront. Again, in that example, when it came to the children and the effects of smoking, if we keep the interest of the children at the forefront, then what we can do is we can say, you know, this is not a good idea to be able to do, in fact, in 1964 the Helsinki Declaration of the World Medical Association made this statement: the interests of the subject must always prevail over the interests of society and science. So, when we start collecting our data, there are safeguards that we can have, for example, institutional review boards, informed consent, and confidentiality. Let's talk about each one of these. What is an institutional review board? Well, people have to exist that have the best interest of the subjects of the data collection in mind. Sometimes the people running the experiments may get blinded by wanting results. So, an institutional review board would come in, and they would be the people that would have the best interest of the subjects of the data collection in mind to be able to keep the experimenters or the people gathering data in an observational study focused on what's best for the subjects, for example, medical studies actually require institutional review boards to help evaluate every single study before they are conducted, so again the subjects are not put into any harm. There are many horror stories of situations before all of this went into place when it came to institutional review boards where medical studies were done on people that should not honestly have been done, and so again, these things are crucial to have others step into place to be able to keep the best interest of the subjects in mind. Unfortunately, these are not required for a lot of business studies, however, the people collecting the data, potentially you should take the subject into account before any data collection is performed. You can be the voice of reason at whatever company you're doing this for. So, let's talk about informed consent. What do we mean by informed? Well, informed a subject should be told what data is needed from them and what potential outcomes come from the data being given to the people collecting it. It's not just that we should tell people, hey, I need to collect this data about you. I need to also tell them what are the potential outcomes of them giving me that data. So, we must ensure that all information is shared. Now, this may be hard for those gathering the data, since they believe in their work and its usefulness. However, you have to think about all the potential risks of having that data and making sure those risks are revealed to the subject, so that's the idea of being informed. Then comes the idea of consent. After being informed, subjects must agree to the collection of data. Usually, this is done in writing, of course. If this is
the case, we have to also consider the idea of who can actually give consent. For example, can a small child give consent? Usually not. Usually, we have a parent or a guardian who has to give consent for the child. Same idea applies to mentally ill subjects as well. So, again, these things are done a lot of times in medical studies, but in business studies they may not be done. So, I invite you to be the person in charge of this, if you're in charge of collecting data, if you're in charge of analyzing data, it is your stead to be able to make sure that we are good stewards and good examples for what people should and be doing with this data. Now, I'll be honest, some people are afraid that consent is harder to come by if you reveal all the possible bad outcomes, no matter how unlikely they are, but is that a bad thing? Is it bad to say, you know, there's a very, very slim chance that this data could be used for something that you don't want it to be used for. We need to make sure again that consent is given only after being fully informed, last but not least, we have confidentiality. Once data is collected, privacy is extremely important. Confidentiality is where we have the subjects in the data having their identified information masked, so you can report overall statistics about data that is gathered, but not who it belonged to, unless you're reporting results to others who own the data, so you can sit there and say, well, the average age of this collection, this sample of people, is this, without revealing everyone's individual age, of course, in the modern age of technology, we see many stories of confidential data being leaked due to computer hacking. Again, that's the downside of data collection, is there's risk, and so we must take all the steps that we can to make sure that the data that has been given to us, we are stewards of that data. This data has been entrusted to us. We must keep it confidential and private. Now, being anonymous is a little bit different than confidentiality. Anonymity is when identifying information about the subjects is actually never known in the data collection. Anonymity is even more private than confidentiality. In confidentiality, we know who the information is collected on. In anonymity, we don't. An example like that would be not actually writing down the names of the people that you're collecting heights or ages for, and so now you just have a collection of people, but you can't tie it back really to anybody. So, let's walk through an example of a website. You want to know which website design will work better to get people to click on your products you randomly show one of the two websites to people who visit your website to measure which design performs better. Are there any concerns around institutional review, informed consent? How about confidentiality? Did anyone think or double check this website design study that you wanted? Did you have anyone else review it? If you're showing different websites to different people, have you actually informed the people who are in this study that they're actually in a study? Did they give consent? Are you tracking what these people are doing on your website? If so, are you taking confidentiality into play? These are important questions. What about a wearable medical device? You wear a watch that tracks
your heart rate and sends that information off to a company. That company uses the information to determine trends and characteristics of people at risk for heart disease. Again, did you agree to that? Maybe you did. Maybe you did unknowingly. Was there any institutional review of this study? Again, I say, did you agree to it? Maybe it was in all that small print at the bottom when you bought that wearable device, and maybe buying it was you giving consent again. We have to be very careful of this, and again, what about confidentiality? Are these companies making sure that we can't be linked to our health data that can potentially be stolen, so in summary, the gathering of data leads to questions around ethical collection and use of that data. As Christians, we're held to an even higher standard around these ethical considerations, because we are stewards of this data in observational studies and experiments, we must keep the interest of the subject we're collecting data from at the forefront. That's why we have institutional review boards, informed consent, and confidentiality. All right, so let's wrap everything up. Collecting data intuition. So, what have we done? The main concepts in this section, we talked about samples and populations, we talked about randomness, good and bad sampling methods, as well as ethical concerns around data. So, again, intuition, population of interest. Who are you really interested in gathering data around? Well, the biggest problem with setting a population is not providing enough detail. So, make sure you provide plenty of detail around what you're actually interested in. It'll actually save you time later on. Of course, we can't talk to that entire population, so we have to take a representative sample, and that brings up a good question. Does your sample represent your population? Good sampling methods that involve randomness help you get a sample that represents the population. However, it's still good practice to explore your data to make sure it looks like the population in a common sense way. For example, it's possible to randomly get really lucky and select only NBA players for your height study. However, upon quick investigation, you realize your sample probably isn't right, so you take another sample. Of course, if we're going to talk about sampling, we should talk about good sampling. Does your sampling favor certain outcomes over others? This is called bias. It's always good to think about your sampling method to make sure you haven't built in any biases. Again, randomness helps with this, so make sure your sampling method has randomness to protect you against bias. And if we're going to talk about good sampling. We should also talk about ethical considerations. Can anyone be harmed or burdened by the collection and use of your data? It's an important question. Think about the possible harm the collection of your data could have. You must be open and honest with people you're collecting data on. Remember, God holds us to a higher standard than the world. Let's represent him well. So overall, it's extremely hard to protect yourself and consider all these things by yourself. Ask for help. I always like to ask others who I know, especially if they have different perspectives and
experiences than I do, to make sure I'm not missing anything when I start these studies, so intuition and careful thought can protect you a lot of times when it comes to data gathering, and don't be afraid, don't be afraid to use other people, especially those who are different than you, to help make sure you're considering all the things you need to. Well, that wraps up this section on gathering data. I look forward to seeing you in our next section.