Video Transcript: What is Data? - Part 3
So let's talk about the last lecture here in the section. What is data? And in this lecture, we're going to talk about the ideas of association and correlation, as well as look at some examples of how data is used in the modern world. Let's continue our ideas around exploring data relationships, remember in the previous lecture we talked about the idea of exploring data and how helpful it was to be able to reveal potential insights and possible uses of that information. In fact, we even talked about using visuals to help explore that data. We previewed some visuals that we'll be seeing throughout this course, things like distributions, bar charts, stacked bar charts, and scatterplots. However, this begs the question: Are visuals enough? Can we get through all the information we need based on visuals alone. Let's go back to a visual that we looked at last time. Again, this is a bar chart of the average total users by season. For example, in the spring we have nearly 5000 average users a day. In the summer we're closer to 6000 average users a day, whereas the fall is a little below 5000 and the winter is a little below 3000 average users in a day, but what does this always tell us? We just look at the averages, but we saw previously that there's a wide spread of possibilities. Let's look at a different visual called a boxplot. This boxplot not only represents the average, which this previous plot only represented, but this box plot also represents the spread, so for example, we can see the x's in each one of these boxes represents the average of each season, so if you keep your eyes focused on the x's, those are the exact same heights of the bars that you see here, dizzy yet. All right, so we have the averages represented on this chart, but also we have other things represented. We have what we call the range of each season, or you can think about it as a visual representation of both the minimum and the maximum of each season. In other words, what is the smallest daily number of users in the spring, and what's the largest daily number of users in the spring? Same for the summer, fall, and the winter. So we can see that although winter does have a lower average, the x is lower than all the other x's. We notice there is still a wide spread to winter's daily users. Winter, in fact, doesn't even have the lowest number of daily users across all the seasons, if you take a look at fall, fall has the lowest number of daily users. Take a look also at winter. Take a look at that dot in the upper right hand side. One day in winter we had almost 8000 daily users, so again we can see that on average winter may be lower, however each season has a very wide spread. This spread is what we refer to in statistics as variation. Basically, data points vary from one another, and that is expected. We don't always see everything, and averages or measures of center don't always represent everything. Again, you can think about this as an example of heights. Think about the average height of your family. Is everyone in your family the same height? Probably not. You probably have some people that are taller than you, and potentially people that are shorter than you as well. Maybe you're the tallest, maybe you're the shortest, but again, there's variation. Data points vary from
one another, so because of the fact that there's variation that might change our thoughts on some of the insights that we had earlier, we don't always see everything, and of course when we measure things, we may measure things imperfectly. Now something like clicks on a website or number of users that we have for a bike rental service that may be a little bit more exact. Height, on the other hand, you guys may be measuring things a little bit different. I know my youngest child always stands on her tippy toes trying to be able to get just a little bit taller. So, why does one day in summer differ from another day in summer. What if temperature is the same? So, what if I had two days in summer, both of them were the same temperature, however, they had a different number of total users. Well, in all honesty, in the real world, we're not exactly sure perfectly how to predict everything. In fact, that's part of God's plan, is that it's not all perfectly able to be predicted. So, we can't be perfectly sure why days are different, especially when we have two summer days with the same temperature. However, differences are expected by what we will call in statistics randomness. Now, these ideas of variation and randomness, again, we're going to cover in much more detail later in the course. Remember, this is just a preview. Well, now that we have some variation, and now that we've admitted that there may be a little bit of randomness from day to day on how many users we have, we have to go back to the original question that we asked, are the seasons truly different from each other, or could there be some expected random variation from day to day that may explain why we see some differences, so again taking a look at this boxplot, which again will detail all the rest of the boxplot in a later lecture, the actual box itself and all the little lines coming out of it will deal, deal, we'll detail that later. But again, let's focus in on this. We can see that on average, winter is lower, but winter does have some high days. We can see on average fall is higher than winter, but there is one day in fall that is lower than any day in winter, so this might lead you to question your original idea that winter is lower than the other seasons. This is why in statistics we have what we call hypothesis testing. This will help us answer these questions. The idea of statistical hypothesis testing is this: we're going to use the data that we have available to us, the information that we've collected, to see if those differences we see, for example, winter looks lower than the other seasons, we're going to see if those differences are expected due to just random variations from day to day, or if we can say there's an actual association between the season and the numbers of users. Now, what do I mean by association? Well, an association is a statistical relationship between a qualitative and a quantitative variable. Our qualitative variable here would be something like season, it's measured in categories. Our quantitative variable would be something like the number of daily users. I want to know if there's an association between season and number of daily users. That would be what we could use statistical hypothesis testing to help us out with. This can also provide so much more insights. Here's some
examples we can test, for example, if the previous marketing campaign actually brought in more customers. We can test if a drug treatment actually helped the patient, or if we had something like a placebo effect. We can test if our program
for veterans helped them find jobs after they left the service. These are just a few examples of how to be able to use data and information, as well as statistical testing, to be able to make some more inferences and insights around questions you may have involving your data. Again, we can take a look at the idea of temperature compared to total daily users. What did we see previously? Well, it looks like as temperature increases that the daily number of users also has a tendency. Of increasing, however, again it's not perfect. There is some variation. Let's take a look at one specific temperature, for an example. At 75 degrees, we have daily number of users ranging anywhere from a little over 1000 to a little under 8000 so although yes, it looks like that as temperature goes up, the total daily users has a tendency of going up. We can still see that there is variation in our data, so when we make these ideas, we need to be able to add a little bit of statistical testing to be able to help us put a little bit more statistical backing to some of these inferences, these insights that we want to make, for example, we have what we call statistical correlation, again variation occurs when you're looking for relationships, specifically. Now I'm looking at relationships between two quantitative variables, again the quantitative variables here of temperature and total daily users. So again, we want to use the data available to see if the differences we see are expected due to just random variations, for example, again, 75 degrees has anywhere from 1000 to 8000 total users, or can we say there's actually a correlation between temperature and number of users. Now you've probably heard the word correlation before. You may have even used the word correlation before when describing a relationship between two things, but when we talk about a statistical correlation, we mean that there is a statistical linear relationship between two numeric or quantitative variables, the stronger the correlation, the stronger the linear relationship between them. Again, we'll focus a lot on correlation later on in the course, but this, as well as association, gives you at least a little bit of a preview of what we'll be seeing later. All right. In summary, data has a natural and expected variation. Some of this variation might be because you have actual associations or correlations, however, some could be due to just in all honesty, apparent randomness, at least to us, to help alleviate that statistical testing can evaluate if that variation we see is random or is it intentional when we're talking about relationships between a qualitative variable and a quantitative variable, we refer to those relationships as an association. When we are talking about a relationship between two quantitative variables, specifically a linear relationship between two quantitative variables, we call that a correlation. You can think about association and correlation as sort of God revealing a little bit more to his plan. Things may appear as randomness, but if they're intentional and there's
some patterns that are going on underneath. That's why I love statistics so much. It's like God wrote something in the math, he put something in the numbers for us to be able to find, for us to be able to see. So, hopefully the last two sections have given you a little bit of a preview and gotten you excited about some of the things we'll be seeing in this course, but before we finish with this lecture. I did want to briefly talk about the idea of data in the world around us, just to drive home the value that we're going to get by being able to look at and analyze data. Did you know that data is pretty much everywhere? I mean, you probably did everything from your credit card to your cell phone to anything you put in the mail. Data is out there. If you want to see how much data actually is out there, you can see the chart here on the right-hand side, according to the source IDC Digital Universe, the projected amount of digital data produced worldwide in a given year back in 2012 was 2.8 zettabytes of information in 2020 that increased over tenfold to 44 zettabytes of information, and by the year 2025 it will increase four times that much to over 160 zettabytes of information. What in the world is a zettabyte? You might be asking, well, a zettabyte is a lot. Let's just put it that way. If you've never heard of a zettabyte, that's okay. You might have heard of a gigabyte. Well, if you wanted to look at a gigabyte and you wanted to look at 1 trillion gigabytes, that's one zettabyte. So, if you wanted to think about the year 2025 the amount of digital data produced worldwide. What's projected is that year alone we will produce 163 trillion gigabytes of information to help you see that a little bit more, if you were to fill the latest smartphone full of data, stack them end to end, they would go all the way to the moon and back to earth and back to the moon again. That's how much information we're producing worldwide by 2025. Data is everywhere, and because data is everywhere, all different types of industries use that data to be able to make further decisions and draw insights, everything from banking and finance to marketing to healthcare to supply chain and agriculture. Everybody uses data. Let me give you a couple of examples. In the world of banking and finance, where I come from, banks use information. Banks use data to help them make more informed decisions. For example, Who do banks give loans to? Well, microfinance banks actually help impoverished and developing nations through small business loans. So, how can these banks make more informed decisions around who to make these small business loans to, to be able to help develop these nations further, and to bring people out of poverty? Data modeling can be used to help find which clients would be best to loan money to at the least amount of risk, so as the money comes into the bank, the bank can then make a better decision on who to loan money to to be able to get that money back to be able to loan it to somebody else to again try and bring people out of poverty. What about the world of marketing, you might ask questions like Who do you advertise to, or How do you advertise to them. Marketing companies use statistical hypothesis tests to compare the effectiveness of different campaigns.
We call these A/B tests. For example, if you've ever gone to a website and clicked on a link. Why did that company put that link right there? It was probably statistically designed to get you to click on it more frequently than if it was somewhere else. Data about customer purchases also helps companies group customers by similar buying habits, so they know again how to market to you best. Have you ever received a flyer in the mail from a grocery store? Your flyer might be actually different than the flyer of your neighbor. Why? Because that grocery store might be showing you things that you are more likely to buy, which might be different than what they're showing your neighbor. In fact, the coupons you get sometimes at the checkout counter are definitely different than the coupons of the person checking out behind you. What about health care? Well, healthcare costs are increasing, so how can we make health care more efficient without losing quality of care? Well, hospitals and medical agencies use data to help identify onset of disease sooner, determine who is at a higher risk for hospital readmission, and provide more specialized care for patients. Again, trying to use data to be able to help people make better decisions, except here these better decisions actually impact people's lives. Last but not least, let's talk about supply chain, or agriculture in an ever growing. Population, we need to ask ourselves, How do we use food more efficiently? Agricultural companies use data to efficiently track food from their seed to the field, to the store, and then to the table. This helps keep food fresher longer, allowing people to be able to know where their food came from, allowing them to know what went into their food, how far it's traveled, and how fresh it is. So, as you can see, there are so many different ways that people use data. Data exists in all types of industries. It just really depends on how it's used. Knowledge of data and its usefulness is helpful in all aspects of life, not just the career you're going into, but in other aspects of your life as well. Hopefully, this gets you excited about the rest of the course, but this is the end of this section describing what is data.