So let's talk about the last lecture here in the section. What is data? And in this  lecture, we're going to talk about the ideas of association and correlation, as well as look at some examples of how data is used in the modern world. Let's  continue our ideas around exploring data relationships, remember in the  previous lecture we talked about the idea of exploring data and how helpful it  was to be able to reveal potential insights and possible uses of that information.  In fact, we even talked about using visuals to help explore that data. We  previewed some visuals that we'll be seeing throughout this course, things like  distributions, bar charts, stacked bar charts, and scatterplots. However, this  begs the question: Are visuals enough? Can we get through all the information  we need based on visuals alone. Let's go back to a visual that we looked at last  time. Again, this is a bar chart of the average total users by season. For  example, in the spring we have nearly 5000 average users a day. In the summer we're closer to 6000 average users a day, whereas the fall is a little below 5000  and the winter is a little below 3000 average users in a day, but what does this  always tell us? We just look at the averages, but we saw previously that there's  a wide spread of possibilities. Let's look at a different visual called a boxplot.  This boxplot not only represents the average, which this previous plot only  represented, but this box plot also represents the spread, so for example, we  can see the x's in each one of these boxes represents the average of each  season, so if you keep your eyes focused on the x's, those are the exact same  heights of the bars that you see here, dizzy yet. All right, so we have the  averages represented on this chart, but also we have other things represented.  We have what we call the range of each season, or you can think about it as a  visual representation of both the minimum and the maximum of each season. In  other words, what is the smallest daily number of users in the spring, and what's the largest daily number of users in the spring? Same for the summer, fall, and  the winter. So we can see that although winter does have a lower average, the x is lower than all the other x's. We notice there is still a wide spread to winter's  daily users. Winter, in fact, doesn't even have the lowest number of daily users  across all the seasons, if you take a look at fall, fall has the lowest number of  daily users. Take a look also at winter. Take a look at that dot in the upper right hand side. One day in winter we had almost 8000 daily users, so again we can  see that on average winter may be lower, however each season has a very wide spread. This spread is what we refer to in statistics as variation. Basically, data  points vary from one another, and that is expected. We don't always see  everything, and averages or measures of center don't always represent  everything. Again, you can think about this as an example of heights. Think  about the average height of your family. Is everyone in your family the same  height? Probably not. You probably have some people that are taller than you,  and potentially people that are shorter than you as well. Maybe you're the tallest, maybe you're the shortest, but again, there's variation. Data points vary from 

one another, so because of the fact that there's variation that might change our  thoughts on some of the insights that we had earlier, we don't always see  everything, and of course when we measure things, we may measure things  imperfectly. Now something like clicks on a website or number of users that we  have for a bike rental service that may be a little bit more exact. Height, on the  other hand, you guys may be measuring things a little bit different. I know my  youngest child always stands on her tippy toes trying to be able to get just a little bit taller. So, why does one day in summer differ from another day in summer.  What if temperature is the same? So, what if I had two days in summer, both of  them were the same temperature, however, they had a different number of total  users. Well, in all honesty, in the real world, we're not exactly sure perfectly how  to predict everything. In fact, that's part of God's plan, is that it's not all perfectly  able to be predicted. So, we can't be perfectly sure why days are different,  especially when we have two summer days with the same temperature.  However, differences are expected by what we will call in statistics randomness.  Now, these ideas of variation and randomness, again, we're going to cover in  much more detail later in the course. Remember, this is just a preview. Well,  now that we have some variation, and now that we've admitted that there may  be a little bit of randomness from day to day on how many users we have, we  have to go back to the original question that we asked, are the seasons truly  different from each other, or could there be some expected random variation  from day to day that may explain why we see some differences, so again taking  a look at this boxplot, which again will detail all the rest of the boxplot in a later  lecture, the actual box itself and all the little lines coming out of it will deal, deal,  we'll detail that later. But again, let's focus in on this. We can see that on  average, winter is lower, but winter does have some high days. We can see on  average fall is higher than winter, but there is one day in fall that is lower than  any day in winter, so this might lead you to question your original idea that  winter is lower than the other seasons. This is why in statistics we have what we call hypothesis testing. This will help us answer these questions. The idea of  statistical hypothesis testing is this: we're going to use the data that we have  available to us, the information that we've collected, to see if those differences  we see, for example, winter looks lower than the other seasons, we're going to  see if those differences are expected due to just random variations from day to  day, or if we can say there's an actual association between the season and the  numbers of users. Now, what do I mean by association? Well, an association is  a statistical relationship between a qualitative and a quantitative variable. Our  qualitative variable here would be something like season, it's measured in  categories. Our quantitative variable would be something like the number of  daily users. I want to know if there's an association between season and number of daily users. That would be what we could use statistical hypothesis testing to  help us out with. This can also provide so much more insights. Here's some 

examples we can test, for example, if the previous marketing campaign actually  brought in more customers. We can test if a drug treatment actually helped the  patient, or if we had something like a placebo effect. We can test if our program  

for veterans helped them find jobs after they left the service. These are just a  few examples of how to be able to use data and information, as well as  statistical testing, to be able to make some more inferences and insights around  questions you may have involving your data. Again, we can take a look at the  idea of temperature compared to total daily users. What did we see previously?  Well, it looks like as temperature increases that the daily number of users also  has a tendency. Of increasing, however, again it's not perfect. There is some  variation. Let's take a look at one specific temperature, for an example. At 75  degrees, we have daily number of users ranging anywhere from a little over  1000 to a little under 8000 so although yes, it looks like that as temperature  goes up, the total daily users has a tendency of going up. We can still see that  there is variation in our data, so when we make these ideas, we need to be able  to add a little bit of statistical testing to be able to help us put a little bit more  statistical backing to some of these inferences, these insights that we want to  make, for example, we have what we call statistical correlation, again variation  occurs when you're looking for relationships, specifically. Now I'm looking at  relationships between two quantitative variables, again the quantitative variables here of temperature and total daily users. So again, we want to use the data  available to see if the differences we see are expected due to just random  variations, for example, again, 75 degrees has anywhere from 1000 to 8000  total users, or can we say there's actually a correlation between temperature  and number of users. Now you've probably heard the word correlation before.  You may have even used the word correlation before when describing a  relationship between two things, but when we talk about a statistical correlation,  we mean that there is a statistical linear relationship between two numeric or  quantitative variables, the stronger the correlation, the stronger the linear  relationship between them. Again, we'll focus a lot on correlation later on in the  course, but this, as well as association, gives you at least a little bit of a preview  of what we'll be seeing later. All right. In summary, data has a natural and  expected variation. Some of this variation might be because you have actual  associations or correlations, however, some could be due to just in all honesty,  apparent randomness, at least to us, to help alleviate that statistical testing can  evaluate if that variation we see is random or is it intentional when we're talking  about relationships between a qualitative variable and a quantitative variable,  we refer to those relationships as an association. When we are talking about a  relationship between two quantitative variables, specifically a linear relationship  between two quantitative variables, we call that a correlation. You can think  about association and correlation as sort of God revealing a little bit more to his  plan. Things may appear as randomness, but if they're intentional and there's 

some patterns that are going on underneath. That's why I love statistics so  much. It's like God wrote something in the math, he put something in the  numbers for us to be able to find, for us to be able to see. So, hopefully the last  two sections have given you a little bit of a preview and gotten you excited about some of the things we'll be seeing in this course, but before we finish with this  lecture. I did want to briefly talk about the idea of data in the world around us,  just to drive home the value that we're going to get by being able to look at and  analyze data. Did you know that data is pretty much everywhere? I mean, you  probably did everything from your credit card to your cell phone to anything you  put in the mail. Data is out there. If you want to see how much data actually is  out there, you can see the chart here on the right-hand side, according to the  source IDC Digital Universe, the projected amount of digital data produced  worldwide in a given year back in 2012 was 2.8 zettabytes of information in  2020 that increased over tenfold to 44 zettabytes of information, and by the year 2025 it will increase four times that much to over 160 zettabytes of information.  What in the world is a zettabyte? You might be asking, well, a zettabyte is a lot.  Let's just put it that way. If you've never heard of a zettabyte, that's okay. You  might have heard of a gigabyte. Well, if you wanted to look at a gigabyte and  you wanted to look at 1 trillion gigabytes, that's one zettabyte. So, if you wanted  to think about the year 2025 the amount of digital data produced worldwide.  What's projected is that year alone we will produce 163 trillion gigabytes of  information to help you see that a little bit more, if you were to fill the latest  smartphone full of data, stack them end to end, they would go all the way to the  moon and back to earth and back to the moon again. That's how much  information we're producing worldwide by 2025. Data is everywhere, and  because data is everywhere, all different types of industries use that data to be  able to make further decisions and draw insights, everything from banking and  finance to marketing to healthcare to supply chain and agriculture. Everybody  uses data. Let me give you a couple of examples. In the world of banking and  finance, where I come from, banks use information. Banks use data to help  them make more informed decisions. For example, Who do banks give loans  to? Well, microfinance banks actually help impoverished and developing nations through small business loans. So, how can these banks make more informed  decisions around who to make these small business loans to, to be able to help  develop these nations further, and to bring people out of poverty? Data modeling can be used to help find which clients would be best to loan money to at the  least amount of risk, so as the money comes into the bank, the bank can then  make a better decision on who to loan money to to be able to get that money  back to be able to loan it to somebody else to again try and bring people out of  poverty. What about the world of marketing, you might ask questions like Who  do you advertise to, or How do you advertise to them. Marketing companies use statistical hypothesis tests to compare the effectiveness of different campaigns. 

We call these A/B tests. For example, if you've ever gone to a website and  clicked on a link. Why did that company put that link right there? It was probably  statistically designed to get you to click on it more frequently than if it was  somewhere else. Data about customer purchases also helps companies group  customers by similar buying habits, so they know again how to market to you  best. Have you ever received a flyer in the mail from a grocery store? Your flyer  might be actually different than the flyer of your neighbor. Why? Because that  grocery store might be showing you things that you are more likely to buy, which might be different than what they're showing your neighbor. In fact, the coupons  you get sometimes at the checkout counter are definitely different than the  coupons of the person checking out behind you. What about health care? Well,  healthcare costs are increasing, so how can we make health care more efficient  without losing quality of care? Well, hospitals and medical agencies use data to  help identify onset of disease sooner, determine who is at a higher risk for  hospital readmission, and provide more specialized care for patients. Again,  trying to use data to be able to help people make better decisions, except here  these better decisions actually impact people's lives. Last but not least, let's talk  about supply chain, or agriculture in an ever growing. Population, we need to  ask ourselves, How do we use food more efficiently? Agricultural companies use data to efficiently track food from their seed to the field, to the store, and then to  the table. This helps keep food fresher longer, allowing people to be able to  know where their food came from, allowing them to know what went into their  food, how far it's traveled, and how fresh it is. So, as you can see, there are so  many different ways that people use data. Data exists in all types of industries. It just really depends on how it's used. Knowledge of data and its usefulness is  helpful in all aspects of life, not just the career you're going into, but in other  aspects of your life as well. Hopefully, this gets you excited about the rest of the  course, but this is the end of this section describing what is data.



पिछ्ला सुधार: मंगलवार, 19 मई 2026, 8:57 AM