Video Transcript: What is Data? - Part 1
Welcome. In this first section of the course, we're going to be talking about what is data, and really, when we say what is data, it could also be what are data. Data can be considered both a plural and a singular version of the word, so sometimes you'll hear people say data as in more than one, or data as a singular thing, either way, the idea around data is that it is factual information used as a basis for reasoning or discussion or calculation, but let's break down that definition around data just a little bit more. What do we mean by information? Well, by information we mean the idea of measuring something. Think about values that describe something, values that may describe an object or a person or a place or a thing. Some examples, if our object was a person and we were trying to describe a person. We could describe a person using their height, using their weight, their age, their race, their spending habits, and so that would be how we could describe a person if we were trying to describe an object or a thing like a car. We could describe a car based on its mileage, how good its gas mileage is, its color, its size, again a variety of different characteristics or measurements. Last but not least, we can think about an example of a website. With a website, we can measure things like number of clicks or page views, or how much revenue we get on a specific ad. Either way, through all of these examples, hopefully you can see the goal here is to gather information, where information is some kind of measurement used to describe something. But okay, so we have this information, but notice the second half of the definition used as a basis for reasoning, discussion, or calculation. What do we mean by that? We mean the idea of inference. The idea of inference is that we're using information to come to some kind of grander conclusion. Think about it. We want to use the information that we've collected to draw some important conclusions, but not necessarily about only the pieces of information that we have. We want to be able to make better decisions in the context of our entire problem. So, with that being the case, we're going to use this information to be able to answer important questions like who, what, where, when, why, how, who is buying my product, what product do they prefer, where do the people who buy my product live, when do people typically buy my product? Why is it that people are buying my product? How are they buying my product, whether it be in person or online? So, as you can see, we can use the information, those measurements about those different pieces of objects or people or places to be able to try and draw better conclusions to be able to handle whatever context or problem that you're trying to solve. An example data set that we're going to work with throughout the entirety of this course is a data set that you see here, and it's actually provided on your course website. The data set consists of a bike rental organization in Washington, DC. Now, this information is a little outdated. It was collected back in 2011 and 2012 but it still serves a purpose for us. It measures a variety of different things about not only how many people used the bike service that we offer, but all the different things about the time of year, the days specifically
involving temperature and humidity, and a variety of other factors. If you were to take a look at that data set, what you would see is a notion of a table, and that is how we typically view or look at data. We have rows and we have columns. For
the rows, we typically call these things observations. Well, what are observations? Observations are typically those individuals or those objects that we're collecting information about, so again going back to the idea of data, data is a series of measurements. Okay, well, what are we collecting measurements on? Individuals or objects or places or things? Those are what we typically store as the rows in our data set, or our data table. We call these rows again observations, but what about the columns? The columns we call variables. A variable is just a characteristic that describes those observations. Think about it as a piece of information, a piece of data. So, again, when it comes to what we're looking at, we can describe those days and those users of our bike rental service by a variety of different characteristics. These are what we typically have as columns in our data table. Now let's break down those variables a little bit more. Let's focus in on those columns. There are two main types of variables, or two main types of columns, that we look at for a data table. The first is a qualitative set of variables. The second is a quantitative set of variables. So, what do I mean by qualitative? Qualitative is a variable or a piece of data with a measurement scale inherently categorical. Let me show you an example again, going back to that same data table I showed you earlier, things like date, weekday, season, type of weather. These are categorical pieces of information. They're not numeric in structure. Something like season, winter, fall, spring, summer, weather, misty, clear, snowy. These are categorical variables. Now, when it comes to these qualitative or categorical variables, we can break those down further into two specific groups. The first is what we call a nominal categorical variable. A nominal variable has categories with no logical ordering. In other words, you can say which categories they are by putting them really in any order. It doesn't really matter. Think, for example, color of car. It doesn't matter if you list the color as green, yellow, blue, red, blue, red, yellow, green, yellow, red, blue, green. It doesn't really matter what order you're listing things in. There's no logical ordering to those categories. Therefore, that would be a nominal piece of information, an ordinal qualitative variable, or an ordinal categorical variable, on the other hand, has categories with a natural logical ordering to them, things that have an order in how you would say them, for example, low, medium, high, high, medium low. Those are the logical orderings of those three categories. You wouldn't see someone list them as medium low high, because that wouldn't make intuitive sense. Let me show you some further examples with that same data table, something like weekday, for example, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday. This is a categorical variable, but it is an ordinal categorical or an ordinal qualitative variable. Why? Because the seven categories of weekday have a natural logical
ordering. Again, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday. This is the natural ordering of these variables. It's the same for season: winter, spring, summer, fall. There's a natural logical ordering. You wouldn't see someone listed as fall, spring, summer, winter, and so something like season and weekday could be considered an ordinal qualitative variable, ordinal because of the ordinal structure, the logical ordering of the categories, qualitative, because the values, the pieces of information are categorical in nature, and variable, because it describes individual pieces of observations. Now, something like the weather type variable, that one may be a little bit harder to discern on whether it's nominal or ordinal. For example, in the weather types that we have in our data set, we have clear or partly cloudy, we have misty, we have rainy or snowy, so again you might be able to make an argument that these are nominal. It doesn't really matter what order that you list these categories in. However, some people might be able to make an argument around them being ordinal categories, going from nicer weather to more precipitation again, that one's a little bit up for debate, but something like weekday or season is obviously an ordinal qualitative variable. All right, we focused a lot on the ideas of qualitative variables, let's go to the other main type of variable, a quantitative variable. A quantitative variable is a column of information that summarizes data where it is numeric and defines some value or some quantity. So you can think of qualitative pieces of information as categorical in nature, where quantitative pieces of information are more numeric in nature. Again, let's go to our data table and see some examples. Some examples in our data table here: temperature, humidity, number of casual users, number of registered users, all these things have a piece of quantitative information. Again, taking a look at temperature, we can see on that first day, January 1, 2011 the temperature outside, the high temperature that day was 46.7 degrees Fahrenheit. That is a numeric piece of information. It describes some kind of value or quantity, and so that being the case, this is a quantitative variable. Same idea for number of registered users. People who use our bike service could be registered users or more casual users. Registered users sign up to make sure that they have access to a bike. Casual users just sort of come and go and try and see about things being available when they need it. They don't register ahead of time, so we can see how many people who are registered users on that day used our biking service, for example, 654 registered users. This is a quantity, it's measuring some quantity about the observation we're interested in, so it's a quantitative variable, same for number of casual users. We can also see on that day 331 casual users used our bike renting service. Now one quick piece of information, not all variables that are numeric are quantitative. Some examples of this would be something like date or social security number or zip code, for example. Zip code may be measured with numbers, but it's really a qualitative piece of information. It's a geographical
location. I know it may be a little hard to discern. We have these things that are numeric, but they're not quantitative. So, what's an easy way for us to be able to tell if something we see is numeric, but it's not actually a quantitative variable. One easy way of looking at things is you can take a look at a variable and ask this question, Can I do basic arithmetic on this variable and have it be meaningful? What do I mean by that? For example, if you were to take the average, and I know we haven't talked about averages yet, or if you were to add up the values inside of a variable, would it make sense? For example, if you were to look at the average height of people in your family, well, that makes sense. You have multiple people, each with their own height. We can take the average of those heights, so height is measured in a notion of a quantitative piece of information. We can't take the average, for example, of a zip code. What's the average of two zip codes? It doesn't make intuitive sense. Same idea for a social security number or a driver's license number, or any kind of personal identification number. You can't really average two people's social security numbers and have them make intuitive sense. So, let's go back to our data table. When going back to our data table, we can see some. Examples again of quantitative variables, and again these things would make sense to us. For example, we can go again to the temperature variable. The first day in our data set was 46.7 degrees Fahrenheit. The second day 48.4 degrees Fahrenheit. Well, again, are these quantitative pieces of information? Can we take the average temperature between those two days? Yes, yes, we can. I can take a look at the average temperature between 46.7 degrees Fahrenheit and 48.4 degrees Fahrenheit, because I can do that, because I can perform basic arithmetic. This would be a quantitative piece of information. Again, we can do that same thing when it comes to humidity, the number of casual users, or even the number of registered users. In fact, we can even add the number of casual users on all Sundays to see what the total number of users are that we see on Sundays for our bike rental service. Again, that would imply that casual users is a quantitative piece of information. I can sum them all up and still have them make sense. Notice on the far left hand side of the screen, though, you have something like date. Date, although written as a number, is not a quantitative piece of information. You can't take the average of January 1 and January 5, for example, and have that make meaningful sense. So that is why, if you remember earlier, we listed date as a qualitative piece of information. Again, you can think about it as a piece of categorical information. Now, date is an ordinal qualitative variable, it has an order to it. There's a certain order in which you would list dates in, but they are not inherently numeric and quantitative. Hopefully, that helps distinguish the ideas between a qualitative variable and a quantitative variable, so let's summarize this lecture first. Data is a factual piece of information used as a basis for reasoning or discussion or calculation. The whole idea is we're going to use that information to make inferences or grander
conclusions about things that we're interested in, of course, as we're collecting this information, this data, we will structure it in what we will call a data table, or a data set. Typically, data tables are structured where your rows are specific observations, those things you're collecting information about, and your columns are the pieces of information you're collecting for those rows. These columns, again, we typically call variables, so again, think about it as the columns, the variables are describing the rows or the observations inside of your data table. Also, in this lecture, we talked about two different types of variables, the first one being qualitative, think about these as categorical pieces of information, remember when it came to these categorical pieces of information, we had two different types. Nominal categorical pieces of information could be listed in any order, something like color of car, whereas ordinal categorical pieces of information have an inherent order to them, such as day of week. The other type of variable we talked about was a quantitative variable, a notion of a piece of information that's measured in some numerical way. So, that is the end of this lecture, and I look forward to the next one with you.