Video Transcript: Exploring Data - Part 1
Welcome to the next section of the course. In this section of the course, we're going to be talking about exploring data. So, why do we organize and explore our data? Well, remember what we talked about previously. A lot of insights can be drawn from just organizing, exploring, and just even looking at your data. However, different types of data. Well, they need to be summarized differently. Remember back to what we talked about. There are two main types of variables: qualitative variables, or you can think about those as categorical variables. They are data with a measurement scale that is well inherently categorical, and remember we even had two different types of qualitative variables, nominal and ordinal. The other type of variable that we have is a quantitative variable, essentially a data variable that is numeric and really defines some value or quantity. These variables need to be explored differently, so let's go back to our data set that we've been working with. When it comes to looking at the users for our bike industry, when looking at the left hand side, we have our qualitative variables. When it comes to qualitative variables, a lot of times we will either explore within a category or we will explore across categories. In fact, let's talk about some example questions you could ask something like, What do Saturdays look like? This will be exploring within the category of Saturdays. What do clear days look like in comparison to rainy days? This will be exploring across categories, as well as the idea is the winter different than the summer. Remember, we talked about that last one last time. When it comes to quantitative variables, though, we need to handle them a little bit differently. With quantitative variables, a lot of times we like to explore the center or even the spread of our data, and you can sort of think about it as the look of your variables. Some example questions you may ask with quantitative data would be something like, What is the typical temperature in my data set, and we're going to have to define typical is the number of users trending up or is it trending downward, of course. When looking at just sort of overall look, we also look at spread, something like what is the range of humidity values, are they a narrow range or are they very spread out, so as you can see, a lot of insights can be drawn from just organizing, exploring, and looking at your data. However, with our two main types of variables, qualitative and quantitative, we need to be able to summarize them differently. With qualitative variables, we like to explore either within a category or across categories with quantitative variables, we like to explore the look of our variables. Where's the center located? How spread out our variables are. In fact, that's what this lecture is going to be doing. We're going to be exploring these two types of variables visually. So, let's jump into the first one. Let's talk about displaying qualitative or categorical data again. When it comes to qualitative data, we like to explore either within or across categories. Same sample questions as I already mentioned. What do Saturdays look like? What do clear days look like in comparison to rainy days, is the winter different than the summer. Well, when it comes to visually trying to explore qualitative
data, there are different graphs that are used for different tasks. A pie chart, for example, is a comparison across all categories. We call this the distribution of categories, you've probably seen a pie chart before. A bar chart, which you've also probably seen before, is a comparison across specific categories. We don't actually have to view all of the categories when we view a bar chart. A pie chart, on the other hand, explicitly is defined across all categories now, when it comes to a bar chart, there actually are a few different types. We could have a regular bar chart, a side by side bar chart, or even a stacked bar chart. But let's start with pie charts and work our way through this list. So a pie chart is a graph in which a circle is divided into sections, kind of like pieces of a pie that each represent a portion of the whole. These kinds of charts are best used when looking to show the entire distribution or all of the categories for a specific qualitative variable. Take a look at the pie chart you see here on the right-hand side. What we've done is we've looked at the total users by season, however it has been broken down for us by the four specific categories: spring, summer, fall, and winter, so we can see here by looking at the distribution of all these categories that it looks like summer has the largest slice of the pie. By looking at all of the categories, we can see that summer takes up the most total users. Spring and fall are rather close to each other, with spring being slightly higher, and winter is noticeably lower than the rest. Again, when looking at a pie chart, it is easy for us to be able to compare these four seasons. There is another variation of a pie chart called a donut chart. It's basically the same thing with the center missing. So, instead of a pie, think of a donut. The nice part about a donut chart, though, is that donut charts actually allow us to compare different groups distributions across all of the categories. So, you'll notice here on the right-hand side that there actually are two donut charts, one contained within the other. The inner donut is the casual user distribution, the outer donut is the registered user distribution. So, remember previously with the pie chart, we looked at total users, which is the combination of casual and registered. Here, we've broken that down into each of the separate categories, so we can see that again, just like with registered, casual users are dominated by the summer month. However, we can see, though, that the distribution is different. The slice of the doughnut is a little bit smaller on casual and a little bit larger on registered when it comes to those summer users. In fact, it's the same for the spring users as well. Take a look at the spring between registered and casual. For registered users, 33% of all registered users use the spring. However, 27% of casual users use the spring. Take a look at the other side. Let's take a look at winter. With winter, only 10% of registered users make up the total users in winter. However, 15% of total users is represented by register. I'm sorry, by casual. So, again, we can compare these two different groups distributions across all of the categories. Now, let's move into the idea of a bar chart. When it comes to a bar chart, a bar chart is essentially numerical values of variables, just represented by the height,
or if it's turned on its side, the length of the bars or rectangles of equal width that you see here. This is best used when looking to compare specific categories to each other, so again, here we're actually viewing all four of the seasons, but I don't have to. If I wanted to, I could just compare summer to winter, and you could take a look at those two specifically. However, we can still use all four, just to give us a general idea, we've seen this bar chart previously in a previous lecture, and we can see again the same pattern that we saw earlier. Summer seems to be the dominating total users across season. Winter seems to be the one at the end, the one with the lowest number of total users. A stacked bar chart breaks down the first categories into subcategories, so again, what you see here is a comparison of registered versus casual users. When it comes to registered users, again, we can see with the lighter shade inside of the bar, the predominant number of total users is still registered users. When it comes to casual users, they're the darker shade in the bar. They make up a small portion. However, again, we can take a look within a season. So, for summer, again, there are definitely more registered users than there are casual users. However, in the winter, this comparison is a little bit different. There's definitely more registered users than casual users in the winter for this chart. Maybe it'll be a little bit easier to see this in another chart. We call this a side by side bar chart, again side by side. Bar charts look at these comparisons across multiple categories, so when looking here at a stacked bar chart, we can look within a category what the comparison is, but again it's hard to really compare across. However, when we look at the side by side bar chart, it's a lot easier to make these comparisons across multiple categories. Again, we can compare the look of casual versus registered, but we can also compare the look of just the overall distribution of registered and the overall distribution of casual. We can see that for both registered and casual, winter seems to be the smallest month, and summer seems to be the largest month, but we can see that comparison again, like we mentioned earlier with the stack bar chart, that casual users are a lot closer to registered users in the summer than they are in the winter, so in summary, when looking at qualitative variables again, we like to explore either within a category or across multiple categories, and we saw a couple different ways of doing that. We saw a pie chart, which is a graph where a circle is divided into sections that represent really the entire distribution of categories for a specific variable, and we also saw a bar chart. Now, remember, in the bar chart, we do not have to represent every category, we did just to be able to make a comparison, but we're still looking at numerical values of variables represented by height or length of bars inside of a graph to make them easy to compare. Excellent, so now we've looked at qualitative variables. Let's finish off this lecture by displaying visually some quantitative data. Remember, when it comes to quantitative data, we like to explore things like the center or the spread, and really sort of the overall look of variables. Again, some of our
sample questions would be something like, what is the typical temperature in my data, or is the number of users trending up or trending down, or what is the range of humidity values, are they narrow or are they very spread out. So, let's take a look at how we can potentially do this when it comes to looking at things visually for numeric variables, again, there are a couple main graphs that we can use to do this. We can first look at a line graph. This essentially plots a variable numerically over time. So, for example, if we wanted to see how our users are trending over time, a line graph would be great for this. Another graph we can look at when looking at quantitative or numeric variables would be a scatterplot. We've seen this in a previous lecture, but essentially what we're doing is we're comparing different quantitative variables to each other. So, let's take a look at these two different types of plots again. The first one is a line graph. What a line graph does is it uses a line to connect individual data points over time. Line graphs are best used when wanting to see how things change across time. Take a look at the right-hand side. On the right-hand side, we can see again on the x axis, the horizontal axis, the bottom axis are dates on the upper axis, the y axis, the vertical axis, the left hand axis. You can see daily users, and so we can take a look at how daily users is changing over time, and it does have a tendency to look like it may be trending a little bit upwards, however, when taking a look at certain times of year, it definitely looks like our daily user seems to drop. It looks like they seem to all have low time periods during the winter months. Again, we can see in January, February, and March, or even in late December, users being a lower count than during the other months of the year, like you see here, where we see peak times really in the spring, summer, and early fall months. So again, we can start seeing that our users are changing over time. We can even see trends on when our users are high, when our users are low, and we can even see that it does look like that, although there are peaks and there are valleys, that our daily users is increasing over time, which, from a marketing standpoint, is wonderful. This is exactly what we would want. We want our users to continue to go up over time. Let's look at our other plot. Let's look at a scatterplot now. A scatterplot has the values of two variables plotted along two different axes, and the pattern of the. Resulting points would hopefully reveal any relationship that may exist between those two variables. Scatterplots are best used when trying to explore two variables and their potential relationship. So, let's take a look at the plot you see here on the right-hand side. This is a comparison of temperature and total users on the horizontal axis. Again, we call this the x axis, or the bottom axis. We see temperature, temperature ranges here on our bottom axis from 15 all the way up to 95 degrees, and we can see here that we have a range of temperatures. Our lowest temperature seems to be in the low 20s. Again, it's the dot on the farther left hand side, and then our temperatures reach upwards in the 90s, as we can see on the right hand side of the plot. Now, the vertical axis you see here again,
typically called the y axis, or the left hand axis, is where we're plotting total daily users, and we can see our plot goes from zero all the way up to 10,000 and so what we're seeing here on the bottom points, if looking vertically, are smaller numbers of daily users and higher points vertically are looking at larger number of daily users, everything from zero users all the way up to almost 9000 users in a day, but again the value of a scatterplot is trying to relate the two of these variables together, so we can make a comparison between temperature and total daily users. Basically, what happens as temperature goes up? Well, as temperature seems to be going up, we notice that total daily users has a tendency of going up as well. Again, it's not a perfectly straight line, but we do see that the cloud of data points seems to be going upwards. So, again, as temperature increases, it looks like total daily users has a tendency of increasing. Of course, the opposite can also be said, as temperature decreases, the number of total daily users seems to decrease as well, so again scatterplots are wonderful for being able to compare to see if there's any kind of relationship, and at least right now it looks like there might be some kind of relationship between temperature and daily users, which would make intuitive sense. The nicer the day is, the higher the temperature. The more people won't mind using our bike service. However, as the temperature seems to drop, then our bike service doesn't seem to be as popular as it would normally be. So, let's summarize what we've talked about in this lecture here. So, when it comes to quantitative variables again, we like to explore center and spread, and really the look of the variables, and you can really see that look when we looked at our two plots, the line graph, which used lines to try and connect those data points, our total daily users over time, it allowed us to see that our daily user seems to be trending upwards, however, there are some peaks and valleys in terms of different times of year when my daily user seems to go up and seems to come back down. Also, when exploring the look of these variables, we used a scatterplot, and the scatterplot was used to compare two quantitative variables to try and reveal if there's any kind of relationship existing between them. We used it to compare total daily users to temperature to see as temperature changes what happens to total daily users, and our intuition held as temperature seems to go up, our total daily users seemed to go up, and as temperature went down, so did our total daily users. However, again, all the plots that we've seen in today's lecture, from the qualitative plots to the quantitative plots, are wonderful for continuing to explore and understand your data. Exploring and understanding your data are a great first step to anything you do in statistics, once you have your data, you need to explore it. It helps drive insights, which we're going to talk about in later lectures. But for right now, that is the end of this lecture, and I look forward to seeing you next time.