Welcome to the next section of the course. In this section of the course, we're  going to be talking about exploring data. So, why do we organize and explore  our data? Well, remember what we talked about previously. A lot of insights can  be drawn from just organizing, exploring, and just even looking at your data.  However, different types of data. Well, they need to be summarized differently.  Remember back to what we talked about. There are two main types of variables: qualitative variables, or you can think about those as categorical variables. They are data with a measurement scale that is well inherently categorical, and  remember we even had two different types of qualitative variables, nominal and  ordinal. The other type of variable that we have is a quantitative variable,  essentially a data variable that is numeric and really defines some value or  quantity. These variables need to be explored differently, so let's go back to our  data set that we've been working with. When it comes to looking at the users for  our bike industry, when looking at the left hand side, we have our qualitative  variables. When it comes to qualitative variables, a lot of times we will either  explore within a category or we will explore across categories. In fact, let's talk  about some example questions you could ask something like, What do  Saturdays look like? This will be exploring within the category of Saturdays.  What do clear days look like in comparison to rainy days? This will be exploring  across categories, as well as the idea is the winter different than the summer.  Remember, we talked about that last one last time. When it comes to  quantitative variables, though, we need to handle them a little bit differently. With quantitative variables, a lot of times we like to explore the center or even the  spread of our data, and you can sort of think about it as the look of your  variables. Some example questions you may ask with quantitative data would  be something like, What is the typical temperature in my data set, and we're  going to have to define typical is the number of users trending up or is it trending downward, of course. When looking at just sort of overall look, we also look at  spread, something like what is the range of humidity values, are they a narrow  range or are they very spread out, so as you can see, a lot of insights can be  drawn from just organizing, exploring, and looking at your data. However, with  our two main types of variables, qualitative and quantitative, we need to be able  to summarize them differently. With qualitative variables, we like to explore  either within a category or across categories with quantitative variables, we like  to explore the look of our variables. Where's the center located? How spread out our variables are. In fact, that's what this lecture is going to be doing. We're  going to be exploring these two types of variables visually. So, let's jump into the first one. Let's talk about displaying qualitative or categorical data again. When it comes to qualitative data, we like to explore either within or across categories.  Same sample questions as I already mentioned. What do Saturdays look like?  What do clear days look like in comparison to rainy days, is the winter different  than the summer. Well, when it comes to visually trying to explore qualitative 

data, there are different graphs that are used for different tasks. A pie chart, for  example, is a comparison across all categories. We call this the distribution of  categories, you've probably seen a pie chart before. A bar chart, which you've  also probably seen before, is a comparison across specific categories. We don't  actually have to view all of the categories when we view a bar chart. A pie chart,  on the other hand, explicitly is defined across all categories now, when it comes  to a bar chart, there actually are a few different types. We could have a regular  bar chart, a side by side bar chart, or even a stacked bar chart. But let's start  with pie charts and work our way through this list. So a pie chart is a graph in  which a circle is divided into sections, kind of like pieces of a pie that each  represent a portion of the whole. These kinds of charts are best used when  looking to show the entire distribution or all of the categories for a specific  qualitative variable. Take a look at the pie chart you see here on the right-hand  side. What we've done is we've looked at the total users by season, however it  has been broken down for us by the four specific categories: spring, summer,  fall, and winter, so we can see here by looking at the distribution of all these  categories that it looks like summer has the largest slice of the pie. By looking at all of the categories, we can see that summer takes up the most total users.  Spring and fall are rather close to each other, with spring being slightly higher,  and winter is noticeably lower than the rest. Again, when looking at a pie chart, it is easy for us to be able to compare these four seasons. There is another  variation of a pie chart called a donut chart. It's basically the same thing with the  center missing. So, instead of a pie, think of a donut. The nice part about a  donut chart, though, is that donut charts actually allow us to compare different  groups distributions across all of the categories. So, you'll notice here on the  right-hand side that there actually are two donut charts, one contained within the other. The inner donut is the casual user distribution, the outer donut is the  registered user distribution. So, remember previously with the pie chart, we  looked at total users, which is the combination of casual and registered. Here,  we've broken that down into each of the separate categories, so we can see that again, just like with registered, casual users are dominated by the summer  month. However, we can see, though, that the distribution is different. The slice  of the doughnut is a little bit smaller on casual and a little bit larger on registered when it comes to those summer users. In fact, it's the same for the spring users  as well. Take a look at the spring between registered and casual. For registered  users, 33% of all registered users use the spring. However, 27% of casual users use the spring. Take a look at the other side. Let's take a look at winter. With  winter, only 10% of registered users make up the total users in winter. However,  15% of total users is represented by register. I'm sorry, by casual. So, again, we  can compare these two different groups distributions across all of the categories. Now, let's move into the idea of a bar chart. When it comes to a bar chart, a bar  chart is essentially numerical values of variables, just represented by the height,

or if it's turned on its side, the length of the bars or rectangles of equal width that you see here. This is best used when looking to compare specific categories to  each other, so again, here we're actually viewing all four of the seasons, but I  don't have to. If I wanted to, I could just compare summer to winter, and you  could take a look at those two specifically. However, we can still use all four, just  to give us a general idea, we've seen this bar chart previously in a previous  lecture, and we can see again the same pattern that we saw earlier. Summer  seems to be the dominating total users across season. Winter seems to be the  one at the end, the one with the lowest number of total users. A stacked bar  chart breaks down the first categories into subcategories, so again, what you  see here is a comparison of registered versus casual users. When it comes to  registered users, again, we can see with the lighter shade inside of the bar, the  predominant number of total users is still registered users. When it comes to  casual users, they're the darker shade in the bar. They make up a small portion.  However, again, we can take a look within a season. So, for summer, again,  there are definitely more registered users than there are casual users. However, in the winter, this comparison is a little bit different. There's definitely more  registered users than casual users in the winter for this chart. Maybe it'll be a  little bit easier to see this in another chart. We call this a side by side bar chart,  again side by side. Bar charts look at these comparisons across multiple  categories, so when looking here at a stacked bar chart, we can look within a  category what the comparison is, but again it's hard to really compare across.  However, when we look at the side by side bar chart, it's a lot easier to make  these comparisons across multiple categories. Again, we can compare the look  of casual versus registered, but we can also compare the look of just the overall  distribution of registered and the overall distribution of casual. We can see that  for both registered and casual, winter seems to be the smallest month, and  summer seems to be the largest month, but we can see that comparison again,  like we mentioned earlier with the stack bar chart, that casual users are a lot  closer to registered users in the summer than they are in the winter, so in  summary, when looking at qualitative variables again, we like to explore either  within a category or across multiple categories, and we saw a couple different  ways of doing that. We saw a pie chart, which is a graph where a circle is  divided into sections that represent really the entire distribution of categories for  a specific variable, and we also saw a bar chart. Now, remember, in the bar  chart, we do not have to represent every category, we did just to be able to  make a comparison, but we're still looking at numerical values of variables  represented by height or length of bars inside of a graph to make them easy to  compare. Excellent, so now we've looked at qualitative variables. Let's finish off  this lecture by displaying visually some quantitative data. Remember, when it  comes to quantitative data, we like to explore things like the center or the  spread, and really sort of the overall look of variables. Again, some of our 

sample questions would be something like, what is the typical temperature in my data, or is the number of users trending up or trending down, or what is the  range of humidity values, are they narrow or are they very spread out. So, let's  take a look at how we can potentially do this when it comes to looking at things  visually for numeric variables, again, there are a couple main graphs that we  can use to do this. We can first look at a line graph. This essentially plots a  variable numerically over time. So, for example, if we wanted to see how our  users are trending over time, a line graph would be great for this. Another graph  we can look at when looking at quantitative or numeric variables would be a  scatterplot. We've seen this in a previous lecture, but essentially what we're  doing is we're comparing different quantitative variables to each other. So, let's  take a look at these two different types of plots again. The first one is a line  graph. What a line graph does is it uses a line to connect individual data points  over time. Line graphs are best used when wanting to see how things change  across time. Take a look at the right-hand side. On the right-hand side, we can  see again on the x axis, the horizontal axis, the bottom axis are dates on the  upper axis, the y axis, the vertical axis, the left hand axis. You can see daily  users, and so we can take a look at how daily users is changing over time, and it does have a tendency to look like it may be trending a little bit upwards,  however, when taking a look at certain times of year, it definitely looks like our  daily user seems to drop. It looks like they seem to all have low time periods  during the winter months. Again, we can see in January, February, and March,  or even in late December, users being a lower count than during the other  months of the year, like you see here, where we see peak times really in the  spring, summer, and early fall months. So again, we can start seeing that our  users are changing over time. We can even see trends on when our users are  high, when our users are low, and we can even see that it does look like that,  although there are peaks and there are valleys, that our daily users is increasing over time, which, from a marketing standpoint, is wonderful. This is exactly what  we would want. We want our users to continue to go up over time. Let's look at  our other plot. Let's look at a scatterplot now. A scatterplot has the values of two  variables plotted along two different axes, and the pattern of the. Resulting  points would hopefully reveal any relationship that may exist between those two  variables. Scatterplots are best used when trying to explore two variables and  their potential relationship. So, let's take a look at the plot you see here on the  right-hand side. This is a comparison of temperature and total users on the  horizontal axis. Again, we call this the x axis, or the bottom axis. We see  temperature, temperature ranges here on our bottom axis from 15 all the way up to 95 degrees, and we can see here that we have a range of temperatures. Our  lowest temperature seems to be in the low 20s. Again, it's the dot on the farther  left hand side, and then our temperatures reach upwards in the 90s, as we can  see on the right hand side of the plot. Now, the vertical axis you see here again, 

typically called the y axis, or the left hand axis, is where we're plotting total daily  users, and we can see our plot goes from zero all the way up to 10,000 and so  what we're seeing here on the bottom points, if looking vertically, are smaller  numbers of daily users and higher points vertically are looking at larger number  of daily users, everything from zero users all the way up to almost 9000 users in  a day, but again the value of a scatterplot is trying to relate the two of these  variables together, so we can make a comparison between temperature and  total daily users. Basically, what happens as temperature goes up? Well, as  temperature seems to be going up, we notice that total daily users has a  tendency of going up as well. Again, it's not a perfectly straight line, but we do  see that the cloud of data points seems to be going upwards. So, again, as  temperature increases, it looks like total daily users has a tendency of  increasing. Of course, the opposite can also be said, as temperature decreases, the number of total daily users seems to decrease as well, so again scatterplots  are wonderful for being able to compare to see if there's any kind of relationship, and at least right now it looks like there might be some kind of relationship  between temperature and daily users, which would make intuitive sense. The  nicer the day is, the higher the temperature. The more people won't mind using  our bike service. However, as the temperature seems to drop, then our bike  service doesn't seem to be as popular as it would normally be. So, let's  summarize what we've talked about in this lecture here. So, when it comes to  quantitative variables again, we like to explore center and spread, and really the  look of the variables, and you can really see that look when we looked at our  two plots, the line graph, which used lines to try and connect those data points,  our total daily users over time, it allowed us to see that our daily user seems to  be trending upwards, however, there are some peaks and valleys in terms of  different times of year when my daily user seems to go up and seems to come  back down. Also, when exploring the look of these variables, we used a  scatterplot, and the scatterplot was used to compare two quantitative variables  to try and reveal if there's any kind of relationship existing between them. We  used it to compare total daily users to temperature to see as temperature  changes what happens to total daily users, and our intuition held as temperature seems to go up, our total daily users seemed to go up, and as temperature went down, so did our total daily users. However, again, all the plots that we've seen  in today's lecture, from the qualitative plots to the quantitative plots, are  wonderful for continuing to explore and understand your data. Exploring and  understanding your data are a great first step to anything you do in statistics,  once you have your data, you need to explore it. It helps drive insights, which  we're going to talk about in later lectures. But for right now, that is the end of this lecture, and I look forward to seeing you next time.



Última modificación: martes, 26 de mayo de 2026, 09:05