Video Transcript: What is Data? - Part 2
Welcome to the next lecture in our section on what is data. In this lecture, we're going to be talking about exploring relationships with data. Going back to the previous lecture, remember where our goal is to drive inference to come up with some grander conclusions. Essentially, what we're saying is that data by itself is just that it's just information, information all alone. Essentially, for us to take advantage of data, we need to draw insights from the data to be able to help us make better decisions. Rarely do we collect data just to be able to collect it. We collect it for a grander purpose. Now, maybe we don't know what that purpose is initially, but we want to be able to use it to make better decisions. Well, how can we make better decisions? How can we draw insights from our data? That's really what this whole course is about, is to try and help you learn to be able to draw insights. Insights come from exploring your data. Later on in the course, we're going to have an entire section on exploring data, but we're going to at least preview that here again. We're going to go back to the same data table, the same bike rental data that we had in our last lecture, and that we're going to use throughout the entirety of this course. Again, we can see here on the left hand side we have some rows. These rows are summarizing different days, and really we're summarizing what those days looked like, as well as how many users we had on each one of those days. The columns that you see here, remember, are variables. These variables describe these days, everything from the categorical or qualitative variables on the left hand side, like weekday, season, or weather type, and the quantitative variables on the right hand side, like temperature, humidity, and number of casual or registered users, so let's imagine that we had some piece of information. Let's imagine we knew that the historical average bike rentals is 4000 per day. A new employee sees low bike rental numbers over the first few days of the new year. Does that necessarily mean trouble again? If we go back to our data table, if you were to look just at these five days of the new year, and if I told you that the historical average total users in a day is around 4000 oof, if we look at the last two columns, casual users and registered users were a little bit over 1000 maybe in between 1000 and 2000 total users. Well, this new employee sees that may be a problem, but again, let's explore our data, for example, this is the distribution of all total users by day on the x axis, basically on the horizontal piece of information down at the very, very bottom you can see different ranges of number of total users, everything from below 500 total users in a day to all the way up to 8500 to 9000 users in a day. On the vertical axis, on the left-hand side, the zero, 10, 20, 30. What that is doing is telling you how many days had the corresponding number of users? So we can see by looking at this, we have a lot of days where we have more than 4000 total users in a day. Maybe it's just those first few days that didn't show everything, so what we've done is we've looked at the distribution of daily bike rentals. Essentially, we looked at all the different values for bike rentals that we've seen historically, and then what we tried to do is we
tried to look at what we would refer to as an average, basically the center of that collection, that distribution. The nice part is average and distribution are going to be things that we're going to be focusing on in this course. We're going to learn about distributions, we're going to learn about averages, so you can make these
same kinds of insights. This is just a preview. Well, maybe bike rentals drop in the winter, that would make sense. We were looking at days in the early part of winter, in January. If you remember that data table, here were some rather cold days. I don't know about you, I don't prefer to ride a bike outside when it's really that cold, so maybe that's what's driving the lower numbers that that new employee is seeing again. This would be a piece of information that we've collected. Now, let's try and draw insight from it. What you can see here is the average total users broken up now by season. All I want you to do is focus on the tallest bar and the shortest bar. The tallest bar you see is the summer bar, while the shortest bar is the winter bar. In other words, we typically have more users in the summer and typically have less users in the winter. Again, here we're trying to draw insight. We can look at what we refer to as a bar chart of the data to see some kind of possible association between the number of people who use our bike rental service, and the season itself. Again, the nice part is a bar chart is one of the things we're going to be learning about in this course. This is again just a preview. This whole lecture is really a preview of a way of being able to draw insights from data. All right, so this is very intriguing. So we have this idea where we have total number of users that we've looked at. We looked at the total number of users just across all days, and it looked like there was a nice wide spread. Some days we had less than 500 total users, some days we had over 8500 total users, so we have a lot of different users that we could actually have on any given day. Then what we did is we looked at those users across different seasons, and we saw what appeared to be some kind of relationship between the season of year and the total number of users, where in the summer months we have a lot of people using our bike rental service. However, in the winter months that's when we typically see a dip. Okay. Well, now again, let's continue to ask questions. Asking questions around data is a great way of discovering more insights. The first question was, does it look like that our data is following this historical average of 4000 people per day on average using our service? Then the next question was, I wonder if season makes a difference. Okay, well, let's continue that question, and look at, is the drop in the winter months that we saw in our last chart the same for registered users and for casual users. Well, let's just ponder this to start. Remember, a registered user is someone who registers ahead of time to be able to make sure that a bike is available for them. They pay essentially a service fee for this. You can think about these people really as workers, probably using the bike rental service to be able to go to their job, maybe do their day-to-day chores, such as going to the grocery store, going to the gym, you can think about casual users.
On the other hand, as being people who are sort of just using the bike rental service as they need it. Maybe I want to take a stroll through a park, and so I'm going to be able to rent a bike and go biking through a park, but we're probably not going to be using it on a very consistent basis, if we're a casual user, so now again, the question would make sense, is the drop in winter the same for registered and casual users, figuring people that are registered users, they probably need to use the bike rental service, whether it's cold outside or not, whereas casual users, on the other hand, if they're just using it when they want to, they may not choose to use it in the winter months. So, let's take a look at another chart. This is the exact same chart that I showed you previously. So, the taller the bar again, the higher the number of total users on average, but now what I've done is I've broken those same four bars that you saw earlier into two different pieces. The darker shade on the upper part of the bar is the average of the registered users, whereas the lighter. Shade on the lower part of the bar is the average of the casual users, so we can compare the breakdown of registered versus casual users in each season. What I'd like you to do is focus on the left hand two bars, the left hand two bars, the spring and the summer, it looks like we have a bigger piece of casual users to the overall average, which would make sense in the nice spring and summer months, when it's warm outside and people want to get out and move around a little bit, we can have more people who are casually using our bike rental service, however, let's focus in on the right two bars, the fall and the winter. We can see that the light blue casual users in those right two bars seem to be a smaller piece than what we saw in the spring and the summer. Again, this would make intuitive sense as the weather starts to turn cold, as it starts to become a little bit less of an advantageous thing to go bike around for fun. Less and less casual users seem to be using our service. Isn't it amazing the kinds of insights we can look at when just exploring our data. What I showed you previously is what we refer to as a stacked bar chart, to really break down the original bar chart into different groups, so we can see how the groups break down across different bars. Again, a stacked bar chart is something we're going to learn in a later lecture. This is still just a preview to show you how easy it is to be able to explore data and draw insights from it. Wow, all right. So this has been very intriguing. We've seen a lot of different things revealed by our data so far, well, why do you think customers use bike rentals less in the winter? I mean, it probably has to deal with the idea of the weather, right. And so, again, remember we are using information from a bike rental service in Washington, DC, and if you've never been to Washington DC before, it has a tendency of being colder in the winter and again warmer in the summer. This isn't something like Hawaii, which is warm all year round. So, maybe people use bike rentals less in the winter because it has lower temperatures. So, again, let's explore our data. What I'm showing you here is what's referred to as a scatter plot on the bottom axis, the
horizontal axis, the zero, 10, 20, 30, 40, 50 that you're seeing at the very bottom, that represents temperature, the low temperatures are on the left hand side. The high temperatures are on the right hand side. The vertical axis, the axis on the left hand side of this chart, the up and down axis. This is the total daily users again on average, so we can see everything from zero users all the way up to almost 9000 users. So, let's again take a look, so we can see a little bit of a trend here, right? As temperature seems to go up, it looks like there's more total daily users, whereas temperature goes down, there's fewer users. Look at the coldest day, find the dot that's furthest to the bottom left. Notice how that's almost 20 degrees outside for a high, and really we only had about 1000 users that day. However, if we go into something like the 70s or the 80s for temperature, we see a lot more total daily users, somewhere around three to 8000 users in a day. Again, it looks like as temperature goes up, our total daily users tend to increase, like I said, that is what we refer to as a scatterplot, specifically a scatterplot between temperature and user count to try and see some kind of possible relationship where that relationship might be able to give us some insights. Well, what about registered users are casual users. Again, let's break down those similarly to what we broke down earlier. Let's take a look at those separately. What you see here is registered users, same plot. Just now, instead of looking at total users, we're looking. Registered ones, so we can see the same idea as temperature has a tendency of going up, so does the total daily number of registered users. Again, on the colder days, in the 20 and 30 degrees, we typically have registered users between 500 to maybe 2500 people. However, on the warmer days, 80 degrees, we typically see users between 2000 and 7000. So, again, it seems as temperature goes up, the number of total daily registered users also has a tendency of going up. Let's see if this pattern still holds for casual users. Oh, okay. Now we're seeing something a little bit different again. We see that temperature has a relationship, it appears, with the total daily casual users. As temperature goes up, so do the total daily casual users. However, look at those low temperatures. Where we have really low temperatures, we really do not see a lot of casual users. In fact, we see really below 40 degrees, not a lot of users, anything from zero to 500 total daily casual users, whereas if we're up in the 70-80 degrees, we can see anywhere from 500 up to 3500 casual users in a single day. So, although we look back and see that registered users, yes, temperature has a relationship, it seems to have a more impactful relationship when it comes to casual users. Again, these are some insights we can draw by exploring our data. So, let's wrap this up in summary. Remember, data by itself, it's really just information, not overly helpful. However, by exploring our data, looking at our data, that can reveal potential insights invaluable uses to that information. Now, I showed you a lot of different plots and a lot of different things that we're going to be exploring in further detail later on in the course, things like distributions, bar charts, stacked bar charts,
scatterplots. We also talked about averages. All these things will be explored in much further detail later on in the course, but hopefully this gives you a preview, and hopefully, in all honesty, gets you a little excited about some of the things that we can do to explore our data. I'll be honest, visuals help explore data so well. By taking a look at your data, we can draw these insights. Let's remember some of the insights that we had by just exploring and looking at our data in this lecture. It seemed like that we have a wide variety of possibilities for number of total users on any given day, anything from below 500 all the way up to almost 9000 total users. However, season has an important factor in how many total users we see on a given day. We saw that winter had a tendency of being lower than things like spring or summer. When we delve into this further, we see that temperature plays an important role. As temperature seems to increase, we can see that the number of total users tend to increase. However, when we broke this down, when looking at the difference between registered and casual users, temperature has a much bigger impact on casual users, especially on the lower temperatures. Isn't it amazing what we can do with data? So, that is the end of this lecture, and I look forward to seeing you in our next one.