Data Cleaning on Hong Kong Census

on August 13, 2018

In my last blog, we went through the process of Data Analysis with Power BI. In summary, these are the four steps:

  1. Data Collection
  2. Data Cleaning
  3. Data Insights
  4. Data Visualisation

In this blog, I am going to focus on Data Cleaning - an important process in Data Analytics. Without proper data cleaning, you will not be able to proceed with generating useful insights and visualisation with Pivot Tables or Power BI. Again, we will demonstrate the process step by step, feel free to follow through the steps and try it yourselves.

Before we start, What are we looking to achieve in the data cleaning process?

The objective of data cleaning is to arrange the data collected to “Tidy Data” - to arrange data in a way that each variable is a column and each observation is a row. The idea is thoroughly explained in Hadley Wickham’s “Tidy Data”. tidydata

Tools, Skills & Data

In this exercise, we will be using data drawn from Hong Kong Census on Transportation Means. We will be using Unpivot, a feature available in both Power Query and Power BI. Note that these applications are Windows exclusive, so if you are working with Mac, you may still install Windows with Bootcamp or Parallel and use these tools.

We will be working with Hong Kong By-census on Transportation Means that tracks means of transport for each district. Search for the dataset “Working Population with Fixed Place of Work in Hong Kong by District Council District, Place of Work, Year and Main Mode of Transport to Place of Work”.

The Process

The dataset is very extensive and detailed. As it is not following the tidy data principle, it is hard to generate any good insights. You can see variables are defined in both columns and rows. We are going to correct that in this exercise.

Step1: Reorganising the dataset

In PowerBI click on Get Data -> Excel, then select the file you just downloaded. Select the Result table on the left by ticking the box. Then click on Edit. Now we need to re-organize the file and rename the column accordingly. First Column will be for the district. We need to fill down the district name so that it’s available for each row. Using Power Query it’s very simple, just go on the transform tab after you select the column, then click on Fill -> Down, this will copy the data across (See Below Screenshot). powerbiscreencap1

As you can see this is very easy to use and can save you a lot of time. You can then rename column 3 as Work Location, remove the top 4 rows and clear the unnecessary rows. Once you are done you should have two columns with the district names and work location. The other columns are the transport types with the observations The table should now look like this: powerbiscreencap2

Step 2: Unpivot

Now it’s time to go with the Unpivot. Select the all the columns except the first two. Then go on Transform and select the unpivot option. This stores all the observations as rows. At this stage, the data is now properly cleaned, and can be used for further analysis:

powerbiunpivot

Step 3: Data Visualisation

Using the formatted data we have now, we can now use Power BI to perform exploratory analysis:

tidydata

We can already see MTR is the most used form of transport, followed by bus, walking coming on third. The top right indicate the breakdown in terms of population. If you look at the bottom part, I used an interactive slicer where you can see the percentage of the population and where they live (same district or other district in the different regions).

We can see that people living in Central and Western for example tend to work in the same district, considering how many companies present there it is not surprising. Only Tuen Mun, Yau Tsim Mong and Kwun Tong have the same profile. We customize the bar chart to highlight people working in the same district so that it is easier to visualize.

Next, we tried to visualize how district rank by mode of transport. To do that, we created multiple bar chart and then use visual filter for each type of transport. Here, there are no surprising results. Other than Yau Tsim Mong and Central/Western are walking to get to work, which confirm the observation earlier that they tend to live and work in the same district. Central/Western and Wanchai are also the biggest users of Taxi.

tidydata

Now, these are not ground breaking discoveries. It was just an exercise to see how simply you can go from a dataset that Is hard to read to clean data that is visualized and interactive. This information could be used for companies wanting to target specific demographics, with that they can see where they could put advertising (in MTR, Bus etc…). Using the Census data you could also combined other information like the age breakdown, industry they work in to make your profiling more detailed.

Another point to make, people will often say you need big data to perform better analysis. However we demonstrated here that a small data set (provided the information collected is good) can give us some good insights.


  coding