Data - and the insights it provides - are invaluable drivers for business success.
Internal data gets leveraged on a daily basis in order to report upon the activities that drive business within an organisation. These reports enable a company to stay updated as it progresses against business objectives; enabling one to make a range of executive decisions – as well as reflect upon the performance of the business in a range of areas. These reports include sales reports, marketing performance reports, human resources reports and information technology reports.
Internal data is not the only data a person can use in order to generate a meaningful report.
There is a wealth of additional sources of data an organisation can leverage, to either enhance reporting or to even build services on top of. This kind of data may also contextualise the position of a company amidst its competitors, within industry or market factors.
One such source is Open Data. Open data as defined by opendefinition.org as:
Open data and content can be freely used, modified, and shared by anyone for any purpose
To be clear about open data, it does not necessarily mean the data is free, it could mean you have to purchase the data, but once purchased, you are free to do with it as you please. The Open Data Institute have extended the meaning to include that the data has to be machine readable which means data made available in PDF format won’t be acceptable as Open Data.
- OpenCorporates contains data for over 90 million companies in the UK.
- http://data.gov.uk/ provides a listing of available public sector and governmental data within the UK.
- European Data Portal similarly provides public sector and governmental data.
- http://enigma.io/ is global open data aggregator service.
- Amongst others, there are industry reports, investment analyst reports, weather data, economic data, government data of various types, World Bank data and IMF data.
To make this post a bit more practical, I will use two sources of open data in order to generate a set of visualisations and hopefully a bit more insight into how public services are used within London. The first source is from Transport for London (TfL – data.tfl.gov.uk) and the second being weather data.
The TfL data used relates to the public cycle hire scheme data set, which includes all the cycle hire data from January 2012 to December 2015, though the focus will only be on 2015 data. Once registered on the TfL site, the data is freely available to download. There was a challenge with the data though, as the text files were not all consistent and required manipulation in order to ensure consistency. For the clean-up I used OpenRefine which made the task less complex. Once the .csv files were cleaned and ready to be used, I uploaded them to Google Cloud Storage which is part of the Google Cloud Platform.
For the weather data, it was a bit more challenging and required a bit of python code to scrape the data from a weather aggregator. This allowed me to download hourly weather data for London which aligned with the cycle hire data. Once cleaned and ready to use, it was similarly uploaded to Google Cloud Storage.
The below diagram shows the end to end process, from acquiring the data, storing, processing and finally visualising.
From the diagram above, the step before using Tableau to visualise the data was to import the data sets into Google BigQuery. This allowed the data sets to be merged and/or processed into smaller data sets to allow for easier visualisation within Tableau. Instead of BigQuery, an alternative could be Google Cloud SQL but it depends on the size of each data set and the queries that need to be performed on the data.
Below is a dashboard generated in Tableau using the TfL data. Selecting a specific station shows the journeys to and from that station.
View the live tableau dashboard here.
- The vast majority of weekday cycle hire activity can clearly be attributed to commuters.
- In most places, a significant proportion of people who start at a specific cycle hire location end up at the same location.
- One can clearly identify residential vs non-residential zones by observing the number of journeys that occur during the various hours of the day. In residential zones, you see a significant spike in cycle hire usage at between 7:00-9:00 AM and in non-residential zones you see a significant spike in the number of journeys at around 17:00-19:00 PM.
- People typically go on their shortest distance journeys at around midday and their longest journeys during late evening and very early morning.
- The busiest cycle hire points centre is around Waterloo Station.
Open data can be a valuable source of additional data for organisations in order to enable them to generate more meaningful and insightful reports. From allowing company performance comparisons, to comparing product sentiment with competitors’ products, to allowing benchmarking across the industry, open data allows a more holistic view when used correctly. Ultimately by opening up the data silos and using external data in a structured way will benefit the organisation by allowing better decision making and a clearer vision for the future.