Automating Data Extraction

Recently, our Tech Team received a request from our Marketing Team to automate the extraction of certain social media statistics, such as the number of views, comments, and likes on our posts within a specified period.

Turns out there was a new initiative to track the performance of our marketing activities, and part of it involves monitoring various metrics - such as reach, impressions and interactions - on our public-facing platforms, like:

Each platform has its own analytics dashboard - Facebook has a Creator Studio for Facebook and Instagram, Spotify has Spotify for Artists, etc. - but having to log in to each platform to extract the exact information we want (“how many views do we have on each video posted between Jan and March?”) is a tedious and unenviable manual task.

Design

Luckily, these platforms also have public-facing APIs (Application Programming Interfaces) that we can use to automate this process. Our general approach is to :

  • Extract the data we want (impressions, views, likes, saves, etc.) from our social media platforms using scripts (or extractors).

  • Load the data into a database.

  • Build visualisations and a dashboard based on the data.

  • Repeat this process every day.

This is a miniature of a classic procedure called ETL (Extract, Transform, Load) used to consolidate data for data analysis and business intelligence applications.

Untitled Diagram-3.png

There are 4 components to our system:

  • Our extractors are simple scripts written in Python

  • Airflow is a task management tool for running and scheduling the extractor jobs

  • PostgreSQL is a database for storing all the raw and transformed data

  • Redash is a data visualisation tool

These applications are all run on our own DigitalOcean server.

API

In order to use the APIs, we have to first create API keys, used to authenticate API requests. The instructions can be easily found on the respective platform’s websites, such as:

Afterwards, it is just a matter of looking through the various documentation to find out which APIs can give us the data we want.

  • For YouTube, we only need the views, likes and comments, which can be retrieved through the YouTube Data API. We also needed the data for unique viewers, but sadly it is no longer available through their API.

  • For Facebook and Instagram, everything is queried through their Graph API.

  • For Spotify, we need the number of listens per track. Unfortunately. there is no public API available for the Spotify for Artists platform, but we were able to design a web scraping workflow to get the data we want.

ETL pipeline

More accurately, we are using an ELT workflow (Load first, then Transform).

ELT is a more flexible workflow for our use case; since we do not need to optimise for storage yet (our social media analytics isn’t exactly big data, plus data is cheap nowadays), we can just dump the raw data into the database, and design the transformation step accordingly. Right now, we are still not 100% certain of the metrics we need, so this workflow allows us to have the flexibility of changing how we process the raw data into a query-friendly format.

Our extractors are run using Airflow, which organises processes in units called tasks. Airflow allows us to visualise the flow easily, and inspect the logs/output and status of each task as they run.

3D976254-E076-480E-950A-03A3822C9477.png

To keep our data updated, Airflow can also be configured to run our extractors automatically on a regular basis. We have set them up to run daily.

Data Visualisation

Now that we have the data we want in our database, we can use Redash to query the data and visualise the output.

Redash is a BI tool that is meant for users with some knowledge of SQL. Thankfully, the SQL knowledge we need to get what we want is not very complicated - we just need to know the basic select, group, order and some window functions.

1DB2081F-C2B8-4773-A40E-1EE1231CDADC.png

From the output, we can even create some simple time series graphs! The chart below shows the number of new impressions/reach/engagements/likes for all our posts this year, over the past few days.

22E7CA44-2DF7-411E-8047-AE12B31E148B.png

We can then create a dashboard to view all the charts and tables in one place.

CFDA8AD9-090C-4A80-ADFC-C3E4054E0359.png

Acknowledgements

Believe it or not - this was all set up in a single day.

We would like to thank Andrew Goh from Grab for helping us with building this system! Andrew reached out to us through Grab’s Social Impact initiative, which enabled him to commit one full day to this project.

We are also very grateful to DigitalOcean for supporting us with infrastructure credits through their Hub for Good program.