How Plio became 5-10 times faster because of reading a book
Since the start of this year, we've been spending a minor fraction of our time improving our technical knowledge along with our routine of adding new features, making enhancements, and fixing bugs. Within a month of adopting this practice, it has already improved the platform’s speed by 5-10 times, reduced our cost by one-third, and simplified our systems drastically. In this blog post, we’ll share the nitty-gritty details of how that happened.
The starting point
Before we talk about the changes we made and why we made them, let’s set the context of what our setup used to look like. Broadly, this was our tech stack:
backend: Django + PostgreSQL database (AWS RDS)
data warehouse: Google BigQuery
analytics: CubeJS (open-source analytics API to build analytics features on top of a data warehouse)
In our tech stack above, the important thing to focus on is
analytics. Now, there are two use cases of analytics:
Plio-level analytics: this includes the metrics for a given plio that the creator of that plio is looking at.
Internal analytics: this represents the analysis that we do internally to look at the broader usage metrics of Plio.
We had initially confused both of them to have the same technical challenge - where we need to look through a bunch of rows for a given set of columns and calculate aggregate metrics like mean, sum, count, etc.
Now, back then, we had very briefly read the following:
there are two types of databases: transactional and analytical.
AWS is a transactional database and BigQuery is an analytical one
For analysis, using an analytical database is more performant.
So, being as naive as we were back then, we simply took this to mean that we should use BigQuery for all our analytics-related queries.
Also, for some weird reason, we wanted to use some fancy analytics API so that some redundant code could be avoided (basically a few SQL queries and a little bit of Python). So, to do that, we adopted CubeJS, never mind the fact that we had to create a new repository for it along with a new docker image, deploy it as a separate microservice, create a separate Redis instance, add load balancers, health checks, monitoring, etc. All of this just to avoid writing a little bit of code on our backend.
And thus, we had to set up a data pipeline that would copy data from our RDS to our BigQuery every hour, our CubeJS instance would connect with our BigQuery and our frontend would make API calls to both our backend and the CubeJS instance depending on the type of the query.
We thought we were thinking about the long-term and making the smart choice (using the complexity of our system as a proxy for our smartness).
However, reading the book “Designing Data-Intensive Applications” over the last month enlightened us.
Firstly, we realized that the two applications of analytics that we highlighted above involve completely different technical challenges. The analytics for a plio will actually involve reading only a few rows in the database corresponding to that plio. This is different from internal analytics where we are likely to look at all the rows in the database to come up with the broader usage metrics of the tool itself. Thus, a row-oriented database (transactional) is more suited to handling the first use-case and a column-oriented database (analytical) is ideal only for the second one.
Next, transactional databases are optimized for real-time performance which is critical for web applications that require users to interact with them. Whereas analytical databases are meant for use-cases where real-time performance is not critical. Thus, using BigQuery for our analytical queries was drastically slowing down the overall performance of Plio as it was never meant to be used in a web application in the first place.
Thus, we shifted all our analytical queries to our backend as well and completely removed any ties to our CubeJS instance. The loading speed of our home page reduced from 5-6 seconds to 1 second, the time to download a plio’s report reduced from 10 seconds to 1 second.
Finally, we shut down our analytics instance and everything else surrounding it: Redis, load balancers, elastic IPs, docker images, triggers, alarms, automations, Fargate instances, etc.
This experience has made us realize that although moving fast is important, we also need to give enough importance to improving our own technical knowledge and building a deep understanding before implementing something (especially something that can add a lot of complexity and cost us a lot of money). It felt very satisfying to deploy this change, see the instant speed improvement, shut down those useless instances and prevent our future dollars from being wasted.
Well written Aman. Quite motivating on why we should keep increasing our knowledge and never stop learning.