Software architecture is a hard problem, and software architecture that scales is even harder. We’d all like to live in a world where problems of scale don’t creep into our code and can be kept in the realm of infrastructure, and to a certain extent they can. But eventually as developers we do have to think about these problems and find creative architectural ways to solve them.
The problem of scale that I’d like to focus on in this post is query response time. It’s a problem that many team’s just accept as a painful reality of a growing database and increasingly complex queries which drive rich UI functionality. In reality, this is a solvable problem, and one that can be done elegantly using an existing architectural pattern; CQRS.
As an example, we’re going to create a fictional user forum named “SlackerNews”. We’re going to fill it with auto-generated post data and benchmark some common queries and then look at replacing some of the heaviest queries using an Algolia search index to show how we can both improve search UX and also reduce database load and cost across the board.
Command Query Responsibility Segregation is a data modeling and system design pattern which basically says that you can use different models for writing data than you do from reading data. Instead of reading and writing to the same database model, you can write to one database, then copy a version of that data into a separate and different type of data storage system, and use that to do your queries instead. Usually this second data store stores a de-normalized version of the data which is more read performant than a relational database will be.
CQRS will increase the complexity of your system, however, if done right, and used strategically for the right types of reads, the benefits can be huge. We’re going to look at how we can apply CQRS to a forum database to drastically improve text search query response times, improve UX, and decrease load on your source of truth database.
You will be familiar with this pattern because of how we generally design object storage. Generally we don’t store large binary files such as images, music, or videos in a database, opting instead to put them in an object storage medium where the cost for storing that data is lower, and the impact on the main database is minimal. The only difference here is that we store the data in both the relational database and the search index, and we treat the relational database as the source of truth.
Before we get to the query benchmarking and optimization, we have to generate some dummy data in our database. For this example, I’m going to use PostgreSQL as the database, as it’s one of the most performant and popular RDBMS in use today, and also because it has some great out-of-the-box text search optimization features.
The project is broken down into two parts:
1. Data generation
2. Query benchmarking
both of which can be found in this Query Performance GitLab project.
Here’s a class diagram for the very simplified forum database we’re building. It should be relatively easy to follow. The main takeaways for our queries are;
1. Forums can have sub-forums via the ParentID
foreign key
2. Tags and Posts have a many to many relationship which we achieve through the use of a bridging table PostTag
3. Through a recursive query (that we’ll show later on) we can fetch all posts in all child forums underneath any given parent forum.
The goal is to generate 1 million records which will be saved into Postgres in a normalized format using the tables and relationships which represent the data model above. The same data will also be denormalized and sent to an Algolia index.
To generate the data itself we’ll be using some sample posts from HackerNews, and then using Markov chains to generate entirely new random posts from that sample data. The code for this can be found in this Data Generator GitLab project.
To test the performance of both our Postgres database and our Algolia index, I’ve written two AWS Lambda functions; Orchestrator, and Runner.
The orchestrator takes some information about our test scenario and will delegate the execution to the runners using a fan-out pattern. The runner will take a batch of queries and execute them against either the database or Algolia index. Each runner will calculate the average query time for its batch and return that to the orchestrator. The orchestrator will collect all of the averages from the runners and calculate a total average query execution time for that benchmark.
The orchestrator is called with the following arguments:
– Batch Size
– Total Batches
– Runner Type (DB or Algolia)
– Complexity (Simple or Complex)
We want to make sure we’re being fair to both of the data stores and that we’re playing to their strengths so we’re going to do a few things to ensure that that’s the case. For the Postgres database, we’re going to be using Amazon RDS Aurora postgresql with a master instance and a read replica instance both using db.r5.xlarge instance types (4 vCPU, 32GB Memory).
When creating the records in the database we also pre-calculate a tsvector (Text Search Vector) for the body of each post and save it against the database records and then search for specific text tokens in the records using tsquery in our WHERE clause. We also create a gin index to make this text search more efficient.
On the Algolia side, we also denormalize most of the related information onto the index records. For example, we will be searching a parent forum for all posts in all child forums that match the search text so for each post record on the index, we also add all parent forum IDs to the _tags
attribute. We will also be adding the actual forum tags to the _tags
attribute as well.
We create two queries, one simple and one complex. The simple query doesn’t use any joins and just does a text search on the body of the posts.
In each complex query we randomly select a forum that we want to filter by. Forums that have more sub-forums will be a bit slower to search through but we will run this multiple times to get a good reading of the average query response times for each approach.
Simple Query
select * from posts where body_tokens @@ to_tsquery(‘git’) order by created_at desc limit 100
Complex Query
with recursive subforums as ( select id, name, parent_id from forums where ID = @id union select f.id, f.name, f.parent_id from forums f inner join subforums sf on sf.ID = f.parent_id ) select p.* from subforums sf join posts p on p.forum_id = sf.id where p.body_tokens @@ to_tsquery(‘git’) order by p.created_at desc limit 100
The Algolia search versions of these queries are much simpler due to the nature of searching and Algolia’s SDK
Simple Search
params = []interface{}{ opt.HitsPerPage(100), } postIndex.Search(“git”, params)
Complex Search
params = []interface{}{ opt.TagFilter(fmt.Sprintf(“%v”, (*forums)[forumToUse].ID)), opt.HitsPerPage(100), } postIndex.Search(“git”, params)
In this table you can see each benchmark and the 4 arguments to the orchestrator function. The average query time was established from taking the top 3 results from 10 runs. The reason for this approach was that both the Postgres database and the Algolia search would occasionally throw outlier results (longer response times) on the first run of any given query but would quickly stabilize after that and return consistent results.
This chart shows you a plot of each test for each Data Store layered on top of each other. I used a Logarithmic scale for the y-axis to demonstrate that both data stores do get slower as the concurrency goes up. As you can see however the response time quickly increases in the Postgres database whereas the Algolia search doesn’t really experience much of a slowdown at all. I also included a version of the chart without the log scale so you can visualize the real difference in response time.
Interestingly the complexity of the search didn’t have any impact in these particular benchmarks, however I suspect this was because the cardinality of posts to forums, tags and users was actually very low in our generated data. If I were to run these tests again, I would generate hundreds or thousands of each to get more realistic results from the increase in rows that needed to be joined.
In this second chart you can see that for Postgres we experience very large response times (~7.5 seconds) as concurrency increases to 1000, whereas Algolia scales easily to that level of search concurrency with response times orders of magnitude lower at ~15ms (0.015 seconds).
If anyone would like to sponsor me to run these benchmarks at much larger scale I’d be more than willing to oblige.
We as developers have a lot of power in the data storage choices we make as we build software. We can and should be employing the CQRS pattern when it makes sense and think out of the box when it comes to storing and querying our data, especially as our products scale and as customers hit our traditional databases harder. By doing so we can drastically improve the user experience and the underlying database costs associated with searching our data. The lift to incorporate a search data store like Algolia into your tech stack is relatively low for the extraordinary quality of life improvement it can introduce.
I hope that this adventure into data storage and querying has been informative and helpful and I look forward to reading your feedback on my approach and anything I may have missed in my benchmarking method.
Daniel Parker
Senior Integrations Engineer @ GitLabPowered by Algolia AI Recommendations
Sarah Dayan
Principal Software EngineerJulien Lemoine
Co-founder & former CTO at AlgoliaHamish Ogilvy
VP, Artificial Intelligence