Search by Algolia
Vector vs Keyword Search: Why You Should Care
ai

Vector vs Keyword Search: Why You Should Care

Search has been around for a while, to the point that it is now considered a standard requirement in many ...

Nicolas Fiorini

Senior Machine Learning Engineer

What is a B2B marketplace?
e-commerce

What is a B2B marketplace?

It’s no secret that B2B (business-to-business) transactions have largely migrated online. According to Gartner, by 2025, 80 ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

3 strategies for B2B ecommerce growth: key takeaways from B2B Online - Chicago
e-commerce

3 strategies for B2B ecommerce growth: key takeaways from B2B Online - Chicago

Twice a year, B2B Online brings together industry leaders to discuss the trends affecting the B2B ecommerce industry. At the ...

Elena Moravec

Director of Product Marketing & Strategy

Deconstructing smart digital merchandising
e-commerce

Deconstructing smart digital merchandising

This is Part 2 of a series that dives into the transformational journey made by digital merchandising to drive positive ...

Benoit Reulier
Reshma Iyer

Benoit Reulier &

Reshma Iyer

The death of traditional shopping: How AI-powered conversational commerce changes everything
ai

The death of traditional shopping: How AI-powered conversational commerce changes everything

Get ready for the ride: online shopping is about to be completely upended by AI. Over the past few years ...

Aayush Iyer

Director, User Experience & UI Platform

What is B2C ecommerce? Models, examples, and definitions
e-commerce

What is B2C ecommerce? Models, examples, and definitions

Remember life before online shopping? When you had to actually leave the house for a brick-and-mortar store to ...

Catherine Dee

Search and Discovery writer

What are marketplace platforms and software? Why are they important?
e-commerce

What are marketplace platforms and software? Why are they important?

If you imagine pushing a virtual shopping cart down the aisles of an online store, or browsing items in an ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is an online marketplace?
e-commerce

What is an online marketplace?

Remember the world before the convenience of online commerce? Before the pandemic, before the proliferation of ecommerce sites, when the ...

Catherine Dee

Search and Discovery writer

10 ways AI is transforming ecommerce
e-commerce

10 ways AI is transforming ecommerce

Artificial intelligence (AI) is no longer just the stuff of scary futuristic movies; it’s recently burst into the headlines ...

Catherine Dee

Search and Discovery writer

AI as a Service (AIaaS) in the era of "buy not build"
ai

AI as a Service (AIaaS) in the era of "buy not build"

Imagine you are the CTO of a company that has just undergone a massive decade long digital transformation. You’ve ...

Sean Mullaney

CTO @Algolia

By the numbers: the ROI of keyword and AI site search for digital commerce
product

By the numbers: the ROI of keyword and AI site search for digital commerce

Did you know that the tiny search bar at the top of many ecommerce sites can offer an outsized return ...

Jon Silvers

Director, Digital Marketing

Using pre-trained AI algorithms to solve the cold start problem
ai

Using pre-trained AI algorithms to solve the cold start problem

Artificial intelligence (AI) has quickly moved from hot topic to everyday life. Now, ecommerce businesses are beginning to clearly see ...

Etienne Martin

VP of Product

Introducing Algolia NeuralSearch
product

Introducing Algolia NeuralSearch

We couldn’t be more excited to announce the availability of our breakthrough product, Algolia NeuralSearch. The world has stepped ...

Bernadette Nixon

Chief Executive Officer and Board Member at Algolia

AI is eating ecommerce
ai

AI is eating ecommerce

The ecommerce industry has experienced steady and reliable growth over the last 20 years (albeit interrupted briefly by a global ...

Sean Mullaney

CTO @Algolia

Semantic textual similarity: a game changer for search results and recommendations
product

Semantic textual similarity: a game changer for search results and recommendations

As an ecommerce professional, you know the importance of providing a five-star search experience on your site or in ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is hashing and how does it improve website and app search?
ai

What is hashing and how does it improve website and app search?

Hashing.   Yep, you read that right.   Not hashtags. Not golden, crisp-on-the-outside, melty-on-the-inside hash browns ...

Catherine Dee

Search and Discovery writer

Conference Recap: ECIR23 Take-aways
engineering

Conference Recap: ECIR23 Take-aways

We’re just back from ECIR23, the leading European conference around Information Retrieval systems, which ran its 45th edition in ...

Paul-Louis Nech

Senior ML Engineer

What is a neural network and how many types are there?
ai

What is a neural network and how many types are there?

Your grandfather wears those comfy slipper-y shoes all day, every day, and they’re starting to get holes in ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

From the beginning at Algolia, we decided not to place any load balancing infrastructure between our users and our search API servers. We made this choice to keep things simple, to remove any potential single point of failure and to avoid the costs of monitoring and maintaining such a system.

An Algolia application runs on top of the following infrastructure components:

  • a cluster of 3 servers which process both indexing and search queries,
  • some DSNs servers (not DNS). These are read-only replicas serving only search queries. Their primary purpose is to provide faster search to people located geographically far away from the main cluster.

Instead of putting hardware or software between our search servers and our users, we chose to rely on the round-robin feature of DNS to spread the load across the servers. Each Algolia application instance is associated with a unique DNS record, which responds in a round-robin fashion with one of the bare metal servers that handles the given Algolia app.

Load balancing

We consider the most common and optimal usage of Algolia to be with a front-end implementation. In this case, mobile devices or laptops directly establish communication with our bare metal servers. In such a context, we can assume there will be a significant amount of DNS resolution, each leading to a few search requests. This is the best situation to rely on round-robin DNS for load balancing: a large number of users request the DNS to access Algolia servers, and they perform a few searches. This leads to a server load that matches the round-robin DNS resolution. Additionally, to enforce even more DNS resolution, we decreased the DNS TTL to one minute.

Load balancing

In the end, this system was simple. It didn’t use any dedicated hardware or software to manage on our own, and things went pretty well.

That is, until Black Friday.

DNS-based load balancing limitations

Back-end implementation and uneven load balancing

As mentioned earlier, we strongly recommend our customers to go with front-end search implementations. Many parameters are motivating this choice; one of which is to leverage our DNS-based load balancing system. Yet, this isn’t always doable: some clients have specific constraints, like legacy design or security concerns, which lead them to opt for a back-end implementation. Doing so, their back-end servers relay all search queries to our infrastructure.

In this specific context, we already knew that our DNS-based load balancing was suboptimal:

  • Now, a small group of servers perform a few DNS resolutions and forward a considerable number of requests to the chosen back-end server. Instead of 1,000 users making 10 queries each, we now have 1 user making 10,000 queries.
  • As the sessions with our search servers can live longer, the back-end server can send even more requests without needing to re-perform DNS resolution.
  • Sometimes, the customer servers even override our DNS TTL so they can use their DNS cache longer.

That said, the main focus we had when we designed our infrastructure was resilience. This means that, for most customers, a single cluster node can handle all the search load. Consequently, an uneven load across the cluster nodes wouldn’t have any impact on the search experience.

DSN for horizontal scaling

Initially, the DSNs were introduced to increase performance for users who perform search requests far away from the main cluster, by bringing read-only servers near them. Yet, we soon realized that it was also an easy way to bring more search capacity in a given region, by scaling the servers horizontally to absorb more search requests.

The Black Friday Incident

We had a big customer with a back-end implementation for which the load was too big to be handled by a single server. We had already deployed many DSNs in addition to the cluster, all in the same region, to absorb the search load coming from their back-end servers.

Yet, when Black Friday arrived, they started to experience an increased number of search queries. Even if we had worked on dimensioning the infrastructure to absorb the load, they ended up in a situation with slow search queries and even some failing ones. For end users, this meant a highly degraded search experience with increased latency, during a time of the year when you expect an e-commerce website to be highly performant.

The load was uneven: the total number of available servers on our side to handle their requests outnumbered the number of servers on their side able to send requests. We ended up in a situation where, in the best case scenario, with our DNS-based load balancing, each of their servers would choose one of ours and stick to it for few minutes, overloading it, and leaving a few others not used at all.

Load balancing

This made us reconsider our DNS-based load balancing method, at least in this specific use case which combines heavy search load with back-end implementation.

Here comes the Load Balancer

First iteration

To solve the issue during Black Friday, we went for a quick fix, and we deployed a rudimentary load balancer. We leveraged Nginx, and its ability to proxy requests and load balance them toward a group of upstream servers (in our case, the Algolia servers).

Load balancing

We saved the day, and the traffic was evenly load balanced. This confirmed we needed such a system in some cases. Yet, at this point, it was more a workaround than an actual long-term solution. The whole thing was mainly static, with customer-specific parameters hardcoded in the Nginx configuration. This situation raised many interrogations:

  • How to make such a system customer-agnostic?
  • How to dynamically target the right group of search API servers for a given incoming request?
  • How to make it handle our daily infrastructure operations like changing, adding, or removing servers over time?

Second iteration

For the second iteration, the focus was to find a way to make the load balancer generic. The primary challenge was to dynamically build the list of upstream servers able to serve an incoming request. To solve this kind of issue, you can think of two opposite approaches:

  • either the load balancers know in advance all the information they need to operate,
  • or they learn what they need to know when they handle the incoming requests.

We went for the second solution, mostly because the total amount of data we would have to go through for each request was too significant and impactful to keep a low latency on search requests. We implemented a slow learning workflow, to try and make everything as simple as possible, and avoid to manage a complicated and huge distributed data store system.

Each time the load balancer receives a request from a customer it doesn’t already know about, it goes through a slower process to get the list of upstream servers associated with this customer. All the following requests for the same customer are handled much faster, as they then fetch the needed upstream information directly from the local cache.

Load balancing

We tried several technical solutions to achieve this:

  • HAProxy offers a Lua support for dynamic configuration, but from what we tested, it was too limited for our use case.
  • Envoy was (and still is) quite promising but the learning curve is pretty steep, and even though we managed to make a working PoC, their current load balancing algorithms are too restrictive for our long-term vision.
  • We tried to make a custom load balancer in Go. The PoC was working fine, but it remains difficult to assess the level of security and performance of such a solution on our own. It’s also a lot harder to maintain.
  • We finally tried OpenResty, which is Nginx-based and lets you run custom Lua code at different steps of the requests processing. It has a quite well-developed community, there are a bunch of available modules, either official or community-driven, and the documentation is good.

We decided to go with OpenResty. We combined it with Redis for the caching part, as OpenResty offers a convenient module to interact with Redis:

Load balancing

With this iteration, we managed to make our load balancer more scalable and easily maintainable by finding mechanisms to remove any static configuration from it. Yet still, a few things were missing to make it production-proof:

  • How to make sure it correctly and transparently handles upstream server failures?
  • How to make sure we can still operate changes on the infrastructure, as we do daily?
  • What happens if it can no longer access our internal API?

Third (and current) iteration

In the third and latest implementation, we introduced some mechanisms to make the whole system more failure-proof.

In addition to OpenResty handling the load balancing logic, and Redis caching the dynamic data, we added lb-helper, a custom Go daemon.

The complete load balancer now looks like this:

Load balancing

The lb-helper daemon has two different roles:

  • Abstract our internal API. OpenResty learns about the upstream servers through the local lb-helper, which periodically fetch data from our internal API. If the load balancer fails to connect to our internal API, it can still operate with potential slightly outdated data.
  • Manage failures. Each time an upstream server fails more than 10 times in a row, we consider it as down and remove it from the active cache. From there, the lb-helper probes the down upstream to check whether it’s back or not.

Bottom line

Today, we still mainly rely on our DNS-based load balancing, as it fits 99% of our use cases. That said, we’re now also aware that this approach has some limitations in certain situations, such as customers with back-end implementations combined to a heavy search load. In such a context, deploying a set of our load balancers brings back an even load on the search infrastructure.

Load balancing
Requests per second distribution over time for a set of servers, first without, then with a load balancer.

Also, these experiments showed us that we built much more than a simple load balancing device. It brings an abstraction layer on top of our search infrastructure, making failures, infrastructure changes or scaling almost fully transparent to our customers.

As we’re currently working on the fourth iteration, we’re attempting to introduce a latency-based algorithm to replace the current round-robin. The long-term plan is to check whether we can bring a worldwide abstraction layer on top of our search infrastructure. Yet, trying to go global at this scale brings a new set of constraints. That’s a topic for another blog post!

About the author
Paul Berthaux

Sr. Site Reliability Engineer

Algolia infrastructure

More info
Algolia infrastructure

Recommended Articles

Powered byAlgolia Algolia Recommend

Speeding up our Crawler parallel processing by 50 percent
engineering

Samuel Bodin

Software Engineer Crawler

Salt Incident: May 3rd 2020 Retrospective and Update
engineering

Julien Lemoine

Co-founder & former CTO at Algolia

A Journey Into SRE
engineering

Sergio Galvan

Site Reliability Engineer