Search by Algolia
How to increase your ecommerce conversion rate in 2024
e-commerce

How to increase your ecommerce conversion rate in 2024

2%. That’s the average conversion rate for an online store. Unless you’re performing at Amazon’s promoted products ...

Vincent Caruana

Senior Digital Marketing Manager, SEO

How does a vector database work? A quick tutorial
ai

How does a vector database work? A quick tutorial

What’s a vector database? And how different is it than a regular-old traditional relational database? If you’re ...

Catherine Dee

Search and Discovery writer

Removing outliers for A/B search tests
engineering

Removing outliers for A/B search tests

How do you measure the success of a new feature? How do you test the impact? There are different ways ...

Christopher Hawke

Senior Software Engineer

Easily integrate Algolia into native apps with FlutterFlow
engineering

Easily integrate Algolia into native apps with FlutterFlow

Algolia's advanced search capabilities pair seamlessly with iOS or Android Apps when using FlutterFlow. App development and search design ...

Chuck Meyer

Sr. Developer Relations Engineer

Algolia's search propels 1,000s of retailers to Black Friday success
e-commerce

Algolia's search propels 1,000s of retailers to Black Friday success

In the midst of the Black Friday shopping frenzy, Algolia soared to new heights, setting new records and delivering an ...

Bernadette Nixon

Chief Executive Officer and Board Member at Algolia

Generative AI’s impact on the ecommerce industry
ai

Generative AI’s impact on the ecommerce industry

When was your last online shopping trip, and how did it go? For consumers, it’s becoming arguably tougher to ...

Vincent Caruana

Senior Digital Marketing Manager, SEO

What’s the average ecommerce conversion rate and how does yours compare?
e-commerce

What’s the average ecommerce conversion rate and how does yours compare?

Have you put your blood, sweat, and tears into perfecting your online store, only to see your conversion rates stuck ...

Vincent Caruana

Senior Digital Marketing Manager, SEO

What are AI chatbots, how do they work, and how have they impacted ecommerce?
ai

What are AI chatbots, how do they work, and how have they impacted ecommerce?

“Hello, how can I help you today?”  This has to be the most tired, but nevertheless tried-and-true ...

Catherine Dee

Search and Discovery writer

Algolia named a leader in IDC MarketScape
algolia

Algolia named a leader in IDC MarketScape

We are proud to announce that Algolia was named a leader in the IDC Marketscape in the Worldwide General-Purpose ...

John Stewart

VP Corporate Marketing

Mastering the channel shift: How leading distributors provide excellent online buying experiences
e-commerce

Mastering the channel shift: How leading distributors provide excellent online buying experiences

Twice a year, B2B Online brings together America’s leading manufacturers and distributors to uncover learnings and industry trends. This ...

Jack Moberger

Director, Sales Enablement & B2B Practice Leader

Large language models (LLMs) vs generative AI: what’s the difference?
ai

Large language models (LLMs) vs generative AI: what’s the difference?

Generative AI and large language models (LLMs). These two cutting-edge AI technologies sound like totally different, incomparable things. One ...

Catherine Dee

Search and Discovery writer

What is generative AI and how does it work?
ai

What is generative AI and how does it work?

ChatGPT, Bing, Bard, YouChat, DALL-E, Jasper…chances are good you’re leveraging some version of generative artificial intelligence on ...

Catherine Dee

Search and Discovery writer

Feature Spotlight: Query Suggestions
product

Feature Spotlight: Query Suggestions

Your users are spoiled. They’re used to Google’s refined and convenient search interface, so they have high expectations ...

Jaden Baptista

Technical Writer

What does it take to build and train a large language model? An introduction
ai

What does it take to build and train a large language model? An introduction

Imagine if, as your final exam for a computer science class, you had to create a real-world large language ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

The pros and cons of AI language models
ai

The pros and cons of AI language models

What do you think of the OpenAI ChatGPT app and AI language models? There’s lots going on: GPT-3 ...

Catherine Dee

Search and Discovery writer

How AI is transforming merchandising from reactive to proactive
e-commerce

How AI is transforming merchandising from reactive to proactive

In the fast-paced and dynamic realm of digital merchandising, being reactive to customer trends has been the norm. In ...

Lorna Rivera

Staff User Researcher

Top examples of some of the best large language models out there
ai

Top examples of some of the best large language models out there

You’re at a dinner party when the conversation takes a computer-science-y turn. Have you tried ChatGPT? What ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What are large language models?
ai

What are large language models?

It’s the era of Big Data, and super-sized language models are the latest stars. When it comes to ...

Catherine Dee

Search and Discovery writer

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

The simplicity of Heroku, for a team consisting only of developers, made it easy for us to get our prototype into production. However, as our product matured and customer expectations grew, we needed more robustness and fine-grained control over our infrastructure. We knew that Kubernetes was the right choice for us. However, the migration was not a simple task.

Here’s the backstory. About a year ago, we decided to prototype a web crawler for Algolia. As you may know, Algolia users make their content searchable by uploading their data to a searchable index, using Algolia’s Search API. Several potential customers asked if we could populate their search index for them, automatically, by crawling their website’s content. In response, we quickly built a web crawler prototype in Node.js and deployed it to Heroku.

We were not disappointed: adding services such as a database server or RabbitMQ was just a single click away. All we had to do was git push to deploy new versions, and our prototype went to production.

In a few months, our web crawler became popular among Algolia customers, and many others started to express a need for it.

However, the crawler required lots of components – some running in the background, others on demand; additionally, some customers required customized components. As the product grew more complex, we asked for help from our infrastructure colleagues.

A good example of this complexity is with IP Whitelisting. One of our customers wanted us to crawl from a fixed IP address so that they could whitelist that IP for high-rate crawling without being throttled by their load balancer. Only two engineers were developing the crawler, so we asked other colleagues to set up an HTTP proxy with a fixed IP address. Yet, as the number of customers grew, many more started asking for the same thing, and our infrastructure team told us it was time for us to take care of it ourselves.

Therefore, we decided to move to a cloud platform that would provide more control over our infrastructure, and eventually allow us to programmatically set up and tear down proxies. That’s how we decided it was time for us to migrate from Heroku to Google Kubernetes Engine (GKE). As a first step, we wanted to get our crawler to work on a GKE cluster, with as few codebase changes as possible. Only then would we make it more robust and maintainable for production.

This was far from being straightforward, as we had initially thought.

In this article, we describe the architecture of our crawler and explain how we made it run on GKE, sharing three challenges that we tackled while migrating. We then conclude with a few lessons learned and benefits brought about by the migration.

Setting things up

Before we dig down into the rabbit hole, let’s do an overview of the web crawler – its architecture, underlying services, and how we make it work locally and in production.

The crawler itself is a set of three components:

  • the worker is responsible for fetching a web page, extracting information from its HTML content, and storing this information into an Algolia index;
  • the manager is responsible for dispatching URLs to be crawled to workers, with rules and constraints associated with each customer (e.g., rate limiting), as well as any configuration updates that they may have requested;
  • the web server is responsible for handling API requests addressed to the crawler (e.g., from the Algolia dashboard) and serving its own management and monitoring dashboard.

These components sit on top of several services:

  • a RabbitMQ queue that holds the list of URLs to crawl;
  • a PostgreSQL database that holds the state of the crawler, like customer configurations, a list of URLs, and external data to improve the relevance of searched records;
  • a Redis store that holds the user sessions of our dashboard;
  • a Tika server that acts as a proxy to extract content from PDF files and other types of non-HTML documents;
  • and a Rendertron server that acts as a proxy to extract content from single-page applications that require the execution of JavaScript code to render their content in the DOM.

To run all these components and services locally while developing the crawler, we set up a docker-compose file that specifies Docker images and parameters for all of them.
On Heroku, we activated add-ons for each service and wrote a Procfile to specify what command should be executed to start each component. Then, by simply executing git push heroku master, we ensured that the latest versions of our components would be automatically uploaded and started in our Heroku dynos. It was a breeze.

Kubernetes is a system that can dispatch pods based on developer-defined services and deployments. Our first goal was to make our components and services run the same way as in our existing docker-compose.yaml file. We would only have to convert that file to the Kubernetes format and find the right command to start them.

After spending a few hours trying to do that with kompose, without much success, we decided to ask for help. A coworker helped us in three ways: setting up a cluster on GKE, providing us with examples of Kubernetes definition files for deployments and services, and recommending that we use services managed by Google (namely, PubSub and CloudSQL) instead of running our own RabbitMQ and PostgreSQL docker containers as pods. This was all excellent advice, but too soon. To better understand how Kubernetes works, and to feel more confident with it, we decided to solve one problem at a time: first, get our services to run in containers by mirroring our docker-compose definition, and only then, consider replacing them by Google-managed services.

We therefore started writing Kubernetes definition files for each service.

Implementation – Let’s first get Kubernetes to run

We defined them like this:

To summarize:

  • a deployment is the description of a piece of software that can be deployed, run on a given number of instances, and stopped;
  • a service is a deployment that may handle requests from other parts of the system.

For example, to run RabbitMQ on Kubernetes, we need to:

  • define a deployment by specifying a Docker image that runs a RabbitMQ server;
  • and define a service that exposes two ports: one for AMQP queries, and an optional one to serve the management UI.

We defined our crawler’s components the same way as deployments, apart from the web server that needed to be defined as a service. Because these components aren’t public Docker images on DockerHub, we also had to write a Dockerfile to generate an image from our source code, upload the image to our Google Registry, and then refer to its identifier from our three deployments. To do that, we had to learn how to use the gcloud and kubectl command-line interface (CLI) tools.

After defining our deployments and services in YAML files, we needed them to connect to each other. For example, our three crawler components expect environment variables containing the URLs of all the services it needs to connect to. In Heroku, we had a list of global environment variables that all our dynos shared. We could edit them from the Heroku dashboard or through their CLI. That said, most of our add-ons (e.g., managed PostgreSQL database) automatically set environment variables to provide direct access to their data, so we didn’t need to do much.

In the Kubernetes world, environment variables are set at the deployment level. It means each deployment file should contain the values of necessary environment variables. Furthermore, given the fact that Kubernetes can dynamically kill and restart deployments on different nodes (e.g., physical machines of the cluster) at any time, their IP address and port can change. Consequently, we can’t provide hard-coded values to the environment variables of our components.

Fortunately, we learned that Kubernetes dynamically generates cluster-wide environment variables for all services, with the form <SERVICE-NAME>_SERVICE_HOST and <SERVICE-NAME>_SERVICE_PORT. We also discovered that it was possible to inject the value of environment variables into others, by using the following YAML syntax:

Confidential environment variables, like passwords, needed a different process. For that, we used Kubernetes Secrets.

Secrets are Kubernetes entities that can hold confidential values. They are recommended for storing password, certificates, and any other kind of private information: these values are never added in plain text in YAML files, and accessing them requires special permissions.

To be stored as environment variables, Secrets must also be declared in the YAML files of the deployments where they are required. Yet, they aren’t structured the same way as environment variables: we needed to mount a volume to load Secrets, then import their values as environment variables.

We later learned that it was possible to share environment variables across several deployments using ConfigMaps. These are Kubernetes entities that can hold several named values and be imported as environment variables into deployments.

Using ConfigMaps prevented us from duplicating configurations, but we couldn’t find any way to include Secrets, or any other entities, that would wrap the value of other environment variables (e.g., using the $() syntax, as seen above) in a Config Map. Therefore, we ended up using ConfigMaps for invariable configuration values, Secrets for passwords and keys, and inline environment variable definitions for the ones that depend on other environment variables.
Furthermore, as we wanted our YAML files to provision two different clusters (e.g., production and staging) on separate domain names, we ended up turning some of those into templates and writing a script that rendered them into final YAML files, using sed. We’re pretty sure there’s a more standard way to achieve this, but this method was a good compromise for us, given the time we were able to spend on this migration.

At that point, we had written 10 YAML files and 5 bash scripts to define our components and services. We were finally ready to provision our GKE cluster and see them run. The command to upload our YAML files and to get them to run on our cluster was: kubectl apply -f ..

To give you an example of the kind of scripts we wrote, here is the list of commands we were running to restart all components after deploying an update that may contain database migrations:

Not so fast. A colleague warned us that, in order to make our dashboard accessible from the Internet, we had to define an Ingress resource to connect our “web” Service to Google’s HTTP Load Balancer.

He provided an example that ended up looking like this:

After a few minutes, our dashboard was finally live!

Unfortunately, we couldn’t log in using our Single Sign-on system before enabling HTTPS access to that endpoint. Let’s dig into that.

Wire an SSL certificate

In the Heroku world, enabling HTTPS/SSL is a piece of cake. All you have to do is click a button.

Heroku would automatically generate a free SSL certificate using Let’s Encrypt, reply to the ACME challenge, and automatically repeat that process every 3 months to renew the certificate without us knowing it. It just worked.

We hoped that Google would also provide an easy way to set that up on our GKE cluster. Think again! GKE’s documentation clearly states that, while it’s possible to associate an SSL certificate with Google’s Load Balancer through kubectl or Google Cloud Console, they don’t provide a way to generate it.

We used Google to find solutions, and we found several projects that promised to generate and renew an SSL certificate for your GKE cluster, and would automatically associate it to our Load Balancer. Unfortunately, all of them included disclaimers such as “do not use in production” or “we do not currently offer strong guarantees around our API stability”. Therefore, we decided that, until Google provides a reliable way to do that automatically, we would manually generate a Let’s Encrypt certificate, then attach it to our load balancer. The only thing is, we needed to remember to do this every few months.

Our crawler was fully functional at that point. The only remaining concern before migrating was the potential loss of data from our PostgreSQL database, as it was still running from a Docker container, without persistent volume nor any backup routine.

Disclaimer: Since our migration, other solutions have made that process easier. We haven’t tried them yet.

Plug into a managed database

Data is a serious matter. Our customers rely on us, so their data should be available at all times, be responsive no matter the scale, have no leaks, and be quickly recoverable in case of incident. These are all excellent reasons to trust a managed database service instead of running the database ourselves from a Docker container that Kubernetes can kill at any time.

As part of the Google Cloud ecosystem, the CloudSQL product had recently promoted their managed PostgreSQL service as “production-ready”, so this was a no-brainer for us: we should plug our crawler into that service. Our colleagues told us that we would have to integrate a so-called CloudSQL proxy to connect to a CloudSQL-managed PostgreSQL server.

To do this, we followed a tutorial provided by Google. In addition to replacing our PostgreSQL Service by a CloudSQL Proxy Service, we had to:

  • create a database user;
  • store that user’s password securely as a Kubernetes Secret;
  • create a service account to access the CloudSQL instance from our components;
  • load the secret as a dynamic environment variable in all our deployments that needed to connect to the database.

Despite the help we got, it was not straightforward to integrate this into our system. The tutorial provided by Google explained how to run that proxy in a “sidecar” fashion, meaning that the CloudSQL Proxy would run on the same Pod as the application itself, making it easy to connect to that proxy.

In our system, we have three separate components that need to access the same database, and we felt that attaching a separate CloudSQL Proxy to each of them would have been overkill and harder to maintain. Therefore, we had to take our time to better understand how to configure the deployments. On top of that, it’s sometimes necessary to access our production database from outside the cluster (e.g., from our development laptops for debugging purposes). Since all database connections must go through CloudSQL Proxy, we had two options:

  • connect through the CloudSQL Proxy that is running in our production cluster;
  • or set up a local CloudSQL Proxy with one dedicated service account per developer.

For security reasons, we picked the second solution. After downloading the JSON key associated with the service account we created for each of us, here’s how we were able to run CloudSQL Proxy on our laptops:

If you decide to follow that route, make sure that you do not keep the JSON key associated with your service account on your laptop. We recommend using a system like Vault to either store these keys more securely, or even generate a new, short-lived key with every connection.

Disclaimer: It’s now possible to access a CloudSQL database directly from GKE. We haven’t tried this yet.

Conclusion

The migration took time, despite some of the shortcuts that we took, such as mirroring our system’s docker containers first to reach an iso-functional stage, and only then replacing our database container by a managed solution. In the end, we’re glad to have taken a step-by-step approach. It allowed us to better understand how Kubernetes works, and to master the tools that we use to maintain our GKE cluster. It also prevented us from having to tackle more than one problem at a time, which could have become a major source of frustration and demotivation given the complexity of GKE and the infinite number of ways problems can be addressed in that ecosystem.

It took us several weeks to get our crawler up and running sustainably on GKE, and to finally shut down our Heroku dynos and add-ons. While the migration was more tedious and less straightforward than we had anticipated – despite the help we got from colleagues who had experience with Kubernetes and Google Cloud products – at the end of the day, we are satisfied by what it brings to the table. For example:

  • We now have a better understanding of each of our components’ hardware requirements and their behaviors in isolation. Migrating to Kubernetes has made them more robust on a dynamic infrastructure (e.g., nodes can be turned off and re-allocated at any time, and at different IP addresses).
  • We are currently discovering how to horizontally scale the number of replicas of our “worker” deployments automatically and in a more efficient way than we could have done on Heroku.
  • We are confident that we will be able to set up static-IP proxy servers programmatically on our cluster, based on our customers’ need.

The good news is that recent evolutions in Google Kubernetes Engine have addressed some of our difficulties and made this process easier.

By the way, we’re organizing an event about Kubernetes on March 7th in our Paris office with some Google experts. If you’re in Europe at that time, feel free to register!

The author would like to thank Rémy-Christophe Schermesser, Sarah Dayan, Peter Villani, Angélique Fey and Tiphaine Gillet for their contribution to this article. And all the colleagues who helped proofread it. ❤️

About the author
Adrien Joly

Senior Software Engineer

githublinkedintwitter

Recommended Articles

Powered byAlgolia Algolia Recommend

Speeding up our Crawler parallel processing by 50 percent
engineering

Samuel Bodin

Software Engineer Crawler

Introducing our new navigation
product

Craig Williams

Director of Product Design & Research

Good API Documentation Is Not About Choosing the Right Tool
engineering

Maxime Locqueville

DX Engineering Manager