Search by Algolia
What is online retail merchandising? An introduction
e-commerce

What is online retail merchandising? An introduction

Done any shopping on an ecommerce website lately? If so, you know a smooth online shopper experience is not optional ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

5 considerations for Black Friday 2023 readiness
e-commerce

5 considerations for Black Friday 2023 readiness

It’s hard to imagine having to think about Black Friday less than 4 months out from the previous one ...

Piyush Patel

Chief Strategic Business Development Officer

How to increase your sales and ROI with optimized ecommerce merchandising
e-commerce

How to increase your sales and ROI with optimized ecommerce merchandising

What happens if an online shopper arrives on your ecommerce site and: Your navigation provides no obvious or helpful direction ...

Catherine Dee

Search and Discovery writer

Mobile search UX best practices, part 3: Optimizing display of search results
ux

Mobile search UX best practices, part 3: Optimizing display of search results

In part 1 of this blog-post series, we looked at app interface design obstacles in the mobile search experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 2: Streamlining search functionality
ux

Mobile search UX best practices, part 2: Streamlining search functionality

In part 1 of this series on mobile UX design, we talked about how designing a successful search user experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 1: Understanding the challenges
ux

Mobile search UX best practices, part 1: Understanding the challenges

Welcome to our three-part series on creating winning search UX design for your mobile app! This post identifies developer ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Teaching English with Zapier and Algolia
engineering

Teaching English with Zapier and Algolia

National No Code Day falls on March 11th in the United States to encourage more people to build things online ...

Alita Leite da Silva

How AI search enables ecommerce companies to boost revenue and cut costs
ai

How AI search enables ecommerce companies to boost revenue and cut costs

Consulting powerhouse McKinsey is bullish on AI. Their forecasting estimates that AI could add around 16 percent to global GDP ...

Michelle Adams

Chief Revenue Officer at Algolia

What is digital product merchandising?
e-commerce

What is digital product merchandising?

How do you sell a product when your customers can’t assess it in person: pick it up, feel what ...

Catherine Dee

Search and Discovery writer

Scaling marketplace search with AI
ai

Scaling marketplace search with AI

It is clear that for online businesses and especially for Marketplaces, content discovery can be especially challenging due to the ...

Bharat Guruprakash

Chief Product Officer

The changing face of digital merchandising
e-commerce

The changing face of digital merchandising

This 2-part feature dives into the transformational journey made by digital merchandising to drive positive ecommerce experiences. Part 1 ...

Reshma Iyer

Director of Product Marketing, Ecommerce

What’s a convolutional neural network and how is it used for image recognition in search?
ai

What’s a convolutional neural network and how is it used for image recognition in search?

A social media user is shown snapshots of people he may know based on face-recognition technology and asked if ...

Catherine Dee

Search and Discovery writer

What’s organizational knowledge and how can you make it accessible to the right people?
product

What’s organizational knowledge and how can you make it accessible to the right people?

How’s your company’s organizational knowledge holding up? In other words, if an employee were to leave, would they ...

Catherine Dee

Search and Discovery writer

Adding trending recommendations to your existing e-commerce store
engineering

Adding trending recommendations to your existing e-commerce store

Recommendations can make or break an online shopping experience. In a world full of endless choices and infinite scrolling, recommendations ...

Ashley Huynh

Ecommerce trends for 2023: Personalization
e-commerce

Ecommerce trends for 2023: Personalization

Algolia sponsored the 2023 Ecommerce Site Search Trends report which was produced and written by Coleman Parkes Research. The report ...

Piyush Patel

Chief Strategic Business Development Officer

10 ways to know it’s fake AI search
ai

10 ways to know it’s fake AI search

You think your search engine really is powered by AI? Well maybe it is… or maybe not.  Here’s a ...

Michelle Adams

Chief Revenue Officer at Algolia

Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?
ai

Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?

You looked at this scarf twice; need matching mittens? How about an expensive down vest? You watched this goofy flick ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is cognitive search, and what could it mean for your business?
ai

What is cognitive search, and what could it mean for your business?

“I can’t find it.”  Sadly, this conclusion is often still part of the modern enterprise search experience. But ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Looking for something?

Computing Statistics on Terabytes of Data on Multiple Machines
facebookfacebooklinkedinlinkedintwittertwittermailmail

At one point, every company needs to compute some basic statistics on its data. I’m not speaking about advanced statistics but simple ones, like means, top values, etc. The algorithms to compute those metrics are fairly straightforward, but can take a lot of memory. So what happens when the amount of data is so big that it can’t fit in a reasonable amount of RAM?

Today at Algolia we generate 2 TB of logs per day, so computing metrics on this doesn’t fit into any of our machines. Furthermore, the data team would like to have those metrics in pseudo real time, so we need to process the logs in real time.

How to compute metrics on terabytes of data

For some of our data, it was OK not to have an exact value but an approximation. For example, instead of having an exact average of 3.9, we get 4.0, and that’s ok. Thankfully, some algorithms and data structures are able to lower their precision to have a lower memory footprint. Those data structures are what we call probabilistic data structures (or algorithms), and they all share some properties:

  • Have a small memory footprint — we are talking KB of memory instead of TB.
  • Usable in a single pass: each piece of data is processed once by the algorithm.
  • No statistical hypothesis on the data being processed.
  • Have a precision within an acceptable error margin. Most of the time the margin is a parameter of the algorithm.

We’ll see later how they work.

How to compute metrics on a stream of events

As we have logs arriving continuously, the team thought that we could leverage this and compute metrics on the fly. There are algorithms designed to process data on the fly, and they are called streaming algorithms. Sure, the name is fancy, but we code them every day. Let’s take an example.

Let’s say we have an array of integers, and we would like to compute the sum of each element. In pseudo code we would probably do something like:

Now imagine that our array is infinite (what we call a stream). Our code won’t change, and at any point in time, the variable sum would contain the sum of all elements we saw. This is a streaming algorithm: an algorithm that can process an infinite number of elements and that is able to give a coherent state after each iteration.

Bonus point: most probabilistic data structures are streaming algorithms.

How to compute metrics across multiple machines

We are able to compute simple metrics with a small amount of RAM and from a stream of data. But what if your stream is so big that one machine can’t handle it? As we found a way to reduce the memory footprint, couldn’t we find another trick for CPU?

One simple option would be to split the workload across multiple CPUs. For this, we will use a mathematical property. Yes, math. Bear with me a few seconds, you won’t be disappointed.

Let’s say you have a set of elements. On this set you have an operation, let’s call it +, that takes 2 elements of this set and gives back an element of the same set. We say that this set and this operation is a monoid if:

  • For any a, b, c that are elements of this set, this holds: (a + b) + c = a + (b + c)
  • There exists an element e of this set where: e + a = a + e = a

Let’s take some examples:

  • Integers and addition, where e is zero
  • Strings and concatenation, where e is the empty string
  • Lists and concatenation, where e is the empty list
  • Booleans and &&, where e is True

Why did I bother you with this?

Let’s take a simple example, with integers and addition. If you want to sum 1, 2, 3 & 4, you can ask Google or you can spread this sum on multiple CPUs, let’s say 2. Because this is a monoid you know you can do sub additions, for example: 1+2+3+4 = (1+2) + (3+4). So, you can ask the first CPU to compute 1+2, the second one 3+4, and only then sum those sub sums. And voilà, we have our final sum, which is 10.

Therefore, if some set and operation validates 2 properties, we have a sound way to spread the computation on multiple CPUs or machines.

Bonus point: most probabilistic data structures and streaming algorithms are monoids.

Conclusion

Algorithms like the ones above are not new — far from it. The first probabilistic data structure was invented in 1985 by Philippe Flajolet (hence the pf* commands in redis). Monoids are even older.

Happily for us, a lot of these algorithms are already implemented in big data software (think Spark, Apache Beam, Algebird, etc.).

The main thing we should remember is that some simple mathematical properties give us many nice coding features. Perhaps now you feel sad if you slept through your math classes :).

About the author
Rémy-Christophe Schermesser

Staff Software Engineer

Recommended Articles

Powered byAlgolia Algolia Recommend

Harry Logger and the Metrics’ Stone
engineering

Rémy-Christophe Schermesser

Staff Software Engineer

Algolia's top 10 tips to achieve highly relevant search results
product

Julien Lemoine

Co-founder & former CTO at Algolia

Inside the Algolia Engine Part 2 — The Indexing Challenge of Instant Search
engineering

Julien Lemoine

Co-founder & former CTO at Algolia