Search by Algolia
Removing outliers for A/B search tests
engineering

Removing outliers for A/B search tests

How do you measure the success of a new feature? How do you test the impact? There are different ways ...

Christopher Hawke

Senior Software Engineer

Easily integrate Algolia into native apps with FlutterFlow
engineering

Easily integrate Algolia into native apps with FlutterFlow

Algolia's advanced search capabilities pair seamlessly with iOS or Android Apps when using FlutterFlow. App development and search design ...

Chuck Meyer

Sr. Developer Relations Engineer

Algolia's search propels 1,000s of retailers to Black Friday success
e-commerce

Algolia's search propels 1,000s of retailers to Black Friday success

In the midst of the Black Friday shopping frenzy, Algolia soared to new heights, setting new records and delivering an ...

Bernadette Nixon

Chief Executive Officer and Board Member at Algolia

Generative AI’s impact on the ecommerce industry
ai

Generative AI’s impact on the ecommerce industry

When was your last online shopping trip, and how did it go? For consumers, it’s becoming arguably tougher to ...

Vincent Caruana

Senior Digital Marketing Manager, SEO

What’s the average ecommerce conversion rate and how does yours compare?
e-commerce

What’s the average ecommerce conversion rate and how does yours compare?

Have you put your blood, sweat, and tears into perfecting your online store, only to see your conversion rates stuck ...

Vincent Caruana

Senior Digital Marketing Manager, SEO

What are AI chatbots, how do they work, and how have they impacted ecommerce?
ai

What are AI chatbots, how do they work, and how have they impacted ecommerce?

“Hello, how can I help you today?”  This has to be the most tired, but nevertheless tried-and-true ...

Catherine Dee

Search and Discovery writer

Algolia named a leader in IDC MarketScape
algolia

Algolia named a leader in IDC MarketScape

We are proud to announce that Algolia was named a leader in the IDC Marketscape in the Worldwide General-Purpose ...

John Stewart

VP Corporate Marketing

Mastering the channel shift: How leading distributors provide excellent online buying experiences
e-commerce

Mastering the channel shift: How leading distributors provide excellent online buying experiences

Twice a year, B2B Online brings together America’s leading manufacturers and distributors to uncover learnings and industry trends. This ...

Jack Moberger

Director, Sales Enablement & B2B Practice Leader

Large language models (LLMs) vs generative AI: what’s the difference?
ai

Large language models (LLMs) vs generative AI: what’s the difference?

Generative AI and large language models (LLMs). These two cutting-edge AI technologies sound like totally different, incomparable things. One ...

Catherine Dee

Search and Discovery writer

What is generative AI and how does it work?
ai

What is generative AI and how does it work?

ChatGPT, Bing, Bard, YouChat, DALL-E, Jasper…chances are good you’re leveraging some version of generative artificial intelligence on ...

Catherine Dee

Search and Discovery writer

Feature Spotlight: Query Suggestions
product

Feature Spotlight: Query Suggestions

Your users are spoiled. They’re used to Google’s refined and convenient search interface, so they have high expectations ...

Jaden Baptista

Technical Writer

What does it take to build and train a large language model? An introduction
ai

What does it take to build and train a large language model? An introduction

Imagine if, as your final exam for a computer science class, you had to create a real-world large language ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

The pros and cons of AI language models
ai

The pros and cons of AI language models

What do you think of the OpenAI ChatGPT app and AI language models? There’s lots going on: GPT-3 ...

Catherine Dee

Search and Discovery writer

How AI is transforming merchandising from reactive to proactive
e-commerce

How AI is transforming merchandising from reactive to proactive

In the fast-paced and dynamic realm of digital merchandising, being reactive to customer trends has been the norm. In ...

Lorna Rivera

Staff User Researcher

Top examples of some of the best large language models out there
ai

Top examples of some of the best large language models out there

You’re at a dinner party when the conversation takes a computer-science-y turn. Have you tried ChatGPT? What ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What are large language models?
ai

What are large language models?

It’s the era of Big Data, and super-sized language models are the latest stars. When it comes to ...

Catherine Dee

Search and Discovery writer

Mobile search done right: Common pitfalls and best practices
ux

Mobile search done right: Common pitfalls and best practices

Did you know that 86% of the global population uses a smartphone? The 7 billion devices connected to the Internet ...

Alexandre Collin

Staff SME Business & Optimization - UI/UX

Cloud Native meetup: Observability & Sustainability
engineering

Cloud Native meetup: Observability & Sustainability

The Cloud Native Foundation is known for being the organization behind Kubernetes and many other Cloud Native tools. To foster ...

Tim Carry

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Just as with a school kid who’s left unsupervised when their teacher steps outside to deal with a distraction, unsupervised learning in a machine-learning scenario can have its drawbacks. For instance, consider search technology. With no data scientist overseeing the generative process, Ms. or Mr. Machine could get creative in the sources it deems accurate, stringing together unrelated information to present search results that don’t quite compute — that is, they aren’t necessarily true or accurate.

In our era of Big Data, supervised learning has inherent value. It seems safer. When a kid isn’t left alone to let who-knows-what ensue, there aren’t usually issues.

But when it comes to AI-powered search, among other applications, unsupervised learning can’t simply be written off as too risky. For one thing, it offers way too many major benefits. In terms of data science, one huge advantage is the removal of human bias. Instead of having a researcher predefine classification groups, all on its own, the independently minded machine puts forth a number of clusters based on empirical proof.

So an unsupervised data mining system that can detect many kinds of data groupings based on similarities it can detect — for instance, as applied to customer segmentation or image segmentation — is undoubtedly a powerful tool.

How clustering works

Unsupervised learning utilizes clustering: grouping data points in scatter plots in classes that share similar characteristics. Instead of predefined categories, unlabeled data is classified (clustered) by features.

When doing clustering segmentation with words, things can get complex because some words can be placed in a variety of clusters. For example, a word like “moving” might be placed in multiple grammatical clusters (because it’s a verb and an –ing word), as well as clusters related to driving a car and relocation.

Fortunately, the various problems arising from establishing word meaning in machine learning can be summarily solved. And that’s where the k-means clustering algorithm comes in handy.

What is k-means clustering?

The variable k, referring to the number of centroids you need in the dataset, is predefined. It represents the number of groups or categories. The data is divided into k clusters (classes). 

The “means” in k-means refers to the averaging of data related to the location of a center point, or centroid. A centroid is the (real or imaginary) center of the cluster.

K-means looks for a fixed number (k) of clusters in a dataset.

With k-means, the distance between each point in the dataset and each initial centroid is calculated. According to one study, “Distance measure plays a very important rule [in] the performance of this algorithm…. Based on the values found, points are assigned to the centroid with minimum distance. Hence, this distance calculation plays [a] vital role in the clustering algorithm.”

The goal of using the k-means algorithm is to minimize the sum of distances between the data points and their corresponding clusters, grouping similar data points together and thereby being able to see underlying patterns.

Each dataset can belong to only one group of similar properties, which pinpoints the location of the cluster center of mass (the centroid, a collection of feature values).

How the k-means algorithm works

The k-means clustering algorithm starts by randomly choosing data points as proposed cluster centroids, the beginner points for the initial clusters. A k number of centroids is identified.

The algorithm assigns each data point to its nearest center point. Every point is allocated to the nearest cluster by reducing the in-cluster sum of squares, while simultaneously ensuring that the centroids stay as compact as possible. The points near the center create the cluster.

The algorithm uses an iterative (repeating) approach to pinpoint the best value for each centroid, recalculating new centroids until it ultimately comes up with workable final cluster assignments. When the initial centroids are estimated, these estimates are pinpointed by repeatedly assigning examples to their closest centroids and, based on the new assignments, recomputing the centroids.

With this step-1 activity complete, a new data point is assigned for each cluster. The process is repeated until convergence is reached, with the recalculated centroids matching the initial centroids. The centroid values no longer change and the defined number of iterations has been reached. If the convergence criterion is not attained, there is still a limit on the maximum number of iterations (between 1 and 999). Either way, the algorithm has a trigger to stop creating and optimizing clusters.

Determining the number of clusters

With k-means, to determine the number of clusters in a data set, data scientists often apply the aptly named elbow method. Based on where the curve of the plotted data shows an “elbow” or starts to straighten out after heading up, they can see where returns will diminish. Based on that, they can determine the optimal number of clusters to use.

With k-means, there’s another clustering method as well. You can alternatively choose the number of clusters by analyzing the silhouette plot, which shows how close each point in a cluster is to points in other clusters.

Why is the k-means algorithm so popular?

The k-means clustering algorithm is employed extensively to analyze data clusters in applications ranging from search to business analysis.

Among the various unsupervised machine-learning algorithms available, what makes k-means stand out?

In a word, speed. Whether there’s not much information or there’s a very large dataset, k-means excels at quickly dividing data into clusters. It’s also flexible, and it’s known to work well immediately. This is an expedient way to figure out the categories of groups in an unlabeled dataset without training a machine.

The advantage of this approach is also that the human bias is taken out of the equation. Instead of having a researcher create classification groups, the machine creates its own clusters based on empirical proof, as opposed to people’s assumptions.

The downsides? A few considerations:

  • K-means performance is typically not as impressive as with other high-level clustering techniques, as minor variations in the data could possibly lead to high levels of variance.
  • Clusters are expected to be spherical and sized similarly, without too many outliers; when they’re not, it could lower the accuracy of k-means-clustering Python results.
  • In the k-means method of cluster analysis, observations are grouped by minimizing Euclidean distances between them. According to Wikipedia, “k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances…. The problem is computationally difficult…however, efficient heuristic algorithms converge quickly to a local optimum.”

For more information

Hope you’ve enjoyed this quick introduction to k-means clustering. If you want to learn more:

  • Consult the scikit-learn library for a straightforward explanation of how k-means clustering works
  • Check out some of the excellent in-depth tutorials available on k-means clustering
  • Get a good visualization of how the k-means machine-learning algorithm works using the Python programming language

Want the advantages of k-means-powered search?

Looking for potent optimization of your website? Algolia utilizes k-means clustering among other algorithms to return fast, accurate AI-powered search results and recommendations for our clients’ sites and apps.

See how our groundbreaking NeuralSearch technology can improve the search and discovery experience for your users, plus potentially boost your site metrics up along with your customer satisfaction levels. Contact us for all the details!

About the author
Catherine Dee

Search and Discovery writer

linkedin

Recommended Articles

Powered byAlgolia Algolia Recommend

What is clustering?
ai

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

An introduction to machine learning for images and text — now and in the (near) future
ai

Peter Villani

Sr. Tech & Business Writer

How We Tackled Color Identification
engineering

Léo Ercolanelli

Software Engineer