1. All Blogs
  2. Product
  3. AI
  4. E-commerce
  5. Customers
  6. User Experience
  7. Algolia
  8. Engineering

AI Vector Search: Build or Buy?

Updated:
Published:

To get the obvious out of the way, hi! We’re Algolia. We’re the AI search people who serve almost 2 trillion search requests every year, so we know a thing or two about this topic.

In our many years of participating in this community, we’ve observed that many companies know that they want AI search, but they're torn between building or buying it. So let’s do a little compare and contrast today, shall we? Is it worth building your own AI search implementation? How would you even go about that? What are the pros and cons of going with a search vendor like Algolia?

In this blog, we're going to show you how to actually build an AI search engine. It's not that hard to just get something up and running. However, as you'll see, what is hard is all the stuff that goes into making a search engine good, like relevant results, custom ranking, managing your index, A/B testing, building a great UI, and so on.

Building your own AI search

The basic process isn’t too difficult. There are a million optimizations we could make, but just to get a solid understanding of the underlying tech, we’re going to go the easy route.

If you’d like to follow along, here’s a GitHub repo with all the steps in code. You can pip install all the dependencies, put a file called inputs/initial_dataset.json in the repo with your product database, and run each step as we go with the modifications that suit your data. Here’s the basic process:

  1. generate_corpus.py — First, we need to look through the product database to filter out the records that don’t include enough information and generate a corpus of words that are likely to go with each other. In my dataset, I wanted to create this corpus by extracting the name, brand, description, and category fields, stripping non-letters, lowercasing it, combining all of them, and splitting on the spaces.

  2. generate_vectors.py — Second, we’ll create a Word2Vec model and train it on a default corpus so it understands the relationship between lots of words in general.

    from gensim.models import Word2Vec
    import gensim.downloader as api
    
    base_corpus = api.load('text8')
    model = Word2Vec(base_corpus, vector_size=100, window=5, workers=4, min_count=1)
    

    If we just trained it on our product database, it wouldn’t recognize many of the words that someone might try to search if they didn’t appear in our product listings. But since we’re training it on the generalized corpus first, then we can refine it with the product-specific corpus we generated in step 1 and it’ll be responsive to company-specific associations.

    import json
    with open('../step_1_generate_corpus/corpus.json', 'r') as file:
      connected_lists = json.load(file)
    model.train(connected_lists, total_examples=len(connected_lists), epochs=10)
    
  3. vectorize_index.py — Next, load up the big trained model and use it to create a vector matching each record in the product database. Since we’ve already done the formatting work, we can just pull in the product corpus we created again and vectorize each of those lists of words with this function:

    import numpy as np
    def get_word_list_vector (word_list): # sentence is an array of words
      word_list = [word for word in word_list if word in word_vectors.index_to_key]
      return np.mean(word_vectors[word_list], axis=0)
    

    This is easily the most time- and resource-intensive step of the process. It took several hours to vectorize my entire dataset, even after splitting up the work among 15 parallel threads.

  4. generate_index_model.py — Then, we’ll create a simple dictionary of unique IDs from our product database mapped to the vectors that represent the text that describes each product. For convenience, we’ll fill a new gensim.models.KeyedVectors model with this info. In reality, we’re not doing any more training, so we’re just using this as a datatype. But the benefit of using a KeyedVectors instead of a simple dictionary is that it has a method called .most_similar() that can find the most similar vector in the model to a query vector. We’ll use that later to actually run queries, so this data structure is essentially our actual “vector search index”. Since it can be written to a file just as easily as a dictionary can, we store it in a file for our query script to just import later for convenience.

  5. search_by_query.py — This is the actual searching part. We load in the Word2Vec model that can consistently vectorize whatever words we throw into it, as well as the KeyedVectors structure that contains our index. We will also need the actual product database data we cleaned all the way back in step 1.

    from gensim.models import Word2Vec, KeyedVectors
    import json
    sentence_vectors = Word2Vec.load("../step_2_generate_vectors/trained.model").wv
    product_vectors = KeyedVectors.load("../step_4_generate_index_model/index.model")
    with open('../step_4_generate_index_model/index_dict.json', 'r') as file:
      index = json.load(file)
    

    Now, we can vectorize the input query (which if you just run the program as-is from the command line, comes à la python search_by_query.py query goes here in plain text without quotes), pass it into the similarity check, and look up that unique ID in the JSON product database.

    query_vector = np.mean(sentence_vectors[query], axis=0)
    similar_products = product_vectors.most_similar(query_vector, topn=10)
    search_results = [index[result[0]] for result in similar_products]
    

And that’s it! It took me a about a couple hours to build, and the concepts aren’t that difficult to grasp. There are a bunch of potential optimizations here, but maybe there’s your next weekend project!

Should I roll my own AI search?

Now that you know the process of rolling your own vector-based AI search engine, do you think its worth it? You’re definitely not the first person to ask that question. Almost every ecommerce site has a search function, and surely many teams have at some point wondered what path forward is the most flexible, cost-effective, efficient, and easy to maintain. So let’s go over some of the pros and cons of rolling your own AI search like we did above.

Pro: It’s not hard to write a simple version

This whole project took me about an afternoon. It didn’t require anything but basic understanding of Python syntax and access to the gensim docs. It’s true that this was a lot longer than just uploading my product database straight into an Algolia search index, but it’s also true that some Algolia features can take a similar amount of finagling to find the right settings and get it all working just right.

Con: It’s incredibly slow to index and query

The afternoon was just the coding part — I actually had the vectorize_index.py script running overnight because it was taking soooooo long. Then, as it turned out, I had a mistake in my original code and none of that work was saved. 🤦‍♂️ To be fair, I was running this on a fairly large dataset, but even when I split up the data into chunks and ran the vectorizer model on 15 different threads, it still took several hours. Usually an upfront cost like this isn’t too bad, but you’d have to rerun the vectorizer on whatever new records you add. Plus, since the vectorizer itself was partially trained on your initial product dataset, as you add whole new categories of products (like products that you only keep in inventory seasonally, for example), the AI model that’s creating the vectors will slowly drift out of tune with your actual product database. The accuracy of the searches will slowly decrease, and take with it your revenue.

Pro: It helps you grow as an engineer

Projects like this can teach you valuable skills. For example, this project taught me a lot about installing Python packages (getting the right version of gensim and all of its dependencies was a pain in the neck — I had to learn about wheels and pip commands to get it straight). I also better understood how vectors work, even though I already had some experience in the field.

Con: It’s tough to use in production

Loading huge AI models into RAM is tough, especially when you’re trying to stick to a lean backend architecture. If you’re running a microservice, importing that file into your code every time it boots up is going to slow down the query time significantly and probably cost you a pretty penny (especially at scale). Turning it into an API for your frontend is going to require backend engineers and a lot of patience.

Pro: You can control everything

One benefit of rolling your own anything is control. Sometimes, if you have a very specific core business function that involves search, it might be better to tackle that task yourself. This wouldn’t apply to most ecommerce companies, since their “business” isn’t search, it’s selling their products. Search just assists. On the other hand, if you’re getting into the burgeoning privacy-focused web search engine market and your whole value proposition is that your tool can search the web really fast and really accurately, then that’s a different story. At that point, search isn’t just supplemental to your business, it is your business. You need control over it, so it might make sense to build it all yourself.

Con: Search is much more than just vectors and keywords

The thing about Algolia is that with just the one index, you can do so much. Just looking at the navigation menu in the Algolia Docs should give you some idea.

  • You can integrate with different frameworks and platforms;
  • easily sort, group, facet, and filter search results;
  • set synonyms and re-ranking criteria;
  • create a fully functional UI just from prebuilt widgets;
  • build an autocomplete bar; track analytics data;
  • personalize results based on user profiles;
  • recommend items from the index that the user might like or that look similar;
  • categorize queries to trim down the possible search results;
  • automatically find synonyms and rerank results based on trends;
  • and so, so much more.

So many of the features that seem like obvious additions, things that should come out of the box with any sophisticated search engine, just aren’t there with our homemade version. Since many of them involve a lot more complexity than the search index itself, it would be quite the undertaking to try to implement even one of these features.

What’s the verdict?

In short, it’s almost always better to buy, or use a search tool made by specialists. Unless you’re in very specific circumstances, vector search and all its associated technologies are too intricate to entrust to those who haven’t spent years perfecting it. The benefits of using a tool like Algolia for everything from prototypes to production applications are immense and easily seen from the satisfaction of countless development teams.

Are you ready to equip your business with powerful, seamless AI vector search? You’ve learned how it works, let us show you how it’s done — sign up for free today.

Recommended

Get the AI search that shows users what they need