On this page
Textual Matching - The Bare Minimum for Search
A good way to understand relevance is to imagine search with no relevance.
What happens if all you do is you do letter-by-letter comparisons?
I have 4 sentences:
- To be or not to be that is the question
- This was the best of times the worst of times
- Ask not what your country asks of you but what you can do for your country
- For a long time I used to go to bed early
Matching Complete Words
A search for “ask” returns 1 record: 3.
A search for “time” returns 2 records: 2 and 4.
Matching Partial Words
A search for “be” returns 3 records: 1 (“be”), 2 (“best”), and 4 (“bed”).
But search is not just about matching text.
Where text matching doesn’t work
What happens with “qestion” (“question” mistyped)? Simple textual matching doesn’t correct spelling. So no records are returned.
What about “best”? Is this looking for the word “best”? or a search for the idea of “best” = like, the best quotes?
What about the order of the results? When searching for “be”, what record should come first? The ones that match exactly? Or some other criteria?
What about a search for “literary quotes”? Textual matching doesn’t know anything about these quotes - are these political or literary quotes? The quotes need to be classified, tagged as coming from novels or from politicians.
Same with the author: What if I want to find a french author’s quote?
So what’s wrong? All these queries require something more, features like typo tolerance, filtering and faceting, synonyms, prioritizing attributes, and many other textual matching techniques and features. Let’s look at some of these.
Going beyond simple text matching
So let’s improve our sentences, adding more information.
- “To be or not to be, that is the question”, shakespeare, british, hamlet, theatre
- “It was the best of times, it was the worst of times”, dickens, british, tale of two cities, novel
- “Ask not what your country asks of you but what you can do for your country”, kennedy, american, speech, politics
- “For a long time I used to go to bed early”, proust, french, recherche, novel
Now we can start filtering these records and find literature, author names, genre of quote, etc.. So we’ve improved search by adding more data, and structuring it smartly.
But we still haven’t resolved the misspelling of “qestion”. And we might have another problem related to intention. Let’s add a few more sentences to see this.
- “Shakespeare and Dickens are both famous British authors”, georgette, british, history of british literature, essay
- “By George, I should’ve known that! I’ll be a dickens!”, landly, british, my fair lady, theatre
- “An accomplished woman almost always knows more than we men, though her knowledge is of a different sort.”, george elliot, british, middlemarch, novel
If I search for “shakespeare”, I get records 1 and 5, but what if I my intention was to find quotes from Shakespeare, not books about him. Same with “george” and “Dickens”. How does the engine detect intention.
One way to solve this is to start prioritizing data. For example, we can have the search engine look at the author’s names first, before searching the quotes.
This would solve the “shakespeare” problem. But, it would create another one. If we search for “by geoge”, three records are equally valid, providing no criteria to judge the best match: records 5, 6, and 7.
We can continue this introduction, showing how every new search can raise a different problem, and then we can outline a solution to solve each problem. But let’s not do this.
Here’s a general statement: simple textual matching will not solve every problem. We need concepts like filtering, ranking, attribute priorities, handling typos, synonyms, and other language-based characteristics. We also need insights from analytics, personalization, and A/B testing. Simple text matching, therefore, while central to any search engine, is only the starting point. It is not the whole subject nor the full solution.