Intro to machine learning for images & text

Back to all blogs

With the remarkable growth in machine learning technologies, it’s become a reflex to ask – What’s next? But to talk about the future of ML-based search (search using machine learning), we need to know and understand what’s already in place. In this article, we’re going to take a close look at two popular applications that illustrate the state of the art – image recognition and semantic search. Understanding how machine learning enables search by image or descriptive and question-based text will help us anticipate the near future.

We’ll demonstrate how machines recognize images to create search-by-image functionality. Image recognition will provide an intuitive understanding of how machines learn. Then we’ll move on to semantic search, showing you how machines learn the meaning of words, thus enabling us to search by everyday language (idioms, sentences, and questions).

We’ll stay high-level. So don’t expect to see any mathematical models or terminology such as neurons, hidden layers, weights, biases, and cost functions. Instead, you’ll understand the mechanics that drive machine learning – which is the best place to start before diving deeper into the network.

Machine learning – recognizing pixels and pictures, finding similar images

Supervised machine learning

If you feed a computer 1000s of images of dogs and cats, labeled correctly as “dogs” and “cats”, an ML algorithm can eventually learn what a dog or a cat looks like. It does this as follows: it breaks down the images into pixels, converts the pixels into numbers (that represent colors), plots these numbers on a graph, and then discerns patterns that are typical of a dog or a cat image. Some pixels might contain patterns in the shape of a nose (wedged or snout), or ears (pointy or drooping).

Labeling every image as “dog” or “cat” helps the computer know if its guess is right or wrong, and by how much it is wrong. If it’s wrong, it re-processes the images, sharpening its pattern detections, until it minimizes the error of its judgment. The process is finished when the computer can predict — with a very high likelihood of truth — that the image is a dog or a cat.

**From supervised to unsupervised learning**

The just-described dog and cat image-recognition algorithm is called supervised learning, which means that it uses the image labels (or other metadata) to help it learn. However, labelling is tedious – and impossible: we couldn’t label every object in the world if we want our machines to understand the world. For example, annotating dogs and cats is easy, but how about annotating the prognosis in a medical report. So we need an algorithm that can learn without explicit labeling. We need to teach computers to learn unsupervised – that is, unlabeled. This is a significant change in machine learning that affects all aspects of the ML modeling process.

However, one thing remains the same: both supervised and unsupervised learning rely on error detection; that is, they both run and rerun the ML process until the errors in its predictions are minimized. With supervised learning, the ML model uses a true/false error analysis (is it a dog? is it a cat?); with unsupervised, while it can use a binary yes/no analysis, it more often applies a probability test for the likelihood of truth.

Unsupervised machine learning – classifying objects (clustering)

Unsupervised learning uses clustering – that is, it groups objects into classes that share similar characteristics. So, instead of comparing a dog to a dog, unsupervised learning feeds several animals into a system and allows the machine to discern different groups that share similar features. It could be that the computer will find cats and dogs, but it will also find groups like furry-animals, small animals, ones with teeth, others with fins or paws. In general, unsupervised learning is looking for similarity and difference based on the characteristics that we feed into the system.

You might prefer a supervised system that classifies dogs as dogs and cats as cats. But an unsupervised system that can detect many kinds of groupings, based on the many possible similarities that exist in the world, is quite a powerful retrieval tool. For example, when you feed in a new dog image, the system not only retrieves a dog but it might retrieve the same species and pose. This is possible because an unsupervised learning scenario considers all detailed pixel information, all of which may allow it to classify the animals by species and different poses.

It’s all about clustering.

Source: GraphicMama-team

Machines learn to identify objects and cluster them by notable features: 1, fluffy dogs; 2, something in their mouths; 3, tails; 4, spotted.

Clustering – unsupervised machines learn like children

Machine learning models break down content into small pieces (pixels in pictures, words in text) and assign characteristics to each piece. For example, pixels have color and shapes, which, when put together, form a picture; words are surrounded by other words, which, when put together, form meaningful phrases and context.

We can think of machines as children playing with geometric blocks (cubes, spheres, three-dimensional triangles, etc.). The game is to group them together. One child may feel and turn them over. Another child will taste or throw them against a wall and watch how they bounce. Both children are in the process of classifying the shapes of objects by finding characteristics and classifying them by similarities. The first child uses sides and angles; the other uses taste and density. For both children, while they may not be able to name the objects, will eventually put the cubes and spheres into distinct groups (clusters).

Machines do the same. We tell the machine what to look at (shapes, angles, maybe not taste) then feed tons of different shapes into the machine. The machine goes through a back and forth process (the ML model and algorithm) until it successfully clusters the different shapes into meaningful classes.

Patterns and grouping the world into clusters

Most unsupervised learning models represent real objects as a set of numbers, called a vector, because computers can only calculate and compare numbers. The machines then take these sets of numbers and adjust them many times (like a child fiddling with a cube many times) until clear patterns emerge from the model that enable the model to separate out the objects into meaningful clusters.

In the geometry example, the objects are simple: triangles, squares, circles, etc., where each object is represented by a vector ([number of sides, lengths, widths, angles]). If you feed a large sample of these values for each object into the learning process, the model will perform a set of calculations which transform the vectorized numbers and output them onto a graph (a vector space). This transformation is where the learning occurs. At first, the objects are scattered randomly, but if you feed and re-feed the transformed vectors back into the machine and reapply the calculations, they will be placed onto different parts of a two-dimensional graph. The images near each other will be similar in meaning. Squares will be near squares, triangles near triangles, and straight-line objects (triangles and squares) pushed further away from circles.

The world is made up of complex objects

Of course, the world is not so simple. For example, how can an ML-based self-driving car recognize that a child is running across a street? Well, we first need to add a lot more characteristics than just two dimensions; we need 1000s if not millions of dimensions, such as speed, movement, colors, shape, size, weight, direction, and so on. The more characteristics a computer feeds into a system to represent a moving, three-dimensional image of a child running, the better the eventual clustering will be – and the better the computer will “know” whether the car should keep driving, slow down, or stop.

From images to text – unsupervised machine learning as applied to text

Moving from image to text, the state of the art as applied to textual search has grown immeasurably in the last 10 years. In this section, we’ll see how machine learning has either replaced or combined with time-honored search techniques to create a semantic search that classifies and detects meaningful relationships between words.

Laying the foundation of textual search: non-ML word matching and keyword search

Search was built upon understanding text – letters and words – not meaning. The goal was always to word match a query to the underlying index of texts.

In the 90s, advanced word matching techniques using natural language processing (NLP) allowed us to build powerful search engines. We couldn’t build anything semantic at the time. Word matching relies on dictionaries and NLP that understand grammar and language. Word matching also involves creating pre-defined rules and synonyms, and detecting misspellings and typos, among other techniques.

On top of that, keyword search and word-metric statistics like TF-IDF (term frequency–inverse document frequency) are used to improve word matching. Some search engines only match whole words; others perform prefix matching, which means that a search engine queries the text one letter at a time. Prefix search returns instantaneous search results as a user types.

Many of these techniques are powerful and still in use. For example, an NLP-based keyword search brings back perfectly relevant results instantly when working with structured data, such as clearly-defined objects. Think of a collection of kitchen products, or films, or even architectural plans. On the other hand, a statistical-based search method like TF-IDF works better for long unstructured texts, where the frequency of words helps determine relevance. Finding the frequency of the word “whale”, for instance, should bring up Moby Dick as well as books on whale sounds and the whaling industry.

Machine learning takes keyword search to the next level.

Machine learning and unstructured text: finding context

What’s missing from standard word-matching techniques is a semantic understanding of text. Using supervised and unsupervised learning techniques, computers can look at millions of documents with billions of word relationships to determine semantics. For example, machines can determine that banks, mutual funds, and hedging are all related to financial institutions and financial management. When someone searches for “best place to invest my money”, machine learning algorithms can start to make semantic sense of the question.

Clustering and building word relationships

As with image recognition, machine learning involves classifying words (clustering) based on similar characteristics. With words, the clustering can get quite complex because words can be placed in many different clusters. The same word, like “running”, can be placed in grammatical clusters (verb, ing-words), sports, movement, something animals do, something water does, and so forth. And that’s just one word.

Let’s take a simple example: if a user asks, “how hot is it today”, the computer can detect the context (the weather, even though “weather” was not mentioned) and the central idea (temperature, from “hot”). With that in mind, a significantly-trained computer can answer the question as follows: “Today’s temperature will reach 90 degrees Fahrenheit, be sure to wear light clothing and plenty of sunscreen!”.

How a machine learns to clusters words: finding patterns

Similar to how image recognition functions, where images are fed into the machines to find patterns that help classify the images, multiple documents and phrases can be fed to discern semantic word patterns that enable the machines to plot words into clusters based on similarities and differences.

The main technique to do this is to look at phrases and see how each word in each phrase is used. The catchy axiom, “You shall know a word by the company it keeps”, captures this logic: computers (and children) learn context by examining how words are used in meaningful text and conversation. The context is the phrase.

Let’s use dogs and cats are interchangeable in the following two phrases: “the cat jumped on the roof”, and “the dog jumped on the roof”. Thus, “cat” and “dog” are similar. Dogs and cats can also “jump” – therefore, cats and dogs can be clustered near the word “jump”. Whereas tables, airplanes, and rugs are very distant from the term “jump”, but they all can “leap” off the page… you can see how complex it can get! In the end, you can end up with a large number of fairly well-defined – and overlapping – clusters.

However, this is no small task. You can imagine that if this kind of language learning is done for every word, the computer could be learning hundreds of thousands of words with millions of subtle variations, resulting in who knows how many overlapping clusters. And many sentences don’t contain any useful information, such as “I saw a cat” – which means that a cat is an object that can be seen, but what kind of object is not given.

Today’s semantic search tomorrow: mixing the tools of the past with advances in machine learning

Natural language processing (NLP) will always be necessary to sift through the syntax and grammar of the 1000s of languages in the world, with their different rules and exceptions.

However, semantic search has replaced the need to manually create synonyms, rules, or statistics. Machine learning generates word similarities, which is even more powerful than synonyms, and builds an index that doesn’t need rules and statistics..

What about prefix, instant, and keyword search? They’re not going anywhere, as we’ll see below.

Thus, to get back to the initial goal of this article, what’s in store for the near future?

Structured text and keywords

Word and keyword matching will never go away, especially when the content is structured and well-labeled. Even complex searches like “cheap blue phone with 500 gig memory” can be resolved perfectly well with keyword search because of its easily-detected keyword-based structure.

That said, machine learning can still lend a hand in structured-content domains. For example, what happens when you type in “show me shoes for dancing” and the product listing does not include the word “dancing”? Here is where semantic search shines. Thanks to semantic machine learning, “dancing” can be understood by its context, where shoe descriptions containing words like ballroom, ballerina, wedding shoes, and evening wear will be contextually similar to “dancing” shoes.

Unstructured text and machine-learned semantics

But where semantic search adds its greatest value will be with unstructured text. Not all data can be structured. Consider an online bookseller. Semantics will help the engine understand that the query “children with magical powers” matches Harry Potter books, or “chaos theory and dinosaurs” matches Jurassic Park books.

Advances in ML and semantic search will continue to empower NLP and keywords

ML-based search will continue to change search in significant ways. For example:

Laborious NLP coding will be replaced by ML-based NLP that can understand language’s exceptional complexity – in some cases, even better than humans.
ML-based image search will work side-by-side with keyword search. For example, a user who uses a drawing of a house to search for similar houses, can also drill down further by typing in “5 rooms” in the search bar and clicking on the facet “wood paneling”. This is possible because machines can extract keywords during the image recognition process, to create an index (and cluster) of keywords for each image.

If there is any pattern in the explosive growth of machine learning, it’s in how ML continues to combine with keyword search, NLP, and other robust search technologies, to create an even more powerful search and discovery platform.

An introduction to machine learning for images and text — now and in the (near) future