An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common?

Give up? These innovative organizations both utilize technology known as transformer models.

What are transformer models?  

The transformer (represented by the T in ChatGPT, GPT-2, GPT-3, GPT-3.5, etc.) is the key element that makes generative AI so, well, transformational.

Transformer models are a type of neural network architecture designed to process sequential material, such as sentences or time-series data.

The concept of a transformer, an attention-layer-based, sequence-to-sequence (“Seq2Seq”) encoder-decoder architecture, was conceived in a 2017 paper authored by pioneer in deep learning models Ashish Vaswani et al called “Attention Is All You Need”. Since then, in the realms of AI and machine learning, transformer models have emerged as a groundbreaking approach to various language-related tasks.

Compared with traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), transformers differ in their ability to capture long-range dependencies and contextual information.

The transformer “requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets,” notes Wikipedia.

From machine translation to natural language processing (NLP) to computer vision, plus audio and multi-modal processing, transformers have revolutionized the field with their ability to capture long-range dependencies and efficiently process sequential data. They’re used widely in neural machine translation (NMT). They’re used to perform or improve AI and NLP business tasks, as well as streamline enterprise workflows. Transformer technology has also heralded generative pretrained transformers (GPTs) and Bidirectional Encoder Representations from Transformers (BERT).

Multi-head attention

A transformer measures relationships between pairs of input tokens (for example, if the content is text, the tokens are words), known as attention. The attention heads are a key feature of transformers. A transformer uses parallel multi-head attention, meaning the attention module repeats computations in parallel, affording more ability to encode nuances of word meanings. The attention score is computed by combining the similar attention calculations.

In addition to multihead attention mechanisms, transformers rely on layer normalization, residual and feedforward connections, and positional embeddings.

How do transformer models work?

Here’s how the transformer architecture works: 

1. Input embedding 

The first step in transformer operations is understanding the input data. It takes a sentence — or a sequence of data — and turns each word or element into numerical representation known as vector embeddings. The sequence model’s embeddings capture the meanings of the words or elements. Various techniques can be employed for input embedding, such as word embeddings and character embeddings.

This allows the model to work with continuous representations rather than discrete symbols.  

2. Positional encoding 

Next, the transformer model gets to know the order. Transformers don’t naturally understand the order of words, so they use positional encoding to give the model information about the order. This is done by combining the embeddings with sinusoidal functions (remember sine from trigonometry class?), which helps the model understand the relationships between parts of the sequence. For example, if the input sentence is “The cat is on the mat,” the transformer knows “cat” and “mat” are related because they’re both objects.  

3. Encoder layers

The embedded and encoded input sequence is passed through multiple encoder layers. Each layer consists of two sub-layers called the self-attention mechanism and the feed-forward neural network.  

  • The self-attention mechanism allows the model to focus on different parts of the input sequence and capture dependencies. It calculates attention scores for each element based on its relationships with other elements in the sequence.

For each word in a sentence, the self-attention layer computes three vectors (key, value, query). To determine a word’s contextually related words, the dot products of the query vector are considered with the key vectors of the other words.

  •  The feed-forward neural network applies a non-linear transformation to the outputs of the self-attention mechanism, introducing complexity and expressive power to the model. The feed-forward layer makes up two-thirds of the parameters in a transformer model.

4. Decoder layers 

The output is fed into the decoder layers next. Like the encoder layers, each of these consists of two sub-layers: the self-attention mechanism and the encoder-decoder attention mechanism. 

  • The self-attention mechanism in the decoder allows it to attend to different parts within the output sequence, capturing dependencies between elements. It calculates attention scores based on the relationships between positions in the output sequence.  
  • The encoder-decoder attention mechanism enables the decoder to focus on different parts of the input sequence, incorporating information from the encoder. This helps the decoder understand the context of the input sequence, aiding in generating the output sequence.

5. Output projection 

The output of the decoder layers is passed through a linear projection layer. Because the dot products yield values between negative and positive infinity, a softmax activation function is applied; this maps the output to the same size as the vocabulary and generates a probability distribution for each position in the output sequence. The highest probability is considered the predicted output.  

6. Training and optimization 

Transformers are trained using supervised learning. The model’s predictions are compared with the correct target sequence, and optimization algorithms adjust the model’s parameters to minimize the difference between predicted and correct outputs. This is done by going through the training data in batches and improving the model’s performance. 

7. Inference 

A pretrained model can then be used for inference to generate predictions for new input sequences. During inference, the trained model applies the same preprocessing steps as during training (such as input embedding and positional encoding) to an input sequence, then feeds it through the encoder and decoder layers.  

 The model generates predictions for each position in the output sequence, producing the most probable output at each step. The predictions are then decoded into the desired format, such as when generating a translation or sequence of words. 

Applications of transformer models 

Just how much of a help are transformer models in deciphering real-world challenges?

As documented by Google, Vaswani et al’s paper shows that “the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks. On top of higher translation quality, the Transformer requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.”

Because of this high level of effectiveness, transformer neural networks are used for various types of applications, including: 

Machine translation 

In earlier times, traditional machine translation approaches relied on statistical methods and phrase-based models, which often struggled with capturing the semantic meaning and syntactic structure of sentences. But with the introduction of transformer models, translation accuracy has significantly improved. 

In the transformer, the self-attention mechanism allows the model to attend to different parts of the input sequence, capturing long-range dependencies and improving the overall translation quality. Because transformer models can effectively learn the patterns in source and target languages, they can generate more-fluent and accurate translations.  

Some of the most successful machine translation systems powered by transformers include Google Translate, Microsoft Translator, and DeepL. This application can improve global communication between organizations as well as fine-tune multilingual chatbot support and content localization.  

Natural language processing

Transformer models’ ability to handle long-range dependencies and capture contextual information makes them super effective in language understanding and humanlike text generation. Their functionality has been applied to tasks such as sentiment analysis, text classification, named entity recognition, and text summarization.  

In sentiment analysis, for example, models powered by transformers can accurately determine the sentiment expressed in text. This enables companies, for instance, to gain insight from customer feedback, identifying areas for improvement and ways to better manage their brand reputation. 

Furthermore, NLP (with a transformer working alongside it) is used in industries such as finance and healthcare to understand and analyze legal and regulatory documents. This ensures compliance and identifies potential risks, as well as detects fraud. 

Improve pretrained LLMs banner

Speech recognition 

Their ability to capture dependencies and contextual information has enabled transformer models to transcribe spoken language very accurately. This has led to utilization in popular voice assistants such as Amazon’s Alexa, Apple’s Siri, and Google Assistant.  

These models process the audio input, segment it into smaller units, and generate the corresponding text representation. Transformers have improved the accuracy and fluency of the transcriptions.

One result: more-seamless interaction between humans and machines, especially when it comes to chatbots. The ecommerce, finance, and health Industries routinely employ chatbots in their customer service operations. By improving content quality, transformers have ensured that shoppers, clients, and patients can all chat with an AI entity to quickly get the support they need. 

Image captioning 

Images contain rich visual information, while captions provide textual descriptions of the image content. Transformer models encode the visual features of an image and then decode them into corresponding captions.  

The transformer’s ability to capture dependencies and generate coherent text makes it effective in producing accurate and contextually relevant captions. Image captioning powered by transformers has found application in areas such as content understanding, visual search, and accessibility for visually impaired individuals. 

In ecommerce, image captioning is utilized to automatically generate captions for product images. Descriptive captions proactively provide shoppers with valuable information such as product features and dimensions and other specifications, thereby enhancing the shopping experience. 

Transform your outlook 

That’s it for this introduction to how transformers work their magic.

Want to use this technology to transform your ecommerce revenue? Here at Algolia, we’re incorporating transformer models and other amazing technology to improve our clients’ search results and recommendations. We use vector representation, along with machine-learning techniques such as spelling correction, language processing, and category matching, to make sense of language. Our smart search experiences have proven to enhance user engagement and increase conversion for a vast array of clients. 

Want to know more? Let’s chat, or take the next step and request a demo of how our AI-powered NeuralSearch can give your site surprisingly on-target search results.

About the authorVincent Caruana

Vincent Caruana

Senior Digital Marketing Manager, SEO

Recommended Articles

Powered by Algolia AI Recommendations

What are large language models?

What are large language models?

Catherine Dee

Catherine Dee

Search and Discovery writer
Top examples of some of the best large language models out there

Top examples of some of the best large language models out there

Vincent Caruana

Vincent Caruana

Sr. SEO Web Digital Marketing Manager
What does it take to build and train a large language model? An introduction

What does it take to build and train a large language model? An introduction

Vincent Caruana

Vincent Caruana

Senior Digital Marketing Manager, SEO