Transformers: Why everyone is talking about it?

Transformers!! yay 😀

Update 11th Dec 2021: see what Andrej Karpathy is saying about transformers

I had been planning to read about the Transformers for a while, but I had to wait until I had a few more days to get my hands on it. This is a seminal concept in NLP and I am going to try to explain it in laymen terms as possible. I was overwhelmed with the level of details in transformers whenever I tried reading about it. So I will give you the building blocks of transformers so that you don’t hesitate to read further.

Why Transformers?

Its important to understand why Transformers are so popular. Recurrent Neural Networks,long short-term memory and gated RNNs are the popularly approaches used for Sequence Modelling tasks such as machine translation and language modeling. However, RNN/CNN handle sequences word-by-word in a sequential fashion. This sequentiality is an obstacle toward parallelization of the process. Moreover, when such sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with following positions’ content.
Recent works have achieved significant improvements in computational efficiency and model performance through factorization tricks and conditional computation. But they are not enough to eliminate the fundamental constraint of sequential computation. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance inthe input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.But, the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and has reached a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Now we know why we are using Transformers.

The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution

The main characteristics are:
Non-Sequential: sentences are processed as a whole rather than word by word.

(Masked)Self Attention: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence.

Positional encoding: another innovation introduced to replace recurrence. The idea is to use fixed or learned weights that encode information related to a specific position of a token in a sentence.

Layer Normalization: a normalization technique that is used to stabilize the variance of activations in a layer. Neural net layers work best when input vectors have a uniform mean and std in each dimension.
Self-attention is a key component of the TransformersLet us distil how it works.


Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Say the following sentence is an input sentence we want to translate:
”The animal didn’t cross the street because it was too tired”
What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

So far, we have understood why transformers are being used and what are main components. lets talk about model architecture.

Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequenceof continuous representations z = (z1, …, zn). Given z, the decoder then generates an outputsequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.The Transformer follows this overall architecture using stacked self-attention and point-wise, fullyconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,respectively.

Transformers Architecture


The idea of this post was to explain the what and why of Transformers and how it works. This was the seminal concept in NLP but it is so generic that it is expanding into other domains like vision as well. It allows you cross-pollinate your ideas even while working on completely different domains. I am pretty excited about the future of Transformers. For the longest time, I did not know why we are using transformers and why it outperforms other techniques. Once I understood it, it became go-to technique for NLP tasks. I hope it helped you to understand the concept and how it can be used in your own projects.
I highly recommend you to read the below sources as I have skipped a lot of details not to scare you away from transformers like I was scared whenever I tried to understand transformers.  


[Attention all you need-*The seminal paper on transformers*](
[Transformers- Highly recommended for deep understanding](
[Transformers vs RNN/LSTM](


A concise overview of cost function optimizers!

Photo by Jon Tyson on Unsplash

Optimizers are needed to find the optimal solution for the given task. Optimizers associate themselves with cost function and model parameters together by updating the model. i.e. when you want to identify weights that minimize your mean squared error in linear regression, you need to use some function to find parameters such that mean squared error is minimum, this function is called optimizer. So you use the optimizer function to reach global minima with respect to the cost function.

Types of optimizers:

1. Gradient Descent

2. Momentum

3. Nesterov Momentum

4. Adagrad

5. RMSProp

6. Adam

Gradient Descent

Gradient Descent, is one of the simplest optimization algorithms. It uses just one static learning rate for all parameters during the entire training phase.

The static learning rate does not imply an equal update after every minibatch. As the optimizers approach an (sub)optimal value, their gradients start to decrease.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Types of gradient descent method:

a. Batch Gradient Descent

b. Stochastic Gradient Descent

c. Mini-batch gradient Descent


1. It does not guarantee convergence and slower than other newer methods.

2. It can stuck in local minima.

3. Choosing a right learning rate is difficult.


This method has literal meaning. Its a method that helps accelerate SGD in the relavant direction and dampen oscillations.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. The momentum helps to avoid local minima.

Read more about momentum [here](

Nesterov Momentum

It is same as Momentum but with one additional information of notional momentum.

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.


Adagrad is a gradient based algorithm that adapts the learning rate to the parameters. In momentum based optimizers we adapted our updates to the slope of error function and speed up SGD. While adagrad updates based on learning rate.

It adapts the learning rate to the parameters, performing smaller updates

(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.


RMSProp was developed to resolve the weakness of adagrad’s radically diminishing learning rates.

To combat that problem, RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, RMSprop behaves like moving average. RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.


Adam is the latest state of the art of first order optimization method that’s widely used in the real world. It’s a modification of RMSprop. Loosely speaking, Adam is RMSprop with momentum. So, Adam tries to combine the best of both world of momentum and adaptive learning rate.

Which optimizer to use?

So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates.Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, and Adam are very similar algorithms that do well in similar circumstances. its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.


In this post we looked at the optimization algorithms beyond SGD. We looked at two classes of algorithms: momentum based and adaptive learning rate methods.

Further Reading