Transformers: Why everyone is talking about it?

Transformers!! yay 😀

Update 11th Dec 2021: see what Andrej Karpathy is saying about transformers

I had been planning to read about the Transformers for a while, but I had to wait until I had a few more days to get my hands on it. This is a seminal concept in NLP and I am going to try to explain it in laymen terms as possible. I was overwhelmed with the level of details in transformers whenever I tried reading about it. So I will give you the building blocks of transformers so that you don’t hesitate to read further.

Why Transformers?

Its important to understand why Transformers are so popular. Recurrent Neural Networks,long short-term memory and gated RNNs are the popularly approaches used for Sequence Modelling tasks such as machine translation and language modeling. However, RNN/CNN handle sequences word-by-word in a sequential fashion. This sequentiality is an obstacle toward parallelization of the process. Moreover, when such sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with following positions’ content.
Recent works have achieved significant improvements in computational efficiency and model performance through factorization tricks and conditional computation. But they are not enough to eliminate the fundamental constraint of sequential computation. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance inthe input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.But, the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and has reached a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Now we know why we are using Transformers.

The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution

The main characteristics are:
Non-Sequential: sentences are processed as a whole rather than word by word.

(Masked)Self Attention: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence.

Positional encoding: another innovation introduced to replace recurrence. The idea is to use fixed or learned weights that encode information related to a specific position of a token in a sentence.

Layer Normalization: a normalization technique that is used to stabilize the variance of activations in a layer. Neural net layers work best when input vectors have a uniform mean and std in each dimension.
Self-attention is a key component of the TransformersLet us distil how it works.


Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Say the following sentence is an input sentence we want to translate:
”The animal didn’t cross the street because it was too tired”
What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

So far, we have understood why transformers are being used and what are main components. lets talk about model architecture.

Here, the encoder maps an input sequence of symbol representations (x1, …, xn) to a sequenceof continuous representations z = (z1, …, zn). Given z, the decoder then generates an outputsequence (y1, …, ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.The Transformer follows this overall architecture using stacked self-attention and point-wise, fullyconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,respectively.

Transformers Architecture


The idea of this post was to explain the what and why of Transformers and how it works. This was the seminal concept in NLP but it is so generic that it is expanding into other domains like vision as well. It allows you cross-pollinate your ideas even while working on completely different domains. I am pretty excited about the future of Transformers. For the longest time, I did not know why we are using transformers and why it outperforms other techniques. Once I understood it, it became go-to technique for NLP tasks. I hope it helped you to understand the concept and how it can be used in your own projects.
I highly recommend you to read the below sources as I have skipped a lot of details not to scare you away from transformers like I was scared whenever I tried to understand transformers.  


[Attention all you need-*The seminal paper on transformers*](
[Transformers- Highly recommended for deep understanding](
[Transformers vs RNN/LSTM](


A concise overview of cost function optimizers!

Photo by Jon Tyson on Unsplash

Optimizers are needed to find the optimal solution for the given task. Optimizers associate themselves with cost function and model parameters together by updating the model. i.e. when you want to identify weights that minimize your mean squared error in linear regression, you need to use some function to find parameters such that mean squared error is minimum, this function is called optimizer. So you use the optimizer function to reach global minima with respect to the cost function.

Types of optimizers:

1. Gradient Descent

2. Momentum

3. Nesterov Momentum

4. Adagrad

5. RMSProp

6. Adam

Gradient Descent

Gradient Descent, is one of the simplest optimization algorithms. It uses just one static learning rate for all parameters during the entire training phase.

The static learning rate does not imply an equal update after every minibatch. As the optimizers approach an (sub)optimal value, their gradients start to decrease.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Types of gradient descent method:

a. Batch Gradient Descent

b. Stochastic Gradient Descent

c. Mini-batch gradient Descent


1. It does not guarantee convergence and slower than other newer methods.

2. It can stuck in local minima.

3. Choosing a right learning rate is difficult.


This method has literal meaning. Its a method that helps accelerate SGD in the relavant direction and dampen oscillations.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. The momentum helps to avoid local minima.

Read more about momentum [here](

Nesterov Momentum

It is same as Momentum but with one additional information of notional momentum.

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.


Adagrad is a gradient based algorithm that adapts the learning rate to the parameters. In momentum based optimizers we adapted our updates to the slope of error function and speed up SGD. While adagrad updates based on learning rate.

It adapts the learning rate to the parameters, performing smaller updates

(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.


RMSProp was developed to resolve the weakness of adagrad’s radically diminishing learning rates.

To combat that problem, RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, RMSprop behaves like moving average. RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.


Adam is the latest state of the art of first order optimization method that’s widely used in the real world. It’s a modification of RMSprop. Loosely speaking, Adam is RMSprop with momentum. So, Adam tries to combine the best of both world of momentum and adaptive learning rate.

Which optimizer to use?

So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates.Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, and Adam are very similar algorithms that do well in similar circumstances. its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.


In this post we looked at the optimization algorithms beyond SGD. We looked at two classes of algorithms: momentum based and adaptive learning rate methods.

Further Reading

Why Problem Solving is the most important skill in Data Science?

A few years back, when I started my journey in data science, I was fascinated by Kaggle. I used to devote my time there. I was overwhelmed by the knowledge and problems. I wanted to be competitive and kaggle shows their leaderboard. You win and make a name. I used to have casual discussions with my manager on a variety of topics. One day I asked my manager which things are most important in data science and what’s his opinion about Kaggle.

He said “Problem Solving is one of the most important skills in data scientist. As far as kaggle is concerned I dont think Kaggle helps you to become a good problem solver.If you know Xs and Y then almost anyone can solve the problem. Most important part is defining your Xs and Y

After some time I gave a thought about it and he was right. In my experience, most of the problems I solved was the result of the way problem defined. Most often, Its almost never a question to build the most accurate model, rather it is solving the problem. Kaggle comes in handy to know different approaches to solve the problems. But its not critical to your success in data-science career.

Read the below blog to get more perspective on this

Stitchfix: comfortable with ambiguity and successfully framing problems

A long time!!

Hi Guys!
How have you been doing? Sorry for not writing anything in past few years. I had been doing crazy shit work. Ohh no.. that is just an excuse, I am telling myself. The motivation of restarting the blog came because I am seeing a lot of junk content on the web about data science(last week, I came across an advice that 4GB laptop is good for data science and much more rubbish things).

There are a lot of things I want to talk about. The things which most people do not talk about. I will cover the following things in upcoming articles:

  • Why almost all the paid data science courses are just junk and how to judge if a course is good for you?
  • How to maximize the success rate of data science projects?
  • How to customize agile development for data science?
  • How to be more productive in your work?
  • Why data science is hard and what can you do about it?
  • How to succeed in data science in startups?
  • How to structure data teams to get the best outcome?
  • How to design data science projects?
  • How to think about data science problems?
  • Is data science the right career for you?
  • What have I learnt from 5 years in data science?
  • How to evaluate a data science job?
  • Best practices in data science.

Please tell me in comments if you think I have missed something that you want me to talk about.

Data Science Interview Questions – Part 1

Data science is a field which has no ends. It doesn’t matter how much you will read it will always be less. One interviewer told me that you use only 5% of knowledge what you learn in data science. Its actually true. Although the type of questions changes according to the job profiles. I will list all the questions that I ask during the interview and also the questions which were asked to me.

keep checking this blog post as I will always be updating this post.(I will also write their answers soon)

Analytics and Consulting firms

  1. Explain logistic regression? why do we use it? Assumptions of linear regression
  2. Clustering questions- How do you choose between K means and Hierarchical clustering?
  3. Explain ROC curve, Precision- Recall.
  4. What do you mean by p-value(My favorite question. Most people don’t know answer to this question)
  5. Explain the steps in a data science project.
  6. Difference between machine learning and statistical modeling.
  7. Explain me logistic regression in LAYMEN TERMS. (Without using technical words)
  8. What is the correlation? Is it bad or good?
  9. What do you mean by data science.(Another fav)
  10. Types of join.(Must)
  11. What is R square?
  12. What is random forest?
  13. Explain any algorithm end to end.(Most often logistic regression and decision tree)
  14. Whats the most challenging project you have done? How did you overcome ?
  15. Explain Central limit theorem.




Deep Learning Explained in laymen terms

Deep learning is pattern recognition via so-called neural networks. Neural networks are a set of algorithms, modeled after the human brain. They are sensors: a form of machine perception. Deep learning is a name for a certain type of stacked neural network composed of several node layers. Each layer’s output is simultaneously the subsequent layer’s input, starting from an initial input layer.

Deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth; that is, the number of node layers through which data is passed in a multistep process of pattern recognition. Three or more including input and output is deep learning. Anything less is simply machine learning.


Deep learning is motivated by intuition, theoretical arguments from circuit theory, empirical results, and current knowledge of neuroscience.


  • The main concept in deep learning algorithms is automating the extraction of representations (abstractions) from the data.


  • A key concept underlying Deep Learning methods is distributed representations of the data, in which a large number of possible configurations of the abstract features of the input data is feasible, allowing for a compact representation of each sample and leading to a richer generalization.


  • Deep learning algorithms lead to abstract representations because more abstract representations are often constructed based on less abstract ones.An important advantage of more abstract representations is that they can be invariant to the local changes in the input data.
  • Deep learning algorithms are actually Deep architectures of consecutive layers.


  • Stacking up the nonlinear transformation layers is the basic idea in deep learning algorithms.


  • It is important to note that the transformations in the layers of deep architecture are non-linear transformations which try to extract underlying explanatory factors in the data.


  • The final representation of data constructed by the deep learning algorithm (output of the final layer) provides useful information from the data which can be used as features in building classifiers, or even can be used for data indexing and other applications which are more efficient when using abstract representations of data rather than high dimensional sensory data.



Let’s understand in layman’s terms-

Imagine you’re building a shopping recommendation engine, and you discover that if an item is trending and a user has browsed the category of that item in the last day, they are very likely to buy the trending item.

These two variables are so accurate together that you can combine them into a new single variable, or feature (Call it “interested_in_trending_category”, for example).

Finding connections between variables and packaging them into a new single variable is called feature engineering
Deep learning is automated feature engineering.


auto-feature deep learning




Setting up a GPU based Deep Learning Machine

Using GPU for deep learning has seen a tremendous performance. It has been reported that execution time using GPU is 10x -50x times faster than CPU-based deep learning and It is also a lot cheaper than CPU-based system. You can see this below in the picture.


I was curious to check deep learning performance on my laptop which has GeForce GT 940M GPU.
Today I will walk you through how to set up GPU based deep learning machine to make use of GPUs. I have used Tensorflow for deep learning on a windows system. Using GPU in windows system is really a pain. You can’t get it to work if you don’t follow correct steps. But if you follow the steps it will be very easy to set up Tensorflow with GPU for windows.


  • Python 3.5 – Currently Tensorflow on windows doesn’t support python 2.7.
  • nvidia cuda GPU


  • CUDA toolkit
    Use this link to install cuda-
    According to your windows version, you can install this toolkit.
    Recommended version: Cuda Toolkit 8.0
  • cuDNN
    Use this link to install cuDNN -
    You need to register to install this. You need to choose cuDNN v5.1. I have tried latest version but it didn’t work out.After downloading, You need to copy and replace these filescuDNN
    into this location C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0

    Now you also need to set path for environment variables. Check below snapshots and make the required changes. If they are not there you have to do it manually.

  • Python
    Install using anaconda . Use whatever anaconda python 2.7 or 3.5 you want to use for your daily tasks because we will create a separate environment for python 3.5 .
  • Tensorflow with GPU
    Create a virtual environment for tensorflow

    conda create --name tensorflow-gpu python=3.5  

    Then activate this virtual environment:

    activate tensorflow-gpu  

    And finally, install TensorFlow with GPU support:

    pip install tensorflow-gpu  

    Test the TensorFlow installation

    >>> import tensorflow as tf
    >>> hello = tf.constant('Hello, TensorFlow!')
    >>> sess = tf.Session()
    >>> print(
    Hello, TensorFlow!  

    If you run into any error check below link-
    Any other link might lead you to different problems.

  • Let’s play with Tensorflow GPU 

    Let’s check performance on MNIST data using convolution neural network.
    download the code- lets run it and check its performance

  • GPU based Tensorflowtensorflow_gpu

We can see  each step is taking roughly around ~40 ms. Now we want to see if this gpu performanvce worth or not.

  • CPU Tensorflowtensorflow_cpu

Let’s take a look at CPU performance. Really? Each step is taking ~370 ms . Wow what a performance!! Tensorflow with GPU is 10x faster than Tensorflow with CPU.

Next steps:

Further, You can install Keras library to do more advance things in deep learning. Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation.Keras uses Tensorflow as backend. Keras also work seamlessly on CPU and GPU. Follow below commands. Install jupyter notebook too if you love working with notebooks.

conda install jupyter
conda install scipy pandas 
conda install mingw libpython (theano dependencies) 
conda install theano 
pip install keras

In case of any trouble, leave comments and let me know your thoughts about this articles.

Happy hunting with deep learning !!

A Data Science Project- Part 4: Chi-Square Test of Independence

In the last article, we have discussed ANOVA test, and it gave us insight into checking the distribution of response variable among groups of an independent variable. Today, we will learn how to check relationships between two categorical variables.

A Data Science Project- Part 3: Hypothesis testing and ANOVA

This post is in continuation of A Data Science project series. In this post, We will use ANOVA test whenever we need to check if two or more groups are different from each other or not. i.e. let’s say there are four races in a school – White, Hispanic, Black and Indian. Now school management wants to know if the marks scored by these races are statistically different or not. If yes then they would like to do something about it.



A Data Science Project- Part 2: Making Sense of Data

I hope you have gone through part1 and part2, Today I will tell you not only how to explore data through visualization but also the most important part how to interpret them. Don’t miss reading summary in the end of every post.

we will create bar chart, histograms, scatter-plot, box-plots etc. We will also check association between two variables. Let’s start.