## A concise overview of cost function optimizers!

Optimizers are needed to find the optimal solution for the given task. Optimizers associate themselves with cost function and model parameters together by updating the model. i.e. when you want to identify weights that minimize your mean squared error in linear regression, you need to use some function to find parameters such that mean squared error is minimum, this function is called optimizer. So you use the optimizer function to reach global minima with respect to the cost function.

Types of optimizers:

2. Momentum

3. Nesterov Momentum

5. RMSProp

Gradient Descent, is one of the simplest optimization algorithms. It uses just one static learning rate for all parameters during the entire training phase.

The static learning rate does not imply an equal update after every minibatch. As the optimizers approach an (sub)optimal value, their gradients start to decrease.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Challenges:

1. It does not guarantee convergence and slower than other newer methods.

2. It can stuck in local minima.

3. Choosing a right learning rate is difficult.

### Momentum

This method has literal meaning. Its a method that helps accelerate SGD in the relavant direction and dampen oscillations.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. The momentum helps to avoid local minima.

### Nesterov Momentum

It is same as Momentum but with one additional information of notional momentum.

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.

### RMSProp

To combat that problem, RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, RMSprop behaves like moving average. RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.

Adam is the latest state of the art of first order optimization method that’s widely used in the real world. It’s a modification of RMSprop. Loosely speaking, Adam is RMSprop with momentum. So, Adam tries to combine the best of both world of momentum and adaptive learning rate.

### Which optimizer to use?

So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates.Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, and Adam are very similar algorithms that do well in similar circumstances. its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.

### Conclusion

In this post we looked at the optimization algorithms beyond SGD. We looked at two classes of algorithms: momentum based and adaptive learning rate methods.

https://cs231n.github.io/neural-networks-3/#sgd

## Data Science Interview Questions – Part 1

Data science is a field which has no ends. It doesn’t matter how much you will read it will always be less. One interviewer told me that you use only 5% of knowledge what you learn in data science. Its actually true. Although the type of questions changes according to the job profiles. I will list all the questions that I ask during the interview and also the questions which were asked to me.

keep checking this blog post as I will always be updating this post.(I will also write their answers soon)

Analytics and Consulting firms

1. Explain logistic regression? why do we use it? Assumptions of linear regression
2. Clustering questions- How do you choose between K means and Hierarchical clustering?
3. Explain ROC curve, Precision- Recall.
4. What do you mean by p-value(My favorite question. Most people don’t know answer to this question)
5. Explain the steps in a data science project.
6. Difference between machine learning and statistical modeling.
7. Explain me logistic regression in LAYMEN TERMS. (Without using technical words)
8. What is the correlation? Is it bad or good?
9. What do you mean by data science.(Another fav)
10. Types of join.(Must)
11. What is R square?
12. What is random forest?
13. Explain any algorithm end to end.(Most often logistic regression and decision tree)
14. Whats the most challenging project you have done? How did you overcome ?
15. Explain Central limit theorem.

## Setting up a GPU based Deep Learning Machine

Using GPU for deep learning has seen a tremendous performance. It has been reported that execution time using GPU is 10x -50x times faster than CPU-based deep learning and It is also a lot cheaper than CPU-based system. You can see this below in the picture.

I was curious to check deep learning performance on my laptop which has GeForce GT 940M GPU.
Today I will walk you through how to set up GPU based deep learning machine to make use of GPUs. I have used Tensorflow for deep learning on a windows system. Using GPU in windows system is really a pain. You can’t get it to work if you don’t follow correct steps. But if you follow the steps it will be very easy to set up Tensorflow with GPU for windows.

Requirement:

• Python 3.5 – Currently Tensorflow on windows doesn’t support python 2.7.
• nvidia cuda GPU

Installation:

• CUDA toolkit
According to your windows version, you can install this toolkit.
Recommended version: Cuda Toolkit 8.0
• cuDNN
Use this link to install cuDNN -https://developer.nvidia.com/cudnn
You need to register to install this. You need to choose cuDNN v5.1. I have tried latest version but it didn’t work out.After downloading, You need to copy and replace these files
into this location C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0

Now you also need to set path for environment variables. Check below snapshots and make the required changes. If they are not there you have to do it manually.

• Python
Install using anaconda . Use whatever anaconda python 2.7 or 3.5 you want to use for your daily tasks because we will create a separate environment for python 3.5 .
• Tensorflow with GPU
Create a virtual environment for tensorflow

``````conda create --name tensorflow-gpu python=3.5
``````

Then activate this virtual environment:

``````activate tensorflow-gpu
``````

And finally, install TensorFlow with GPU support:

``````pip install tensorflow-gpu
``````

Test the TensorFlow installation

``````python
...
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!  ``````

If you run into any error check below link-

https://www.tensorflow.org/install/install_windows

• Let’s play with Tensorflow GPU

Let’s check performance on MNIST data using convolution neural network.

• GPU based Tensorflow

We can see  each step is taking roughly around ~40 ms. Now we want to see if this gpu performanvce worth or not.

• CPU Tensorflow

Let’s take a look at CPU performance. Really? Each step is taking ~370 ms . Wow what a performance!! Tensorflow with GPU is 10x faster than Tensorflow with CPU.

Next steps:

Further, You can install Keras library to do more advance things in deep learning. Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation.Keras uses Tensorflow as backend. Keras also work seamlessly on CPU and GPU. Follow below commands. Install jupyter notebook too if you love working with notebooks.

``````conda install jupyter
conda install scipy pandas
conda install mingw libpython (theano dependencies)
conda install theano
pip install keras

``````

Happy hunting with deep learning !!

## How to improve performance of Neural Networks

Neural networks have been the most promising field of research for quite some time. Recently they have picked up more pace. In earlier days of neural networks, it could only implement single hidden layers and still we have seen better results.
Deep learning methods are becoming exponentially more important due to their demonstrated success at tackling complex learning problems. At the same time, increasing access to high-performance computing resources and state-of-the-art open-source libraries are making it more and more feasible for enterprises, small firms, and individuals to use these methods.

Neural network models have become the center of attraction in solving machine learning problems.

Now, What’s the use of knowing something when we can’t apply our knowledge intelligently. There are various problems with neural networks when we implement them and if we don’t know how to deal with them, then so-called “Neural Network” becomes useless.

Some Issues with Neural Network:

1. Sometimes neural networks fail to converge due to low dimensionality.
2. Even a small change in weights can lead to significant change in output. sometimes results may be worse.
3. The gradient may become zero . In this case , weight optimization fails.
4. Data overfitting.
5. Time complexity is too high. Sometimes algorithm runs for days even on small data set.
6. We get the same output for every input when we predict.

So what next!!

One day I sat down(I am not kidding!)  with neural networks to check What can I do for better performance of neural networks.  I have tried and tested various use cases to discover solutions.
Let’s dig deeper now. Now we’ll check out the proven way to improve the performance(Speed and Accuracy both) of neural network models:

1. Increase hidden Layers

we have always been wondering what happens if we can implement more hidden layers!! In theory, it has been established that many of the functions will converge in a higher level of abstraction. So it seems more layers better results

Multiple hidden layers for networks are created using the mlp function in the RSNNS package and neuralnet in the neuralnet package. As far as I know, these are the only neural network functions in R that can create multiple hidden layers(I am not talking about Deep Learning here). All others use a single hidden layer. Let’s start exploring the neural net package first.

I won’t go into the details of the algorithms. You can google it yourself about their training process.  I have used a data set and want to predict Response/Target  variable. Below is a sample code for 4 layers.

R code

A. Neuralnet Package

library(neuralnet)

set.seed(1000000)

multi_net = neuralnet(action_click~ FAL_DAYS_last_visit_index+NoofSMS_30days_index+offer_index+Days_last_SMS_index+camp_catL3_index+Index_weekday , algorithm= ‘rprop+’, data=train, hidden = c(6,9,10,11) ,stepmax=1e9 , err.fct = “ce”   ,linear.output =F)

I have tried several iteration. Below are the confusion matrix of some of  the results

B. RSNNS Package

library(RSNNS)
set.seed(10000)

a = mlp(train[,2:7], train\$action_click, size = c(5,6), maxit = 5000,

initFunc = “Randomize_Weights”, initFuncParams = c(-0.3, 0.3),

learnFunc = “Std_Backpropagation”, learnFuncParams = c(0.2,0),

hiddenActFunc = “Act_Logistic”, shufflePatterns = TRUE, linOut = FALSE )

I have tried several iteration. Below are the confusion matrix of some of  the results.

From my experiment, I have concluded that when you increase layers, it may result in better accuracy but it’s not a thumb rule. You have to just test it with a different number of layers. I have tried several data set with several iterations and it seems neuralnet package performs better than RSNNS.  Always start with single layer then gradually increase if you don’t have performance improvement .

Figure 2 . A multi layered Neural Network

2. Change Activation function

Changing activation function can be a deal breaker for you. I have tested results with sigmoid, tanh and Rectified linear units. Simplest and most successful activation function is rectified linear unit. Mostly we use sigmoid function network.  Compared to sigmoid, the gradients of ReLU does not approach zero when x is very big. ReLU also converges faster than other activation function. You should know how to use these activation function i.e. when you use “tanh” activation function you should categorize your binary classes into “-1” and “1”.  The classes encoded in 0 and 1 , won’t work in tanh activation function.

3. Change Activation function in Output layer

I have experimented with trying a different activation function in output layer than that of in hidden layers. In some cases, results were better so its better to try with different activation function in output neuron.

As with the single-layered ANN, the choice of activation function for the output layer will depend on the task that we would like the network to perform (i.e. categorization or regression). However, in multi-layered NN, it is generally desirable for the hidden units to have nonlinear activation functions (e.g. logistic sigmoid or tanh). This is because multiple layers of linear computations can be equally formulated as a single layer of linear computations. Thus using linear activations for the hidden layers doesn’t buy us much. However, using linear activations for the output unit activation function (in conjunction with nonlinear activations for the hidden units) allows the network to perform nonlinear regression.

4. Increase number of neurons

If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting fit will be poor. If too many neurons are used, the training time may become excessively long, and, worse, the network may overfit the data. When overfitting \$ occurs, the network will begin to model random noise in the data. The result is that the model fits the training data extremely well, but it generalizes poorly to new, unseen data. Validation must be used to test for this.

There is no rule of thumb in choosing number of neurons but you can consider this one –

N is number of hidden neurons-

• N = 2/3 the size of the input layer, plus the size of the output layer.
• N < twice the size of the input layer

5. Weight initialization

While training neural networks, first-time weights are assigned randomly. Although weight updation does take place, but sometimes neural network can converge in local minima. When we use multilayered architecture, random weights does not perform well. We can supply optimal initial weights. You should try with different random seed to generate different random weights then choose the seed number which works well for your problem.
You can use methods like Adaptive weight initialization, Xavier weight initialization etc  to initialize weights.

The random values of initial synaptic weights generally lead to a big error. So learning is finding a proper value for the synaptic weights, in order to find the minimum value for output error. below figure shows being trapped in local minima in order to find optimal weights-

Figure 3: Local minima problem due to random initialization of weights

6. More data

When We have lots of data , then neural network generalizes well. otherwise, it may overfits data. So it’s better to have more data. Overfitting is a general problem when using neural networks. The amount of data needed to train a neural network is very much problem-dependent. The quality of training data (i.e., how well the available training data represents the problem space) is as important as the quantity (i.e., the number of records, or examples of input-output pairs). The key is to use training data that generally span the problem data space. For relatively small datasets (fewer than 20 input variables, 100 to several thousand records) a minimum of 10 to 40 records (examples) per input variable is recommended for training. For relatively large datasets (more than 20 000 records), the dataset should be sub-sampled to obtain a smaller dataset that contains 30 – 50 records per input variable. In either case, any “extra” records should be used for validating the neural networks produced.

7. Normalizing/Scaling data

Most of the times scaling/normalizing your input data can lead to improvement. There are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs. When NN use gradient descent to optimize parameters , standardizing covariates may speed up convergence (because when you have unscaled covariates, the corresponding parameters may inappropriately dominate the gradient).

8. Change learning algorithm parameters

Try different learning rates (0.01 to 0.9). Also try different momentum parameters, if your algorithm supports it (0.1 to 0.9). Changing learning rate parameter can help us to identify if we are getting stuck in local minima.

The two plots below nicely emphasize the importance of choosing learning rate by illustrating two most common problems with gradient descent:

(i) If the learning rate is too large, gradient descent will overshoot the minima and diverge.

(ii) If the learning rate is too small, the algorithm will require too many epochs to converge and can become trapped in local minima more easily.

Figure 4 : Effect of learning rate parameter values

9. Deep learning for auto feature generation

Machine learning is one of the fastest-growing and most exciting fields out there, and deep learning represents its true bleeding edge. Usual neural networks are not efficient in creating features. Like other machine learning models, Neural networks algorithm’s performance also depends on the quality of features. If we have better features then we would have better accuracy. When we use deep architecture then features are created automatically and every layer refines the features. i.e.

10. Misc- You can try with a different number of epoch and different random seed. Various parameters like dropout ratio, regularization weight penalties, early stopping etc can be changed while training neural network models.

To improve generalization on small noisy data, you can train multiple neural networks and average their output or you can also take a weighted average. There are various types of neural network model and you should choose according to your problem. i.e. while doing stock prediction you should first try Recurrent Neural network models.

Figure 5 : After dropout, insignificant neurons do not participate in training

References:

## Feature Learning , Deep Learning and Machine learning

Machine learning is a very successful technology but applying it today often requires spending substantial effort hand-designing features. This is true for applications in vision, audio, and text

Any machine learning algorithm performs as good as provided features are. Let’s understand this by using image classification example. When we try to classify an image into “motorcycle” and “Not Motorcycle”.

Algorithm needs features so that it can draw information from them.

Earlier Several researchers have spent decades to hand design these features . Below image shows sample process of creating features.

fig 1 – Feature vector creation

In the case of images, audio, and text , coming up with features is difficult , time-consuming and it requires expert knowledge. When we work on applications of learning, we spend a lot of time in tuning these features.

So this is where usual machine learning fails us. What about if we can automate this feature learning task instead of hand -engineering them ?

Self-taught learning/ Unsupervised Feature Learning

In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from ”unlabeled” data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former—for example, by downloading random unlabeled images/audio clips/text documents off the internet—and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.

In Self-taught learning and Unsupervised feature learning, we will give our algorithms a large amount of unlabeled data with which to learn a good feature representation of the input.

Deep Learning

“Deep Learning” algorithms  can automatically learn feature representations (often from unlabeled data) thus avoiding a lot of time-consuming engineering. These algorithms are based on building massive artificial neural networks that were loosely inspired by cortical (brain) computations. Below image shows comparison of deep learning feature discovery process among other algorithms.

fig 2 –  Deep Learning Feature creation

To simulate the brain’s visual processing, sparse coding was developed to explain early visual processing in the brain(edge – detection).

Input: Images x (1) , x (2) , …, x (m) (each in Rn* n)

Learn: Dictionary of bases f1 , f2 , …, fk (also Rn*n), so that each input x can be approximately decomposed as:

x  =  å aj fj                           s.t. aj ’s are mostly zero (“sparse”)

Sparse coding algorithm automatically learns to represent an image in terms of the edges that appear in it. It gives a more succinct, higher-level representation than the raw pixels.

Lets understand this by using  below example.

fig 3 – Face detection using deep  learning

In the first layer, deep learning algorithm uses sparse coding and express images in succinct, higher-level representation. Rectangles are shown in each layer look like the same size but higher level up features look at the bigger version of the image. In 2nd layer ,  we can say that some neuron detects an eye which looks like that, similarly, some neuron found ear feature too. And in highest layer, neurons find a way to detect faces.

Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need  deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas.

Conclusion :

Usual Machine learning is simply a curve fitting. It is capable of producing great results but we spent a lot of time in feature discovery.  While deep learning is closer to AI. It automatically learns features instead to creating it manually. We might miss some important features while creating them but deep learning tries to learn higher level features by itself.

we can use Theono(Python)  for implementation of deep learning.

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.

References :

## RECOMMENDER SYSTEMS 101

Have you ever given a thought how e-commerce websites show you products with “Customer who bought this also bought this” or how Netflix recommends movies based on your interest or how facebook discovers “Person you may know” list?

Let’s look at the below pictures:

Amazon: “When you buy a book”

Netflix: “Other Movies You Might Enjoy”

You can see in these pictures that different products are getting recommended on the basis of your behavior and content. How?

You might have heard the term “Recommendation Engine”.

Recommendation engines have changed the way websites interact with visitors. Rather than providing a static experience in which users search for and potentially buy products, recommender engines personalize user experience by recommendation of products or making suggestions on the basis of past purchases, search and other behavioral traits.

Recommendation engines are algorithms for filtering and sorting items and information. These use opinions of the user community to help individuals in that community to discover interesting and relevant content from a potentially overwhelming set of choices.

One can build recommendation engines using different techniques or ensemble of techniques. Some popular techniques are:

### CONTENT BASED FILTERING

Content-based recommendation engine works with existing profiles of users . A profile has information about a user and their taste. Taste is based on user rating for different items. Generally, whenever a user creates his profile, Recommendation engine does a user survey to get initial information about the user in order to avoid new user problem. In the recommendation process, the engine compares the items that are already positively rated by the user with the items he didn’t rate and looks for similarities. Items similar to the positively rated ones will be recommended to the user. Here, based on user’s taste and behavior a content-based model can be built by recommending articles relevant to user’s taste. This model is efficient and personalized yet it lacks something. Let us understand this with an example. Assume there are four categories of news A) Politics B) Sports C) Entertainment D) Technology and there is a user A  who has read  articles related to Technology and Politics. Content-based recommendation engine will only recommend articles related to these categories and may never recommend anything in other categories as user never viewed those articles before. This problem can be solved using another variant of recommendation algorithm known as Collaborative Filtering.

Example of Content-Based Recommendation

### COLLABORATIVE FILTERING

The idea of collaborative filtering is finding users in a community that share appreciations. If two users have same or almost same rated items in common, then they have similar taste. Such users build a group or a so called neighborhood. User gets recommendations for those items that user hasn’t rated before but were positively rated by users in his/her neighborhood.

Example of collaborative recommendation

Collaborative filtering has basically two approaches:

1. User Based Approach
In this approach, Items that are recommended to a user are based on evaluation of items by users of  same neighborhood, with whom he/she shares common preferences. If the article was positively rated by the community, it will be recommended to the user. In the user-based approach  articles which  already rated by user  play an important role in searching for a group that shares appreciations with him/her.
2. Item Based Approach
Referring to the fact that the taste of users remains constant or change very slightly, similar articles build neighborhoods based on appreciations of users. Afterwards the system generates recommendations with articles in the neighborhood that a user might prefer.

Example of User-based CF & Item-based CF

Let’s try to understand above picture. Let’s say there are three users, A,B & C. In user-based CF, user A and C are similar because both of them like Strawberry and Watermelon. Now user A likes Grapes and Orange too. So user-based CF will recommend Grapes and Orange to user C.

In item-based CF, Grapes and Watermelon will form the similar items neighborhood which means irrespective of users, different items which are similar will form a neighborhood. So when user C likes Watermelon, the other item from the same neighborhood  i.e Grapes will be recommended by item-based CF.

### HYBRID APPROACH

For better results, we can combine collaborative and content-based recommendation algorithms. Netflix is a good example of a hybrid recommendation engine. It makes recommendations by comparing the browsing and search habits of similar users (i.e. collaborative filtering) as well as by offering movies that have similar characteristics to a movie which a user has rated highly (content-based filtering).Using hybrid approaches we can avoid some limitations of pure recommender systems, like the cold-start problem. Hybrid approach can be implemented in different ways:

a. Separate implementation of algorithms and joining the results.
b. Utilize some rules of content-based filtering in collaborative approach.
c. Utilize some rules of collaborative filtering in content based approach.
d. Create a unified recommender system that brings together both approaches.

One such hybrid approach is Context-aware Approach. Context is the information about the environment of a user and the details of situation user is in. These details can play more significant role in recommendations than ratings or popularity of articles. Some recommendations can be more suited to a user in evening and may not  match user preference in  morning at all. User may  like to do one thing when it’s cold and completely different when it’s hot outside. Recommender engines that pay attention and utilize such information in generating  recommendations are called context-aware recommender systems.

It’s difficult to generate recommendations for new users when their profile is almost empty and taste and preferences are unknown. This is called the cold start problem. We can do following things to overcome this challenge:

1. Recommend subset of most popular articles from various categories to the user.
2. A better approach would be a hybrid one like context-aware approach, we can initially collect some data about the user’s environment, situation etc. (maybe by using cookies data). And then recommend the articles after having some information of the user.

### CONCLUSION

We have thrown light on some popular techniques in building recommendation engines. There are some well- known challenges in building these systems. i.e. users can exploit recommendation system  to favor one product over another- based on positive feedback on a product and negative feedback on competitive products. A good recommender system must address these issues.
Recommender Engine  use different algorithms like Pearson correlation , Adaptive Resonance Theory (ART) family, Fuzzy C-means, and Expectation-Maximization (probabilistic clustering) etc.

I hope you like this post In next post I will cover recommendation engines in depth.

## Boosting Performance of Machine Learning Models

People often get stuck when they are asked to improve the performance of predictive models. What usually they do is try different algorithms and check their results. But often they end up not improving the model. Today I will walk you through what we can do to improve our models.

You can build a predictive model in many ways. There is no ‘must-follow’ rule. But, if you follow these ways (shared below), you’d surely achieve high accuracy in your models (given that the data provided is sufficient to make predictions).

1. Add more data: More data is always useful. It helps us to capture all the variance that the data has.
I understand, we don’t get an option to add more data. For example, we do not get a choice to increase the size of training data in data science competitions. But while working on a company project, I suggest you to ask for more data, if possible. This will reduce your pain of working on limited data sets.

1. More Features: Adding new features decreases bias on the expense of variance of the model. New features might help algorithms to capture the effect of that feature. i.e. While predicting daily withdrawal from ATMs, People may follow different pattern in the start of month by drawing higher amounts from ATMs. So it’s better to create a new feature that is responsive to the start of the month.
2. Feature selection– This is also one of the most important aspects of predictive models. If we keep all the features in the data it might overfit the model and it will behave poorly on the unseen data. So it’s always advisable to choose important features in the model and built the model again only with important and significant features.

1. Missing value and Outlier Treatment: Outliers can deflect your model so badly that sometimes it becomes essential to treat these outliers. There might be some data which is wrong or illogical. i.e. Once I was working on airline industry data, in the data there were some passengers whose age is 100+ and some of them were 2000 years. So it is illogical to use this data. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Another reason might be that they might have placed their birth year in the age column. Either way, these values would appear to be errors that will need to be addressed. In the same way, missing value issue should also be addressed.

1. Ensemble Models: Ensemble models can produce better results most of the times. Bagging (Bootstrap Aggregating) and Boosting are some of the ways which can be used. These methods are generally more complex and black box type approaches.

We can also ensemble several weak models and produce better results by taking the simple average or weighted average of all those models. The idea behind is that one model might be only capturing variance of the data and another model might be better at capturing the only trend. In these types of cases, ensemble method works great.

1. Using the suitable Machine learning algorithm: Choosing the right algorithm is a crucial step in building a better model. Once I was working with holtzwinter model for prediction but It performed badly for real-time forecasting so I had to move on neural network models. Some algorithms are just better suited to some data sets than others. Identifying the right type of models could be really tricky, though!
2. Auto- feature generation: There is a lot of buzz around the term “deep learning”. The quality of features is critical to the accuracy of the resulting machine learned algorithm; no machine learning method will work well with poorly chosen features. However, due to the size and complexity of programs, theoretically there are an infinite number of potential features to choose from. If you are doing image classification or hand writing classification then deep learning is for you. Deep learning does not require you to provide the best possible features, it learns by its own. Image processing tasks have seen amazing results using deep learning.

1. Miscellaneous: It is always better to explore the data efficiently. The data distribution might be suggesting for transformation. The data might be following the gaussian function or some other family of function, in that case, we can apply algorithm with a little transformation to have better predictions. Once we get the right data distribution, the algorithm can work efficiently. Another thing we can do is fine tuning of parameters of algorithms.

## Difference between Usual Machine Learning and Deep Learning Explained!

As wikipedia defines: “Machine learning is a subfield of computer science[1] that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[2] Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions,[3]:2 rather than following strictly static program instructions.”
i.e. you collect a bunch of face images and non-face images, choose an algorithm, and wait for the computations to finish.  This is the spirit of machine learning.  “Machine Learning” emphasizes that the computer program (or machine) must do some work after it is given data.

Loosely speaking, most often, ML algorithms work on precise set of features extracted from your raw data. Features could be very simple, such as pixel values for images, temporal values for a signal, or complex features such as Bag of Words feature representation for text. Most known ML algorithms only work as good as the features represent the data. Correct feature identification is the close representative of your all states of your data is a crucial step.

Importance of feature extractor:
Making correct feature extractor is great deal of science in itself. Most of these features extractors (from data) are very specific in function and utility. For ex: for face detection one needs a feature extractor which correctly represents parts of face, resistant to spatial aberrations etc. Each and every type of data and task may have its own class of feature extraction. ( Ex: Speech Recognition, Image Recognition)

These feature extractors can then be used to extract correct data features for a given sample, and pass this information to a classifier/ predictor.

How is Deep Learning different ?
Deep Learning is broader family of Machine Learning methods that tries to learn high level features from the given data. Thus, the problem it solves is reducing task of making new feature extractor for each and every type of data (speech, image etc.)

For last example, Deep Learning algo’s will try to learn features such as difference between human face , a dog and room structure  etc. features when image recognition task is presented to them. They may use this info for classification, prediction etc tasks. Thus, this is a major step away from previous “Shallow Learning Algorithms.”
The main difference is that regular machine learning involves a lot of handcrafted feature extraction while deep learning does all the feature extraction by itself.
So, Deep learning is essentially a set of techniques that help you to parameterize deep neural network structures, neural networks with many, many layers and parameters.

It’s a growing trend in ML due to some favorable results in applications where the target function is very complex and the datasets are large. For example in Hinton et al. (2012),  Hinton and his students managed to beat the status quo prediction systems on five well known datasets: Reuters, TIMIT, MNIST, CIFAR and ImageNet. This covers speech, text and image classification – and these are quite mature datasets, so a win on any of these gets some attention. A win on all of them gets a lot of attention. Deep learning networks differ from “normal” neural networks and SVMs because they can be trained in an UNSUPERVISED or SUPERVISED manner for both UNSUPERVISED and SUPERVISED learning tasks.
Prof. Andrew Ng remarks that Deep Learning focuses on original aim of One Learning, an ideal Algorithm envisioned for an AI.