A Data Science Project- Part 1(b)

In the previous article (https://d4datascience.wordpress.com/2016/11/10/a-data-science-project-part-1/), we have done basic data analysis like calculating means, frequency tables, summary etc. Now we will derive new variables. Why?
Derived variables will help to understand more about them. For example, We have derived variable ip(derived from incomeperperson variable) which will help us to understand how many people fall in lower income class or higher income class. Similarly other variables le, ac transformed into new variables.
How do we decide these value of cutoff points?
This is answered by the business people or you have to explore the data to divide them into different buckets.

If you have any query, let me know in comment section.

Advertisement

A Data Science Project- Part 1(a)

In the earlier post, we have discussed our hypothesis and different variables impacting life expectancy. In this post, we will start digging into data. I suggest you go through codebook(check the earlier post). It is important to understand variables first before doing any analysis. I have chosen SAS as data analysis language, but you are free to choose any language R or Python. The process and logic will be same, the just syntax will be different. Now in this part, we will explore all the variables. We will check frequency and distribution of variables. For numeric variables, you can draw the histogram, boxplot, quantiles to understand them and for the categorical variables, you can draw barplot, frequency tables.

A Data Science Project-Introduction: How can we have better life expectancy!

Hi, future data science buddies!

I have been asked many times how can I become data scientist or what should I do to become a data scientist.  Here I have written about some technical skills that are required to be a data scientist.

https://d4datascience.wordpress.com/2016/02/13/how-to-make-career-into-data-science/

But above article was lacking one thing, and Monica Rogati has summed it up perfectly.

https://www.quora.com/How-can-I-become-a-data-scientist-1/answer/Monica-Rogati

This advice from Monica Rogati is a must read for all beginners who want to make into data science. In one sentence, she says

Do a project you care about. Make it good and share it.

Now the question is how to choose a data science project?

So I thought how about I can give you some ideas about the projects and how to start a project and finish it. I am starting a series of blog posts in which step by step I will show you the journey of a data science project.

I am sharing one of my project which I have undertaken during a course. This is a basic project and read this article to understand the approach that you will follow in any project. You will figure out why hypothesis generation is necessary. You will also get to know that why people say every good data analysis starts with a question. You will understand what type of questions data science can answer. You can choose any project topic as long as you have relevant data available.

My research Project Topic was How life expectancy is related to social, economic, environmental factors.

The dataset I have downloaded is from Gapminder http://www.gapminder.org/data/.  Details of the data were given in codebook.

This is gapminder data’s codebook link
https://www.dropbox.com/s/lfzmkvnwb5r84ff/GapMinder%20Codebook%20.pdf?dl=0

After looking through the codebook for the gap minder study, I have decided that I am particularly interested in Life expectancy.
So I chose a variable lifeexpectancy from the code book.
I want to study the life expectancy behavior with social,economic and environmental factors. I found out the variables which are related to these factors.

  • alcconsumption(alcohol consumption)
  • co2emissions(cumulative CO2 emission)
  • oilperperson(oil Consumption)
  • incomeperperson

I have performed a literature review to see what the work has been done on this topic so that I can bring out some new research. I have taken references from below research:

  • Life expectancy of people with intellectual disability:
    a 35-year follow-up study
    K. Patja,1 M. Iivanainen,1 H.Vesala,2 H. Oksanen3 & I. Ruoppila4    
  • Income distribution and life expectancy
    R G Wilkinson
  • Changes in life expectancy in Russia in the mid-1990s
    Vladimir Shkolnikov, PhD,Prof Martin McKee,Prof David A Leon, PhD
  • Drinking Pattern and Mortality:
    The Italian Risk Factor and Life Expectancy Pooling Project
  1. After going through those literature I generated hypothesis that I believe, people who are having better income, less alcohol consumption and better environment conditions are having better life expectancy. So These factors are positively correlated with life expectancy according to my hypothesis.

We would like to find out what is the association between life expectancy and social,economic and environmental factors. We would like to address several questions i.e.

  1.    Does alcohol consumption and oil consumption determine life
    expectancy?
  2.    Does better income means better life expectancy?
  3.    Is there causal relationship ?
  4.    If there is more CO2 emission then would it means lower life
    expectancy?

The Variables that we will consider for the hypothesis are

  1. incomeperperson (Gross Domestic Product per capita in constant 2000 US$)
  2. alcconsumption (alcohol consumption per adult (age 15+), litres
    Recorded and estimated average alcohol consumption, adult (15+) per
    capita consumption in litres pure alcohol)
  3. co2emissions (cumulative CO2 emission (metric tons), Total amount of CO2 emission in metric tons since 1751)
  4. lifeexpectancy (life expectancy at birth (years)
    The average number of years a newborn child would live if current
    mortality patterns were to stay the same.)
  5. oilperperson (oil Consumption per capita (tonnes per year and person))

This will help us to find answers for important questions like How does a country can have better life expectancy without compromising on its development projects which causes lots of CO2 emission. Where does a country should focus its agenda to have a better life expectancy. So lets dive into the data….

 

 

How to improve performance of Neural Networks

 

Neural networks have been the most promising field of research for quite some time. Recently they have picked up more pace. In earlier days of neural networks, it could only implement single hidden layers and still we have seen better results.
Deep learning methods are becoming exponentially more important due to their demonstrated success at tackling complex learning problems. At the same time, increasing access to high-performance computing resources and state-of-the-art open-source libraries are making it more and more feasible for enterprises, small firms, and individuals to use these methods.

Neural network models have become the center of attraction in solving machine learning problems.

Now, What’s the use of knowing something when we can’t apply our knowledge intelligently. There are various problems with neural networks when we implement them and if we don’t know how to deal with them, then so-called “Neural Network” becomes useless.

Some Issues with Neural Network:

  1. Sometimes neural networks fail to converge due to low dimensionality.
  2. Even a small change in weights can lead to significant change in output. sometimes results may be worse.
  3. The gradient may become zero . In this case , weight optimization fails.
  4. Data overfitting.
  5. Time complexity is too high. Sometimes algorithm runs for days even on small data set.
  6. We get the same output for every input when we predict.

 

So what next!!

 

One day I sat down(I am not kidding!)  with neural networks to check What can I do for better performance of neural networks.  I have tried and tested various use cases to discover solutions.
Let’s dig deeper now. Now we’ll check out the proven way to improve the performance(Speed and Accuracy both) of neural network models:

1. Increase hidden Layers

we have always been wondering what happens if we can implement more hidden layers!! In theory, it has been established that many of the functions will converge in a higher level of abstraction. So it seems more layers better results

Multiple hidden layers for networks are created using the mlp function in the RSNNS package and neuralnet in the neuralnet package. As far as I know, these are the only neural network functions in R that can create multiple hidden layers(I am not talking about Deep Learning here). All others use a single hidden layer. Let’s start exploring the neural net package first.

I won’t go into the details of the algorithms. You can google it yourself about their training process.  I have used a data set and want to predict Response/Target  variable. Below is a sample code for 4 layers.

R code

     A. Neuralnet Package

library(neuralnet)

set.seed(1000000)

multi_net = neuralnet(action_click~ FAL_DAYS_last_visit_index+NoofSMS_30days_index+offer_index+Days_last_SMS_index+camp_catL3_index+Index_weekday , algorithm= ‘rprop+’, data=train, hidden = c(6,9,10,11) ,stepmax=1e9 , err.fct = “ce”   ,linear.output =F)

 

I have tried several iteration. Below are the confusion matrix of some of  the results

layers1neuralnet

    B. RSNNS Package

library(RSNNS)
set.seed(10000)

a = mlp(train[,2:7], train$action_click, size = c(5,6), maxit = 5000,

initFunc = “Randomize_Weights”, initFuncParams = c(-0.3, 0.3),

learnFunc = “Std_Backpropagation”, learnFuncParams = c(0.2,0),

hiddenActFunc = “Act_Logistic”, shufflePatterns = TRUE, linOut = FALSE )

 

I have tried several iteration. Below are the confusion matrix of some of  the results.

rsnn2

From my experiment, I have concluded that when you increase layers, it may result in better accuracy but it’s not a thumb rule. You have to just test it with a different number of layers. I have tried several data set with several iterations and it seems neuralnet package performs better than RSNNS.  Always start with single layer then gradually increase if you don’t have performance improvement .

bpneuralnetwork-with-one-hidden-layer

Figure 2 . A multi layered Neural Network

 

2. Change Activation function

Changing activation function can be a deal breaker for you. I have tested results with sigmoid, tanh and Rectified linear units. Simplest and most successful activation function is rectified linear unit. Mostly we use sigmoid function network.  Compared to sigmoid, the gradients of ReLU does not approach zero when x is very big. ReLU also converges faster than other activation function. You should know how to use these activation function i.e. when you use “tanh” activation function you should categorize your binary classes into “-1” and “1”.  The classes encoded in 0 and 1 , won’t work in tanh activation function.

 

3. Change Activation function in Output layer

I have experimented with trying a different activation function in output layer than that of in hidden layers. In some cases, results were better so its better to try with different activation function in output neuron.

As with the single-layered ANN, the choice of activation function for the output layer will depend on the task that we would like the network to perform (i.e. categorization or regression). However, in multi-layered NN, it is generally desirable for the hidden units to have nonlinear activation functions (e.g. logistic sigmoid or tanh). This is because multiple layers of linear computations can be equally formulated as a single layer of linear computations. Thus using linear activations for the hidden layers doesn’t buy us much. However, using linear activations for the output unit activation function (in conjunction with nonlinear activations for the hidden units) allows the network to perform nonlinear regression.

4. Increase number of neurons

If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting fit will be poor. If too many neurons are used, the training time may become excessively long, and, worse, the network may overfit the data. When overfitting $ occurs, the network will begin to model random noise in the data. The result is that the model fits the training data extremely well, but it generalizes poorly to new, unseen data. Validation must be used to test for this.

There is no rule of thumb in choosing number of neurons but you can consider this one –

N is number of hidden neurons-

  • N = 2/3 the size of the input layer, plus the size of the output layer.
  • N < twice the size of the input layer

 

5. Weight initialization

While training neural networks, first-time weights are assigned randomly. Although weight updation does take place, but sometimes neural network can converge in local minima. When we use multilayered architecture, random weights does not perform well. We can supply optimal initial weights. You should try with different random seed to generate different random weights then choose the seed number which works well for your problem.
You can use methods like Adaptive weight initialization, Xavier weight initialization etc  to initialize weights.

The random values of initial synaptic weights generally lead to a big error. So learning is finding a proper value for the synaptic weights, in order to find the minimum value for output error. below figure shows being trapped in local minima in order to find optimal weights-

localminima

Figure 3: Local minima problem due to random initialization of weights

6. More data

When We have lots of data , then neural network generalizes well. otherwise, it may overfits data. So it’s better to have more data. Overfitting is a general problem when using neural networks. The amount of data needed to train a neural network is very much problem-dependent. The quality of training data (i.e., how well the available training data represents the problem space) is as important as the quantity (i.e., the number of records, or examples of input-output pairs). The key is to use training data that generally span the problem data space. For relatively small datasets (fewer than 20 input variables, 100 to several thousand records) a minimum of 10 to 40 records (examples) per input variable is recommended for training. For relatively large datasets (more than 20 000 records), the dataset should be sub-sampled to obtain a smaller dataset that contains 30 – 50 records per input variable. In either case, any “extra” records should be used for validating the neural networks produced.

7. Normalizing/Scaling data

Most of the times scaling/normalizing your input data can lead to improvement. There are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs. When NN use gradient descent to optimize parameters , standardizing covariates may speed up convergence (because when you have unscaled covariates, the corresponding parameters may inappropriately dominate the gradient).

8. Change learning algorithm parameters

Try different learning rates (0.01 to 0.9). Also try different momentum parameters, if your algorithm supports it (0.1 to 0.9). Changing learning rate parameter can help us to identify if we are getting stuck in local minima.

The two plots below nicely emphasize the importance of choosing learning rate by illustrating two most common problems with gradient descent:

(i) If the learning rate is too large, gradient descent will overshoot the minima and diverge.

(ii) If the learning rate is too small, the algorithm will require too many epochs to converge and can become trapped in local minima more easily.

perceptron_learning_rate

Figure 4 : Effect of learning rate parameter values

 

9. Deep learning for auto feature generation

Machine learning is one of the fastest-growing and most exciting fields out there, and deep learning represents its true bleeding edge. Usual neural networks are not efficient in creating features. Like other machine learning models, Neural networks algorithm’s performance also depends on the quality of features. If we have better features then we would have better accuracy. When we use deep architecture then features are created automatically and every layer refines the features. i.e.

lrn-10-01-16-neural-networks-e1474990995824

auto-feature deep learning

 

10. Misc- You can try with a different number of epoch and different random seed. Various parameters like dropout ratio, regularization weight penalties, early stopping etc can be changed while training neural network models.

To improve generalization on small noisy data, you can train multiple neural networks and average their output or you can also take a weighted average. There are various types of neural network model and you should choose according to your problem. i.e. while doing stock prediction you should first try Recurrent Neural network models.

drop-out

Figure 5 : After dropout, insignificant neurons do not participate in training

 

References:

1.  http://stats.stackexchange.com/
2. http://stackoverflow.com/
3. https://www.quora.com/
4. http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html
5. http://www.nexyad.net/html/upgrades%20site%20nexyad/e-book-Tutorial-Neural-Networks.html

Feature Learning , Deep Learning and Machine learning

Machine learning is a very successful technology but applying it today often requires spending substantial effort hand-designing features. This is true for applications in vision, audio, and text

Any machine learning algorithm performs as good as provided features are. Let’s understand this by using image classification example. When we try to classify an image into “motorcycle” and “Not Motorcycle”.

Algorithm needs features so that it can draw information from them.

Earlier Several researchers have spent decades to hand design these features . Below image shows sample process of creating features.

d11

fig 1 – Feature vector creation

 

In the case of images, audio, and text , coming up with features is difficult , time-consuming and it requires expert knowledge. When we work on applications of learning, we spend a lot of time in tuning these features.

So this is where usual machine learning fails us. What about if we can automate this feature learning task instead of hand -engineering them ?

Self-taught learning/ Unsupervised Feature Learning

In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from ”unlabeled” data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former—for example, by downloading random unlabeled images/audio clips/text documents off the internet—and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.

In Self-taught learning and Unsupervised feature learning, we will give our algorithms a large amount of unlabeled data with which to learn a good feature representation of the input.

Deep Learning

“Deep Learning” algorithms  can automatically learn feature representations (often from unlabeled data) thus avoiding a lot of time-consuming engineering. These algorithms are based on building massive artificial neural networks that were loosely inspired by cortical (brain) computations. Below image shows comparison of deep learning feature discovery process among other algorithms.

d2

fig 2 –  Deep Learning Feature creation

 

To simulate the brain’s visual processing, sparse coding was developed to explain early visual processing in the brain(edge – detection).

Input: Images x (1) , x (2) , …, x (m) (each in Rn* n)

Learn: Dictionary of bases f1 , f2 , …, fk (also Rn*n), so that each input x can be approximately decomposed as:

x  =  å aj fj                           s.t. aj ’s are mostly zero (“sparse”)

 

 

Sparse coding algorithm automatically learns to represent an image in terms of the edges that appear in it. It gives a more succinct, higher-level representation than the raw pixels.

Lets understand this by using  below example.

 

d3

fig 3 – Face detection using deep  learning

 

In the first layer, deep learning algorithm uses sparse coding and express images in succinct, higher-level representation. Rectangles are shown in each layer look like the same size but higher level up features look at the bigger version of the image. In 2nd layer ,  we can say that some neuron detects an eye which looks like that, similarly, some neuron found ear feature too. And in highest layer, neurons find a way to detect faces.

Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need  deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas.

Conclusion :

Usual Machine learning is simply a curve fitting. It is capable of producing great results but we spent a lot of time in feature discovery.  While deep learning is closer to AI. It automatically learns features instead to creating it manually. We might miss some important features while creating them but deep learning tries to learn higher level features by itself.

 

we can use Theono(Python)  for implementation of deep learning.

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.

To know more about Theono follow this link – http://deeplearning.net/software/theano/introduction.html

References :

fig 1- http://www.cs.stanford.edu/people/ang//slides/DeepLearning-Mar2013.pptx

fig 2 – http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/

fig 3 – http://www.cs.stanford.edu/people/ang//slides/DeepLearning-Mar2013.pptx

 

 

A brief introduction to Outliers and Outlier Removal methods

What is an Outlier?

 Simply speaking, Outlier is an observation that appears far away and diverges from an overall   pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as Outliers.

What are the types of Outliers?

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.

Let us understand this with an example. Let us say we understand the relationship between height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight and height.

What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories:

  1. Artificial (Error) / Non-natural
  2. Natural.

Let’s understand various types of outliers in more detail:

  • Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data. For example: Annual income of a customer is $100,000. Accidentally, the data entry operator puts an additional zero in the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population.
  • Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers.
  • Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.
  • Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens is under reporting the consumption.
  • Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
  • Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
  • Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.

What is the Impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

  • It increases the error variance and reduces the power of statistical tests
  • If the outliers are non-randomly distributed, they can decrease normality
  • They can bias or influence estimates that may be of substantive interest
  • They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.

As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would change the estimate completely.

How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:

  • Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
  • Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
  • Data points, three or more standard deviation away from mean are considered outlier
  • Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding
  • Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.
  • In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others.

How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.

Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

Outlier treatment methods:

1.Box plot:

2.Histogram:

histogramconsists of parallel vertical bars that graphically show the frequency distribution of a quantitative variable. The area of each bar is equal to the frequency of items found in each class.

Example:

In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars showing the number of eruptions classified according to their durations.

Problem

Find the histogram of the eruption durations in faithful.

Solution

We apply the hist function to produce the histogram of the eruptions variable.

> duration = faithful$eruptions 
> hist(duration,    # apply the hist function 
+   right=FALSE)    # intervals closed on the left

R detection methods in R

1. Using  “outliers” package.
  
outlier_tf = outlier(data_full$target column,logical=TRUE)
#This gives an array with all values False, except for the outlier (as defined in the package documentation “Finds value with largest difference between it and sample mean, which can be an outlier”).  That value is returned as True. 
find_outlier = which(outlier_tf==TRUE,arr.ind=TRUE)
#This finds the location of the outlier by finding that “True” value within the “outlier_tf” array. 
data_new = data_full[-find_outlier,]
#This creates a new dataset based on the old data, removing the one row that contains the outlier .

2. using DMwR Package       

Quartile Method

The quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set

The third quartile (Q3) is the middle value between the median and the highest value of the data set.

Inter Quartile Range(IQR) refers to the difference between third and first quartile

To find the oultier n,

n> Q3+1.5*IQR

or

n<Q1 -1.5*IQR

The same can be done in R using Box Whisker Plots

3.LOF(LOCAL OUTLIER FACTOR)

Outliers are found based on their local neighbourhoods,more specifically on the local densities.

To calculate the local outlier factor scores

score=lofactor(data,k)

k is the number of neighbours

To plot density plots

plot(density(score)

To find the  data points with the highest outlier scores(greater the score greater the chance of the data point being an outlier)

For example the top 5 scores

outliers <- order(outlier.scores, decreasing=T)[1:5]

RECOMMENDER SYSTEMS 101

Have you ever given a thought how e-commerce websites show you products with “Customer who bought this also bought this” or how Netflix recommends movies based on your interest or how facebook discovers “Person you may know” list?

Let’s look at the below pictures:

Amazon: “When you buy a book”

rec

Netflix: “Other Movies You Might Enjoy”

netflix

You can see in these pictures that different products are getting recommended on the basis of your behavior and content. How?

You might have heard the term “Recommendation Engine”.

Recommendation engines have changed the way websites interact with visitors. Rather than providing a static experience in which users search for and potentially buy products, recommender engines personalize user experience by recommendation of products or making suggestions on the basis of past purchases, search and other behavioral traits.

Recommendation engines are algorithms for filtering and sorting items and information. These use opinions of the user community to help individuals in that community to discover interesting and relevant content from a potentially overwhelming set of choices.

One can build recommendation engines using different techniques or ensemble of techniques. Some popular techniques are:

CONTENT BASED FILTERING

Content-based recommendation engine works with existing profiles of users . A profile has information about a user and their taste. Taste is based on user rating for different items. Generally, whenever a user creates his profile, Recommendation engine does a user survey to get initial information about the user in order to avoid new user problem. In the recommendation process, the engine compares the items that are already positively rated by the user with the items he didn’t rate and looks for similarities. Items similar to the positively rated ones will be recommended to the user. Here, based on user’s taste and behavior a content-based model can be built by recommending articles relevant to user’s taste. This model is efficient and personalized yet it lacks something. Let us understand this with an example. Assume there are four categories of news A) Politics B) Sports C) Entertainment D) Technology and there is a user A  who has read  articles related to Technology and Politics. Content-based recommendation engine will only recommend articles related to these categories and may never recommend anything in other categories as user never viewed those articles before. This problem can be solved using another variant of recommendation algorithm known as Collaborative Filtering.

content

Example of Content-Based Recommendation

 

COLLABORATIVE FILTERING

The idea of collaborative filtering is finding users in a community that share appreciations. If two users have same or almost same rated items in common, then they have similar taste. Such users build a group or a so called neighborhood. User gets recommendations for those items that user hasn’t rated before but were positively rated by users in his/her neighborhood.

cfExample of collaborative recommendation

 

Collaborative filtering has basically two approaches:

  1. User Based Approach 
    In this approach, Items that are recommended to a user are based on evaluation of items by users of  same neighborhood, with whom he/she shares common preferences. If the article was positively rated by the community, it will be recommended to the user. In the user-based approach  articles which  already rated by user  play an important role in searching for a group that shares appreciations with him/her.
  2. Item Based Approach
    Referring to the fact that the taste of users remains constant or change very slightly, similar articles build neighborhoods based on appreciations of users. Afterwards the system generates recommendations with articles in the neighborhood that a user might prefer.

recomfde

Example of User-based CF & Item-based CF

Let’s try to understand above picture. Let’s say there are three users, A,B & C. In user-based CF, user A and C are similar because both of them like Strawberry and Watermelon. Now user A likes Grapes and Orange too. So user-based CF will recommend Grapes and Orange to user C.

In item-based CF, Grapes and Watermelon will form the similar items neighborhood which means irrespective of users, different items which are similar will form a neighborhood. So when user C likes Watermelon, the other item from the same neighborhood  i.e Grapes will be recommended by item-based CF.

HYBRID APPROACH

For better results, we can combine collaborative and content-based recommendation algorithms. Netflix is a good example of a hybrid recommendation engine. It makes recommendations by comparing the browsing and search habits of similar users (i.e. collaborative filtering) as well as by offering movies that have similar characteristics to a movie which a user has rated highly (content-based filtering).Using hybrid approaches we can avoid some limitations of pure recommender systems, like the cold-start problem. Hybrid approach can be implemented in different ways:

a. Separate implementation of algorithms and joining the results.
b. Utilize some rules of content-based filtering in collaborative approach.
c. Utilize some rules of collaborative filtering in content based approach.
d. Create a unified recommender system that brings together both approaches.

One such hybrid approach is Context-aware Approach. Context is the information about the environment of a user and the details of situation user is in. These details can play more significant role in recommendations than ratings or popularity of articles. Some recommendations can be more suited to a user in evening and may not  match user preference in  morning at all. User may  like to do one thing when it’s cold and completely different when it’s hot outside. Recommender engines that pay attention and utilize such information in generating  recommendations are called context-aware recommender systems.

Addressing Cold-Start Problem

It’s difficult to generate recommendations for new users when their profile is almost empty and taste and preferences are unknown. This is called the cold start problem. We can do following things to overcome this challenge:

  1. Recommend subset of most popular articles from various categories to the user.
  2. A better approach would be a hybrid one like context-aware approach, we can initially collect some data about the user’s environment, situation etc. (maybe by using cookies data). And then recommend the articles after having some information of the user.

CONCLUSION

We have thrown light on some popular techniques in building recommendation engines. There are some well- known challenges in building these systems. i.e. users can exploit recommendation system  to favor one product over another- based on positive feedback on a product and negative feedback on competitive products. A good recommender system must address these issues.
Recommender Engine  use different algorithms like Pearson correlation , Adaptive Resonance Theory (ART) family, Fuzzy C-means, and Expectation-Maximization (probabilistic clustering) etc.

I hope you like this post In next post I will cover recommendation engines in depth.

Introduction To Lasso Regression

Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.

I have used gapminder dataset. All predictor variables(incomeperperson alcconsumption co2emissions oilperperson suicideper100th  employrate ) were quantitative and Response variable(lifeexpectancy) was also quantitative.All predictor variables were standardized to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

1.SAS Code

LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new; set mydata.gapminder;

keep COUNTRY incomeperpersonalcconsumptionlifeexpectancy co2emissions oilperperson suicideper100th  employrate;

proc sort; by COUNTRY; /*sort the data by country */

* delete observations with missing data;
*if cmiss(of _all_) then delete;
*run;

ods graphics on;

* Split data randomly into test and training data;
proc surveyselect data=new out=traintest seed = 123
samprate=0.7 method=srs outall;
run;

* lasso multiple regression with lars algorithm k=10 fold validation;
proc glmselect data=traintest plots=all seed=123;
partition ROLE=selected(train=‘1’ test=’0’);
model lifeexpectancy = incomeperperson  alcconsumption co2emissions oilperperson  employrate suicideper100th/selection=lar(choose=cv stop=none) cvmethod=random(10);
run;

2.Output

Read more on  http://yesthedataguy.tumblr.com/post/141154990096/introduction-to-lasso-regression

Boosting Performance of Machine Learning Models

BoostPeople often get stuck when they are asked to improve the performance of predictive models. What usually they do is try different algorithms and check their results. But often they end up not improving the model. Today I will walk you through what we can do to improve our models.

You can build a predictive model in many ways. There is no ‘must-follow’ rule. But, if you follow these ways (shared below), you’d surely achieve high accuracy in your models (given that the data provided is sufficient to make predictions).

  1. Add more data: More data is always useful. It helps us to capture all the variance that the data has.
    I understand, we don’t get an option to add more data. For example, we do not get a choice to increase the size of training data in data science competitions. But while working on a company project, I suggest you to ask for more data, if possible. This will reduce your pain of working on limited data sets.

 

  1. More Features: Adding new features decreases bias on the expense of variance of the model. New features might help algorithms to capture the effect of that feature. i.e. While predicting daily withdrawal from ATMs, People may follow different pattern in the start of month by drawing higher amounts from ATMs. So it’s better to create a new feature that is responsive to the start of the month.l11-the-future-of-machine-learning-6-638
  2. Feature selection– This is also one of the most important aspects of predictive models. If we keep all the features in the data it might overfit the model and it will behave poorly on the unseen data. So it’s always advisable to choose important features in the model and built the model again only with important and significant features.

1

  1. Missing value and Outlier Treatment: Outliers can deflect your model so badly that sometimes it becomes essential to treat these outliers. There might be some data which is wrong or illogical. i.e. Once I was working on airline industry data, in the data there were some passengers whose age is 100+ and some of them were 2000 years. So it is illogical to use this data. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Another reason might be that they might have placed their birth year in the age column. Either way, these values would appear to be errors that will need to be addressed. In the same way, missing value issue should also be addressed.

 

  1. Ensemble Models: Ensemble models can produce better results most of the times. Bagging (Bootstrap Aggregating) and Boosting are some of the ways which can be used. These methods are generally more complex and black box type approaches.

We can also ensemble several weak models and produce better results by taking the simple average or weighted average of all those models. The idea behind is that one model might be only capturing variance of the data and another model might be better at capturing the only trend. In these types of cases, ensemble method works great.

 

  1. Using the suitable Machine learning algorithm: Choosing the right algorithm is a crucial step in building a better model. Once I was working with holtzwinter model for prediction but It performed badly for real-time forecasting so I had to move on neural network models. Some algorithms are just better suited to some data sets than others. Identifying the right type of models could be really tricky, though!
  2. Auto- feature generation: There is a lot of buzz around the term “deep learning”. The quality of features is critical to the accuracy of the resulting machine learned algorithm; no machine learning method will work well with poorly chosen features. However, due to the size and complexity of programs, theoretically there are an infinite number of potential features to choose from. If you are doing image classification or hand writing classification then deep learning is for you. Deep learning does not require you to provide the best possible features, it learns by its own. Image processing tasks have seen amazing results using deep learning.

auto-feature deep learning

  1. Miscellaneous: It is always better to explore the data efficiently. The data distribution might be suggesting for transformation. The data might be following the gaussian function or some other family of function, in that case, we can apply algorithm with a little transformation to have better predictions. Once we get the right data distribution, the algorithm can work efficiently. Another thing we can do is fine tuning of parameters of algorithms.

Difference between Usual Machine Learning and Deep Learning Explained!

As wikipedia defines: “Machine learning is a subfield of computer science[1] that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[2] Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions,[3]:2 rather than following strictly static program instructions.”
i.e. you collect a bunch of face images and non-face images, choose an algorithm, and wait for the computations to finish.  This is the spirit of machine learning.  “Machine Learning” emphasizes that the computer program (or machine) must do some work after it is given data.

Loosely speaking, most often, ML algorithms work on precise set of features extracted from your raw data. Features could be very simple, such as pixel values for images, temporal values for a signal, or complex features such as Bag of Words feature representation for text. Most known ML algorithms only work as good as the features represent the data. Correct feature identification is the close representative of your all states of your data is a crucial step.

ml-eng

 

Importance of feature extractor:
Making correct feature extractor is great deal of science in itself. Most of these features extractors (from data) are very specific in function and utility. For ex: for face detection one needs a feature extractor which correctly represents parts of face, resistant to spatial aberrations etc. Each and every type of data and task may have its own class of feature extraction. ( Ex: Speech Recognition, Image Recognition)

These feature extractors can then be used to extract correct data features for a given sample, and pass this information to a classifier/ predictor.

How is Deep Learning different ?
Deep Learning is broader family of Machine Learning methods that tries to learn high level features from the given data. Thus, the problem it solves is reducing task of making new feature extractor for each and every type of data (speech, image etc.)
convnet

For last example, Deep Learning algo’s will try to learn features such as difference between human face , a dog and room structure  etc. features when image recognition task is presented to them. They may use this info for classification, prediction etc tasks. Thus, this is a major step away from previous “Shallow Learning Algorithms.”
The main difference is that regular machine learning involves a lot of handcrafted feature extraction while deep learning does all the feature extraction by itself.
So, Deep learning is essentially a set of techniques that help you to parameterize deep neural network structures, neural networks with many, many layers and parameters.

It’s a growing trend in ML due to some favorable results in applications where the target function is very complex and the datasets are large. For example in Hinton et al. (2012),  Hinton and his students managed to beat the status quo prediction systems on five well known datasets: Reuters, TIMIT, MNIST, CIFAR and ImageNet. This covers speech, text and image classification – and these are quite mature datasets, so a win on any of these gets some attention. A win on all of them gets a lot of attention. Deep learning networks differ from “normal” neural networks and SVMs because they can be trained in an UNSUPERVISED or SUPERVISED manner for both UNSUPERVISED and SUPERVISED learning tasks.
Prof. Andrew Ng remarks that Deep Learning focuses on original aim of One Learning, an ideal Algorithm envisioned for an AI.