A brief introduction to Outliers and Outlier Removal methods

What is an Outlier?

 Simply speaking, Outlier is an observation that appears far away and diverges from an overall   pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as Outliers.

What are the types of Outliers?

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.

Let us understand this with an example. Let us say we understand the relationship between height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight and height.

What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories:

  1. Artificial (Error) / Non-natural
  2. Natural.

Let’s understand various types of outliers in more detail:

  • Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data. For example: Annual income of a customer is $100,000. Accidentally, the data entry operator puts an additional zero in the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population.
  • Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers.
  • Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.
  • Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens is under reporting the consumption.
  • Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
  • Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
  • Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.

What is the Impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

  • It increases the error variance and reduces the power of statistical tests
  • If the outliers are non-randomly distributed, they can decrease normality
  • They can bias or influence estimates that may be of substantive interest
  • They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.

As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would change the estimate completely.

How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:

  • Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
  • Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
  • Data points, three or more standard deviation away from mean are considered outlier
  • Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding
  • Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.
  • In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others.

How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.

Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

Outlier treatment methods:

1.Box plot:

2.Histogram:

histogramconsists of parallel vertical bars that graphically show the frequency distribution of a quantitative variable. The area of each bar is equal to the frequency of items found in each class.

Example:

In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars showing the number of eruptions classified according to their durations.

Problem

Find the histogram of the eruption durations in faithful.

Solution

We apply the hist function to produce the histogram of the eruptions variable.

> duration = faithful$eruptions 
> hist(duration,    # apply the hist function 
+   right=FALSE)    # intervals closed on the left

R detection methods in R

1. Using  “outliers” package.
  
outlier_tf = outlier(data_full$target column,logical=TRUE)
#This gives an array with all values False, except for the outlier (as defined in the package documentation “Finds value with largest difference between it and sample mean, which can be an outlier”).  That value is returned as True. 
find_outlier = which(outlier_tf==TRUE,arr.ind=TRUE)
#This finds the location of the outlier by finding that “True” value within the “outlier_tf” array. 
data_new = data_full[-find_outlier,]
#This creates a new dataset based on the old data, removing the one row that contains the outlier .

2. using DMwR Package       

Quartile Method

The quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set

The third quartile (Q3) is the middle value between the median and the highest value of the data set.

Inter Quartile Range(IQR) refers to the difference between third and first quartile

To find the oultier n,

n> Q3+1.5*IQR

or

n<Q1 -1.5*IQR

The same can be done in R using Box Whisker Plots

3.LOF(LOCAL OUTLIER FACTOR)

Outliers are found based on their local neighbourhoods,more specifically on the local densities.

To calculate the local outlier factor scores

score=lofactor(data,k)

k is the number of neighbours

To plot density plots

plot(density(score)

To find the  data points with the highest outlier scores(greater the score greater the chance of the data point being an outlier)

For example the top 5 scores

outliers <- order(outlier.scores, decreasing=T)[1:5]