Outliers in Data mining

Outliers in Data mining is a very hot topic in the field of data mining. Let’s discuss the outliers.

The data which deviates too much far away from other data is known as an outlier. The outlier is the data that deviate from other data.

The outlier shows variability in an experimental error or in measurement. In other words, an outlier is a data that is far away from an overall pattern of the sample data.

Outliers can indicate that the population has a heavy-tailed distribution or when measurement error occurs.

Outliers can be categorized as;

  1. Collective outliers.
  2. Point outliers
  3. Contextual outliers

Collective outliers can be subsets of outliers when we introducing the novelties in data. For example,  a signal that may indicate the discovery of a new phenomenon for the data set.

Point outliers are the data points that are far from the other distribution of the data.

Contextual outliers are the outliers just like noisy data. One example of noise data is when data have a punctuation symbol and suppose we are analyzing the background noise of the voice when doing speech recognition.

Types of outliers

There are two types of Outliers.

  1. Univariate outliers
  2. Multivariate outliers

A univariate outlier is a data outlier that differs significantly from one variable. A multivariate outlier is an outlier when a combination of values on two or more than two variables have a significant difference. The univariate outlier and Multivariate outliers can influence the overall outcome of the data analysis.

Causes of outliers

Outliers can have many different causes. Some of these causes are mentioned below.

  • Ther instruments used in the experiments for taking measurements suddenly malfunctioned.
  • The error in data transmission.
  • Due to changes in system behavior.
  • Due to fraudulent behavior
  • Due to human error
  • Due to natural deviations in populations.
  • Due to flaws in the assumed theory.
  • Incorrect data collection.

How to Detect Outlier in data mining

Algorithm to Detect Outlier in data mining.

  1. Calculate the mean of each cluster of the data.
  2. Initialize the Threshold value of the data.
  3. Calculate the distance of the test data from each cluster mean
  4. Find the nearest cluster to the test data
  5. Now, if we found that Distance is greater than Threshold, then it is a signal of Outlier.

There are many methods of outlier detection. Some of the outlier detection methods are mentioned below;

  • Z-Score Normalizatoin
  • Linear Regression Models (PCA, LMS)
  • Information Theory Models
  • High Dimensional Outlier Detection Methods (high dimensional sparse data)
  • Proximity Based Models (non-parametric)
  • Probabilistic and Statistical Modeling (parametric)
  • Probabilistic and Statistical Modeling (parametric)
  • Numeric Outlier

Numeric Outlier
Numeric Outlier is the nonparametric outlier detection technique in a one-dimensional feature space. TheNumeric outliers calculation can be performed by means of the InterQuartile Range (IQR).
Z-score is a data normalization technique and assumes a Gaussian distribution of the data. Outliers detection can be performed by Z-Score.
The DBSCAN technique is based on the DBSCAN clustering algorithm. DBSCAN is a density-based, nonparametric outlier detection technique in a 1 or multi-dimensional feature space. In DBSCAN, all the data points are defined in the following points.

  1. Core Points
  2. Border Points
  3. Noise Points.