Data Cleaning, Handling missing, incomplete and noisy data, Binning

Data Cleaning

Data cleaning is a process to clean the dirty data.

Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.

[quads id=1]
Dirty data Examples
Incomplete data salary=”  ” 
Inconsistent data Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
Noisy data Salary = “-5000”,  Name = “123”
Intentional error Sometimes applications a lot auto value to attribute. e.g some application put gender value as male by default. gender=”male”

How to Handle incomplete/Missing Data?

  • Ignore the tuple
  • Fill in the missing value manually
  • Fill the values automatically by
    • Getting the attribute mean
    • Getting the constant value if any constant value is there.
    • Getting the most probable value by Bayesian formula or decision tree

How to Handle Noisy Data?

  • Binning
  • Regression
  • Clustering
  • Combined computer and human inspection.
[quads id=2]

What is Binning?

Binning is a technique in which first of all we sort the data and then partition the data into equal frequency bins.

Bin 1 2, 3, 6, 8
Bin 2 14,16,18,24
Bin 3 26,28,30,32


Types of binning:

There are many types of binning. Some of them are as follows;

  1. Smooth by getting the bin means
Bin 1 4.75,  4.75,  4.75,  4.75
Bin 2 18,18,18,18
Bin 3 29,29,29,29
  • Smooth by getting the bin median
  • Smooth by getting the bin boundaries, etc.