Data Cleaning, Handling missing, incomplete and noisy data, Binning

Data Cleaning

Data cleaning is a process to clean the dirty data.

Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.

[quads id=1]
Dirty dataExamples
Incomplete datasalary=”  ” 
Inconsistent dataAge =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
Noisy dataSalary = “-5000”,  Name = “123”
Intentional errorSometimes applications a lot auto value to attribute. e.g some application put gender value as male by default. gender=”male”

How to Handle incomplete/Missing Data?

  • Ignore the tuple
  • Fill in the missing value manually
  • Fill the values automatically by
    • Getting the attribute mean
    • Getting the constant value if any constant value is there.
    • Getting the most probable value by Bayesian formula or decision tree

How to Handle Noisy Data?

  • Binning
  • Regression
  • Clustering
  • Combined computer and human inspection.
[quads id=2]

What is Binning?

Binning is a technique in which first of all we sort the data and then partition the data into equal frequency bins.

Bin 12, 3, 6, 8
Bin 214,16,18,24
Bin 326,28,30,32


Types of binning:

There are many types of binning. Some of them are as follows;

  1. Smooth by getting the bin means
Bin 14.75,  4.75,  4.75,  4.75
Bin 218,18,18,18
Bin 329,29,29,29
  • Smooth by getting the bin median
  • Smooth by getting the bin boundaries, etc.
Please Share This Article with Friends
Fazal Rehman Shamil
Welcome to all friends. The reason for our success is only your love for T4Tutorials. Our team is always available to answer your queries regarding any kind of confusions or discussion regarding your study and career matters. For discussion with us please join our facebook group "". The link of the group is mentioned below. Thanks and love to all for connecting with us. We are nothing without you. Love you all.....