Oops! It appears that you have disabled your Javascript. In order for you to see this page as it is meant to appear, we ask that you please re-enable your Javascript!

Data Cleaning, Handling missing, incomplete and noisy data, Binning

Last modified on December 23rd, 2018 at 7:45 am

Data Cleaning, Handling missing, incomplete and noisy data, Binning

Data Cleaning

Data cleaning is a process to clean the dirty data.

Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.

Dirty dataExamples
Incomplete datasalary=”  ” 
Inconsistent dataAge =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
Noisy dataSalary = “-5000”,  Name = “123”
Intentional errorSometimes applications a lot auto value to attribute. e.g some application put gender value as male by default. gender=”male”

How to Handle incomplete/Missing Data?

  • Ignore the tuple
  • Fill in the missing value manually
  • Fill the values automatically by
    • Getting the attribute mean
    • Getting the constant value if any constant value is there.
    • Getting the most probable value by Bayesian formula or decision tree

How to Handle Noisy Data?

  • Binning
  • Regression
  • Clustering
  • Combined computer and human inspection.

What is Binning?

Binning is a technique in which first of all we sort the data and then partition the data into equal frequency bins.

Bin 12, 3, 6, 8
Bin 214,16,18,24
Bin 326,28,30,32

Types of binning:

There are many types of binning. Some of them are as follows;

  1. Smooth by getting the bin means
Bin 14.75,  4.75,  4.75,  4.75
Bin 218,18,18,18
Bin 329,29,29,29

Smooth by getting the bin median

Smooth by getting the bin boundaries, etc.

Prof. Fazal Rehman Shamil
Researcher, Publisher of International Journal Of Software Technology & Science ISSN: 2616-5325
Instructor, SEO Expert, Web Programmer and poet.
Feel free to contact.