Data Cleaning, Handling missing, incomplete and noisy data, Binning
Data cleaning is a process to clean the dirty data.
Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.
|Incomplete data||salary=” ” |
|Inconsistent data||Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″|
|Noisy data||Salary = “-5000”, Name = “123”|
|Intentional error||Sometimes applications a lot auto value to attribute. e.g some application put gender value as male by default. gender=”male”|
How to Handle incomplete/Missing Data?
- Ignore the tuple
- Fill in the missing value manually
- Fill the values automatically by
- Getting the attribute mean
- Getting the constant value if any constant value is there.
- Getting the most probable value by Bayesian formula or decision tree
How to Handle Noisy Data?
- Combined computer and human inspection.
What is Binning?
Binning is a technique in which first of all we sort the data and then partition the data into equal frequency bins.
|Bin 1||2, 3, 6, 8|
Types of binning:
There are many types of binning. Some of them are as follows;
- Smooth by getting the bin means
|Bin 1||4.75, 4.75, 4.75, 4.75|
Smooth by getting the bin median
Smooth by getting the bin boundaries, etc.