What is meant by data cleaning?
Data cleaning is a process to clean the dirty data. Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.
What is importance and benefits of data cleaning
1. Data Cleaning removes major errors.
2. Data Cleaning ensures happier customers, more sales, and more accurate decision.
3. Data Cleaning removes inconsistencies that are most likely occur when multiple sources of data are store into one data-set.
4. Data Cleaning make the data-set more efficient, more reliable and more accurate
Sources of Missing Values
- There are many sources of missing data. Let’s see some major sources of missing data.
- User forgot to fill the data in a field.
- It can be a programming error.
- Data can be lost when we transferring the data manually from a legacy database.
Dirty data | Examples |
Incomplete data | salary=”  ” |
Inconsistent data | Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″ |
Noisy data | Salary = “-5000”, Â Name = “123” |
Intentional error | Sometimes applications a lot auto value to attribute. e.g some application put gender value as male by default. gender=”male” |
How to Handle incomplete/Missing Data?
- Ignore the tuple
- Fill in the missing value manually
- Fill the values automatically by
- Getting the attribute mean
- Getting the constant value if any constant value is there.
- Getting the most probable value by Bayesian formula or decision tree
How to Handle Noisy Data?
- Binning
- Regression
- Clustering
- Combined computer and human inspection.
What is Binning?
Binning is a technique in which first of all we sort the data and then partition the data into equal frequency bins.
Bin 1 |
2, 3, 6, 8 |
Bin 2 | 14,16,18,24 |
Bin 3 | 26,28,30,32 |
Types of binning:
There are many types of binning. Some of them are as follows;
- Smooth by getting the bin means
Bin 1 | 4.75, Â 4.75, Â 4.75, Â 4.75 |
Bin 2 | 18,18,18,18 |
Bin 3 | 29,29,29,29 |
Smooth by getting the bin median
Smooth by getting the bin boundaries, etc.
Data cleaning steps
There are six major steps for data cleaning.
1. Monitoring the Errors
It is very important to monitor the source of errors and to monitor that which is the source that is the reason for most of the errors.
2. Standardization of the mining Processes
We standardize the point of entry and check the importance. When we standardize the data process, then it leads to a a good point of entry. The process of Standardization reduce the risk of duplication.
3. Validation of data Accuracy
We need to Validate the accuracy of our data when we already cleaned the database. There are many tools that helps us to clean our data in real-time.
4. Scrub for Duplicate Data
It is very important to identify the duplicates because it is very useful and it save our time when perform data analysis.
5. Analyze
Before this activity, our data must be standardized, validated, and scrubbed for the duplicates. There are many third-party sources and these sources can capture information directly from our databases. They helps us to  clean and compile the data to ensure the completeness, accuracy, and reliability for business decision making.
6. Communicate with the Team
At last, we must communicate with our team and to tell them about the new standardized cleaning.
Data cleaning tools
There are many data cleaning tools. Here, i am sharing with you top 10 data cleaning toolsl.
- OpenRefine
- Trifacta Wrangler
- Drake
- Data Ladder
- Data Cleaner
- Cloudingo
- Reifier
- IBM Infosphere Quality Stage
- TIBCO Clarity
- Winpure
FAQ
Question
________ is the process of obtaining, cleaning, organizing, relating, and cataloging source data.
Answer: “Data Cleaning is the process of obtaining, cleaning, organizing, relating, and cataloging source data“.
How slack variables help SVM with noisy data?
Slack variables are non-negative, local quantities and they relax the firm condition of linear separability, where each data training point can be observed with similar marginal hyperplane and so they can help the support vector machine with noisy data.