Data Cleaning in Data Mining

What is meant by data cleaning?

data cleaning diagramData cleaning is a process to clean the dirty data. Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.

What is importance and benefits of data cleaning

1. Data Cleaning removes major errors.
2. Data Cleaning ensures happier customers, more sales, and more accurate decision.
3. Data Cleaning removes inconsistencies that are most likely occur when multiple sources of data are store into one data-set.
4. Data Cleaning make the data-set more efficient, more reliable and more accurate

Sources of Missing Values

  1. There are many sources of missing data. Let’s see some major sources of missing data.
  2. User forgot to fill the data in a field.
  3. It can be a programming error.
  4. Data can be lost when we transferring the data manually from a legacy database.
Dirty dataExamples
Incomplete datasalary=”  ” 
Inconsistent dataAge =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
Noisy dataSalary = “-5000”,  Name = “123”
Intentional errorSometimes applications a lot auto value to attribute. e.g some application put gender value as male by default. gender=”male”

How to Handle incomplete/Missing Data?

  • Ignore the tuple
  • Fill in the missing value manually
  • Fill the values automatically by
    • Getting the attribute mean
    • Getting the constant value if any constant value is there.
    • Getting the most probable value by Bayesian formula or decision tree

How to Handle Noisy Data?

What is Binning?

Binning is a technique in which first of all we sort the data and then partition the data into equal frequency bins.

Bin 12, 3, 6, 8
Bin 214,16,18,24
Bin 326,28,30,32

Types of binning:

There are many types of binning. Some of them are as follows;

  1. Smooth by getting the bin means
Bin 14.75,  4.75,  4.75,  4.75
Bin 218,18,18,18
Bin 329,29,29,29

Smooth by getting the bin median

Smooth by getting the bin boundaries, etc.

Data cleaning steps

There are six major steps for data cleaning.
1. Monitoring the Errors
It is very important to monitor the source of errors and to monitor that which is the source that is the reason for most of the errors.
2. Standardization of the mining Processes
We standardize the point of entry and check the importance. When we standardize the data process, then it leads to a a good point of entry. The process of Standardization reduce the risk of duplication.
3. Validation of data Accuracy
We need to Validate the accuracy of our data when we already cleaned the database. There are many tools that helps us to clean our data in real-time.

4. Scrub for Duplicate Data

It is very important to identify the duplicates because it is very useful and it save our time when perform data analysis.

5. Analyze

Before this activity, our data must be standardized, validated, and scrubbed for the duplicates. There are many third-party sources and these sources can capture information directly from our databases. They helps us to  clean and compile the data to ensure the completeness, accuracy, and reliability for business decision making.

6. Communicate with the Team

At last, we must communicate with our team and to tell them about the new standardized cleaning.

Data cleaning tools

There are many data cleaning tools. Here, i am sharing with you top 10 data cleaning toolsl.

  1. OpenRefine
  2. Trifacta Wrangler
  3. Drake
  4. Data Ladder
  5. Data Cleaner
  6. Cloudingo
  7. Reifier
  8. IBM Infosphere Quality Stage
  9. TIBCO Clarity
  10. Winpure

FAQ

Question

________ is the process of obtaining, cleaning, organizing, relating, and cataloging source data.

Answer: “Data Cleaning is the process of obtaining, cleaning, organizing, relating, and cataloging source data“.

How slack variables help SVM with noisy data?

Slack variables are non-negative, local quantities and they relax the firm condition of linear separability, where each data training point can be observed with similar marginal hyperplane and so they can help the support vector machine with noisy data.

Video Lecture

Subscribe for Friendship

Latest posts by Prof. Fazal Rehman Shamil (see all)

Buy advertisement space on T4Tutorials

For more details email [email protected]