Table of Contents
Data Mining in Dbms
The database is an organized collection of related data. When we store a large amount of data (big data), then it is very difficult to extract the information from this big data.
Data mining is a technique to extract useful information from data.
Data is meaningless for a user. So, we need to mine the data. Recently data mining is used widely to explore the data. Data mining is very useful for business analytics.
There are different techniques to mine the data and to help the mining process. Some of these techniques are as follows;
- Tracking patterns to track the patterns.
- Classification to classify the data.
- Association techniques to identify the association among data sets.
- Outlier detection to detect the dissimilar data.
- Clustering to detect the similar clusters of data sets.
- Regression to estimate the relationships among variables.
- Prediction to predict the results.
Now, let’s discuss some important terminologies in data mining.
Data cleaning is a process to clean the dirty data.
Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.
Z-Score helps in normalization of data.
Min Max normalization
Min Max is a technique that helps to normalize the data. It will scale the data between the 0 and 1.
Decimal scaling is a data normalization technique. In this technique, we move the decimal point of the values of the attribute. This movement of decimal points totally depends on the maximum value among all values in the attribute.
Data is in attribute tuples and data can be normalized by using standard deviation.
Data discretization converts a large number of data values into smaller once, so that data evaluation and data management becomes very easy.
Binning Methods for Data Smoothing
Binning method can be used for smoothing data.
This analysis can be done by chi-square test. Chi-square test is the test to analyze the correlation of nominal data.
Apriori Helps in mining the frequent itemset.
Clustering is a process of partitioning a group of data into small partitions or cluster on the basis of similarity and dissimilarity.
What is Boosting?
Boosting is an efficient algorithm that is able to convert a weak learner into a strong learner.
RainForest Algorithm / Framework
RainForest is framework specially designed to classify the large data set.
RainForest contains AVC set.
AVC set consist of the following parts;
All data is randomly divided into same equal size data sets. e.g,
- Training set
- Test set
- Validation set
Bootstrap Aggregation famously knows as bagging, is a powerful and simple ensemble method.
An ensemble method is a technique that combines the predictions from many machine learning algorithms together to make more reliable and accurate predictions than any individual model. It means that we can say that the prediction of bagging is very strong.