Table of Contents
List of clustering algorithms in data mining
In this tutorial, we will try to learn little basic of clustering algorithms in data mining. A list of clustering algorithms is given below;
- K-Means Clustering
- Agglomerative Hierarchical Clustering
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
- Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
- Mean-Shift Clustering
1. K-Means Clustering
K-Means Clustering is a technique in which we move the data points to the nearest neighbors on the basis of similarity or dissimilarity.
Step 1: Find the centroid randomly.
Step 2: Assign cluster to each data set.
Step 3: Repeat the process again and again. Every time, the total error rate is changed. When we observe that error rate is not changed further, we can stop and finalize the clusters.
Agglomerative Hierarchical Clustering
Hierarchical cluster analysis is also known as hierarchical cluster analysis. In this type of clustering, we build a hierarchy of clusters. There are two types of Strategies for hierarchical clustering.
- Agglomerative Strategies
- Divisive Strategies
In Agglomerative Strategies, each observation starts in its own cluster, and then pairs of clusters are merged as one moves up the hierarchy. This kind of strategy is known as bottom-up strategy.
In Divisive Strategies, all observations start in one cluster and then split the clusters when we move down the hierarchy.
These splitting and merging are determined in a greedy way.
Advantages of Agglomerative Hierarchical Clustering
- Hierarchical Clustering is very helpful in ordering the objects in such a way that is informative for data display.
- When we generate smaller clusters, it is very helpful for us for discovering the information.
Disadvantages of Agglomerative Hierarchical Clustering
- Not allowed for a relocation of objects that may have been wrongly grouped at an early stage. The result should be examined in detail to ensure that it is giving accurate information.
- Use of different kind of distance metrics for measuring the distance between the clusters may generate a different kind of results.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
It is also known as DBSCAN. This clustering algorithm was proposed by Martin Ester, Hans – Peter Kriegal, Xiaowei Xu and Jorg Sander and in 1996.
Suppose we have a set of points, it can groups together points that are nearby neighbors and also marks the outliers points that lie on a big distance.
Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
Gaussian mixture models (GMM) are well known due to their use in data clustering. Given a fitted GMM, cluster assigns query data to the component and yielding the highest posterior probability. When we assign a data point to the exactly one cluster, then this kind of clustering is called hard clustering.
Mean-Shift Clustering is one of the simple and flexible clustering technique that has several advantages when we compare it with other approaches. First of all, we need to represent our data in a mathematical manner. This method follows the concept of kernel density estimation (KDE).
KDE is a technique to estimate the distribution for a set of data.
This method works by placing one kernel on each point of the data in the data set. A kernel means a weighting function. There are many types of kernels, but the Gaussian kernel is a good choice for placing on the data point.