KMeans Clustering in data mining

What is clustering

Clustering is a process of partitioning a group of data into small partitions or cluster on the basis of similarity and dissimilarity.

What is K-Means clustering in data mining?

K-Means clustering is a clustering method in which we move the every data item(attribute value) nearest to its similar cluster.

How does K-means clustering work?

Step 1:

Find the centroid randomly. It is better to take the boundary and middle values as the centroid.

Step 2:

Assign cluster to each item-set(value of the attribute)

Step 3:

Repeat all the process, every time we repeat the process total sum of error rate is changed. When error rate stops to change, then finalize the cluster and their itemset(attribute value).

 

id Age D1 D2 D3 Cluster Error
1 23 0 42 20 1 0
2 33 10 32 10 1 100
3 28 5 37 15 1 25
4 23 0 42 20 1 0
5 65 42 0 22 2 0
6 67 44 2 24 2 4
7 64 41 1 21 2 1
8 73 50 8 30 2 64
9 68 45 3 25 2 9
10 43 20 22 0 3 0
11 34 11 31 9 3 81
12 43 20 22 0 3 0
13 52 29 13 9 3 81
14 49 26 16 6 3 36
              
  Centriod 23 65 43   401

Iteration 2:

id Age D1 D2 D3 Cluster Error
1 23 3.75 44.4 21.2 1 14.0625
2 33 6.25 34.4 11.2 1 39.0625
3 28 1.25 39.4 16.2 1 1.5625
4 23 3.75 44.4 21.2 1 14.0625
5 34 7.25 33.4 10.2 1 52.5625
6 65 38.25 2.4 20.8 2 5.76
7 67 40.25 0.4 22.8 2 0.16
8 64 37.25 3.4 19.8 2 11.56
9 73 46.25 5.6 28.8 2 31.36
10 68 41.25 0.6 23.8 2 0.36
11 43 16.25 24.4 1.2 3 1.44
12 43 16.25 24.4 1.2 3 1.44
13 52 25.25 15.4 7.8 3 60.84
14 49 22.25 18.4 4.8 3 23.04
             
  Centriod 26.75 67.4 44.2   257.2725

Iteration 3: 

id Age D1 D2 D3 Cluster Error
1 23 5.2 44.4 23.75 1 27.04
2 33 4.8 34.4 13.75 1 23.04
3 28 0.2 39.4 18.75 1 0.04
4 23 5.2 44.4 23.75 1 27.04
5 34 5.8 33.4 12.75 1 33.64
6 65 36.8 2.4 18.25 2 5.76
7 67 38.8 0.4 20.25 2 0.16
8 64 35.8 3.4 17.25 2 11.56
9 73 44.8 5.6 26.25 2 31.36
10 68 39.8 0.6 21.25 2 0.36
11 43 14.8 24.4 3.75 3 14.0625
12 43 14.8 24.4 3.75 3 14.0625
13 52 23.8 15.4 5.25 3 27.5625
14 49 20.8 18.4 2.25 3 5.0625
             
  Centriod 28.2 67.4 46.75   220.75

[quads id=3]

Iteration 4:

id Age D1 D2 D3 Cluster Error
1 23 5.2 44.4 23.75 1 27.04
2 23 5.2 44.4 23.75 1 27.04
3 28 0.2 39.4 18.75 1 0.04
4 33 4.8 34.4 13.75 1 23.04
5 34 5.8 33.4 12.75 1 33.64
6 43 14.8 24.4 3.75 3 14.0625
7 43 14.8 24.4 3.75 3 14.0625
8 49 20.8 18.4 2.25 3 5.0625
9 52 23.8 15.4 5.25 3 27.5625
10 64 35.8 3.4 17.25 2 11.56
11 65 36.8 2.4 18.25 2 5.76
12 67 38.8 0.4 20.25 2 0.16
13 68 39.8 0.6 21.25 2 0.36
14 73 44.8 5.6 26.25 2 31.36
             
  Centroid 28.2 67.4 46.75   220.75

Iteration stops:

Now, iterations are stopped because error rate is consistent with iteration 3 and iteration 4. The error rate is now fixed at 220.75, so there is no need of further. Clusters are final now.

Shortcomings of K-Means clustering:

It is sensitive to outliers.

Not much suitable for categorical or nominal data