KMeans Clustering in data mining

Last modified on December 9th, 2018 at 9:19 pm

What is clustering

Clustering is a process of partitioning a group of data into small partitions or cluster on the basis of similarity and dissimilarity.

What is K-Means clustering in data mining?

K-Means clustering is a clustering method in which we move the every data item(attribute value) nearest to its similar cluster.

How does K-means clustering work?

Step 1:

Find the centroid randomly. It is better to take the boundary and middle values as the centroid.

Step 2:

Assign cluster to each item-set(value of the attribute)

Step 3:

Repeat all the process, every time we repeat the process total sum of error rate is changed. When error rate stops to change, then finalize the cluster and their itemset(attribute value).

 

idAgeD1D2D3ClusterError
1230422010
2331032101100
32853715125
4230422010
5654202220
6674422424
7644112121
87350830264
9684532529
10432022030
113411319381
12432022030
135229139381
144926166336
        
 Centriod236543 401

Iteration 2:

idAgeD1D2D3ClusterError
1233.7544.421.2114.0625
2336.2534.411.2139.0625
3281.2539.416.211.5625
4233.7544.421.2114.0625
5347.2533.410.2152.5625
66538.252.420.825.76
76740.250.422.820.16
86437.253.419.8211.56
97346.255.628.8231.36
106841.250.623.820.36
114316.2524.41.231.44
124316.2524.41.231.44
135225.2515.47.8360.84
144922.2518.44.8323.04
       
 Centriod26.7567.444.2 257.2725

Iteration 3: 

idAgeD1D2D3ClusterError
1235.244.423.75127.04
2334.834.413.75123.04
3280.239.418.7510.04
4235.244.423.75127.04
5345.833.412.75133.64
66536.82.418.2525.76
76738.80.420.2520.16
86435.83.417.25211.56
97344.85.626.25231.36
106839.80.621.2520.36
114314.824.43.75314.0625
124314.824.43.75314.0625
135223.815.45.25327.5625
144920.818.42.2535.0625
       
 Centriod28.267.446.75 220.75

[quads id=3]

Iteration 4:

idAgeD1D2D3ClusterError
1235.244.423.75127.04
2235.244.423.75127.04
3280.239.418.7510.04
4334.834.413.75123.04
5345.833.412.75133.64
64314.824.43.75314.0625
74314.824.43.75314.0625
84920.818.42.2535.0625
95223.815.45.25327.5625
106435.83.417.25211.56
116536.82.418.2525.76
126738.80.420.2520.16
136839.80.621.2520.36
147344.85.626.25231.36
       
 Centroid28.267.446.75 220.75

Iteration stops:

Now, iterations are stopped because error rate is consistent with iteration 3 and iteration 4. The error rate is now fixed at 220.75, so there is no need of further. Clusters are final now.

Shortcomings of K-Means clustering:

It is sensitive to outliers.

Not much suitable for categorical or nominal data

Prof. Fazal Rehman Shamil
Researcher, Publisher of International Journal Of Software Technology & Science ISSN: 2616-5325
Instructor, SEO Expert, Web Programmer and poet.
Feel free to contact.