By: Prof. Dr. Fazal Rehman | Last updated: March 3, 2022
What is clustering
Clustering is a process of partitioning a group of data into small partitions or cluster on the basis of similarity and dissimilarity.
What is K-Means clustering in data mining?
K-Means clustering is a clustering method in which we move the every data item(attribute value) nearest to its similar cluster.
How does K-means clustering work?
Step 1:Find the centroid randomly. It is better to take the boundary and middle values as the centroid.Step 2:Assign cluster to each item-set(value of the attribute)Step 3:Repeat all the process, every time we repeat the process total sum of error rate is changed. When the error rate stops to change, then finalize the cluster and their itemset(attribute value).
id
Age
D1
D2
D3
Cluster
Error
1
23
0
42
20
1
0
2
33
10
32
10
1
100
3
28
5
37
15
1
25
4
23
0
42
20
1
0
5
65
42
0
22
2
0
6
67
44
2
24
2
4
7
64
41
1
21
2
1
8
73
50
8
30
2
64
9
68
45
3
25
2
9
10
43
20
22
0
3
0
11
34
11
31
9
3
81
12
43
20
22
0
3
0
13
52
29
13
9
3
81
14
49
26
16
6
3
36
Centriod
23
65
43
401
Iteration 2:
id
Age
D1
D2
D3
Cluster
Error
1
23
3.75
44.4
21.2
1
14.0625
2
33
6.25
34.4
11.2
1
39.0625
3
28
1.25
39.4
16.2
1
1.5625
4
23
3.75
44.4
21.2
1
14.0625
5
34
7.25
33.4
10.2
1
52.5625
6
65
38.25
2.4
20.8
2
5.76
7
67
40.25
0.4
22.8
2
0.16
8
64
37.25
3.4
19.8
2
11.56
9
73
46.25
5.6
28.8
2
31.36
10
68
41.25
0.6
23.8
2
0.36
11
43
16.25
24.4
1.2
3
1.44
12
43
16.25
24.4
1.2
3
1.44
13
52
25.25
15.4
7.8
3
60.84
14
49
22.25
18.4
4.8
3
23.04
Centriod
26.75
67.4
44.2
257.2725
Iteration 3:
id
Age
D1
D2
D3
Cluster
Error
1
23
5.2
44.4
23.75
1
27.04
2
33
4.8
34.4
13.75
1
23.04
3
28
0.2
39.4
18.75
1
0.04
4
23
5.2
44.4
23.75
1
27.04
5
34
5.8
33.4
12.75
1
33.64
6
65
36.8
2.4
18.25
2
5.76
7
67
38.8
0.4
20.25
2
0.16
8
64
35.8
3.4
17.25
2
11.56
9
73
44.8
5.6
26.25
2
31.36
10
68
39.8
0.6
21.25
2
0.36
11
43
14.8
24.4
3.75
3
14.0625
12
43
14.8
24.4
3.75
3
14.0625
13
52
23.8
15.4
5.25
3
27.5625
14
49
20.8
18.4
2.25
3
5.0625
Centriod
28.2
67.4
46.75
220.75
Iteration 4:
id
Age
D1
D2
D3
Cluster
Error
1
23
5.2
44.4
23.75
1
27.04
2
23
5.2
44.4
23.75
1
27.04
3
28
0.2
39.4
18.75
1
0.04
4
33
4.8
34.4
13.75
1
23.04
5
34
5.8
33.4
12.75
1
33.64
6
43
14.8
24.4
3.75
3
14.0625
7
43
14.8
24.4
3.75
3
14.0625
8
49
20.8
18.4
2.25
3
5.0625
9
52
23.8
15.4
5.25
3
27.5625
10
64
35.8
3.4
17.25
2
11.56
11
65
36.8
2.4
18.25
2
5.76
12
67
38.8
0.4
20.25
2
0.16
13
68
39.8
0.6
21.25
2
0.36
14
73
44.8
5.6
26.25
2
31.36
Centroid
28.2
67.4
46.75
220.75
Iteration stops:Now, iterations are stopped because the error rate is consistent with iteration 3 and iteration 4. The error rate is now fixed at 220.75, so there is no need for further. Clusters are final now.
Shortcomings of K-Means clustering
It is sensitive to outliers.
Not much suitable for categorical or nominal data.