KMeans Clustering in data mining

What is clustering

Clustering is a process of partitioning a group of data into small partitions or cluster on the basis of similarity and dissimilarity.

What is K-Means clustering in data mining?

K-Means clustering is a clustering method in which we move the every data item(attribute value) nearest to its similar cluster.

How does K-means clustering work?

Step 1: Find the centroid randomly. It is better to take the boundary and middle values as the centroid. Step 2: Assign cluster to each item-set(value of the attribute) Step 3: Repeat all the process, every time we repeat the process total sum of error rate is changed. When the error rate stops to change, then finalize the cluster and their itemset(attribute value).

id	Age	D1	D2	D3	Cluster	Error
1	23	0	42	20	1	0
2	33	10	32	10	1	100
3	28	5	37	15	1	25
4	23	0	42	20	1	0
5	65	42	0	22	2	0
6	67	44	2	24	2	4
7	64	41	1	21	2	1
8	73	50	8	30	2	64
9	68	45	3	25	2	9
10	43	20	22	0	3	0
11	34	11	31	9	3	81
12	43	20	22	0	3	0
13	52	29	13	9	3	81
14	49	26	16	6	3	36

	Centriod	23	65	43		401

Iteration 2:

id	Age	D1	D2	D3	Cluster	Error
1	23	3.75	44.4	21.2	1	14.0625
2	33	6.25	34.4	11.2	1	39.0625
3	28	1.25	39.4	16.2	1	1.5625
4	23	3.75	44.4	21.2	1	14.0625
5	34	7.25	33.4	10.2	1	52.5625
6	65	38.25	2.4	20.8	2	5.76
7	67	40.25	0.4	22.8	2	0.16
8	64	37.25	3.4	19.8	2	11.56
9	73	46.25	5.6	28.8	2	31.36
10	68	41.25	0.6	23.8	2	0.36
11	43	16.25	24.4	1.2	3	1.44
12	43	16.25	24.4	1.2	3	1.44
13	52	25.25	15.4	7.8	3	60.84
14	49	22.25	18.4	4.8	3	23.04

	Centriod	26.75	67.4	44.2		257.2725

Iteration 3:

id	Age	D1	D2	D3	Cluster	Error
1	23	5.2	44.4	23.75	1	27.04
2	33	4.8	34.4	13.75	1	23.04
3	28	0.2	39.4	18.75	1	0.04
4	23	5.2	44.4	23.75	1	27.04
5	34	5.8	33.4	12.75	1	33.64
6	65	36.8	2.4	18.25	2	5.76
7	67	38.8	0.4	20.25	2	0.16
8	64	35.8	3.4	17.25	2	11.56
9	73	44.8	5.6	26.25	2	31.36
10	68	39.8	0.6	21.25	2	0.36
11	43	14.8	24.4	3.75	3	14.0625
12	43	14.8	24.4	3.75	3	14.0625
13	52	23.8	15.4	5.25	3	27.5625
14	49	20.8	18.4	2.25	3	5.0625

	Centriod	28.2	67.4	46.75		220.75

Iteration 4:

id	Age	D1	D2	D3	Cluster	Error
1	23	5.2	44.4	23.75	1	27.04
2	23	5.2	44.4	23.75	1	27.04
3	28	0.2	39.4	18.75	1	0.04
4	33	4.8	34.4	13.75	1	23.04
5	34	5.8	33.4	12.75	1	33.64
6	43	14.8	24.4	3.75	3	14.0625
7	43	14.8	24.4	3.75	3	14.0625
8	49	20.8	18.4	2.25	3	5.0625
9	52	23.8	15.4	5.25	3	27.5625
10	64	35.8	3.4	17.25	2	11.56
11	65	36.8	2.4	18.25	2	5.76
12	67	38.8	0.4	20.25	2	0.16
13	68	39.8	0.6	21.25	2	0.36
14	73	44.8	5.6	26.25	2	31.36

	Centroid	28.2	67.4	46.75		220.75

Iteration stops: Now, iterations are stopped because the error rate is consistent with iteration 3 and iteration 4. The error rate is now fixed at 220.75, so there is no need for further. Clusters are final now.

Shortcomings of K-Means clustering

It is sensitive to outliers.
Not much suitable for categorical or nominal data.

Download Excel File

Video Lecture

Next Similar Tutorials

KMeans Clustering in data mining. – Click Here
KMeans clustering on two attributes in data mining. – Click Here
List of clustering algorithms in data mining. – Click Here
Markov cluster process Model with Graph Clustering – Click Here.

KMeans Clustering in data mining
By: Prof. Dr. Fazal Rehman | Last updated: March 3, 2022