TF IDF Cosine similarity Formula Examples in data mining
What is Cosine similarity?
Cosine similarity is a measure to find the similarity between two files/documents.
Example of cosine similarity
What is the similarity between two files, file 1 and file 2?
Cosine similarity Formula
cos(file 1, file 2) = (file 1 • file 2) / ||file 1|| ||file 2|| ,
file 1 = (0, 3, 0, 0, 2, 0, 0, 2, 0, 5)
file 2 = (1, 2, 0, 0, 1, 1, 0, 1, 0, 3)
file 1 • file 2 = 0*1 + 3*2 + 0*0 + 0*0 + 2*1 + 0*1 + 0*0 + 2*1 + 0*0 + 5*3
= 25
||d1||= (0*0 + 3*3 + 0*0 + 0*0 + 2*2 + 0*0 + 0*0 + 2*2 + 0*0 + 5*5)0.5
=(42)0.5 = 6.481
||d2||= (1*1 + 2*2 + 0*0 + 0*0 + 1*1 + 1*1 + 0*0 + 1*1 + 0*0 + 3*3)0.5
=(17)0.5 = 4.12
cos(d1 , d2 ) = 0.94
What is a good cosine similarity 0 or 1?
- Similarity 0 means no similarity
- Similarity 0 means identical
- A similarity above 0.5 might be a good starting point.
Is cosine similarity a metric?
Yes, Cosine similarity is a metric. This metric can be used to measure the similarity between two objects.
When to use cosine similarity over Euclidean similarity?
In Cosine similarity our focus is at the angle between two vectors and in case of euclidean similarity our focus is at the distance between two points.
For example, we want to analyze the data of a shop and the data is;
- User 1 bought 1x copy, 1x pencil and 1x rubber from the shop.
- User 2 bought 100x copy, 100x pencil and 100x rubber from the shop.
- User 3 bought 1x copy, 1x PEPSI and 1x Shoes Polish from the shop.
Cosine similarity python
Suppose we have text in the three documents;
Doc Imran Khan (A) : Mr. Imran Khan won the president seat after winning the National election 2020-2021. Though he lost the support of some republican friends, Imran Khan is friends with President Nawaz Sharif.
Doc Imran Khan Election (B) : President Imran Khan says Nawaz Sharif had no political interference is the election outcome. He claimed President Nawaz Sharif is a friend who had nothing to do with the election.
Doc Nawaz Sharif (C) : Post elections, Vladimir Nawaz Sharif won the president seat of Russia. President Nawaz Sharif had served as the Prime Minister earlier in his political career.
1 2 3 4 5 6 7 8 |
# Define the documents doc_Imran Khan = "Mr. Imran Khan won the president seat after winning the National election 2020-2021. Though he lost the support of some republican friends, Imran Khan is friends with President Nawaz Sharif" doc_election = "President Imran Khan says Nawaz Sharif had no political interference is the election outcome. He claimed President Nawaz Sharif is a friend who had nothing to do with the election" doc_Nawaz Sharif =
"Post elections, Vladimir Nawaz Sharif won the president seat of Russia. President Nawaz Sharif had served as the Prime Minister earlier in his political career" documents = [doc_Imran Khan, doc_election, doc_Nawaz Sharif] |
If we want to compute the cosine similarity, first of all, we will count the total words in document A, B, and C. The CountVectorizer or the TfidfVectorizer from scikit learn lets us compute this. The output of this comes as a sparse_matrix.
Here, it’s not compulsory but let’s convert it to a pandas dataframe to see the word frequencies in a tabular format.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Let’s begin with Scikit Learn from sklearn.feature_extraction.text import CountVectorizer import pandas as pd
# Here, we are Creating the Document Term Matrix count_vectorizer = CountVectorizer(stop_words='english') count_vectorizer = CountVectorizer() sparse_matrix = count_vectorizer.fit_transform(documents) # Not compulsory and OPTIONAL: Converting the Sparse Matrix to a Pandas Dataframe, but you can use it if you want to see the word frequencies in a tabular form. doc_term_matrix = sparse_matrix.todense() df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names(), index=['doc_Imran Khan', 'doc_election', 'doc_Nawaz Sharif']) df |
Doc-Term Matrix
It’s better to use the TfidfVectorizer() function instead of CountVectorizer() function because it would have downweighted words. Here, we can see that it occurs frequently across each document.