TF IDF Cosine similarity Formula Examples in data mining

What is Cosine similarity?

Cosine similarity is a measure to find the similarity between two files/documents.

Example of cosine similarity

What is the similarity between two files, file 1 and file 2?

Cosine similarity Formula

cos(file 1, file 2) = (file 1 • file 2) / ||file 1|| ||file 2|| ,

file 1 = (0, 3, 0, 0, 2, 0, 0, 2, 0, 5)

file 2 = (1, 2, 0, 0, 1, 1, 0, 1, 0, 3)

file 1 • file 2 = 0*1 + 3*2 + 0*0 + 0*0 + 2*1 + 0*1 + 0*0 + 2*1 + 0*0 + 5*3

= 25

||d1||= (0*0 + 3*3 + 0*0 + 0*0 + 2*2 + 0*0 + 0*0 + 2*2 + 0*0 + 5*5)0.5

=(42)0.5 = 6.481

||d2||= (1*1 + 2*2 + 0*0 + 0*0 + 1*1 + 1*1 + 0*0 + 1*1 + 0*0 + 3*3)0.5

=(17)0.5 = 4.12

cos(d1 , d2 ) = 0.94

What is a good cosine similarity 0 or 1?

Similarity 0 means no similarity
Similarity 0 means identical
A similarity above 0.5 might be a good starting point.

Is cosine similarity a metric?

Yes, Cosine similarity is a metric. This metric can be used to measure the similarity between two objects.

When to use cosine similarity over Euclidean similarity?

In Cosine similarity our focus is at the angle between two vectors and in case of euclidean similarity our focus is at the distance between two points.

For example, we want to analyze the data of a shop and the data is;

User 1 bought 1x copy, 1x pencil and 1x rubber from the shop.
User 2 bought 100x copy, 100x pencil and 100x rubber from the shop.
User 3 bought 1x copy, 1x PEPSI and 1x Shoes Polish from the shop.

Cosine similarity python

Suppose we have text in the three documents;

Doc Imran Khan (A) : Mr. Imran Khan won the president seat after winning the National election 2020-2021. Though he lost the support of some republican friends, Imran Khan is friends with President Nawaz Sharif.

Doc Imran Khan Election (B) : President Imran Khan says Nawaz Sharif had no political interference is the election outcome. He claimed President Nawaz Sharif is a friend who had nothing to do with the election.

Doc Nawaz Sharif (C) : Post elections, Vladimir Nawaz Sharif won the president seat of Russia. President Nawaz Sharif had served as the Prime Minister earlier in his political career.

# Define the documents
doc_Imran Khan = "Mr. Imran Khan won the president seat after winning the National election 2020-2021. Though he lost the support of some republican friends, Imran Khan is friends with President Nawaz Sharif"
doc_election = "President Imran Khan says Nawaz Sharif had no political interference is the election outcome.  He claimed President Nawaz Sharif is a friend who had nothing to do with the election"
doc_Nawaz Sharif = "Post elections, Vladimir Nawaz Sharif won the president seat of Russia. President Nawaz Sharif had served as the Prime Minister earlier in his political career"
documents = [doc_Imran Khan, doc_election, doc_Nawaz Sharif]

# Define the documents

doc_Imran Khan = "Mr. Imran Khan won the president seat after winning the National election 2020-2021. Though he lost the support of some republican friends, Imran Khan is friends with President Nawaz Sharif"

doc_election = "President Imran Khan says Nawaz Sharif had no political interference is the election outcome. He claimed President Nawaz Sharif is a friend who had nothing to do with the election"

doc_Nawaz Sharif = "Post elections, Vladimir Nawaz Sharif won the president seat of Russia. President Nawaz Sharif had served as the Prime Minister earlier in his political career"

documents = [doc_Imran Khan, doc_election, doc_Nawaz Sharif]

If we want to compute the cosine similarity, first of all, we will count the total words in document A, B, and C. The CountVectorizer or the TfidfVectorizer from scikit learn lets us compute this. The output of this comes as a sparse_matrix.

Here, it’s not compulsory but let’s convert it to a pandas dataframe to see the word frequencies in a tabular format.

# Let’s begin with Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Here, we are Creating the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
# Not compulsory and OPTIONAL: Converting the  Sparse Matrix to a Pandas Dataframe, but you can use it if you want to see the word frequencies in a tabular form.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, 
columns=count_vectorizer.get_feature_names(), 
index=['doc_Imran Khan', 'doc_election', 'doc_Nawaz Sharif'])
df

# Let’s begin with Scikit Learn

from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd

# Here, we are Creating the Document Term Matrix

count_vectorizer = CountVectorizer(stop_words='english')

count_vectorizer = CountVectorizer()

sparse_matrix = count_vectorizer.fit_transform(documents)

# Not compulsory and OPTIONAL: Converting the Sparse Matrix to a Pandas Dataframe, but you can use it if you want to see the word frequencies in a tabular form.

doc_term_matrix = sparse_matrix.todense()

df = pd.DataFrame(doc_term_matrix,

columns=count_vectorizer.get_feature_names(),

index=['doc_Imran Khan',

'doc_election', 'doc_Nawaz Sharif'])

Doc-Term Matrix

It’s better to use the TfidfVectorizer() function instead of CountVectorizer() function because it would have downweighted words. Here, we can see that it occurs frequently across each document.

T4Tutorials.com