TF IDF Cosine similarity Formula Examples in data mining

What is Cosine similarity?

Cosine similarity is a measure to find the similarity between two files/documents.

Example of cosine similarity

What is the similarity between two files, file 1 and file 2?

Cosine similarity Formula

cos(file 1, file 2) = (file 1 • file 2) / ||file 1|| ||file 2|| ,

file 1 = (0, 3, 0, 0, 2, 0, 0, 2, 0, 5)

file 2 = (1, 2, 0, 0, 1, 1, 0, 1, 0, 3)

file 1 • file 2 = 0*1 + 3*2 + 0*0 + 0*0 + 2*1 + 0*1 + 0*0 + 2*1 + 0*0 + 5*3

= 25

||d1||= (0*0 + 3*3 + 0*0 + 0*0 + 2*2 + 0*0 + 0*0 + 2*2 + 0*0 + 5*5)0.5

=(42)0.5 = 6.481

||d2||= (1*1 + 2*2 + 0*0 + 0*0 + 1*1 + 1*1 + 0*0 + 1*1 + 0*0 + 3*3)0.5

=(17)0.5 = 4.12

cos(d1 , d2 ) = 0.94

What is a good cosine similarity 0 or 1?

  • Similarity 0 means no similarity
  • Similarity 0 means identical
  • A similarity above 0.5 might be a good starting point.

Is cosine similarity a metric?

Yes, Cosine similarity is a metric. This metric can be used to measure the similarity between two objects.

When to use cosine similarity over Euclidean similarity?

In Cosine similarity our focus is at the angle between two vectors and in case of euclidian similarity our focus is at the distance between two points.

For example we want to analyse the data of a shop and the data is;

  • User 1 bought 1x copy, 1x pencil and 1x rubber from the shop.
  • User 2 bought 100x copy, 100x pencil and 100x rubber from the shop.
  • User 3 bought 1x copy, 1x PEPSI and 1x Shoes Polish from the shop.

According to cosine similarity, user 1 and user 2 are more similar and in case of euclidean similarity, user 3 is more similar to user 1.

 

Cosine similarity python

Suppose we have text in the three documents;

Doc Imran Khan (A) : Mr. Imran Khan win the president seat after winning the National election 2020-2021. Though he lost the support of some republican friends, Imran Khan is friends with President Nawaz Sharif.

Doc Imran Khan Election (B) : President Imran Khan says Nawaz Sharif had no political interference is the election outcome. He claimed President Nawaz Sharif is a friend who had nothing to do with the election.

Doc Nawaz Sharif (C) : Post elections, Vladimir Nawaz Sharif win the president seat of Russia. President Nawaz Sharif had served as the Prime Minister earlier in his political career.

Here, we can see that, Doc B has more in common with Doc A than with Doc C, so here we can expect that the Cosine between the document A and B is larger than document B and document C.

If we want to compute the cosine similarity, first of all we will count the total words in document A, B, and C. The CountVectorizer or the TfidfVectorizer from scikit learn lets us compute this. The output of this comes as a sparse_matrix.

Here, its not compulsory but let’s convert it to a pandas dataframe to see the word frequencies in a tabular format.

Doc-Term Matrix

It’s more better to use the TfidfVectorizer() function instead of CountVectorizer() function, because it would have downweighted words. Here, we can see that it occurs frequently across the each document.

Finaly, we can write a cosine_similarity() function to get the final output. It can take the document term matri as a pandas dataframe as well as a sparse matrix as inputs.

Examples of TF IDF Cosine Similarity

Document 1: T4Tutorials website is a website and it is for professionals.

Document 2: T4Tutorials website is also for good students.

Document 3: i love T4Tutorials

Step 1:

Term Frequency (TF)

Term Frequency commonly known as TF measures the total number of times word appears in a selected document.

Term Frequency Matrix / Document-term matrix

Let’s see some terms and their frequency on each of the document. In this example, there are three document.

TF for Document 1

Document1 T4Tutorials website is fantastic and It’s for professioonals
Term Frequency 1 2 2 1 1 1 1 1

TF for Document 2

Document2 T4Tutorials website is also for Good Students
Term Frequency 1 1 1 1 1 1 1

TF for Document 3

Document3 i love T4Tutorials
Term Frequency 1 1 1

For big data and for big values it’s difficult to understand the data. So, its better to normalize the document based on its size. We can do this with different normalization techniques like min max, decimal scaling and Z-Score normalization. The simple is decimal scaling by dividing the term frequency by the total number of terms.

For example in Document 1 the term website occurs two times. The total number of terms in the document1 is 10. So, let;s normalized the term frequency by 2 / 10 = 0.2.

Now, let’s see the normalized term frequency for all the document1, document2 and document3.

Normalized TF for Document 1

Document1 T4Tutorials website is fantastic and It’s for professionals
Normalized TF 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1

Normalized TF for Document 2

Document2 T4Tutorials website is also for Good students
Normalized TF 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857

Normalized TF for Document 3

Document3 i love T4Tutorials
Normalized TF 0.333333 0.333333 0.333333

Given below is the code in python which will do the normalized TF calculation.

Step 2

Inverse Document Frequency (IDF)

Let us compute Inverse Document Frequency for the term website

Inverse Document Frequency (IDF) in Python

Here, I am sharing the python code to calculate the IDF.

Step 3:

We need to multiply the Term Frequency with Document Frequency just like TF * IDF .

Step 4:

Cosine Similarity