Understanding TF-IDF: Term Frequency-Inverse Document Frequency


What is TF-IDF?

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The core idea is to highlight words that are important (frequent in a specific document) but not too common across all documents.

Term Frequency (TF): The number of times a term t appears in a document d.

TF(t,d) = Number of times term t appears in document d / Total number of terms in document d

Inverse Document Frequency (IDF): Measures how common or rare a term is across all documents.

IDF(t)= log⁡(Total number of documents / Number of documents containing term t)

Example

Let’s illustrate with a simple example:

Documents:

  1. “the cat sat on the mat”
  2. “the cat sat”

Step 1: Compute TF

For the term “cat” in Document 1:

TF(cat,d1) = 1/6 ≈ 0.167

For the term “cat” in Document 2:

TF(cat,d2) = 1/3 ≈ 0.333

Step 2: Compute IDF

The term “cat” appears in both documents:

IDF(cat) = log⁡(2/2) = log⁡(1) = 0

However, for a term like “mat” which appears only in Document 1:

IDF(mat) = log⁡(2/1) = log⁡(2) ≈ 0.693

Step 3: Compute TF-IDF

For “cat” in Document 1:

TF-IDF(cat,d1) = 0.167×0 = 0

For “mat” in Document 1:

TF-IDF(mat,d1) = 0.167×0.693 ≈ 0.116

Simple Code Example

Here’s a basic implementation in Python:

import mathfrom collections import Counter

def compute_tf(text):
    words = text.split()
    tf_values = Counter(words)
    total_words = len(words)
    tf = {word: count / total_words for word, count in tf_values.items()}
    return tf

def compute_idf(corpus):
    idf = {}
    total_docs = len(corpus)
    all_words = set(word for text in corpus for word in text.split())
    for word in all_words:
        containing_docs = sum(1 for text in corpus if word in text.split())
        idf[word] = math.log(total_docs / containing_docs)
    return idf

def compute_tf_idf(tf, idf):
    tf_idf = {word: tf_val * idf[word] for word, tf_val in tf.items()}
    return tf_idf

corpus = ["the cat sat on the mat", "the cat sat"]
tf_doc1 = compute_tf(corpus[0])
idf_values = compute_idf(corpus)
tf_idf_doc1 = compute_tf_idf(tf_doc1, idf_values)

print("TF-IDF for Document 1:", tf_idf_doc1)

This code calculates the TF-IDF values for the terms in the first document of the given corpus.

Conclusion

TF-IDF is a powerful tool for text analysis, especially useful in information retrieval and text mining. By emphasizing terms that are significant in specific documents but rare across the corpus, it helps in identifying the most relevant words for each document.

Leave a Reply

Your email address will not be published. Required fields are marked *