- 0
- 0
What is TF-IDF?
TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The core idea is to highlight words that are important (frequent in a specific document) but not too common across all documents.
Term Frequency (TF): The number of times a term t appears in a document d.
TF(t,d) = Number of times term t appears in document d / Total number of terms in document d
Inverse Document Frequency (IDF): Measures how common or rare a term is across all documents.
IDF(t)= log(Total number of documents / Number of documents containing term t)
Example
Let’s illustrate with a simple example:
Documents:
- “the cat sat on the mat”
- “the cat sat”
Step 1: Compute TF
For the term “cat” in Document 1:
TF(cat,d1) = 1/6 ≈ 0.167
For the term “cat” in Document 2:
TF(cat,d2) = 1/3 ≈ 0.333
Step 2: Compute IDF
The term “cat” appears in both documents:
IDF(cat) = log(2/2) = log(1) = 0
However, for a term like “mat” which appears only in Document 1:
IDF(mat) = log(2/1) = log(2) ≈ 0.693
Step 3: Compute TF-IDF
For “cat” in Document 1:
TF-IDF(cat,d1) = 0.167×0 = 0
For “mat” in Document 1:
TF-IDF(mat,d1) = 0.167×0.693 ≈ 0.116
Simple Code Example
Here’s a basic implementation in Python:
import math
from collections import Counter
def compute_tf(text):
words = text.split()
tf_values = Counter(words)
total_words = len(words)
tf = {word: count / total_words for word, count in tf_values.items()}
return tf
def compute_idf(corpus):
idf = {}
total_docs = len(corpus)
all_words = set(word for text in corpus for word in text.split())
for word in all_words:
containing_docs = sum(1 for text in corpus if word in text.split())
idf[word] = math.log(total_docs / containing_docs)
return idf
def compute_tf_idf(tf, idf):
tf_idf = {word: tf_val * idf[word] for word, tf_val in tf.items()}
return tf_idf
corpus = ["the cat sat on the mat", "the cat sat"]
tf_doc1 = compute_tf(corpus[0])
idf_values = compute_idf(corpus)
tf_idf_doc1 = compute_tf_idf(tf_doc1, idf_values)
print("TF-IDF for Document 1:", tf_idf_doc1)
This code calculates the TF-IDF values for the terms in the first document of the given corpus.
Conclusion
TF-IDF is a powerful tool for text analysis, especially useful in information retrieval and text mining. By emphasizing terms that are significant in specific documents but rare across the corpus, it helps in identifying the most relevant words for each document.
Leave a Reply