杰瑞科技汇

Python NLTK如何实现TF-IDF?

Of course! Let's break down how to use TF-IDF with NLTK and Scikit-learn in Python.

This is a very common task in Natural Language Processing (NLP) for tasks like text classification, information retrieval, and keyword extraction.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic that reflects how important a word is to a document in a collection or corpus. It's a clever way to convert text into numbers (a process called vectorization) for machine learning models.

It's composed of two parts:

  1. Term Frequency (TF): How often a word appears in a single document.

    • Intuition: If a word appears many times in a document, it's probably important to that document.
    • Simple Formula: TF(word, document) = (Number of times the word appears in the document) / (Total number of words in the document)
  2. Inverse Document Frequency (IDF): How rare or common a word is across all documents in the corpus.

    • Intuition: Words that appear in many documents (like "the", "a", "is") are not very distinctive. We want to down-weight them. Words that appear in only a few documents are more important and should be up-weighted.
    • Formula: IDF(word, corpus) = log_e(Total number of documents / Number of documents containing the word)

The final TF-IDF score is the product of TF and IDF:

TF-IDF(word, document, corpus) = TF(word, document) * IDF(word, corpus)


Step-by-Step Python Implementation using NLTK and Scikit-learn

While NLTK is excellent for text preprocessing (like tokenization and stop word removal), the most common and efficient library for calculating TF-IDF is Scikit-learn. We'll use both: NLTK for preparation and Scikit-learn for the heavy lifting.

Step 1: Installation

First, make sure you have the necessary libraries installed.

pip install nltk scikit-learn

Step 2: Import Libraries and Download NLTK Data

We'll import the required libraries and download the 'stopwords' and 'punkt' (for tokenization) data from NLTK.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
# Download necessary NLTK data (only need to do this once)
nltk.download('punkt')
nltk.download('stopwords')

Step 3: Prepare Your Corpus (Collection of Documents)

Let's define a simple list of documents. In a real-world scenario, this would be much larger.

corpus = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat and the dog are friends.",
    "Cats and dogs are common pets."
]

Step 4: Preprocessing the Text (Using NLTK)

Good practice is to clean the text before vectorization. This usually involves:

  1. Tokenization: Splitting sentences into words.
  2. Lowercasing: Converting all words to lowercase.
  3. Stop Word Removal: Removing common words that don't carry much meaning (e.g., "the", "is", "on").

Let's create a function to do this.

stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove stop words and non-alphabetic characters
    filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(filtered_tokens)
# Apply the preprocessing function to our corpus
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]
print("--- Preprocessed Corpus ---")
for i, doc in enumerate(preprocessed_corpus):
    print(f"Document {i+1}: {doc}")

Output of Preprocessing:

--- Preprocessed Corpus ---
Document 1: cat sat mat
Document 2: dog sat log
Document 3: cat dog friends
Document 4: cats dogs common pets

Notice how "the", "on", "and", "are" have been removed.

Step 5: Create and Fit the TF-IDF Vectorizer (Using Scikit-learn)

Now we'll use TfidfVectorizer from Scikit-learn. It handles the TF and IDF calculations for us.

# Create the TfidfVectorizer
# - max_features: limits the number of features (words) to the top N most frequent.
# - ngram_range: can capture phrases (e.g., ('cat', 'dog')) in addition to single words.
vectorizer = TfidfVectorizer(max_features=10)
# Fit the vectorizer to our preprocessed corpus and transform it into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)
# Get the feature names (the words)
feature_names = vectorizer.get_feature_names_out()
print("\n--- Feature Names (Vocabulary) ---")
print(feature_names)

Step 6: Inspect the Results

The tfidf_matrix is a sparse matrix. To see the actual TF-IDF scores, we can convert it to a dense array and then to a pandas DataFrame for better readability.

import pandas as pd
# Convert the sparse matrix to a dense array
tfidf_array = tfidf_matrix.toarray()
# Create a DataFrame
tfidf_df = pd.DataFrame(tfidf_array, columns=feature_names)
print("\n--- TF-IDF Matrix ---")
print(tfidf_df)

Output of the TF-IDF Matrix:

--- TF-IDF Matrix ---
        cat       common        dogs        dog      friends        log        mat        pets       sat
0  0.707107     0.000000     0.000000     0.000000     0.000000     0.000000     0.707107     0.000000     0.707107
1  0.000000     0.000000     0.000000     0.707107     0.000000     0.707107     0.000000     0.000000     0.707107
2  0.500000     0.000000     0.000000     0.500000     0.707107     0.000000     0.000000     0.000000     0.000000
3  0.000000     0.707107     0.707107     0.000000     0.000000     0.000000     0.000000     0.707107     0.000000

How to interpret this table:

  • Rows correspond to our original documents (after preprocessing).
  • Columns correspond to the words in our vocabulary (feature_names).
  • Values are the TF-IDF scores.

Let's look at Document 1 (cat sat mat):

  • The word "cat" has a score of 707. It's important to this document.
  • The word "sat" has a score of 707. It's also important.
  • The word "mat" has a score of 707. Important.
  • All other words (like "dog", "log", "friends") have a score of 0 because they don't appear in this document.

Now let's look at Document 3 (cat dog friends):

  • The word "cat" has a score of 5. It appears in Document 1 and Document 3, so its IDF is lower, making its overall TF-IDF score lower than in Document 1.
  • The word "dog" has a score of 5. Same reasoning as "cat".
  • The word "friends" has a score of 707. This word is unique to Document 3, so its IDF is high, resulting in a high TF-IDF score.

Complete Code Example

Here is the full script from start to finish.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# --- 1. Setup ---
# Download NLTK data
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    word_tokenize("test")
except LookupError:
    nltk.download('punkt')
# --- 2. Data and Preprocessing ---
corpus = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat and the dog are friends.",
    "Cats and dogs are common pets."
]
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(filtered_tokens)
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]
print("--- Preprocessed Corpus ---")
for i, doc in enumerate(preprocessed_corpus):
    print(f"Document {i+1}: {doc}")
# --- 3. TF-IDF Vectorization ---
vectorizer = TfidfVectorizer(max_features=10)
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)
feature_names = vectorizer.get_feature_names_out()
# --- 4. Results ---
print("\n--- Feature Names (Vocabulary) ---")
print(feature_names)
tfidf_array = tfidf_matrix.toarray()
tfidf_df = pd.DataFrame(tfidf_array, columns=feature_names)
print("\n--- TF-IDF Matrix (DataFrame) ---")
print(tfidf_df)
# You can also get the scores for a specific word
print("\n--- IDF Scores for each word ---")
print(dict(zip(feature_names, vectorizer.idf_)))

Key Parameters of TfidfVectorizer

  • max_features: An integer that specifies the number of most frequent features (words) to keep. Useful for reducing dimensionality.
  • ngram_range: A tuple (min_n, max_n). For example, (1, 2) will create features for both single words (unigrams) and two-word phrases (bigrams). This can capture context (e.g., "new york" vs. "new" and "york").
  • stop_words: You can pass a list of stop words (like stopwords.words('english')) or the string 'english' to use a built-in list.
  • tokenizer: You can pass your own custom tokenization function. If not provided, it uses a simple regex-based tokenizer. NLTK's word_tokenize is a good custom option.
  • lowercase: A boolean (default True) that converts all characters to lowercase before tokenizing.
分享:
扫描分享到社交APP
上一篇
下一篇