Python NLTK如何实现TF-IDF？-杰瑞科技汇

Of course! Let's break down how to use TF-IDF with NLTK and Scikit-learn in Python.

This is a very common task in Natural Language Processing (NLP) for tasks like text classification, information retrieval, and keyword extraction.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic that reflects how important a word is to a document in a collection or corpus. It's a clever way to convert text into numbers (a process called vectorization) for machine learning models.

It's composed of two parts:

Term Frequency (TF): How often a word appears in a single document.
- Intuition: If a word appears many times in a document, it's probably important to that document.
- Simple Formula: TF(word, document) = (Number of times the word appears in the document) / (Total number of words in the document)
Inverse Document Frequency (IDF): How rare or common a word is across all documents in the corpus.
- Intuition: Words that appear in many documents (like "the", "a", "is") are not very distinctive. We want to down-weight them. Words that appear in only a few documents are more important and should be up-weighted.
- Formula: IDF(word, corpus) = log_e(Total number of documents / Number of documents containing the word)

The final TF-IDF score is the product of TF and IDF:

TF-IDF(word, document, corpus) = TF(word, document) * IDF(word, corpus)

Step-by-Step Python Implementation using NLTK and Scikit-learn

While NLTK is excellent for text preprocessing (like tokenization and stop word removal), the most common and efficient library for calculating TF-IDF is Scikit-learn. We'll use both: NLTK for preparation and Scikit-learn for the heavy lifting.

Step 1: Installation

First, make sure you have the necessary libraries installed.

pip install nltk scikit-learn

Step 2: Import Libraries and Download NLTK Data

We'll import the required libraries and download the 'stopwords' and 'punkt' (for tokenization) data from NLTK.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
# Download necessary NLTK data (only need to do this once)
nltk.download('punkt')
nltk.download('stopwords')

Step 3: Prepare Your Corpus (Collection of Documents)

Let's define a simple list of documents. In a real-world scenario, this would be much larger.

corpus = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat and the dog are friends.",
    "Cats and dogs are common pets."
]

Step 4: Preprocessing the Text (Using NLTK)

Good practice is to clean the text before vectorization. This usually involves:

Tokenization: Splitting sentences into words.
Lowercasing: Converting all words to lowercase.
Stop Word Removal: Removing common words that don't carry much meaning (e.g., "the", "is", "on").

Let's create a function to do this.

stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove stop words and non-alphabetic characters
    filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(filtered_tokens)
# Apply the preprocessing function to our corpus
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]
print("--- Preprocessed Corpus ---")
for i, doc in enumerate(preprocessed_corpus):
    print(f"Document {i+1}: {doc}")

Output of Preprocessing:

--- Preprocessed Corpus ---
Document 1: cat sat mat
Document 2: dog sat log
Document 3: cat dog friends
Document 4: cats dogs common pets

Notice how "the", "on", "and", "are" have been removed.

Step 5: Create and Fit the TF-IDF Vectorizer (Using Scikit-learn)

Now we'll use TfidfVectorizer from Scikit-learn. It handles the TF and IDF calculations for us.

# Create the TfidfVectorizer
# - max_features: limits the number of features (words) to the top N most frequent.
# - ngram_range: can capture phrases (e.g., ('cat', 'dog')) in addition to single words.
vectorizer = TfidfVectorizer(max_features=10)
# Fit the vectorizer to our preprocessed corpus and transform it into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)
# Get the feature names (the words)
feature_names = vectorizer.get_feature_names_out()
print("\n--- Feature Names (Vocabulary) ---")
print(feature_names)

Step 6: Inspect the Results

The tfidf_matrix is a sparse matrix. To see the actual TF-IDF scores, we can convert it to a dense array and then to a pandas DataFrame for better readability.

import pandas as pd
# Convert the sparse matrix to a dense array
tfidf_array = tfidf_matrix.toarray()
# Create a DataFrame
tfidf_df = pd.DataFrame(tfidf_array, columns=feature_names)
print("\n--- TF-IDF Matrix ---")
print(tfidf_df)

Output of the TF-IDF Matrix:

--- TF-IDF Matrix ---
        cat       common        dogs        dog      friends        log        mat        pets       sat
0  0.707107     0.000000     0.000000     0.000000     0.000000     0.000000     0.707107     0.000000     0.707107
1  0.000000     0.000000     0.000000     0.707107     0.000000     0.707107     0.000000     0.000000     0.707107
2  0.500000     0.000000     0.000000     0.500000     0.707107     0.000000     0.000000     0.000000     0.000000
3  0.000000     0.707107     0.707107     0.000000     0.000000     0.000000     0.000000     0.707107     0.000000

How to interpret this table:

Rows correspond to our original documents (after preprocessing).
Columns correspond to the words in our vocabulary (feature_names).
Values are the TF-IDF scores.

Let's look at Document 1 (cat sat mat):

The word "cat" has a score of 707. It's important to this document.
The word "sat" has a score of 707. It's also important.
The word "mat" has a score of 707. Important.
All other words (like "dog", "log", "friends") have a score of 0 because they don't appear in this document.

Now let's look at Document 3 (cat dog friends):

The word "cat" has a score of 5. It appears in Document 1 and Document 3, so its IDF is lower, making its overall TF-IDF score lower than in Document 1.
The word "dog" has a score of 5. Same reasoning as "cat".
The word "friends" has a score of 707. This word is unique to Document 3, so its IDF is high, resulting in a high TF-IDF score.

Complete Code Example

Here is the full script from start to finish.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# --- 1. Setup ---
# Download NLTK data
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    word_tokenize("test")
except LookupError:
    nltk.download('punkt')
# --- 2. Data and Preprocessing ---
corpus = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat and the dog are friends.",
    "Cats and dogs are common pets."
]
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(filtered_tokens)
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]
print("--- Preprocessed Corpus ---")
for i, doc in enumerate(preprocessed_corpus):
    print(f"Document {i+1}: {doc}")
# --- 3. TF-IDF Vectorization ---
vectorizer = TfidfVectorizer(max_features=10)
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)
feature_names = vectorizer.get_feature_names_out()
# --- 4. Results ---
print("\n--- Feature Names (Vocabulary) ---")
print(feature_names)
tfidf_array = tfidf_matrix.toarray()
tfidf_df = pd.DataFrame(tfidf_array, columns=feature_names)
print("\n--- TF-IDF Matrix (DataFrame) ---")
print(tfidf_df)
# You can also get the scores for a specific word
print("\n--- IDF Scores for each word ---")
print(dict(zip(feature_names, vectorizer.idf_)))

Key Parameters of `TfidfVectorizer`

max_features: An integer that specifies the number of most frequent features (words) to keep. Useful for reducing dimensionality.
ngram_range: A tuple (min_n, max_n). For example, (1, 2) will create features for both single words (unigrams) and two-word phrases (bigrams). This can capture context (e.g., "new york" vs. "new" and "york").
stop_words: You can pass a list of stop words (like stopwords.words('english')) or the string 'english' to use a built-in list.
tokenizer: You can pass your own custom tokenization function. If not provided, it uses a simple regex-based tokenizer. NLTK's word_tokenize is a good custom option.
lowercase: A boolean (default True) that converts all characters to lowercase before tokenizing.

Python NLTK如何实现TF-IDF？

What is TF-IDF?

Step-by-Step Python Implementation using NLTK and Scikit-learn

Step 1: Installation

Step 2: Import Libraries and Download NLTK Data

Step 3: Prepare Your Corpus (Collection of Documents)

Step 4: Preprocessing the Text (Using NLTK)

Step 5: Create and Fit the TF-IDF Vectorizer (Using Scikit-learn)

Step 6: Inspect the Results

Complete Code Example

Key Parameters of `TfidfVectorizer`

99ANYc3cd6

Java线程run()与start()有何区别？

Java中stack和heap有何区别？

Python中apply的axis参数到底该怎么用？

y480拆机清灰教程，步骤和注意事项有哪些？

Python requests参数有哪些？如何使用？

Java Socket编程实例具体怎么实现？

Python import tab是什么？如何使用？

Python zipfile如何删除文件？

Sphinx如何用Python生成PDF？

java核心技术中文 pdf

独立显卡怎么安装？视频教程步骤详解

华为hg8321怎么设置？

java android虚拟机

3dmax如何制作逼真易拉罐？

byte数组转int，java如何实现？

金蝶kis商贸高级版教程

Python NLTK如何实现TF-IDF？

What is TF-IDF?

Step-by-Step Python Implementation using NLTK and Scikit-learn

Step 1: Installation

Step 2: Import Libraries and Download NLTK Data

Step 3: Prepare Your Corpus (Collection of Documents)

Step 4: Preprocessing the Text (Using NLTK)

Step 5: Create and Fit the TF-IDF Vectorizer (Using Scikit-learn)

Step 6: Inspect the Results

Complete Code Example

Key Parameters of TfidfVectorizer

相关推荐

Java Socket编程实例具体怎么实现？

Key Parameters of `TfidfVectorizer`