杰瑞科技汇

Python多项式分布如何实现?

Of course! The term "multinomial" in Python can refer to two main things:

  1. The Multinomial Distribution: A fundamental concept in probability theory, which is an extension of the binomial distribution for scenarios with more than two outcomes (e.g., rolling a 6-sided die).
  2. The Multinomial Naive Bayes Classifier: A popular machine learning algorithm used for text classification and other tasks where features represent counts or frequencies.

Let's break down both.


The Multinomial Distribution (from scipy.stats)

The multinomial distribution models the probability of counts for each possible outcome when you draw from a categorical distribution n times with replacement.

Key Parameters:

  • n: The total number of trials (e.g., the number of times you roll the die).
  • p: A list of probabilities for each of the k possible outcomes. The probabilities must sum to 1.
  • k: The number of possible outcomes (this is often inferred from the length of p).

Use Case: You want to know the probability of getting a specific count for each face of a die when you roll it 20 times.

Example: Rolling a Loaded Die

Imagine a 4-sided die with the following probabilities:

  • Face 1: 10% chance (p=0.1)
  • Face 2: 20% chance (p=0.2)
  • Face 3: 30% chance (p=0.3)
  • Face 4: 40% chance (p=0.4)

We want to find the probability of rolling the die 10 times and getting:

  • Face 1: 1 time
  • Face 2: 2 times
  • Face 3: 3 times
  • Face 4: 4 times

(Note: 1 + 2 + 3 + 4 = 10, which matches our number of trials, n=10).

import numpy as np
from scipy.stats import multinomial
# 1. Define the parameters
n_trials = 10  # The total number of die rolls
probabilities = [0.1, 0.2, 0.3, 0.4] # p-values for each face (must sum to 1)
# 2. Define the specific outcome we're interested in
# The counts for each face in the order of the probabilities list
outcome_counts = [1, 2, 3, 4]
# 3. Calculate the probability of this exact outcome
# The .pmf() method calculates the Probability Mass Function
probability = multinomial.pmf(outcome_counts, n=n_trials, p=probabilities)
print(f"The probability of the outcome {outcome_counts} is: {probability:.6f}")
# You can also generate random samples from the distribution
# This simulates rolling the die 10 times, 5 different times.
samples = multinomial.rvs(n=n_trials, p=probabilities, size=5)
print("\n5 random samples (each sample is a list of counts for the 4 faces):")
print(samples)

Output:

The probability of the outcome [1, 2, 3, 4] is: 0.000810
5 random samples (each sample is a list of counts for the 4 faces):
[[2 3 2 3]
 [1 1 5 3]
 [0 4 3 3]
 [2 2 2 4]
 [1 2 2 5]]

The Multinomial Naive Bayes Classifier (from sklearn)

This is a classification algorithm based on Bayes' theorem. It's "naive" because it assumes that all features are independent of each other, which is often not true in practice but works surprisingly well, especially for text data.

Why "Multinomial"? It's called "Multinomial" because it is specifically designed for features that are counts or frequencies. For example:

  • Text Classification: Each feature can represent the count of a particular word in a document.
  • Image Classification: Each feature can represent the count of a particular pixel intensity.

Example: Document Classification (Spam vs. Ham)

Let's classify text messages as "spam" or "ham" (not spam).

Step 1: Setup and Data Preparation We'll use CountVectorizer to convert our text documents into a matrix of token counts. This is the "multinomial" part of the features.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# 1. Sample data: Text messages and their labels
# 'spam' = 1, 'ham' = 0
corpus = [
    'Get a free vacation now!',  # spam
    'Your package has arrived',   # ham
    'Exclusive offer just for you', # spam
    'Meeting at 3pm today',      # ham
    'Claim your prize now',      # spam
    'Lunch tomorrow?',           # ham
    'Free money, click here',    # spam
    'Thanks for your message'    # ham
]
labels = np.array([1, 0, 1, 0, 1, 0, 1, 0]) # 1 for spam, 0 for ham
# 2. Vectorize the text data into a matrix of token counts
# This creates our multinomial features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# The features are now counts of each word
# For example, the first document 'Get a free vacation now!' has counts for 'Get', 'a', 'free', etc.
print("Feature Matrix (X):")
print(X.toarray())
# The feature names (words)
print("\nFeature Names (vocabulary):")
print(vectorizer.get_feature_names_out())
# 3. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

Step 2: Train the Multinomial Naive Bayes Model

# 4. Initialize and train the Multinomial Naive Bayes classifier
# The model learns the probability of each word given the class (spam or ham)
model = MultinomialNB()
model.fit(X_train, y_train)
print("\nModel trained successfully!")

Step 3: Make Predictions

# 5. Make predictions on the test data
y_pred = model.predict(X_test)
# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2f}")
# The test set is small, so the report might be simple
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

Step 4: Classify New, Unseen Data

# 7. Classify a new, unseen message
new_messages = [
    "You won a free iPhone, claim now!", # Should be spam
    "Are we still on for dinner?"         # Should be ham
]
# IMPORTANT: You must use the SAME vectorizer to transform the new data
new_messages_counts = vectorizer.transform(new_messages)
# Predict the labels
predictions = model.predict(new_messages_counts)
print("\n--- Predictions for New Messages ---")
for message, prediction in zip(new_messages, predictions):
    label = "Spam" if prediction == 1 else "Ham"
    print(f"Message: '{message}' -> Predicted Label: {label}")

Output of the full example:

Feature Matrix (X):
[[0 0 0 1 0 0 0 0 1 0 0 1 0 0 1]
 [0 1 0 0 0 0 1 0 0 0 0 0 1 0 0]
 [0 0 0 1 0 1 0 0 0 0 0 0 0 1 1]
 [0 0 1 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 1 1]
 [0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 1 0 0 0 1 0 0 0 0 0 1 1]
 [0 1 0 0 0 0 0 0 0 1 0 0 0 0 0]]
Feature Names (vocabulary):
['a' 'arrived' 'at' 'claim' 'click' 'dinner' 'exclusive' 'free' 'for'
 'get' 'ham' 'here' 'lunch' 'money' 'now']
Model trained successfully!
Accuracy: 1.00
Classification Report:
              precision    recall  f1-score   support
         Ham       1.00      1.00      1.00         1
        Spam       1.00      1.00      1.00         1
    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2
--- Predictions for New Messages ---
Message: 'You won a free iPhone, claim now!' -> Predicted Label: Spam
Message: 'Are we still on for dinner?' -> Predicted Label: Ham

Summary: Which One to Use?

Feature scipy.stats.multinomial sklearn.naive_bayes.MultinomialNB
Purpose Probability Theory: To calculate probabilities or generate samples from a multinomial distribution. Machine Learning: To build a classification model.
Input n (trials), p (probabilities), outcome_counts. A feature matrix X (counts/frequencies) and a target vector y (labels).
Output A single probability value (.pmf) or random samples (.rvs). A trained model that can predict the class of new data.
Common Use Case Modeling experiments with multiple categorical outcomes (e.g., election results, A/B/C testing). Text classification (spam detection, sentiment analysis), document categorization.
分享:
扫描分享到社交APP
上一篇
下一篇