Python朴素贝叶斯如何应用与优化？-杰瑞科技汇

Here is a complete guide to implementing Naive Bayes in Python.

（图片来源网络，侵删）

There are two ways to do this:

Using Scikit-Learn: The industry-standard way (easy, fast, optimized).
From Scratch: The educational way (to understand the math).

The Practical Approach (Scikit-Learn)

For real-world projects, you should use scikit-learn. The most common version is Multinomial Naive Bayes, which is famous for text classification (like Spam vs. Ham).

Here is a full example using the classic "20 Newsgroups" dataset to classify text topics.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# 1. Create a dummy dataset (Text, Label)
data = {
    'text': [
        "Free money now!!!", 
        "Hi Bob, how about a game of golf tomorrow?",
        "URGENT! Your bank account is compromised.",
        "Meeting reminder: Project sync at 10 AM.",
        "Win a brand new car! Click here.",
        "Can we reschedule dinner to next week?",
        "Limited time offer! Buy one get one free.",
        "Don't forget to bring the snacks for the party."
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)
# 2. Convert text to numbers (Bag of Words)
# Naive Bayes cannot read text; it needs frequency counts.
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])
y = df['label']
# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 4. Initialize and Train the Model
model = MultinomialNB()
model.fit(X_train, y_train)
# 5. Make Predictions
y_pred = model.predict(X_test)
# 6. Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Test on a new sentence
new_sentence = ["Congratulation! You won the lottery."]
new_sentence_vectorized = vectorizer.transform(new_sentence)
prediction = model.predict(new_sentence_vectorized)
print(f"\nPrediction for '{new_sentence[0]}': {prediction[0]}")

Choosing the Right Naive Bayes

Scikit-learn offers three main types. You must choose based on your data distribution:

（图片来源网络，侵删）

MultinomialNB: Best for Text (Word counts, TF-IDF).
- Example: Spam filtering, Topic categorization.
GaussianNB: Best for Continuous Features (assuming a normal distribution).
- Example: Iris flower classification (petal length/width).
BernoulliNB: Best for Binary Features (presence/absence).
- Example: Checking if specific words exist in a document (yes/no).

Example: Gaussian Naive Bayes (Continuous Data)

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# Predict
print("Predictions:", gnb.predict(X_test))

The "From Scratch" Approach (Educational)

To understand how it works, here is a simplified implementation using only Python standard libraries.

Naive Bayes calculates: $P(Class | Features) \propto P(Class) \times P(Feature | Class)$

import numpy as np
class NaiveBayesScratch:
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        # Calculate mean, var, and prior for each class
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)
        for idx, c in enumerate(self._classes):
            X_c = X[y == c]
            self._mean[idx, :] = X_c.mean(axis=0)
            self._var[idx, :] = X_c.var(axis=0)
            self._priors[idx] = X_c.shape[0] / float(n_samples)
    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)
    def _predict(self, x):
        posteriors = []
        # Calculate posterior probability for each class
        for idx, c in enumerate(self._classes):
            prior = np.log(self._priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x)))
            posterior = prior + posterior
            posteriors.append(posterior)
        # Return class with highest posterior
        return self._classes[np.argmax(posteriors)]
    def _pdf(self, class_idx, x):
        # Probability Density Function (Gaussian)
        mean = self._mean[class_idx]
        var = self._var[class_idx]
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator
# Testing the scratch model
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_redundant=0, random_state=42)
model = NaiveBayesScratch()
model.fit(X, y)
predictions = model.predict(X)
print(f"Accuracy from scratch: {np.mean(predictions == y)}")

Summary of Pros and Cons

Pros:

Extremely Fast: Training is almost instantaneous.
Works on Small Data: Doesn't need massive datasets to work well.
Good for Text: Handles high-dimensional data (like thousands of words) better than many complex algorithms.

Cons:

（图片来源网络，侵删）

The "Naive" Assumption: It assumes features are independent (e.g., the word "Bank" appearing is unrelated to the word "Money" appearing). In reality, this is rarely true, but the algorithm works anyway.
Zero Frequency: If a word appears in the test set but not the training set, the probability becomes zero. (Scikit-learn handles this automatically using a parameter called alpha for smoothing).

Python朴素贝叶斯如何应用与优化？

The Practical Approach (Scikit-Learn)

Choosing the Right Naive Bayes

Example: Gaussian Naive Bayes (Continuous Data)

The "From Scratch" Approach (Educational)

Summary of Pros and Cons

99ANYc3cd6

删除文件夹下所有文件 java

python cloghandler

U盘装Win7系统教程详细步骤是怎样的？

Python parseString如何高效解析字符串？

Python openprocess如何实现进程操作？

Java Socket编程实例具体怎么实现？

Python cProfile如何高效分析代码性能瓶颈？

Java混淆器如何保护代码不被逆向？

Android视频开发教程从哪开始学？

java nio server

64位Win7系统怎么安装？步骤详解

Java如何将十六进制转为十进制？

Python如何用JSON数据操作SQLite？

Windows下Python环境搭建如何操作？

C程序设计教程答案哪里找？

java response方法

Python朴素贝叶斯如何应用与优化？

The Practical Approach (Scikit-Learn)

Choosing the Right Naive Bayes

Example: Gaussian Naive Bayes (Continuous Data)

The "From Scratch" Approach (Educational)

Summary of Pros and Cons

相关推荐

Java Socket编程实例具体怎么实现？