Python中sklearn如何计算AUC？-杰瑞科技汇

Of course! Let's dive deep into how to calculate the Area Under the Curve (AUC) using Python and the scikit-learn library.

（图片来源网络，侵删）

What is AUC?

The Area Under the Curve (AUC) most commonly refers to the Area Under the Receiver Operating Characteristic (ROC) Curve. It's a performance measurement for classification problems at various threshold settings.

ROC Curve: A plot that shows the performance of a classification model at all classification thresholds. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR).
- True Positive Rate (Recall/Sensitivity): TP / (TP + FN) - Of all the actual positives, how many did we correctly predict?
- False Positive Rate: FP / (FP + TN) - Of all the actual negatives, how many did we incorrectly predict as positive?
AUC Score: The AUC of the ROC curve provides a single number summary of the model's ability to discriminate between positive and negative classes.
- AUC = 1.0: Perfect classifier. It ranks all positive instances higher than all negative instances.
- AUC = 0.5: No discriminative ability, equivalent to random guessing.
- AUC < 0.5: The model is worse than random guessing. It's systematically getting it wrong.

How to Calculate AUC in `sklearn`

sklearn provides a straightforward function to calculate the AUC. The key steps are:

（图片来源网络，侵删）

Train a classification model (e.g., Logistic Regression, Random Forest).
Get the prediction probabilities for the positive class. You need probabilities, not just the final class labels (0 or 1).
Use sklearn.metrics.roc_auc_score to calculate the AUC from the true labels and the predicted probabilities.

Let's walk through a complete example.

Step 1: Import Necessary Libraries

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

Step 2: Create a Sample Dataset

We'll use make_classification to create a synthetic binary classification dataset.

# Generate a synthetic dataset
X, y = make_classification(
    n_samples=1000,        # 1000 data points
    n_features=20,        # 20 features
    n_informative=5,      # 5 of which are useful
    n_redundant=5,        # 5 are linear combinations of the useful ones
    n_classes=2,          # 2 classes (0 and 1)
    random_state=42       # for reproducibility
)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_test: {y_test.shape}")
print(f"Class distribution in y_test: {np.bincount(y_test)}")

Step 3: Train a Classification Model

We'll use a simple LogisticRegression model.

# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Step 4: Get Predicted Probabilities

This is the most critical step. We need the probability that the model assigns to the positive class (class 1).

（图片来源网络，侵删）

# Get the predicted probabilities for the positive class (class 1)
y_pred_proba = model.predict_proba(X_test)[:, 1]
print("\nFirst 5 predicted probabilities for class 1:")
print(y_pred_proba[:5])

Step 5: Calculate the AUC Score

Now we can use roc_auc_score with the true labels (y_test) and the predicted probabilities (y_pred_proba).

# Calculate the AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"\nAUC Score: {auc_score:.4f}")

Output:

AUC Score: 0.9261

An AUC of 0.9261 is excellent, indicating that the model has a very good ability to distinguish between the two classes.

Visualizing the ROC Curve

To better understand the AUC score, it's very helpful to plot the ROC curve.

Step 6: Calculate FPR and TPR

The roc_curve function calculates the FPR, TPR, and corresponding thresholds for you.

# Calculate the FPR, TPR, and thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

Step 7: Plot the Curve

We'll use matplotlib to create the plot.

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random classifier (AUC = 0.50)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')'Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

This will generate a plot showing the model's performance curve compared to a random guess. The larger the area under the orange curve, the better the model.

Important Considerations and Common Pitfalls

Probabilities, Not Predictions: roc_auc_score requires the probability estimates for the positive class, not the final class labels (e.g., model.predict(X_test)).
- Correct: model.predict_proba(X_test)[:, 1]
- Incorrect: model.predict(X_test)

Multi-Class Classification: The standard AUC-ROC is for binary classification. For multi-class problems, you have two main strategies:

One-vs-Rest (OvR) / One-vs-All (OvA): Calculate the AUC for each class against all other classes, then average the results.
One-vs-One (OvO): Calculate the AUC for every unique pair of classes and then average the results.

sklearn's roc_auc_score handles this automatically with the multi_class parameter.

# Example for multi-class
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3, n_informative=5, random_state=42)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi, test_size=0.3, random_state=42)
model_multi = RandomForestClassifier(random_state=42)
model_multi.fit(X_train_multi, y_train_multi)
# Get probabilities for all classes
y_pred_proba_multi = model_multi.predict_proba(X_test_multi)
# Calculate AUC using One-vs-Rest strategy
auc_multi_ovr = roc_auc_score(y_test_multi, y_pred_proba_multi, multi_class='ovr')
print(f"\nMulti-class AUC (OvR): {auc_multi_ovr:.4f}")
# Calculate AUC using One-vs-One strategy
auc_multi_ovo = roc_auc_score(y_test_multi, y_pred_proba_multi, multi_class='ovo')
print(f"Multi-class AUC (OvO): {auc_multi_ovo:.4f}")

Imbalanced Datasets: AUC is a very good metric for imbalanced datasets because it evaluates the model's ranking ability across all thresholds, not just the one that minimizes error. It is not affected by the class distribution in the same way as accuracy.

Summary of Key `sklearn` Functions

Function	Purpose
`sklearn.metrics.roc_auc_score(y_true, y_score)`	The main function to calculate the AUC. Takes true labels and predicted probabilities.
`sklearn.metrics.roc_curve(y_true, y_score)`	Calculates the FPR, TPR, and thresholds for plotting the ROC curve.
`model.predict_proba(X)[:, 1]`	The standard way to get the predicted probabilities for the positive class.

Python中sklearn如何计算AUC？

What is AUC?

How to Calculate AUC in `sklearn`

Step 1: Import Necessary Libraries

Step 2: Create a Sample Dataset

Step 3: Train a Classification Model

Step 4: Get Predicted Probabilities

Step 5: Calculate the AUC Score

Visualizing the ROC Curve

Step 6: Calculate FPR and TPR

Step 7: Plot the Curve

Important Considerations and Common Pitfalls

Summary of Key `sklearn` Functions

99ANYc3cd6

Java email正则表达式如何正确编写？

有系统U盘如何装系统？

Python中round函数用法详解

Python如何获取文件夹下所有文件？

Python MySQL连接池如何高效管理与优化？

Java Socket编程实例具体怎么实现？

Java中static关键字的核心作用是什么？

Python如何用PhantomJS实现登录？

Java连接MySQL数据库代码如何正确编写？

Python Scrapy视频教程怎么学更高效？

Win7如何搭建Java开发环境？

U盘装Win7系统教程详细步骤是怎样的？

Java如何连接phpMyAdmin数据库？

Django Form如何高效处理与验证？

Java如何连接SQLServer数据库？

Recovery刷教程，具体步骤是怎样的？

Python中sklearn如何计算AUC？

What is AUC?

How to Calculate AUC in sklearn

Step 1: Import Necessary Libraries

Step 2: Create a Sample Dataset

Step 3: Train a Classification Model

Step 4: Get Predicted Probabilities

Step 5: Calculate the AUC Score

Visualizing the ROC Curve

Step 6: Calculate FPR and TPR

Step 7: Plot the Curve

Important Considerations and Common Pitfalls

Summary of Key sklearn Functions

相关推荐

Java Socket编程实例具体怎么实现？

How to Calculate AUC in `sklearn`

Summary of Key `sklearn` Functions