杰瑞科技汇

Python中sklearn如何计算AUC?

Of course! Let's dive deep into how to calculate the Area Under the Curve (AUC) using Python and the scikit-learn library.

Python中sklearn如何计算AUC?-图1
(图片来源网络,侵删)

What is AUC?

The Area Under the Curve (AUC) most commonly refers to the Area Under the Receiver Operating Characteristic (ROC) Curve. It's a performance measurement for classification problems at various threshold settings.

  • ROC Curve: A plot that shows the performance of a classification model at all classification thresholds. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR).

    • True Positive Rate (Recall/Sensitivity): TP / (TP + FN) - Of all the actual positives, how many did we correctly predict?
    • False Positive Rate: FP / (FP + TN) - Of all the actual negatives, how many did we incorrectly predict as positive?
  • AUC Score: The AUC of the ROC curve provides a single number summary of the model's ability to discriminate between positive and negative classes.

    • AUC = 1.0: Perfect classifier. It ranks all positive instances higher than all negative instances.
    • AUC = 0.5: No discriminative ability, equivalent to random guessing.
    • AUC < 0.5: The model is worse than random guessing. It's systematically getting it wrong.

How to Calculate AUC in sklearn

sklearn provides a straightforward function to calculate the AUC. The key steps are:

Python中sklearn如何计算AUC?-图2
(图片来源网络,侵删)
  1. Train a classification model (e.g., Logistic Regression, Random Forest).
  2. Get the prediction probabilities for the positive class. You need probabilities, not just the final class labels (0 or 1).
  3. Use sklearn.metrics.roc_auc_score to calculate the AUC from the true labels and the predicted probabilities.

Let's walk through a complete example.

Step 1: Import Necessary Libraries

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

Step 2: Create a Sample Dataset

We'll use make_classification to create a synthetic binary classification dataset.

# Generate a synthetic dataset
X, y = make_classification(
    n_samples=1000,        # 1000 data points
    n_features=20,        # 20 features
    n_informative=5,      # 5 of which are useful
    n_redundant=5,        # 5 are linear combinations of the useful ones
    n_classes=2,          # 2 classes (0 and 1)
    random_state=42       # for reproducibility
)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_test: {y_test.shape}")
print(f"Class distribution in y_test: {np.bincount(y_test)}")

Step 3: Train a Classification Model

We'll use a simple LogisticRegression model.

# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Step 4: Get Predicted Probabilities

This is the most critical step. We need the probability that the model assigns to the positive class (class 1).

Python中sklearn如何计算AUC?-图3
(图片来源网络,侵删)
# Get the predicted probabilities for the positive class (class 1)
y_pred_proba = model.predict_proba(X_test)[:, 1]
print("\nFirst 5 predicted probabilities for class 1:")
print(y_pred_proba[:5])

Step 5: Calculate the AUC Score

Now we can use roc_auc_score with the true labels (y_test) and the predicted probabilities (y_pred_proba).

# Calculate the AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"\nAUC Score: {auc_score:.4f}")

Output:

AUC Score: 0.9261

An AUC of 0.9261 is excellent, indicating that the model has a very good ability to distinguish between the two classes.


Visualizing the ROC Curve

To better understand the AUC score, it's very helpful to plot the ROC curve.

Step 6: Calculate FPR and TPR

The roc_curve function calculates the FPR, TPR, and corresponding thresholds for you.

# Calculate the FPR, TPR, and thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

Step 7: Plot the Curve

We'll use matplotlib to create the plot.

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random classifier (AUC = 0.50)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')'Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

This will generate a plot showing the model's performance curve compared to a random guess. The larger the area under the orange curve, the better the model.


Important Considerations and Common Pitfalls

  1. Probabilities, Not Predictions: roc_auc_score requires the probability estimates for the positive class, not the final class labels (e.g., model.predict(X_test)).

    • Correct: model.predict_proba(X_test)[:, 1]
    • Incorrect: model.predict(X_test)
  2. Multi-Class Classification: The standard AUC-ROC is for binary classification. For multi-class problems, you have two main strategies:

    • One-vs-Rest (OvR) / One-vs-All (OvA): Calculate the AUC for each class against all other classes, then average the results.
    • One-vs-One (OvO): Calculate the AUC for every unique pair of classes and then average the results.

    sklearn's roc_auc_score handles this automatically with the multi_class parameter.

    # Example for multi-class
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    X_multi, y_multi = make_classification(n_samples=1000, n_features=20, n_classes=3, n_informative=5, random_state=42)
    X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi, test_size=0.3, random_state=42)
    model_multi = RandomForestClassifier(random_state=42)
    model_multi.fit(X_train_multi, y_train_multi)
    # Get probabilities for all classes
    y_pred_proba_multi = model_multi.predict_proba(X_test_multi)
    # Calculate AUC using One-vs-Rest strategy
    auc_multi_ovr = roc_auc_score(y_test_multi, y_pred_proba_multi, multi_class='ovr')
    print(f"\nMulti-class AUC (OvR): {auc_multi_ovr:.4f}")
    # Calculate AUC using One-vs-One strategy
    auc_multi_ovo = roc_auc_score(y_test_multi, y_pred_proba_multi, multi_class='ovo')
    print(f"Multi-class AUC (OvO): {auc_multi_ovo:.4f}")
  3. Imbalanced Datasets: AUC is a very good metric for imbalanced datasets because it evaluates the model's ranking ability across all thresholds, not just the one that minimizes error. It is not affected by the class distribution in the same way as accuracy.

Summary of Key sklearn Functions

Function Purpose
sklearn.metrics.roc_auc_score(y_true, y_score) The main function to calculate the AUC. Takes true labels and predicted probabilities.
sklearn.metrics.roc_curve(y_true, y_score) Calculates the FPR, TPR, and thresholds for plotting the ROC curve.
model.predict_proba(X)[:, 1] The standard way to get the predicted probabilities for the positive class.
分享:
扫描分享到社交APP
上一篇
下一篇