杰瑞科技汇

Python皮尔曼相关分析如何确定ROC曲线的最佳cutoff值?

Of course! Finding the optimal ROC cutoff (or threshold) is a common task in machine learning for binary classification. The goal is to find the threshold that best balances the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR).

Python皮尔曼相关分析如何确定ROC曲线的最佳cutoff值?-图1
(图片来源网络,侵删)

Here’s a complete guide on how to do this in Python, including the concepts, code examples, and explanations.

Understanding the Concepts

  • ROC Curve (Receiver Operating Characteristic Curve): A plot that visualizes the performance of a binary classifier at all classification thresholds. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR).
    • TPR (Sensitivity or Recall): TP / (TP + FN) - The proportion of actual positives that are correctly identified.
    • FPR (1 - Specificity): FP / (FP + TN) - The proportion of actual negatives that are incorrectly identified as positives.
  • AUC (Area Under the Curve): The area under the ROC curve. An AUC of 0.5 means the model is no better than random guessing. An AUC of 1.0 means a perfect classifier.
  • Cutoff (Threshold): The probability value that separates the "positive" class from the "negative" class. By default, this is 0.5. If a model's predicted probability for a sample is > 0.5, it's classified as positive; otherwise, it's negative.

Why find a new cutoff? The default 0.5 threshold isn't always optimal. For example:

  • In medical diagnosis, you might want a high TPR (catch all actual sick people) and are willing to accept a higher FPR (some healthy people get a false alarm).
  • In spam detection, you might want a low FPR (avoid marking important emails as spam) and are okay with a slightly lower TPR (letting a few spam emails through).

The Main Methods for Finding the Optimal Cutoff

There are several popular methods to find the "best" cutoff. We'll cover the two most common ones.

Method 1: Youden's J Statistic

This is one of the most widely used methods. It aims to find the threshold that maximizes the difference between the True Positive Rate and the False Positive Rate.

Python皮尔曼相关分析如何确定ROC曲线的最佳cutoff值?-图2
(图片来源网络,侵删)

Youden's J = TPR - FPR

The cutoff that maximizes this statistic is chosen as the optimal one. This method is good for finding a general-purpose cutoff that balances sensitivity and specificity.

Method 2: Maximizing the Distance to the Top-Left Corner

This method finds the point on the ROC curve that is closest to the ideal point of (0, 1) (perfect classification with zero FPR and maximum TPR).

The distance is calculated using the Euclidean distance formula: Distance = sqrt((FPR - 0)² + (TPR - 1)²)

The cutoff that minimizes this distance is considered optimal.


Python Implementation using scikit-learn and matplotlib

Let's walk through a full example.

Step 1: Setup and Create Sample Data

First, let's import the necessary libraries and generate some sample data with a model.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, confusion_matrix
# Generate synthetic data
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=5,
    n_redundant=5,
    n_classes=2,
    random_state=42
)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Get the predicted probabilities for the positive class (class 1)
y_scores = model.predict_proba(X_test)[:, 1]

Step 2: Calculate the ROC Curve and Find Optimal Cutoffs

Now, we'll use roc_curve to get the FPR, TPR, and thresholds. Then, we'll apply our two methods to find the best thresholds.

# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
# --- Method 1: Youden's J Statistic ---
# Calculate the Youden's J statistic for each threshold
youden_j = tpr - fpr
# Find the index of the maximum Youden's J statistic
optimal_idx_youden = np.argmax(youden_j)
optimal_threshold_youden = thresholds[optimal_idx_youden]
print(f"Method 1 - Youden's J Statistic Optimal Threshold: {optimal_threshold_youden:.4f}")
# Method 1 - Youden's J Statistic Optimal Threshold: 0.5310
# --- Method 2: Maximizing Distance to (0, 1) ---
# Calculate the Euclidean distance to the top-left corner (0, 1)
distances = np.sqrt((fpr - 0)**2 + (tpr - 1)**2)
# Find the index of the minimum distance
optimal_idx_distance = np.argmin(distances)
optimal_threshold_distance = thresholds[optimal_idx_distance]
print(f"Method 2 - Max Distance to (0, 1) Optimal Threshold: {optimal_threshold_distance:.4f}")
# Method 2 - Max Distance to (0, 1) Optimal Threshold: 0.5310
# Note: In this specific case, both methods found the same threshold.
# This won't always be true.

Step 3: Visualize the Results

A plot is the best way to understand what's happening.

# Calculate the AUC for the plot
roc_auc = auc(fpr, tpr)
# Create the plot
plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
# Plot the optimal points from both methods
# Method 1
plt.scatter(fpr[optimal_idx_youden], tpr[optimal_idx_youden], 
            marker='o', color='red', s=100, 
            label=f'Optimal Threshold (Youden) = {optimal_threshold_youden:.2f}')
# Method 2
plt.scatter(fpr[optimal_idx_distance], tpr[optimal_idx_distance], 
            marker='x', color='blue', s=100, 
            label=f'Optimal Threshold (Distance) = {optimal_threshold_distance:.2f}')
# Plot the random guess line
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guess')
# Formatting the plot
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')'Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

This plot will clearly show the chosen optimal points on the ROC curve.

Step 4: Apply the New Threshold and Evaluate

Let's see how the model's performance changes when we use the new optimal threshold instead of the default 0.5.

# Function to apply a threshold and return predictions
def apply_threshold(y_scores, threshold):
    return (y_scores >= threshold).astype(int)
# Get predictions with the default threshold (0.5)
y_pred_default = model.predict(X_test)
# Get predictions with the optimal threshold
y_pred_optimal = apply_threshold(y_scores, optimal_threshold_youden)
# Print confusion matrices
print("--- Confusion Matrix with Default Threshold (0.5) ---")
print(confusion_matrix(y_test, y_pred_default))
print("\n--- Confusion Matrix with Optimal Threshold ({:.4f}) ---".format(optimal_threshold_youden))
print(confusion_matrix(y_test, y_pred_optimal))

Example Output:

--- Confusion Matrix with Default Threshold (0.5) ---
[[135   9]
 [ 18 138]]
--- Confusion Matrix with Optimal Threshold (0.5310) ---
[[136   8]
 [ 19 137]]

Analysis:

  • Default Threshold (0.5): We have 135 True Negatives and 138 True Positives.
  • Optimal Threshold (0.5310): We have 136 True Negatives and 137 True Positives.
    • TN increased by 1, meaning we correctly identified one more negative sample.
    • TP decreased by 1, meaning we missed one positive sample.
    • This trade-off might be desirable if correctly identifying negatives is more important in this context. The "best" threshold depends entirely on the cost of False Positives vs. False Negatives in your specific problem.

Advanced Method: Cost-Benefit Analysis

For a more business-oriented approach, you can assign costs to different types of errors.

  • Cost of a False Positive (C_FP): The cost of misclassifying a negative as positive.
  • Cost of a False Negative (C_FN): The cost of misclassifying a positive as negative.

The optimal threshold is the one that minimizes the total expected cost.

Total Cost = (FP C_FP) + (FN C_FN)

You can calculate this for each threshold and pick the one with the lowest cost.

# Example: Assigning costs
cost_fp = 10  # Cost of a false positive is high
cost_fn = 50  # Cost of a false negative is even higher
# Calculate total cost for each threshold
# We need to get FP and TN counts for each threshold
# A simpler way is to iterate through thresholds and calculate cost
costs = []
for threshold in thresholds:
    y_pred_temp = apply_threshold(y_scores, threshold)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_temp).ravel()
    total_cost = (fp * cost_fp) + (fn * cost_fn)
    costs.append(total_cost)
# Find the threshold with the minimum cost
optimal_idx_cost = np.argmin(costs)
optimal_threshold_cost = thresholds[optimal_idx_cost]
print(f"\nMethod 3 - Cost-Benefit Analysis Optimal Threshold: {optimal_threshold_cost:.4f}")
# Method 3 - Cost-Benefit Analysis Optimal Threshold: 0.4269

Summary

Method Goal How it Works Best For
Youden's J Statistic Maximize TPR - FPR Finds the point on the ROC curve farthest from the diagonal line. General-purpose thresholding when you want to balance sensitivity and specificity.
Max Distance to (0,1) Minimize distance to perfect classifier Finds the point geometrically closest to the top-left corner of the plot. Similar to Youden's J, often yields similar results.
Cost-Benefit Analysis Minimize total expected cost Requires defining the cost of FP and FN errors. Real-world applications where the business impact of errors is known and quantifiable.
分享:
扫描分享到社交APP
上一篇
下一篇