xgboost Python demo怎么用？-杰瑞科技汇

Of course! Here is a complete, well-commented XGBoost demo in Python, designed to be easy to follow for beginners.

（图片来源网络，侵删）

This demo will walk you through the entire typical machine learning workflow:

Setup: Installing necessary libraries.
Data Loading: Using a built-in dataset.
Data Preparation: Splitting data for training and testing.
Model Training: Creating and training an XGBoost model.
Evaluation: Making predictions and assessing model performance.
Feature Importance: Understanding which features the model used most.
Saving & Loading: How to save a trained model for later use.

Step 1: Prerequisites & Installation

First, you need to have Python installed. Then, you'll need to install xgboost, scikit-learn (for data splitting and evaluation metrics), and matplotlib (for plotting feature importance).

Open your terminal or command prompt and run:

pip install xgboost scikit-learn matplotlib

Step 2: The Complete Python Demo Code

Here is the full script. You can copy and paste this into a Python file (e.g., xgboost_demo.py) and run it.

（图片来源网络，侵删）

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# --- 1. Generate and Load Data ---
# For this demo, we'll create a synthetic dataset.
# In a real-world scenario, you would load your own data here (e.g., from a CSV file).
print("--- 1. Generating Data ---")
X, y = make_classification(
    n_samples=1000,        # 1000 data points
    n_features=20,         # 20 features
    n_informative=10,      # 10 of which are useful
    n_redundant=5,         # 5 are generated from the useful ones
    n_classes=2,           # A binary classification problem
    random_state=42        # For reproducibility
)
print(f"Data shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print("-" * 30)
# --- 2. Split Data into Training and Testing Sets ---
# We split the data to train the model on one subset (training set)
# and evaluate its performance on a new, unseen subset (testing set).
print("--- 2. Splitting Data ---")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,         # 20% of data for testing
    random_state=42,       # For reproducibility
    stratify=y             # Ensures the proportion of classes is the same in train/test sets
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print("-" * 30)
# --- 3. Initialize and Train the XGBoost Model ---
# We use the XGBClassifier for classification tasks.
# 'objective': 'binary:logistic' is for binary classification.
# 'use_label_encoder': False is to suppress a future warning.
# 'eval_metric': 'logloss' is a common metric for binary classification.
print("--- 3. Training the XGBoost Model ---")
xgb_classifier = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    n_estimators=100,      # Number of boosting rounds (trees)
    learning_rate=0.1,     # Step size shrinkage
    max_depth=5,           # Maximum depth of a tree
    random_state=42
)
# Train the model
xgb_classifier.fit(X_train, y_train)
print("Model training complete.")
print("-" * 30)
# --- 4. Make Predictions ---
# Now we use the trained model to make predictions on the test set.
print("--- 4. Making Predictions ---)
y_pred_proba = xgb_classifier.predict_proba(X_test)[:, 1] # Probabilities for the positive class
y_pred = xgb_classifier.predict(X_test)                   # Class labels (0 or 1)
print("Predictions made.")
print("-" * 30)
# --- 5. Evaluate the Model ---
# We compare the model's predictions (y_pred) with the actual labels (y_test).
print("--- 5. Evaluating the Model ---")
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# A more detailed report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Visualize the Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted 0', 'Predicted 1'], 
            yticklabels=['Actual 0', 'Actual 1'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')'Confusion Matrix')
plt.show()
print("-" * 30)
# --- 6. Visualize Feature Importance ---
# XGBoost can tell us which features were most influential in making decisions.
print("--- 6. Visualizing Feature Importance ---")
feature_importance = xgb_classifier.feature_importances_
features = [f'feature_{i}' for i in range(X.shape[1])]
# Create a DataFrame for easier plotting
import pandas as pd
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importance
}).sort_values(by='Importance', ascending=False)
# Plot the top 10 most important features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(10))'Top 10 Most Important Features')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()
print("-" * 30)
# --- 7. Save and Load the Model ---
# Trained models can be saved to disk and loaded later without retraining.
print("--- 7. Saving and Loading the Model ---")
# Save the model
xgb_classifier.save_model('xgb_model.json')
print("Model saved to 'xgb_model.json'")
# Create a new instance and load the model
loaded_model = xgb.XGBClassifier()
loaded_model.load_model('xgb_model.json')
print("Model loaded from 'xgb_model.json'")
# Verify that the loaded model works
loaded_predictions = loaded_model.predict(X_test)
loaded_accuracy = accuracy_score(y_test, loaded_predictions)
print(f"Accuracy of the loaded model: {loaded_accuracy:.4f}")
print("Model successfully saved and loaded.")
print("-" * 30)

Explanation of Key XGBoost Parameters

When you initialize XGBClassifier, you can tune several hyperparameters to improve performance. Here are some of the most important ones:

n_estimators: The number of boosting rounds (i.e., the number of trees to build). More trees can lead to better performance but also increases the risk of overfitting and training time.
learning_rate (or eta): This shrinks the feature weights of each tree after each step. A lower learning rate requires more trees (n_estimators) to reach the same level of performance, but it often leads to a better, more robust model.
max_depth: The maximum depth of a tree. Deeper trees can model more complex relationships but are more likely to overfit the training data.
subsample: The fraction of samples to be used for fitting the individual base learners. If subsample=1.0, the entire training set is used for every tree. Setting it to a value like 8 introduces randomness, which can help prevent overfitting.
colsample_bytree: The fraction of features to be used for fitting the individual base learners. Similar to subsample, this adds randomness and can improve generalization.
gamma: The minimum loss reduction required to make a further partition on a leaf node of the tree. A higher gamma value makes the algorithm more conservative.

Next Steps

This demo provides a solid foundation. To further your learning, you can:

Tune Hyperparameters: Use GridSearchCV or RandomizedSearchCV from scikit-learn to find the best combination of n_estimators, learning_rate, max_depth, etc.
Try a Different Dataset: Load your own dataset using pandas.read_csv() and adapt the code to it.
Explore XGBoost's DMatrix: For very large datasets, XGBoost has its own optimized data structure called DMatrix, which can be faster than using NumPy arrays.
Regression: For a regression problem, you would use xgb.XGBRegressor and change the objective to something like reg:squarederror.