Of course! Here is a complete, well-commented XGBoost demo in Python, designed to be easy to follow for beginners.

This demo will walk you through the entire typical machine learning workflow:
- Setup: Installing necessary libraries.
- Data Loading: Using a built-in dataset.
- Data Preparation: Splitting data for training and testing.
- Model Training: Creating and training an XGBoost model.
- Evaluation: Making predictions and assessing model performance.
- Feature Importance: Understanding which features the model used most.
- Saving & Loading: How to save a trained model for later use.
Step 1: Prerequisites & Installation
First, you need to have Python installed. Then, you'll need to install xgboost, scikit-learn (for data splitting and evaluation metrics), and matplotlib (for plotting feature importance).
Open your terminal or command prompt and run:
pip install xgboost scikit-learn matplotlib
Step 2: The Complete Python Demo Code
Here is the full script. You can copy and paste this into a Python file (e.g., xgboost_demo.py) and run it.

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# --- 1. Generate and Load Data ---
# For this demo, we'll create a synthetic dataset.
# In a real-world scenario, you would load your own data here (e.g., from a CSV file).
print("--- 1. Generating Data ---")
X, y = make_classification(
n_samples=1000, # 1000 data points
n_features=20, # 20 features
n_informative=10, # 10 of which are useful
n_redundant=5, # 5 are generated from the useful ones
n_classes=2, # A binary classification problem
random_state=42 # For reproducibility
)
print(f"Data shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print("-" * 30)
# --- 2. Split Data into Training and Testing Sets ---
# We split the data to train the model on one subset (training set)
# and evaluate its performance on a new, unseen subset (testing set).
print("--- 2. Splitting Data ---")
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% of data for testing
random_state=42, # For reproducibility
stratify=y # Ensures the proportion of classes is the same in train/test sets
)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print("-" * 30)
# --- 3. Initialize and Train the XGBoost Model ---
# We use the XGBClassifier for classification tasks.
# 'objective': 'binary:logistic' is for binary classification.
# 'use_label_encoder': False is to suppress a future warning.
# 'eval_metric': 'logloss' is a common metric for binary classification.
print("--- 3. Training the XGBoost Model ---")
xgb_classifier = xgb.XGBClassifier(
objective='binary:logistic',
eval_metric='logloss',
n_estimators=100, # Number of boosting rounds (trees)
learning_rate=0.1, # Step size shrinkage
max_depth=5, # Maximum depth of a tree
random_state=42
)
# Train the model
xgb_classifier.fit(X_train, y_train)
print("Model training complete.")
print("-" * 30)
# --- 4. Make Predictions ---
# Now we use the trained model to make predictions on the test set.
print("--- 4. Making Predictions ---)
y_pred_proba = xgb_classifier.predict_proba(X_test)[:, 1] # Probabilities for the positive class
y_pred = xgb_classifier.predict(X_test) # Class labels (0 or 1)
print("Predictions made.")
print("-" * 30)
# --- 5. Evaluate the Model ---
# We compare the model's predictions (y_pred) with the actual labels (y_test).
print("--- 5. Evaluating the Model ---")
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# A more detailed report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Confusion Matrix
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Visualize the Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Predicted 0', 'Predicted 1'],
yticklabels=['Actual 0', 'Actual 1'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')'Confusion Matrix')
plt.show()
print("-" * 30)
# --- 6. Visualize Feature Importance ---
# XGBoost can tell us which features were most influential in making decisions.
print("--- 6. Visualizing Feature Importance ---")
feature_importance = xgb_classifier.feature_importances_
features = [f'feature_{i}' for i in range(X.shape[1])]
# Create a DataFrame for easier plotting
import pandas as pd
importance_df = pd.DataFrame({
'Feature': features,
'Importance': feature_importance
}).sort_values(by='Importance', ascending=False)
# Plot the top 10 most important features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(10))'Top 10 Most Important Features')
plt.xlabel('Importance Score')
plt.ylabel('Feature')
plt.show()
print("-" * 30)
# --- 7. Save and Load the Model ---
# Trained models can be saved to disk and loaded later without retraining.
print("--- 7. Saving and Loading the Model ---")
# Save the model
xgb_classifier.save_model('xgb_model.json')
print("Model saved to 'xgb_model.json'")
# Create a new instance and load the model
loaded_model = xgb.XGBClassifier()
loaded_model.load_model('xgb_model.json')
print("Model loaded from 'xgb_model.json'")
# Verify that the loaded model works
loaded_predictions = loaded_model.predict(X_test)
loaded_accuracy = accuracy_score(y_test, loaded_predictions)
print(f"Accuracy of the loaded model: {loaded_accuracy:.4f}")
print("Model successfully saved and loaded.")
print("-" * 30)
Explanation of Key XGBoost Parameters
When you initialize XGBClassifier, you can tune several hyperparameters to improve performance. Here are some of the most important ones:
n_estimators: The number of boosting rounds (i.e., the number of trees to build). More trees can lead to better performance but also increases the risk of overfitting and training time.learning_rate(oreta): This shrinks the feature weights of each tree after each step. A lower learning rate requires more trees (n_estimators) to reach the same level of performance, but it often leads to a better, more robust model.max_depth: The maximum depth of a tree. Deeper trees can model more complex relationships but are more likely to overfit the training data.subsample: The fraction of samples to be used for fitting the individual base learners. Ifsubsample=1.0, the entire training set is used for every tree. Setting it to a value like8introduces randomness, which can help prevent overfitting.colsample_bytree: The fraction of features to be used for fitting the individual base learners. Similar tosubsample, this adds randomness and can improve generalization.gamma: The minimum loss reduction required to make a further partition on a leaf node of the tree. A highergammavalue makes the algorithm more conservative.
Next Steps
This demo provides a solid foundation. To further your learning, you can:
- Tune Hyperparameters: Use
GridSearchCVorRandomizedSearchCVfromscikit-learnto find the best combination ofn_estimators,learning_rate,max_depth, etc. - Try a Different Dataset: Load your own dataset using
pandas.read_csv()and adapt the code to it. - Explore XGBoost's DMatrix: For very large datasets, XGBoost has its own optimized data structure called
DMatrix, which can be faster than using NumPy arrays. - Regression: For a regression problem, you would use
xgb.XGBRegressorand change theobjectiveto something likereg:squarederror.

