杰瑞科技汇

Python预处理有哪些关键步骤与最佳实践?

Of course! "Python preprocessing" is a broad term, but it most commonly refers to the crucial steps of cleaning and preparing data for machine learning or data analysis. A common phrase in data science is "Garbage In, Garbage Out" (GIGO), and preprocessing is how you turn "garbage" into high-quality, usable data.

Python预处理有哪些关键步骤与最佳实践?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide to Python preprocessing, broken down into key concepts, essential libraries, and a practical, step-by-step example.


The Core Idea: Why Preprocess?

Raw data is almost never ready for analysis or modeling. It's often messy, incomplete, and inconsistent. Preprocessing aims to:

  • Handle Missing Values: Deal with empty cells or NaNs.
  • Correct Data Types: Ensure columns are the correct type (e.g., numbers as float/int, dates as datetime).
  • Handle Categorical Data: Convert text-based categories into numbers that models can understand.
  • Scale and Normalize: Adjust the range of numerical features to prevent one feature from dominating others.
  • Handle Outliers: Identify and manage extreme data points that can skew results.
  • Feature Engineering: Create new, more informative features from existing ones.

Essential Python Libraries for Preprocessing

You'll rarely do this from scratch. These libraries are the industry standard.

Library Primary Use Case
Pandas The backbone of data manipulation in Python. Used for loading, cleaning, filtering, and transforming data.
NumPy The fundamental package for numerical computation in Python. Pandas is built on top of it.
Scikit-learn The go-to library for machine learning. It provides a robust and easy-to-use API for almost all preprocessing tasks.
Seaborn & Matplotlib Primarily for data visualization, but crucial for Exploratory Data Analysis (EDA), which informs your preprocessing decisions.

The Preprocessing Workflow: A Step-by-Step Guide

Let's walk through a typical workflow using a sample dataset. Imagine we have a CSV file customer_data.csv with information about customers.

Python预处理有哪些关键步骤与最佳实践?-图2
(图片来源网络,侵删)

Sample Data (customer_data.csv):

CustomerID,Age,Gender,AnnualIncome,SpendScore,HasLoyaltyCard
1,19,Male,15000,39,No
2,21,Female,15000,81,Yes
3,20,Female,16000,6,No
4,23,Female,16000,77,Yes
5,31,Male,17000,40,No
6,22,Female,18000,76,No
7,35,Female,18000,9,No
8,23,Female,19000,57,No
9,64,Female,19000,81,Yes
10,30,Female,20000,25,No
... (some rows are missing data)

Step 1: Load and Inspect the Data

First, we load the data and get a high-level overview.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
try:
    df = pd.read_csv('customer_data.csv')
except FileNotFoundError:
    # Create a sample DataFrame if file doesn't exist
    data = {
        'CustomerID': range(1, 11),
        'Age': [19, 21, 20, 23, 31, 22, 35, 23, 64, 30],
        'Gender': ['Male', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female'],
        'AnnualIncome': [15000, 15000, 16000, 16000, 17000, 18000, 18000, 19000, 19000, 20000],
        'SpendScore': [39, 81, 6, 77, 40, 76, 9, 57, 81, 25],
        'HasLoyaltyCard': ['No', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No']
    }
    df = pd.DataFrame(data)
    # Introduce some missing values for demonstration
    df.loc[2, 'AnnualIncome'] = np.nan
    df.loc[5, 'Age'] = np.nan
# --- INSPECTION ---
print("First 5 rows:")
print(df.head())
print("\nData Info:")
df.info() # Shows column types and non-null counts
print("\nStatistical Summary:")
print(df.describe())

Step 2: Handle Missing Values

The df.info() output likely shows NaN (Not a Number) values. We need to decide what to do with them.

  • Deletion: Remove rows or columns with missing data (simple but can lose information).
  • Imputation: Fill in the missing values with a statistic (mean, median, mode) or a constant.
# Check for missing values
print("\nMissing values before handling:")
print(df.isnull().sum())
# Strategy 1: Drop rows with any missing values (often too aggressive)
# df_dropped = df.dropna()
# Strategy 2: Impute numerical columns with the median
# The median is robust to outliers
median_income = df['AnnualIncome'].median()
df['AnnualIncome'].fillna(median_income, inplace=True)
# Strategy 3: Impute categorical columns with the mode (most frequent value)
mode_gender = df['Gender'].mode()[0]
df['Gender'].fillna(mode_gender, inplace=True)
# Check again
print("\nMissing values after handling:")
print(df.isnull().sum())

Step 3: Handle Categorical Data

Machine learning models work with numbers, not text. We need to convert Gender and HasLoyaltyCard.

Python预处理有哪些关键步骤与最佳实践?-图3
(图片来源网络,侵删)
  • Label Encoding: Assigns a unique integer to each category (e.g., Male=0, Female=1). Good for ordinal data (where order matters).
  • One-Hot Encoding: Creates a new binary column for each category. Good for nominal data (where order doesn't matter).
# For 'Gender' (Nominal), One-Hot Encoding is often better
# We use get_dummies to create new columns like 'Gender_Male' and 'Gender_Female'
df = pd.get_dummies(df, columns=['Gender'], drop_first=True) # drop_first avoids multicollinearity
# For 'HasLoyaltyCard', we can use simple Label Encoding (Yes=1, No=0)
# We can use sklearn's LabelEncoder or a simple map
df['HasLoyaltyCard'] = df['HasLoyaltyCard'].map({'No': 0, 'Yes': 1})
print("\nData after handling categorical features:")
print(df.head())

Step 4: Feature Scaling

If your numerical features are on vastly different scales (e.g., Age is 20-60, AnnualIncome is 15k-200k), algorithms like SVMs or K-Nearest Neighbors can be biased towards the larger-scale feature.

  • Standardization (StandardScaler): Rescales features to have a mean of 0 and a standard deviation of 1. Works well when data is Gaussian-like.
  • Normalization (MinMaxScaler): Rescales features to a range between 0 and 1. Good when you don't have outliers and want to bound the data.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Let's standardize 'Age' and 'AnnualIncome'
# It's good practice to scale only the features, not the target variable
# For this example, let's assume 'SpendScore' is our target.
# Select numerical features to scale
numerical_features = ['Age', 'AnnualIncome']
features_to_scale = df[numerical_features]
# Initialize the scaler
scaler = StandardScaler()
# Fit and transform the data
scaled_features = scaler.fit_transform(features_to_scale)
# Replace the original columns with the scaled versions
df[numerical_features] = scaled_features
print("\nData after scaling:")
print(df.head())
print("\nStatistical summary after scaling:")
print(df.describe())
# Notice how the mean is ~0 and std is ~1 for the scaled columns.

Step 5: Feature Engineering (Optional but Powerful)

This is where you create new features that might be more predictive.

# Example: Create a 'LifeStage' feature based on Age
def get_life_stage(age):
    if age < 25:
        return 'Young Adult'
    elif 25 <= age < 40:
        return 'Adult'
    else:
        return 'Senior'
df['LifeStage'] = df['Age'].apply(get_life_stage)
# Now we need to one-hot encode this new categorical feature
df = pd.get_dummies(df, columns=['LifeStage'], drop_first=True)
print("\nData after feature engineering:")
print(df.head())

Step 6: Final Data Preparation

Now our data is clean and ready for a model. We just need to separate it into features (X) and the target variable (y).

# Define our features (X) and target (y)
#
分享:
扫描分享到社交APP
上一篇
下一篇