Python预处理有哪些关键步骤与最佳实践？-杰瑞科技汇

Of course! "Python preprocessing" is a broad term, but it most commonly refers to the crucial steps of cleaning and preparing data for machine learning or data analysis. A common phrase in data science is "Garbage In, Garbage Out" (GIGO), and preprocessing is how you turn "garbage" into high-quality, usable data.

（图片来源网络，侵删）

Here’s a comprehensive guide to Python preprocessing, broken down into key concepts, essential libraries, and a practical, step-by-step example.

The Core Idea: Why Preprocess?

Raw data is almost never ready for analysis or modeling. It's often messy, incomplete, and inconsistent. Preprocessing aims to:

Handle Missing Values: Deal with empty cells or NaNs.
Correct Data Types: Ensure columns are the correct type (e.g., numbers as float/int, dates as datetime).
Handle Categorical Data: Convert text-based categories into numbers that models can understand.
Scale and Normalize: Adjust the range of numerical features to prevent one feature from dominating others.
Handle Outliers: Identify and manage extreme data points that can skew results.
Feature Engineering: Create new, more informative features from existing ones.

Essential Python Libraries for Preprocessing

You'll rarely do this from scratch. These libraries are the industry standard.

Library	Primary Use Case
Pandas	The backbone of data manipulation in Python. Used for loading, cleaning, filtering, and transforming data.
NumPy	The fundamental package for numerical computation in Python. Pandas is built on top of it.
Scikit-learn	The go-to library for machine learning. It provides a robust and easy-to-use API for almost all preprocessing tasks.
Seaborn & Matplotlib	Primarily for data visualization, but crucial for Exploratory Data Analysis (EDA), which informs your preprocessing decisions.

The Preprocessing Workflow: A Step-by-Step Guide

Let's walk through a typical workflow using a sample dataset. Imagine we have a CSV file customer_data.csv with information about customers.

（图片来源网络，侵删）

Sample Data (customer_data.csv):

CustomerID,Age,Gender,AnnualIncome,SpendScore,HasLoyaltyCard
1,19,Male,15000,39,No
2,21,Female,15000,81,Yes
3,20,Female,16000,6,No
4,23,Female,16000,77,Yes
5,31,Male,17000,40,No
6,22,Female,18000,76,No
7,35,Female,18000,9,No
8,23,Female,19000,57,No
9,64,Female,19000,81,Yes
10,30,Female,20000,25,No
... (some rows are missing data)

Step 1: Load and Inspect the Data

First, we load the data and get a high-level overview.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
try:
    df = pd.read_csv('customer_data.csv')
except FileNotFoundError:
    # Create a sample DataFrame if file doesn't exist
    data = {
        'CustomerID': range(1, 11),
        'Age': [19, 21, 20, 23, 31, 22, 35, 23, 64, 30],
        'Gender': ['Male', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female'],
        'AnnualIncome': [15000, 15000, 16000, 16000, 17000, 18000, 18000, 19000, 19000, 20000],
        'SpendScore': [39, 81, 6, 77, 40, 76, 9, 57, 81, 25],
        'HasLoyaltyCard': ['No', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No']
    }
    df = pd.DataFrame(data)
    # Introduce some missing values for demonstration
    df.loc[2, 'AnnualIncome'] = np.nan
    df.loc[5, 'Age'] = np.nan
# --- INSPECTION ---
print("First 5 rows:")
print(df.head())
print("\nData Info:")
df.info() # Shows column types and non-null counts
print("\nStatistical Summary:")
print(df.describe())

Step 2: Handle Missing Values

The df.info() output likely shows NaN (Not a Number) values. We need to decide what to do with them.

Deletion: Remove rows or columns with missing data (simple but can lose information).
Imputation: Fill in the missing values with a statistic (mean, median, mode) or a constant.

# Check for missing values
print("\nMissing values before handling:")
print(df.isnull().sum())
# Strategy 1: Drop rows with any missing values (often too aggressive)
# df_dropped = df.dropna()
# Strategy 2: Impute numerical columns with the median
# The median is robust to outliers
median_income = df['AnnualIncome'].median()
df['AnnualIncome'].fillna(median_income, inplace=True)
# Strategy 3: Impute categorical columns with the mode (most frequent value)
mode_gender = df['Gender'].mode()[0]
df['Gender'].fillna(mode_gender, inplace=True)
# Check again
print("\nMissing values after handling:")
print(df.isnull().sum())

Step 3: Handle Categorical Data

Machine learning models work with numbers, not text. We need to convert Gender and HasLoyaltyCard.

（图片来源网络，侵删）

Label Encoding: Assigns a unique integer to each category (e.g., Male=0, Female=1). Good for ordinal data (where order matters).
One-Hot Encoding: Creates a new binary column for each category. Good for nominal data (where order doesn't matter).

# For 'Gender' (Nominal), One-Hot Encoding is often better
# We use get_dummies to create new columns like 'Gender_Male' and 'Gender_Female'
df = pd.get_dummies(df, columns=['Gender'], drop_first=True) # drop_first avoids multicollinearity
# For 'HasLoyaltyCard', we can use simple Label Encoding (Yes=1, No=0)
# We can use sklearn's LabelEncoder or a simple map
df['HasLoyaltyCard'] = df['HasLoyaltyCard'].map({'No': 0, 'Yes': 1})
print("\nData after handling categorical features:")
print(df.head())

Step 4: Feature Scaling

If your numerical features are on vastly different scales (e.g., Age is 20-60, AnnualIncome is 15k-200k), algorithms like SVMs or K-Nearest Neighbors can be biased towards the larger-scale feature.

Standardization (StandardScaler): Rescales features to have a mean of 0 and a standard deviation of 1. Works well when data is Gaussian-like.
Normalization (MinMaxScaler): Rescales features to a range between 0 and 1. Good when you don't have outliers and want to bound the data.

from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Let's standardize 'Age' and 'AnnualIncome'
# It's good practice to scale only the features, not the target variable
# For this example, let's assume 'SpendScore' is our target.
# Select numerical features to scale
numerical_features = ['Age', 'AnnualIncome']
features_to_scale = df[numerical_features]
# Initialize the scaler
scaler = StandardScaler()
# Fit and transform the data
scaled_features = scaler.fit_transform(features_to_scale)
# Replace the original columns with the scaled versions
df[numerical_features] = scaled_features
print("\nData after scaling:")
print(df.head())
print("\nStatistical summary after scaling:")
print(df.describe())
# Notice how the mean is ~0 and std is ~1 for the scaled columns.

Step 5: Feature Engineering (Optional but Powerful)

This is where you create new features that might be more predictive.

# Example: Create a 'LifeStage' feature based on Age
def get_life_stage(age):
    if age < 25:
        return 'Young Adult'
    elif 25 <= age < 40:
        return 'Adult'
    else:
        return 'Senior'
df['LifeStage'] = df['Age'].apply(get_life_stage)
# Now we need to one-hot encode this new categorical feature
df = pd.get_dummies(df, columns=['LifeStage'], drop_first=True)
print("\nData after feature engineering:")
print(df.head())

Step 6: Final Data Preparation

Now our data is clean and ready for a model. We just need to separate it into features (X) and the target variable (y).

# Define our features (X) and target (y)
#

Python预处理有哪些关键步骤与最佳实践？

The Core Idea: Why Preprocess?

Essential Python Libraries for Preprocessing

The Preprocessing Workflow: A Step-by-Step Guide

Step 1: Load and Inspect the Data

Step 2: Handle Missing Values

Step 3: Handle Categorical Data

Step 4: Feature Scaling

Step 5: Feature Engineering (Optional but Powerful)

Step 6: Final Data Preparation

99ANYc3cd6

Python Selenium API 如何快速上手？

Java的char转string有几种方法？

Python中except与raise如何协同处理异常？

mac系统如何从u盘启动？

ArrayList和数组有何区别？

HDR Light Studio教程如何快速上手？

AutoCAD Mac教程如何高效上手？

Java字符串如何高效转为字符数组？

Sublime Text如何配置Java开发环境？

Java String转int有哪些方法？

Flash CS4视频教程哪里找？新手能学会吗？

Bluestacks怎么用？新手入门指南？

Git Windows教程，如何快速上手？

Learning Python中文版该怎么学？

如何获取 Eclipse 教程 PDF 最新版？

RobotStudio教程从哪开始学？新手入门必看？

Python预处理有哪些关键步骤与最佳实践？

The Core Idea: Why Preprocess?

Essential Python Libraries for Preprocessing

The Preprocessing Workflow: A Step-by-Step Guide

Step 1: Load and Inspect the Data

Step 2: Handle Missing Values

Step 3: Handle Categorical Data

Step 4: Feature Scaling

Step 5: Feature Engineering (Optional but Powerful)

Step 6: Final Data Preparation

相关推荐

HDR Light Studio教程如何快速上手？