杰瑞科技汇

python onehotencoder

Of course! The OneHotEncoder is a fundamental tool in Python's machine learning preprocessing toolkit. Here’s a complete guide covering what it is, why it's used, how to use it with scikit-learn, and best practices.

python onehotencoder-图1
(图片来源网络,侵删)

What is One-Hot Encoding?

One-hot encoding is a technique used to convert categorical data into a numerical format that machine learning models can understand.

The Problem: Most ML algorithms (like linear regression, SVMs, neural networks) require numerical input. If you have a categorical feature like Color with values ["Red", "Green", "Blue"], you can't just assign numbers like Red=1, Green=2, Blue=3. This would incorrectly imply a mathematical order or relationship (e.g., that Blue is "greater than" Green), which can mislead the model.

The Solution: One-hot encoding creates a new binary (0 or 1) column for each unique category in the original feature.

Example:

python onehotencoder-图2
(图片来源网络,侵删)

Let's say we have a Color feature:

Color
Red
Green
Blue
Red

After one-hot encoding, this becomes:

Color_Blue Color_Green Color_Red
0 0 1
0 1 0
1 0 0
0 0 1
  • A row with Color = "Red" will have a 1 in the Color_Red column and 0s in all other new columns.
  • A row with Color = "Green" will have a 1 in the Color_Green column and 0s elsewhere.

When to Use One-Hot Encoding?

It's ideal for nominal categorical features—categories that have no intrinsic order.

  • Good examples: Country (USA, Canada, Mexico), City (New York, London, Tokyo), Product Type (Electronics, Clothing, Food).
  • Bad examples (use Label Encoding instead): Rank (1st, 2nd, 3rd), Education Level (High School, Bachelor's, Master's). These have an order, and assigning numbers (1, 2, 3) can be appropriate.

How to Use OneHotEncoder in Scikit-Learn

The OneHotEncoder is found in sklearn.preprocessing. The modern versions (since 0.20) are highly flexible and recommended over the older pandas.get_dummies() for ML pipelines.

python onehotencoder-图3
(图片来源网络,侵删)

Basic Example

Let's start with a simple array.

import numpy as np
from sklearn.preprocessing import OneHotEncoder
# 1. Sample data
# Note: Scikit-learn expects 2D input, so we reshape the 1D array.
data = np.array(['Red', 'Green', 'Blue', 'Red']).reshape(-1, 1)
# 2. Initialize the encoder
# handle_unknown='ignore' is good practice to prevent errors on new categories
# in test data.
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# 3. Fit and transform the data
# .fit() learns the categories (Red, Green, Blue)
# .transform() creates the one-hot encoded matrix
encoded_data = encoder.fit_transform(data)
# 4. View the result
print("Encoded Data:")
print(encoded_data)
# 5. See the new category names (feature names)
print("\nCategory Names:")
print(encoder.get_feature_names_out())

Output:

Encoded Data:
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
Category Names:
['x0_Blue' 'x0_Green' 'x0_Red']

(Note: x0 is the default name for the first column. You can change this with features_names_in_ if you use a pandas DataFrame.)


Using OneHotEncoder with Pandas DataFrames (Most Common Use Case)

This is where OneHotEncoder really shines, especially when combined with ColumnTransformer.

Let's say we have a DataFrame with both numerical and categorical features.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# 1. Create a sample DataFrame
data = {
    'Age': [25, 45, 35, 50, 23],
    'City': ['New York', 'London', 'New York', 'Paris', 'London'],
    'Purchase': ['Yes', 'No', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("-" * 30)
# 2. Separate features (X) and target (y)
X = df.drop('Purchase', axis=1)
y = df['Purchase']
# 3. Identify categorical columns
categorical_cols = ['City']
# 4. Create a ColumnTransformer
# This allows us to apply different transformations to different columns.
# - 'encoder': A name we give to this step.
# - OneHotEncoder(...): The transformer to apply.
# - categorical_cols: The columns to apply it to.
# - remainder='passthrough': This is crucial! It tells the transformer
#   to leave all other columns (like 'Age') unchanged.
preprocessor = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'
)
# 5. Fit and transform the features X
X_processed = preprocessor.fit_transform(X)
# 6. View the processed data
print("Processed Data (NumPy Array):")
print(X_processed)
print("-" * 30)
# 7. Convert back to a DataFrame for better readability
# Get the new feature names from the preprocessor
new_feature_names = preprocessor.get_feature_names_out()
X_processed_df = pd.DataFrame(X_processed, columns=new_feature_names)
print("Processed DataFrame:")
print(X_processed_df)

Output:

Original DataFrame:
   Age     City Purchase
0   25  New York       Yes
1   45    London        No
2   35  New York       Yes
3   50     Paris        No
4   23    London       Yes
------------------------------
Processed Data (NumPy Array):
[[ 0.  1.  0. 25.]
 [ 0.  0.  1. 45.]
 [ 0.  1.  0. 35.]
 [ 1.  0.  0. 50.]
 [ 0.  0.  1. 23.]]
------------------------------
Processed DataFrame:
   encoder__City_London  encoder__City_New York  encoder__City_Pas  remainder__Age
0                   0.0                     1.0                0.0            25.0
1                   1.0                     0.0                0.0            45.0
2                   0.0                     1.0                0.0            35.0
3                   0.0                     0.0                1.0            50.0
4                   1.0                     0.0                0.0            23.0

Key Parameters of OneHotEncoder

Parameter Description Common Values
sparse_output Whether to return a sparse matrix or a dense NumPy array. True (default, memory-efficient for many categories), False (easier to read, a NumPy array).
handle_unknown What to do if a category is encountered during transform that was not seen during fit. 'error' (default, raises an error), 'ignore' (encodes the unknown category as all zeros). 'ignore' is highly recommended for production.
drop Whether to drop one of the one-hot encoded columns. None (default, keep all), 'if_binary' (drop one for binary categories), or an integer/string to drop a specific category. This is useful to avoid multicollinearity.
categories If you know the categories in advance, you can provide them. auto (default, learns from data) or a list of lists like [['NY', 'LD', 'PR']].

OneHotEncoder vs. pandas.get_dummies()

This is a common point of confusion.

Feature sklearn.preprocessing.OneHotEncoder pandas.get_dummies()
Primary Use Machine Learning Pipelines Quick Data Analysis / Exploration
Integration Excellent. Works seamlessly
分享:
扫描分享到社交APP
上一篇
下一篇