python onehotencoder-杰瑞科技汇

Of course! The OneHotEncoder is a fundamental tool in Python's machine learning preprocessing toolkit. Here’s a complete guide covering what it is, why it's used, how to use it with scikit-learn, and best practices.

（图片来源网络，侵删）

What is One-Hot Encoding?

One-hot encoding is a technique used to convert categorical data into a numerical format that machine learning models can understand.

The Problem: Most ML algorithms (like linear regression, SVMs, neural networks) require numerical input. If you have a categorical feature like Color with values ["Red", "Green", "Blue"], you can't just assign numbers like Red=1, Green=2, Blue=3. This would incorrectly imply a mathematical order or relationship (e.g., that Blue is "greater than" Green), which can mislead the model.

The Solution: One-hot encoding creates a new binary (0 or 1) column for each unique category in the original feature.

Example:

（图片来源网络，侵删）

Let's say we have a Color feature:

Color
Red
Green
Blue
Red

After one-hot encoding, this becomes:

Color_Blue	Color_Green	Color_Red
0	0	1
0	1	0
1	0	0
0	0	1

A row with Color = "Red" will have a 1 in the Color_Red column and 0s in all other new columns.
A row with Color = "Green" will have a 1 in the Color_Green column and 0s elsewhere.

When to Use One-Hot Encoding?

It's ideal for nominal categorical features—categories that have no intrinsic order.

Good examples: Country (USA, Canada, Mexico), City (New York, London, Tokyo), Product Type (Electronics, Clothing, Food).
Bad examples (use Label Encoding instead): Rank (1st, 2nd, 3rd), Education Level (High School, Bachelor's, Master's). These have an order, and assigning numbers (1, 2, 3) can be appropriate.

How to Use `OneHotEncoder` in Scikit-Learn

The OneHotEncoder is found in sklearn.preprocessing. The modern versions (since 0.20) are highly flexible and recommended over the older pandas.get_dummies() for ML pipelines.

（图片来源网络，侵删）

Basic Example

Let's start with a simple array.

import numpy as np
from sklearn.preprocessing import OneHotEncoder
# 1. Sample data
# Note: Scikit-learn expects 2D input, so we reshape the 1D array.
data = np.array(['Red', 'Green', 'Blue', 'Red']).reshape(-1, 1)
# 2. Initialize the encoder
# handle_unknown='ignore' is good practice to prevent errors on new categories
# in test data.
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# 3. Fit and transform the data
# .fit() learns the categories (Red, Green, Blue)
# .transform() creates the one-hot encoded matrix
encoded_data = encoder.fit_transform(data)
# 4. View the result
print("Encoded Data:")
print(encoded_data)
# 5. See the new category names (feature names)
print("\nCategory Names:")
print(encoder.get_feature_names_out())

Output:

Encoded Data:
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
Category Names:
['x0_Blue' 'x0_Green' 'x0_Red']

(Note: x0 is the default name for the first column. You can change this with features_names_in_ if you use a pandas DataFrame.)

Using `OneHotEncoder` with Pandas DataFrames (Most Common Use Case)

This is where OneHotEncoder really shines, especially when combined with ColumnTransformer.

Let's say we have a DataFrame with both numerical and categorical features.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# 1. Create a sample DataFrame
data = {
    'Age': [25, 45, 35, 50, 23],
    'City': ['New York', 'London', 'New York', 'Paris', 'London'],
    'Purchase': ['Yes', 'No', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("-" * 30)
# 2. Separate features (X) and target (y)
X = df.drop('Purchase', axis=1)
y = df['Purchase']
# 3. Identify categorical columns
categorical_cols = ['City']
# 4. Create a ColumnTransformer
# This allows us to apply different transformations to different columns.
# - 'encoder': A name we give to this step.
# - OneHotEncoder(...): The transformer to apply.
# - categorical_cols: The columns to apply it to.
# - remainder='passthrough': This is crucial! It tells the transformer
#   to leave all other columns (like 'Age') unchanged.
preprocessor = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'
)
# 5. Fit and transform the features X
X_processed = preprocessor.fit_transform(X)
# 6. View the processed data
print("Processed Data (NumPy Array):")
print(X_processed)
print("-" * 30)
# 7. Convert back to a DataFrame for better readability
# Get the new feature names from the preprocessor
new_feature_names = preprocessor.get_feature_names_out()
X_processed_df = pd.DataFrame(X_processed, columns=new_feature_names)
print("Processed DataFrame:")
print(X_processed_df)

Output:

Original DataFrame:
   Age     City Purchase
0   25  New York       Yes
1   45    London        No
2   35  New York       Yes
3   50     Paris        No
4   23    London       Yes
------------------------------
Processed Data (NumPy Array):
[[ 0.  1.  0. 25.]
 [ 0.  0.  1. 45.]
 [ 0.  1.  0. 35.]
 [ 1.  0.  0. 50.]
 [ 0.  0.  1. 23.]]
------------------------------
Processed DataFrame:
   encoder__City_London  encoder__City_New York  encoder__City_Pas  remainder__Age
0                   0.0                     1.0                0.0            25.0
1                   1.0                     0.0                0.0            45.0
2                   0.0                     1.0                0.0            35.0
3                   0.0                     0.0                1.0            50.0
4                   1.0                     0.0                0.0            23.0

Key Parameters of `OneHotEncoder`

Parameter	Description	Common Values
`sparse_output`	Whether to return a sparse matrix or a dense NumPy array.	`True` (default, memory-efficient for many categories), `False` (easier to read, a NumPy array).
`handle_unknown`	What to do if a category is encountered during `transform` that was not seen during `fit`.	`'error'` (default, raises an error), `'ignore'` (encodes the unknown category as all zeros). `'ignore'` is highly recommended for production.
`drop`	Whether to drop one of the one-hot encoded columns.	`None` (default, keep all), `'if_binary'` (drop one for binary categories), or an integer/string to drop a specific category. This is useful to avoid multicollinearity.
`categories`	If you know the categories in advance, you can provide them.	`auto` (default, learns from data) or a list of lists like `[['NY', 'LD', 'PR']]`.

`OneHotEncoder` vs. `pandas.get_dummies()`

This is a common point of confusion.

Feature	`sklearn.preprocessing.OneHotEncoder`	`pandas.get_dummies()`
Primary Use	Machine Learning Pipelines	Quick Data Analysis / Exploration
Integration	Excellent. Works seamlessly

python onehotencoder

What is One-Hot Encoding?

When to Use One-Hot Encoding?

How to Use `OneHotEncoder` in Scikit-Learn

Basic Example

Using `OneHotEncoder` with Pandas DataFrames (Most Common Use Case)

Key Parameters of `OneHotEncoder`

`OneHotEncoder` vs. `pandas.get_dummies()`

99ANYc3cd6

Python3中key in dict如何高效判断？

Java webservice接口如何调用与调试？

Python中UTF-8如何转为Unicode？

NetBeans for Java还值得新手入门学习吗？

Python中Unicode如何转为UTF-8？

HDR Light Studio教程如何快速上手？

Python exec命令安全风险如何规避？

如何快速掌握InDesign视频排版技巧？

U盘装Win7系统教程，U盘如何安装Win7系统？

python socket flush

Edius6.02安装教程在哪找？步骤是怎样的？

Windows Python环境搭建难不难？

Python datetime与str如何转换？

Java float转int会丢失精度吗？

16进制字符串如何转16进制值Java实现？

Java2实用教程实验指导如何高效完成实验？

python onehotencoder

What is One-Hot Encoding?

When to Use One-Hot Encoding?

How to Use OneHotEncoder in Scikit-Learn

Basic Example

Using OneHotEncoder with Pandas DataFrames (Most Common Use Case)

Key Parameters of OneHotEncoder

OneHotEncoder vs. pandas.get_dummies()

相关推荐

HDR Light Studio教程如何快速上手？

How to Use `OneHotEncoder` in Scikit-Learn

Using `OneHotEncoder` with Pandas DataFrames (Most Common Use Case)

Key Parameters of `OneHotEncoder`

`OneHotEncoder` vs. `pandas.get_dummies()`