杰瑞科技汇

Python LabelEncoder如何实现类别编码?

Of course! The LabelEncoder in Python is a fundamental tool from the scikit-learn library used for encoding categorical labels into numerical form. This is a crucial preprocessing step because most machine learning algorithms work with numbers, not text.

Python LabelEncoder如何实现类别编码?-图1
(图片来源网络,侵删)

Let's break it down, from a simple explanation to practical examples and important considerations.

What is a LabelEncoder?

A LabelEncoder assigns a unique integer to each unique label (category) in a dataset.

Example: If you have a list of fruits: ['apple', 'orange', 'banana', 'apple'] A LabelEncoder would transform it into: [2, 1, 0, 2]

  • apple -> 2
  • banana -> 0
  • orange -> 1

Why Do We Need It?

Machine learning models are mathematical. They perform operations like matrix multiplication and gradient descent, which require numerical input. If you feed them text strings, they won't know how to process them.

Python LabelEncoder如何实现类别编码?-图2
(图片来源网络,侵删)

LabelEncoder is the simplest way to convert these text labels into a format that an algorithm can understand.


How to Use It: A Step-by-Step Guide

First, you need to install scikit-learn if you haven't already:

pip install scikit-learn

Basic Encoding

Let's start with the most basic example: encoding a simple list of labels.

from sklearn.preprocessing import LabelEncoder
import numpy as np
# 1. Your categorical data
labels = ['red', 'green', 'blue', 'green', 'red', 'yellow']
# 2. Create an instance of LabelEncoder
le = LabelEncoder()
# 3. Fit and transform the data
# .fit() learns the unique categories
# .transform() converts the categories to numbers
encoded_labels = le.fit_transform(labels)
print("Original Labels:", labels)
print("Encoded Labels:", encoded_labels)
# You can see the mapping that was created
print("Class Mapping:", dict(zip(labels, encoded_labels)))

Output:

Python LabelEncoder如何实现类别编码?-图3
(图片来源网络,侵删)
Original Labels: ['red', 'green', 'blue', 'green', 'red', 'yellow']
Encoded Labels: [2 1 0 1 2 5]
Class Mapping: {'red': 2, 'green': 1, 'blue': 0, 'yellow': 5}

Notice how yellow was assigned 5. This is because LabelEncoder assigns numbers in alphabetical order (blue, green, red, yellow).

Decoding (Inverse Transform)

What if you need to convert your numbers back to the original labels? You can use the inverse_transform method.

# Using the 'le' and 'encoded_labels' from the previous example
decoded_labels = le.inverse_transform(encoded_labels)
print("Encoded Labels:", encoded_labels)
print("Decoded Labels:", decoded_labels)

Output:

Encoded Labels: [2 1 0 1 2 5]
Decoded Labels: ['red' 'green' 'blue' 'green' 'red' 'yellow']

Handling New, Unseen Data

This is a very important concept. What happens if you have new data during prediction that contains a category the encoder hasn't seen before?

# The encoder was fitted on ['red', 'green', 'blue', 'yellow']
new_data = ['purple', 'red', 'green']
try:
    le.transform(new_data)
except ValueError as e:
    print(f"Error: {e}")

Output:

Error: y contains previously unseen labels: ['purple']

Solution: You must handle this. A common strategy is to assign new labels to a special "unknown" category. You can do this by refitting the encoder on the combined set of old and new data.

# Combine old and new data
all_data = labels + new_data
# Create a NEW encoder instance
le_new = LabelEncoder()
# Fit and transform on the combined data
encoded_all = le_new.fit_transform(all_data)
print("All Data:", all_data)
print("Encoded All Data:", encoded_all)
# Now, transforming the original 'new_data' will work
encoded_new_only = le_new.transform(new_data)
print("\nEncoded New Data Only:", encoded_new_only)

Output:

All Data: ['red', 'green', 'blue', 'green', 'red', 'yellow', 'purple', 'red', 'green']
Encoded All Data: [6 2 1 2 6 7 5 6 2]
Encoded New Data Only: [5 6 2]

Notice purple is now 5 and red is 6, because the encoding is based on the alphabetical order of the combined dataset.


Key Limitations and When to Use It

LabelEncoder is simple, but it has a major limitation.

The Problem: Ordinal Relationship

LabelEncoder assigns integers (0, 1, 2, 3...). Some machine learning models (like Linear Regression, Logistic Regression, SVMs) might mistakenly interpret these integers as having an ordinal relationship.

For example, if you encode ['low', 'medium', 'high'] as [0, 1, 2], the model might think that "high" is "greater than" "medium" in a mathematical sense, which is not always true. This can lead to poor model performance.

When to Use LabelEncoder:

  • Target Variable (y): It's perfectly fine and often necessary to use LabelEncoder for your target variable in classification problems (e.g., converting ['cat', 'dog', 'bird'] to [0, 1, 2]).
  • Categorical Features with Intrinsic Order: If your categories have a natural order (ordinal data), like ['bad', 'average', 'good'], and you want the model to respect that order, LabelEncoder can be appropriate.
  • Tree-Based Models: Models like Decision Trees, Random Forests, and Gradient Boosting are generally not affected by the arbitrary integer assignments from LabelEncoder, as they split based on thresholds and don't assume linear relationships.

When to Avoid LabelEncoder (and use OneHotEncoder instead):

  • Nominal Features (No Intrinsic Order): For features like ['country', 'city', 'color'] where there is no order, you should use OneHotEncoder. It creates a new binary column for each category, avoiding the artificial ordinal relationship.

Example of OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Nominal data
data = ['USA', 'UK', 'Germany', 'USA']
# OneHotEncoder expects a 2D array
data_2d = np.array(data).reshape(-1, 1)
ohe = OneHotEncoder(sparse_output=False) # sparse_output=False for a readable array
one_hot_encoded = ohe.fit_transform(data_2d)
# Convert to a DataFrame for clarity
df = pd.DataFrame(one_hot_encoded, columns=ohe.get_feature_names_out())
print(df)

Output:

   country_Germany  country_UK  country_USA
0             0.0         0.0          1.0
1             0.0         1.0          0.0
2             1.0         0.0          0.0
3             0.0         0.0          1.0

This representation is much safer for non-tree-based models.


Summary: LabelEncoder vs. OneHotEncoder

Feature LabelEncoder OneHotEncoder
Purpose Encode labels into integers. Encode categorical features into binary vectors.
Input 1D array (target or feature). 2D array (features only).
Output 1D array of integers. 2D array of binary values (0s and 1s).
When to Use Target variable (y) in classification. Ordinal features. Nominal features (no order) for most models.
Key Advantage Simple, compact representation. Avoids creating false ordinal relationships.
Key Disadvantage Can imply an order where none exists. Can lead to high dimensionality (curse of dimensionality) with many categories.
分享:
扫描分享到社交APP
上一篇
下一篇