杰瑞科技汇

Python DataFrame如何高效处理缺失值NaN?

Of course! Working with NaN (Not a Number) values is a fundamental skill in data analysis with Python's pandas library. A DataFrame is a 2D table-like data structure, and NaN is how pandas represents missing or undefined data.

Python DataFrame如何高效处理缺失值NaN?-图1
(图片来源网络,侵删)

Here's a comprehensive guide covering everything you need to know about handling NaN in a pandas DataFrame.


What is NaN?

  • Definition: NaN stands for "Not a Number". It's a special floating-point value defined by the IEEE 754 standard.
  • In Pandas: Pandas uses numpy.nan to represent missing values. You'll often see it imported as pd.NA or np.nan.
  • Key Property: Any arithmetic operation with NaN results in NaN. This is a crucial concept to remember.
import pandas as pd
import numpy as np
# Example of NaN arithmetic
print(np.nan + 10)  # Output: nan
print(np.nan * 5)   # Output: nan

Creating a DataFrame with NaN Values

You can create a DataFrame with NaN values in several ways.

a) Directly with numpy.nan

import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, np.nan, 30, 29],
        'City': ['New York', 'London', np.nan, 'Paris'],
        'Salary': [70000, 80000, 90000, np.nan]}
df = pd.DataFrame(data)
print(df)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
1      Bob   NaN    London  80000.0
2  Charlie  30.0      NaN  90000.0
3    David  29.0    Paris      NaN

b) From a CSV file with missing values

When reading a CSV file, pandas automatically converts empty cells or placeholders like NA, NULL, or N/A into NaN.

Python DataFrame如何高效处理缺失值NaN?-图2
(图片来源网络,侵删)
# Let's assume you have a file 'data.csv' like this:
# Name,Age,City
# Alice,25,New York
# Bob,,London
# Charlie,30,
# David,29,Paris,
df = pd.read_csv('data.csv')
print(df)

Detecting NaN Values

You can't use to check for NaN because np.nan == np.nan evaluates to False. Instead, use pandas' built-in methods.

isna() and isnull()

These methods are identical. They return a DataFrame of booleans, where True indicates a NaN value.

print(df.isna())

Output:

      Name    City   Age  Salary
0    False   False  False   False
1    False   False   True   False
2    False    True  False   False
3    False   False  False    True

notna() and notnull()

The opposite of isna(). Returns True for non-NaN values.

Python DataFrame如何高效处理缺失值NaN?-图3
(图片来源网络,侵删)
print(df.notna())

sum() with isna()

A very common pattern is to count the number of missing values in each column.

# Count NaNs in each column
print(df.isna().sum())

Output:

Name      0
Age       1
City      1
Salary    1
dtype: int64

Handling NaN Values (The Core of Data Cleaning)

This is the most important part. Your strategy depends on the context of your data and the goal of your analysis.

Strategy 1: Remove Missing Values

Use this when the number of missing values is small and you don't want to introduce bias.

dropna()

This method removes rows or columns that contain NaN values.

  • Default: dropna() removes any row that contains at least one NaN.
df_cleaned = df.dropna()
print(df_cleaned)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
3    David  29.0    Paris      NaN

Note: Row 2 is gone because City is NaN. Row 1 is gone because Age is NaN.

  • axis=1: To remove columns instead of rows.
df_cleaned_cols = df.dropna(axis=1)
print(df_cleaned_cols)

Output:

      Name
0    Alice
1      Bob
2  Charlie
3    David

Note: 'Age', 'City', and 'Salary' columns are gone because they contain NaNs.

  • how='all': Only remove a row/column if all its values are NaN.

  • thresh: Require a row/column to have at least thresh non-NaN values to be kept.

# Keep rows that have at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print(df_thresh)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
1      Bob   NaN    London  80000.0
2  Charlie  30.0      NaN  90000.0

Note: Row 3 is dropped because it only has 2 non-NaN values ('Name' and 'Age').

Strategy 2: Impute (Fill) Missing Values

Use this when you want to keep all the data points and are willing to make an educated guess for the missing values.

fillna()

This method fills NaN values with a specified value.

  • Filling with a specific scalar value (e.g., mean, median, mode, or a constant).
# Fill missing 'Age' with the mean age
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)
# Fill missing 'City' with the most frequent city (mode)
mode_city = df['City'].mode()[0] # [0] because mode() can return multiple values
df['City'] = df['City'].fillna(mode_city)
# Fill missing 'Salary' with 0
df['Salary'] = df['Salary'].fillna(0)
print(df)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
1      Bob  28.0    London  80000.0
2  Charlie  30.0    London  90000.0
3    David  29.0    Paris      0.0
  • Forward Fill (ffill) and Backward Fill (bfill): Useful for time-series data. ffill propagates the last valid observation forward, while bfill uses the next valid observation.
# Create a new DataFrame for this example
ts_data = {'A': [1, 2, np.nan, 4, 5]}
ts_df = pd.DataFrame(ts_data)
# Forward fill
print("Forward Fill:")
print(ts_df.ffill())
# Backward fill
print("\nBackward Fill:")
print(ts_df.bfill())

Output:

Forward Fill:
     A
0  1.0
1  2.0
2  2.0  # NaN filled with the previous value (2)
3  4.0
4  5.0
Backward Fill:
     A
0  1.0
1  2.0
2  4.0  # NaN filled with the next value (4)
3  4.0
4  5.0

Special Case: Interpolation

Sometimes, the best estimate for a missing value is a value between its neighbors. interpolate() does this for you, using methods like linear, polynomial, etc.

interp_data = {'A': [1, 2, np.nan, 4, 5]}
interp_df = pd.DataFrame(interp_data)
# Default is linear interpolation
interp_df['A'] = interp_df['A'].interpolate()
print(interp_df)

Output:

     A
0  1.0
1  2.0
2  3.0  # (2 + 4) / 2 = 3
3  4.0
4  5.0

Summary Table: NaN Handling Methods

Method Purpose Key Parameters
df.isna()
分享:
扫描分享到社交APP
上一篇
下一篇