Python DataFrame如何高效处理缺失值NaN？-杰瑞科技汇

Of course! Working with NaN (Not a Number) values is a fundamental skill in data analysis with Python's pandas library. A DataFrame is a 2D table-like data structure, and NaN is how pandas represents missing or undefined data.

（图片来源网络，侵删）

Here's a comprehensive guide covering everything you need to know about handling NaN in a pandas DataFrame.

What is `NaN`?

Definition: NaN stands for "Not a Number". It's a special floating-point value defined by the IEEE 754 standard.
In Pandas: Pandas uses numpy.nan to represent missing values. You'll often see it imported as pd.NA or np.nan.
Key Property: Any arithmetic operation with NaN results in NaN. This is a crucial concept to remember.

import pandas as pd
import numpy as np
# Example of NaN arithmetic
print(np.nan + 10)  # Output: nan
print(np.nan * 5)   # Output: nan

Creating a DataFrame with `NaN` Values

You can create a DataFrame with NaN values in several ways.

a) Directly with `numpy.nan`

import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, np.nan, 30, 29],
        'City': ['New York', 'London', np.nan, 'Paris'],
        'Salary': [70000, 80000, 90000, np.nan]}
df = pd.DataFrame(data)
print(df)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
1      Bob   NaN    London  80000.0
2  Charlie  30.0      NaN  90000.0
3    David  29.0    Paris      NaN

b) From a CSV file with missing values

When reading a CSV file, pandas automatically converts empty cells or placeholders like NA, NULL, or N/A into NaN.

（图片来源网络，侵删）

# Let's assume you have a file 'data.csv' like this:
# Name,Age,City
# Alice,25,New York
# Bob,,London
# Charlie,30,
# David,29,Paris,
df = pd.read_csv('data.csv')
print(df)

Detecting `NaN` Values

You can't use to check for NaN because np.nan == np.nan evaluates to False. Instead, use pandas' built-in methods.

`isna()` and `isnull()`

These methods are identical. They return a DataFrame of booleans, where True indicates a NaN value.

print(df.isna())

Output:

      Name    City   Age  Salary
0    False   False  False   False
1    False   False   True   False
2    False    True  False   False
3    False   False  False    True

`notna()` and `notnull()`

The opposite of isna(). Returns True for non-NaN values.

（图片来源网络，侵删）

print(df.notna())

`sum()` with `isna()`

A very common pattern is to count the number of missing values in each column.

# Count NaNs in each column
print(df.isna().sum())

Output:

Name      0
Age       1
City      1
Salary    1
dtype: int64

Handling `NaN` Values (The Core of Data Cleaning)

This is the most important part. Your strategy depends on the context of your data and the goal of your analysis.

Strategy 1: Remove Missing Values

Use this when the number of missing values is small and you don't want to introduce bias.

`dropna()`

This method removes rows or columns that contain NaN values.

Default: dropna() removes any row that contains at least one NaN.

df_cleaned = df.dropna()
print(df_cleaned)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
3    David  29.0    Paris      NaN

Note: Row 2 is gone because City is NaN. Row 1 is gone because Age is NaN.

axis=1: To remove columns instead of rows.

df_cleaned_cols = df.dropna(axis=1)
print(df_cleaned_cols)

Output:

      Name
0    Alice
1      Bob
2  Charlie
3    David

Note: 'Age', 'City', and 'Salary' columns are gone because they contain NaNs.

how='all': Only remove a row/column if all its values are NaN.
thresh: Require a row/column to have at least thresh non-NaN values to be kept.

# Keep rows that have at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print(df_thresh)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
1      Bob   NaN    London  80000.0
2  Charlie  30.0      NaN  90000.0

Note: Row 3 is dropped because it only has 2 non-NaN values ('Name' and 'Age').

Strategy 2: Impute (Fill) Missing Values

Use this when you want to keep all the data points and are willing to make an educated guess for the missing values.

`fillna()`

This method fills NaN values with a specified value.

Filling with a specific scalar value (e.g., mean, median, mode, or a constant).

# Fill missing 'Age' with the mean age
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)
# Fill missing 'City' with the most frequent city (mode)
mode_city = df['City'].mode()[0] # [0] because mode() can return multiple values
df['City'] = df['City'].fillna(mode_city)
# Fill missing 'Salary' with 0
df['Salary'] = df['Salary'].fillna(0)
print(df)

Output:

      Name   Age     City   Salary
0    Alice  25.0  New York  70000.0
1      Bob  28.0    London  80000.0
2  Charlie  30.0    London  90000.0
3    David  29.0    Paris      0.0

Forward Fill (ffill) and Backward Fill (bfill): Useful for time-series data. ffill propagates the last valid observation forward, while bfill uses the next valid observation.

# Create a new DataFrame for this example
ts_data = {'A': [1, 2, np.nan, 4, 5]}
ts_df = pd.DataFrame(ts_data)
# Forward fill
print("Forward Fill:")
print(ts_df.ffill())
# Backward fill
print("\nBackward Fill:")
print(ts_df.bfill())

Output:

Forward Fill:
     A
0  1.0
1  2.0
2  2.0  # NaN filled with the previous value (2)
3  4.0
4  5.0
Backward Fill:
     A
0  1.0
1  2.0
2  4.0  # NaN filled with the next value (4)
3  4.0
4  5.0

Special Case: Interpolation

Sometimes, the best estimate for a missing value is a value between its neighbors. interpolate() does this for you, using methods like linear, polynomial, etc.

interp_data = {'A': [1, 2, np.nan, 4, 5]}
interp_df = pd.DataFrame(interp_data)
# Default is linear interpolation
interp_df['A'] = interp_df['A'].interpolate()
print(interp_df)

Output:

     A
0  1.0
1  2.0
2  3.0  # (2 + 4) / 2 = 3
3  4.0
4  5.0

Summary Table: `NaN` Handling Methods

Method	Purpose	Key Parameters
`df.isna()`

Python DataFrame如何高效处理缺失值NaN？

What is `NaN`?

Creating a DataFrame with `NaN` Values

a) Directly with `numpy.nan`

b) From a CSV file with missing values

Detecting `NaN` Values

`isna()` and `isnull()`

`notna()` and `notnull()`

`sum()` with `isna()`

Handling `NaN` Values (The Core of Data Cleaning)

Strategy 1: Remove Missing Values

`dropna()`

Strategy 2: Impute (Fill) Missing Values

`fillna()`

Special Case: Interpolation

Summary Table: `NaN` Handling Methods

99ANYc3cd6

Python global 作用域如何正确使用？

python scapy 伪造ip

Python if else赋值能一行完成吗？

Altium布线教程，新手如何快速掌握布线技巧？

disco 240教程怎么学？

Java Socket编程实例具体怎么实现？

Python UDP如何实现keepalive心跳机制？

hbase counter python

eclipse for java 64位怎么下载安装？

Java线程与Runnable，如何正确创建与启动？

Django render如何正确渲染模板并传递数据？

Java 16进制字符串如何转10进制字符串？

Ubuntu 12.04如何安装Java？

Java如何用XSLT实现XML到XML转换？

webservice wsdl java

Java memcache client如何正确使用？

Python DataFrame如何高效处理缺失值NaN？

What is NaN?

Creating a DataFrame with NaN Values

a) Directly with numpy.nan

b) From a CSV file with missing values

Detecting NaN Values

isna() and isnull()

notna() and notnull()

sum() with isna()

Handling NaN Values (The Core of Data Cleaning)

Strategy 1: Remove Missing Values

dropna()

Strategy 2: Impute (Fill) Missing Values

fillna()

Special Case: Interpolation

Summary Table: NaN Handling Methods

相关推荐

Java Socket编程实例具体怎么实现？

What is `NaN`?

Creating a DataFrame with `NaN` Values

a) Directly with `numpy.nan`

Detecting `NaN` Values

`isna()` and `isnull()`

`notna()` and `notnull()`

`sum()` with `isna()`

Handling `NaN` Values (The Core of Data Cleaning)

`dropna()`

`fillna()`

Summary Table: `NaN` Handling Methods