杰瑞科技汇

Python combinefirst如何合并数据?

Of course! The combine_first() method in pandas is a powerful tool for filling missing values in a DataFrame or Series using data from another DataFrame or Series.

Python combinefirst如何合并数据?-图1
(图片来源网络,侵删)

Let's break it down with clear explanations and examples.

What is combine_first()?

At its core, combine_first() is used to "patch" missing values (NaN) in the calling object (let's call it A) with values from another object (B).

The key rule is: If a value in A is missing (NaN), it is replaced with the corresponding value from B. If the value in A is not missing, it is kept as is.

The operation is performed element-wise, aligning the data based on their index (for Series) or index and columns (for DataFrames).

Python combinefirst如何合并数据?-图2
(图片来源网络,侵删)

combine_first() with Series

This is the simplest case. We have two Series, and we want to fill missing values in the first one using the second one.

How it Works:

  1. Alignment: The two Series are aligned by their index.
  2. Filling: For each index, if the value in the first Series is NaN, it's replaced by the value from the second Series at that same index. If the value in the first Series is not NaN, it remains unchanged.

Example:

Let's create two Series. s1 has some missing values, and s2 has the values we want to use to fill them.

import pandas as pd
import numpy as np
# Series with missing values
s1 = pd.Series([10, 20, np.nan, 40, np.nan], index=['a', 'b', 'c', 'd', 'e'])
print("Series s1 (original):")
print(s1)
print("-" * 30)
# Series to fill the missing values from
s2 = pd.Series([100, 200, 300, 400, 500], index=['a', 'c', 'e', 'f', 'g'])
print("Series s2 (filler):")
print(s2)
print("-" * 30)
# Use combine_first to fill missing values in s1 with values from s2
s_filled = s1.combine_first(s2)
print("Series s1 after combine_first(s2):")
print(s_filled)

Output:

Series s1 (original):
a    10.0
b    20.0
c     NaN
d    40.0
e     NaN
dtype: float64
------------------------------
Series s2 (filler):
a    100
c    300
e    500
f    400
g    500
dtype: int64
------------------------------
Series s1 after combine_first(s2):
a     10.0  # Not NaN in s1, kept as 10
b     20.0  # Not NaN in s1, kept as 20
c    300.0  # Was NaN in s1, filled with 300 from s2
d     40.0  # Not NaN in s1, kept as 40
e    500.0  # Was NaN in s1, filled with 500 from s2
dtype: float64

Explanation:

Python combinefirst如何合并数据?-图3
(图片来源网络,侵删)
  • s1['a'] is 10, so it's kept.
  • s1['b'] is 20, so it's kept.
  • s1['c'] is NaN, so it's replaced with s2['c'] which is 300.
  • s1['d'] is 40, so it's kept.
  • s1['e'] is NaN, so it's replaced with s2['e'] which is 500.

Notice that values in s2 for indices f and g are ignored because they don't have a corresponding index in s1.


combine_first() with DataFrames

This is where combine_first() becomes incredibly useful, especially for time-series data or data with aligned columns.

How it Works:

  1. Alignment: The two DataFrames are aligned by both their index and columns.
  2. Filling: For each cell at (row, col):
    • If the value in the first DataFrame (A) is NaN, it's replaced by the value from the second DataFrame (B) at the same (row, col).
    • If the value in A is not NaN, it is kept.
    • If a column or index exists only in B, it is ignored.
    • If a column or index exists only in A, the NaNs in that column/row will remain NaN (unless B also has that column/index).

Example:

Imagine df1 is our primary DataFrame with some missing data, and df2 is a secondary source with data we can use to fill the gaps.

# DataFrame with missing values
df1 = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
}, index=['row1', 'row2', 'row3', 'row4'])
print("DataFrame df1 (original):")
print(df1)
print("-" * 40)
# DataFrame to fill missing values from
df2 = pd.DataFrame({
    'A': [100, 200, 300, 400],
    'B': [500, 600, 700, 800],
    'D': [13, 14, 15, 16] # Column 'D' is only in df2
}, index=['row1', 'row2', 'row3', 'row4'])
print("DataFrame df2 (filler):")
print(df2)
print("-" * 40)
# Use combine_first to fill missing values in df1 with values from df2
df_filled = df1.combine_first(df2)
print("DataFrame df1 after combine_first(df2):")
print(df_filled)

Output:

DataFrame df1 (original):
         A    B   C
row1   1.0  5.0   9
row2   2.0  NaN  10
row3   NaN  NaN  11
row4   4.0  8.0  12
----------------------------------------
DataFrame df2 (filler):
        A    B   D
row1  100  500  13
row2  200  600  14
row3  300  700  15
row4  400  800  16
----------------------------------------
DataFrame df1 after combine_first(df2):
         A      B   C
row1   1.0    5.0   9
row2   2.0  600.0  10
row3  300.0  700.0  11
row4   4.0    8.0  12

Explanation:

  • Column 'A': df1['A']['row3'] was NaN, so it was filled with df2['A']['row3'] which is 300. Other values in df1['A'] were kept.
  • Column 'B': df1['B']['row2'] and df1['B']['row3'] were NaN, so they were filled with 600 and 700 from df2.
  • Column 'C': No values were NaN in df1, so the entire column is unchanged.
  • Column 'D': This column existed only in df2, so it was completely ignored in the result. The result only has the columns that were in df1.

Common Use Case: Forward-Filling Time Series

A classic use case for combine_first is to create a complete time series by combining two partially overlapping ones.

Let's say we have stock prices recorded by two different systems, and we want to merge them into a single, continuous series.

import pandas as pd
# Create two time series with some overlapping and non-overlapping data
dates = pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05'])
# System A has data for the beginning
prices_a = pd.Series([100, 102, np.nan, np.nan, np.nan], index=dates)
print("System A Prices:")
print(prices_a)
print("-" * 30)
# System B has data for the end
prices_b = pd.Series([np.nan, np.nan, 105, 107, 110], index=dates)
print("System B Prices:")
print(prices_b)
print("-" * 30)
# Combine them to get a continuous price series
combined_prices = prices_a.combine_first(prices_b)
print("Combined Price Series:")
print(combined_prices)

Output:

System A Prices:
2025-01-01    100.0
2025-01-02    102.0
2025-01-03      NaN
2025-01-04      NaN
2025-01-05      NaN
dtype: float64
------------------------------
System B Prices:
2025-01-01      NaN
2025-01-02      NaN
2025-01-03    105.0
2025-01-04    107.0
2025-01-05    110.0
dtype: float64
------------------------------
Combined Price Series:
2025-01-01    100.0
2025-01-02    102.0
2025-01-03    105.0
2025-01-04    107.0
2025-01-05    110.0
dtype: float64

This perfectly stitches the two time series together.


combine_first() vs. fillna()

It's important not to confuse combine_first() with fillna().

Feature combine_first(other) fillna(value)
Purpose Fill missing values using data from another pandas object (Series or DataFrame). Fill missing values using a scalar value, a dictionary, or a Series.
Data Source A Series or DataFrame. A single number, a list, a dict, or a Series.
Alignment Aligns by index (and columns for DataFrames) from the other object. For a Series, aligns by the index of the fill value Series. For a DataFrame, fills all NaN with the same scalar or column-specific values from a dict/Series.
Flexibility High. Can intelligently patch data from a complex secondary source. Lower. Primarily for simple, predefined fill values.

In short: Use fillna() when you have a simple rule like "fill all NaNs with 0" or "fill column 'A' with 99 and column 'B' with -1". Use combine_first() when you have another entire dataset and you want to use its values to intelligently patch the missing spots in your primary dataset.

分享:
扫描分享到社交APP
上一篇
下一篇