Of course! The combine_first() method in pandas is a powerful tool for filling missing values in a DataFrame or Series using data from another DataFrame or Series.

Let's break it down with clear explanations and examples.
What is combine_first()?
At its core, combine_first() is used to "patch" missing values (NaN) in the calling object (let's call it A) with values from another object (B).
The key rule is: If a value in A is missing (NaN), it is replaced with the corresponding value from B. If the value in A is not missing, it is kept as is.
The operation is performed element-wise, aligning the data based on their index (for Series) or index and columns (for DataFrames).

combine_first() with Series
This is the simplest case. We have two Series, and we want to fill missing values in the first one using the second one.
How it Works:
- Alignment: The two Series are aligned by their index.
- Filling: For each index, if the value in the first Series is
NaN, it's replaced by the value from the second Series at that same index. If the value in the first Series is notNaN, it remains unchanged.
Example:
Let's create two Series. s1 has some missing values, and s2 has the values we want to use to fill them.
import pandas as pd
import numpy as np
# Series with missing values
s1 = pd.Series([10, 20, np.nan, 40, np.nan], index=['a', 'b', 'c', 'd', 'e'])
print("Series s1 (original):")
print(s1)
print("-" * 30)
# Series to fill the missing values from
s2 = pd.Series([100, 200, 300, 400, 500], index=['a', 'c', 'e', 'f', 'g'])
print("Series s2 (filler):")
print(s2)
print("-" * 30)
# Use combine_first to fill missing values in s1 with values from s2
s_filled = s1.combine_first(s2)
print("Series s1 after combine_first(s2):")
print(s_filled)
Output:
Series s1 (original):
a 10.0
b 20.0
c NaN
d 40.0
e NaN
dtype: float64
------------------------------
Series s2 (filler):
a 100
c 300
e 500
f 400
g 500
dtype: int64
------------------------------
Series s1 after combine_first(s2):
a 10.0 # Not NaN in s1, kept as 10
b 20.0 # Not NaN in s1, kept as 20
c 300.0 # Was NaN in s1, filled with 300 from s2
d 40.0 # Not NaN in s1, kept as 40
e 500.0 # Was NaN in s1, filled with 500 from s2
dtype: float64
Explanation:

s1['a']is10, so it's kept.s1['b']is20, so it's kept.s1['c']isNaN, so it's replaced withs2['c']which is300.s1['d']is40, so it's kept.s1['e']isNaN, so it's replaced withs2['e']which is500.
Notice that values in s2 for indices f and g are ignored because they don't have a corresponding index in s1.
combine_first() with DataFrames
This is where combine_first() becomes incredibly useful, especially for time-series data or data with aligned columns.
How it Works:
- Alignment: The two DataFrames are aligned by both their index and columns.
- Filling: For each cell at
(row, col):- If the value in the first DataFrame (
A) isNaN, it's replaced by the value from the second DataFrame (B) at the same(row, col). - If the value in
Ais notNaN, it is kept. - If a column or index exists only in
B, it is ignored. - If a column or index exists only in
A, theNaNs in that column/row will remainNaN(unlessBalso has that column/index).
- If the value in the first DataFrame (
Example:
Imagine df1 is our primary DataFrame with some missing data, and df2 is a secondary source with data we can use to fill the gaps.
# DataFrame with missing values
df1 = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
}, index=['row1', 'row2', 'row3', 'row4'])
print("DataFrame df1 (original):")
print(df1)
print("-" * 40)
# DataFrame to fill missing values from
df2 = pd.DataFrame({
'A': [100, 200, 300, 400],
'B': [500, 600, 700, 800],
'D': [13, 14, 15, 16] # Column 'D' is only in df2
}, index=['row1', 'row2', 'row3', 'row4'])
print("DataFrame df2 (filler):")
print(df2)
print("-" * 40)
# Use combine_first to fill missing values in df1 with values from df2
df_filled = df1.combine_first(df2)
print("DataFrame df1 after combine_first(df2):")
print(df_filled)
Output:
DataFrame df1 (original):
A B C
row1 1.0 5.0 9
row2 2.0 NaN 10
row3 NaN NaN 11
row4 4.0 8.0 12
----------------------------------------
DataFrame df2 (filler):
A B D
row1 100 500 13
row2 200 600 14
row3 300 700 15
row4 400 800 16
----------------------------------------
DataFrame df1 after combine_first(df2):
A B C
row1 1.0 5.0 9
row2 2.0 600.0 10
row3 300.0 700.0 11
row4 4.0 8.0 12
Explanation:
- Column 'A':
df1['A']['row3']wasNaN, so it was filled withdf2['A']['row3']which is300. Other values indf1['A']were kept. - Column 'B':
df1['B']['row2']anddf1['B']['row3']wereNaN, so they were filled with600and700fromdf2. - Column 'C': No values were
NaNindf1, so the entire column is unchanged. - Column 'D': This column existed only in
df2, so it was completely ignored in the result. The result only has the columns that were indf1.
Common Use Case: Forward-Filling Time Series
A classic use case for combine_first is to create a complete time series by combining two partially overlapping ones.
Let's say we have stock prices recorded by two different systems, and we want to merge them into a single, continuous series.
import pandas as pd
# Create two time series with some overlapping and non-overlapping data
dates = pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05'])
# System A has data for the beginning
prices_a = pd.Series([100, 102, np.nan, np.nan, np.nan], index=dates)
print("System A Prices:")
print(prices_a)
print("-" * 30)
# System B has data for the end
prices_b = pd.Series([np.nan, np.nan, 105, 107, 110], index=dates)
print("System B Prices:")
print(prices_b)
print("-" * 30)
# Combine them to get a continuous price series
combined_prices = prices_a.combine_first(prices_b)
print("Combined Price Series:")
print(combined_prices)
Output:
System A Prices:
2025-01-01 100.0
2025-01-02 102.0
2025-01-03 NaN
2025-01-04 NaN
2025-01-05 NaN
dtype: float64
------------------------------
System B Prices:
2025-01-01 NaN
2025-01-02 NaN
2025-01-03 105.0
2025-01-04 107.0
2025-01-05 110.0
dtype: float64
------------------------------
Combined Price Series:
2025-01-01 100.0
2025-01-02 102.0
2025-01-03 105.0
2025-01-04 107.0
2025-01-05 110.0
dtype: float64
This perfectly stitches the two time series together.
combine_first() vs. fillna()
It's important not to confuse combine_first() with fillna().
| Feature | combine_first(other) |
fillna(value) |
|---|---|---|
| Purpose | Fill missing values using data from another pandas object (Series or DataFrame). | Fill missing values using a scalar value, a dictionary, or a Series. |
| Data Source | A Series or DataFrame. | A single number, a list, a dict, or a Series. |
| Alignment | Aligns by index (and columns for DataFrames) from the other object. |
For a Series, aligns by the index of the fill value Series. For a DataFrame, fills all NaN with the same scalar or column-specific values from a dict/Series. |
| Flexibility | High. Can intelligently patch data from a complex secondary source. | Lower. Primarily for simple, predefined fill values. |
In short: Use fillna() when you have a simple rule like "fill all NaNs with 0" or "fill column 'A' with 99 and column 'B' with -1". Use combine_first() when you have another entire dataset and you want to use its values to intelligently patch the missing spots in your primary dataset.
