Of course! Working with NaN (Not a Number) values is a fundamental skill in data analysis with Python's pandas library. A DataFrame is a 2D table-like data structure, and NaN is how pandas represents missing or undefined data.

Here's a comprehensive guide covering everything you need to know about handling NaN in a pandas DataFrame.
What is NaN?
- Definition:
NaNstands for "Not a Number". It's a special floating-point value defined by the IEEE 754 standard. - In Pandas: Pandas uses
numpy.nanto represent missing values. You'll often see it imported aspd.NAornp.nan. - Key Property: Any arithmetic operation with
NaNresults inNaN. This is a crucial concept to remember.
import pandas as pd import numpy as np # Example of NaN arithmetic print(np.nan + 10) # Output: nan print(np.nan * 5) # Output: nan
Creating a DataFrame with NaN Values
You can create a DataFrame with NaN values in several ways.
a) Directly with numpy.nan
import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, np.nan, 30, 29],
'City': ['New York', 'London', np.nan, 'Paris'],
'Salary': [70000, 80000, 90000, np.nan]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary
0 Alice 25.0 New York 70000.0
1 Bob NaN London 80000.0
2 Charlie 30.0 NaN 90000.0
3 David 29.0 Paris NaN
b) From a CSV file with missing values
When reading a CSV file, pandas automatically converts empty cells or placeholders like NA, NULL, or N/A into NaN.

# Let's assume you have a file 'data.csv' like this:
# Name,Age,City
# Alice,25,New York
# Bob,,London
# Charlie,30,
# David,29,Paris,
df = pd.read_csv('data.csv')
print(df)
Detecting NaN Values
You can't use to check for NaN because np.nan == np.nan evaluates to False. Instead, use pandas' built-in methods.
isna() and isnull()
These methods are identical. They return a DataFrame of booleans, where True indicates a NaN value.
print(df.isna())
Output:
Name City Age Salary
0 False False False False
1 False False True False
2 False True False False
3 False False False True
notna() and notnull()
The opposite of isna(). Returns True for non-NaN values.

print(df.notna())
sum() with isna()
A very common pattern is to count the number of missing values in each column.
# Count NaNs in each column print(df.isna().sum())
Output:
Name 0
Age 1
City 1
Salary 1
dtype: int64
Handling NaN Values (The Core of Data Cleaning)
This is the most important part. Your strategy depends on the context of your data and the goal of your analysis.
Strategy 1: Remove Missing Values
Use this when the number of missing values is small and you don't want to introduce bias.
dropna()
This method removes rows or columns that contain NaN values.
- Default:
dropna()removes any row that contains at least oneNaN.
df_cleaned = df.dropna() print(df_cleaned)
Output:
Name Age City Salary
0 Alice 25.0 New York 70000.0
3 David 29.0 Paris NaN
Note: Row 2 is gone because City is NaN. Row 1 is gone because Age is NaN.
axis=1: To remove columns instead of rows.
df_cleaned_cols = df.dropna(axis=1) print(df_cleaned_cols)
Output:
Name
0 Alice
1 Bob
2 Charlie
3 David
Note: 'Age', 'City', and 'Salary' columns are gone because they contain NaNs.
-
how='all': Only remove a row/column if all its values areNaN. -
thresh: Require a row/column to have at leastthreshnon-NaNvalues to be kept.
# Keep rows that have at least 3 non-NaN values df_thresh = df.dropna(thresh=3) print(df_thresh)
Output:
Name Age City Salary
0 Alice 25.0 New York 70000.0
1 Bob NaN London 80000.0
2 Charlie 30.0 NaN 90000.0
Note: Row 3 is dropped because it only has 2 non-NaN values ('Name' and 'Age').
Strategy 2: Impute (Fill) Missing Values
Use this when you want to keep all the data points and are willing to make an educated guess for the missing values.
fillna()
This method fills NaN values with a specified value.
- Filling with a specific scalar value (e.g., mean, median, mode, or a constant).
# Fill missing 'Age' with the mean age mean_age = df['Age'].mean() df['Age'] = df['Age'].fillna(mean_age) # Fill missing 'City' with the most frequent city (mode) mode_city = df['City'].mode()[0] # [0] because mode() can return multiple values df['City'] = df['City'].fillna(mode_city) # Fill missing 'Salary' with 0 df['Salary'] = df['Salary'].fillna(0) print(df)
Output:
Name Age City Salary
0 Alice 25.0 New York 70000.0
1 Bob 28.0 London 80000.0
2 Charlie 30.0 London 90000.0
3 David 29.0 Paris 0.0
- Forward Fill (
ffill) and Backward Fill (bfill): Useful for time-series data.ffillpropagates the last valid observation forward, whilebfilluses the next valid observation.
# Create a new DataFrame for this example
ts_data = {'A': [1, 2, np.nan, 4, 5]}
ts_df = pd.DataFrame(ts_data)
# Forward fill
print("Forward Fill:")
print(ts_df.ffill())
# Backward fill
print("\nBackward Fill:")
print(ts_df.bfill())
Output:
Forward Fill:
A
0 1.0
1 2.0
2 2.0 # NaN filled with the previous value (2)
3 4.0
4 5.0
Backward Fill:
A
0 1.0
1 2.0
2 4.0 # NaN filled with the next value (4)
3 4.0
4 5.0
Special Case: Interpolation
Sometimes, the best estimate for a missing value is a value between its neighbors. interpolate() does this for you, using methods like linear, polynomial, etc.
interp_data = {'A': [1, 2, np.nan, 4, 5]}
interp_df = pd.DataFrame(interp_data)
# Default is linear interpolation
interp_df['A'] = interp_df['A'].interpolate()
print(interp_df)
Output:
A
0 1.0
1 2.0
2 3.0 # (2 + 4) / 2 = 3
3 4.0
4 5.0
Summary Table: NaN Handling Methods
| Method | Purpose | Key Parameters |
|---|---|---|
df.isna() |
