Of course! The "AirPassengers" dataset is a classic time series dataset in data science, often used to demonstrate forecasting techniques. It contains the monthly number of airline passengers from 1949 to 1960.

Here’s a complete guide on how to work with this dataset in Python, covering:
- Loading the data (the standard way and from a URL).
- Basic data exploration and visualization.
- Decomposing the time series to analyze its components.
- Building a forecasting model (using the popular
statsmodelslibrary). - Making predictions and visualizing the results.
Step 1: Prerequisites
First, you need to install the necessary Python libraries. If you don't have them, open your terminal or command prompt and run:
pip install pandas numpy matplotlib statsmodels
Step 2: Loading the Data
The most common way to get this dataset is by using the statsmodels library, which includes it as an example dataset. Alternatively, you can load it from a public URL.
Method 1: Using statsmodels (Recommended)
This is the easiest and most direct way.

import pandas as pd
import statsmodels.api as sm
# Load the AirPassengers dataset from statsmodels
# The data is returned as a pandas DataFrame
data = sm.datasets.get_rdataset("AirPassengers", "datasets").data
# The 'time' column is a PeriodIndex. Let's convert it to a regular datetime index.
data['time'] = pd.to_datetime(data['time'].astype(str))
data = data.set_index('time')
# Rename the 'value' column for clarity
data.rename(columns={'value': 'Passengers'}, inplace=True)
print("First 5 rows of the dataset:")
print(data.head())
Method 2: Loading from a URL (Good for environments without statsmodels)
You can also find the dataset in CSV format online.
import pandas as pd
import io
import requests
# URL for the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
# Fetch the data from the URL
response = requests.get(url)
csv_data = response.text
# Load the data into a pandas DataFrame
data = pd.read_csv(io.StringIO(csv_data))
# The 'Month' column needs to be parsed as datetime and set as the index
data['Month'] = pd.to_datetime(data['Month'])
data = data.set_index('Month')
print("First 5 rows of the dataset:")
print(data.head())
Both methods will give you a DataFrame that looks like this:
First 5 rows of the dataset:
Passengers
Month
1949-01-01 112
1949-02-01 118
1949-03-01 132
1949-04-01 129
1949-05-01 121
Step 3: Data Exploration and Visualization
Before modeling, it's crucial to understand the data's characteristics. A time series plot is the first step.
import matplotlib.pyplot as plt
# Set a nice style for the plots
plt.style.use('seaborn-v0_8-whitegrid')
# Plot the data
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Passengers'], label='Monthly Passengers')'Monthly Airline Passengers 1949-1960')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()
What this plot tells us:

- Trend: There is a clear upward trend over the years. More people are flying.
- Seasonality: There is a very strong, regular pattern that repeats every year. Passenger numbers peak in the summer months (June-August) and dip in the winter.
- Increasing Variance: The fluctuations around the trend seem to get larger as the trend increases. This is a key observation for modeling.
Step 4: Decomposing the Time Series
A time series can be decomposed into three parts:
- Trend: The overall direction of the series.
- Seasonality: The periodic, cyclical patterns.
- Residuals (or Noise): What's left over after removing the trend and seasonality.
We can use statsmodels to decompose the series. We'll use a 'multiplicative' model because the seasonal fluctuations appear to grow with the level of the time series.
from statsmodels.tsa.seasonal import seasonal_decompose # Perform seasonal decomposition # We use a period of 12 because the data is monthly and we expect yearly seasonality decomposition = seasonal_decompose(data['Passengers'], model='multiplicative', period=12) # Plot the decomposed components fig = decomposition.plot() fig.set_size_inches((12, 8)) plt.show()
What the decomposition plots show:
- Observed: The original data.
- Trend: The long-term progression of the series, smoothed out.
- Seasonal: The clear, repeating yearly pattern.
- Resid: The noise. In a good model, the residuals should look like random noise with no clear pattern.
Step 5: Building a Forecasting Model (SARIMAX)
A great model for data with both trend and seasonality is the SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables) model.
We will perform the following steps:
- Split the data into a training set and a testing set.
- Train the model on the training data.
- Forecast on the test data.
- Evaluate the model's performance.
# Split the data into training and testing sets
# We'll use the last 12 months as our test set
train_size = len(data) - 12
train_data, test_data = data.iloc[:train_size], data.iloc[train_size:]
print(f"Training data points: {len(train_data)}")
print(f"Testing data points: {len(test_data)}")
# Build and fit the SARIMAX model
# We'll use a common set of parameters (p,d,q)(P,D,Q,s) = (1,1,1)(1,1,1,12)
# This is a good starting point, but parameters can be tuned for better results.
model = sm.tsa.SARIMAX(train_data['Passengers'],
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False)
fitted_model = model.fit(disp=False)
# Print the model summary
print(fitted_model.summary())
Step 6: Making Predictions and Evaluating the Model
Now, let's use our trained model to forecast the 12 months of our test set and see how well it performed.
# Get the predictions for the test set
# 'dynamic=False' means that we use all available data up to each point for the forecast
predictions = fitted_model.predict(start=test_data.index[0], end=test_data.index[-1], dynamic=False)
# Calculate the Root Mean Squared Error (RMSE) to evaluate the model
from sklearn.metrics import mean_squared_error
import numpy as np
rmse = np.sqrt(mean_squared_error(test_data['Passengers'], predictions))
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
# Visualize the results
plt.figure(figsize=(14, 7))
# Plot the training data
plt.plot(train_data.index, train_data['Passengers'], label='Training Data')
# Plot the test data
plt.plot(test_data.index, test_data['Passengers'], label='Actual Test Data', color='orange')
# Plot the predictions
plt.plot(test_data.index, predictions, label='Forecasted Data', color='green', linestyle='--')
'AirPassengers Forecast vs Actuals')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()
Output of the plot:
You should see that the green dashed line (our forecast) follows the orange line (actual data) very closely, indicating that our model has done a good job of capturing both the trend and the seasonality. The RMSE value gives you a quantitative measure of the average error of your predictions in the original units (passengers).
Full Script (All-in-One)
Here is the complete script from start to finish for easy copying and running.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_squared_error
# --- 1. Load Data ---
try:
# Try loading from statsmodels first
data = sm.datasets.get_rdataset("AirPassengers", "datasets").data
data['time'] = pd.to_datetime(data['time'].astype(str))
data = data.set_index('time')
data.rename(columns={'value': 'Passengers'}, inplace=True)
except:
# Fallback to URL
print("Could not load from statsmodels, loading 