Python AirPassengers数据如何分析与预测？-杰瑞科技汇

Of course! The "AirPassengers" dataset is a classic time series dataset in data science, often used to demonstrate forecasting techniques. It contains the monthly number of airline passengers from 1949 to 1960.

（图片来源网络，侵删）

Here’s a complete guide on how to work with this dataset in Python, covering:

Loading the data (the standard way and from a URL).
Basic data exploration and visualization.
Decomposing the time series to analyze its components.
Building a forecasting model (using the popular statsmodels library).
Making predictions and visualizing the results.

Step 1: Prerequisites

First, you need to install the necessary Python libraries. If you don't have them, open your terminal or command prompt and run:

pip install pandas numpy matplotlib statsmodels

Step 2: Loading the Data

The most common way to get this dataset is by using the statsmodels library, which includes it as an example dataset. Alternatively, you can load it from a public URL.

Method 1: Using `statsmodels` (Recommended)

This is the easiest and most direct way.

（图片来源网络，侵删）

import pandas as pd
import statsmodels.api as sm
# Load the AirPassengers dataset from statsmodels
# The data is returned as a pandas DataFrame
data = sm.datasets.get_rdataset("AirPassengers", "datasets").data
# The 'time' column is a PeriodIndex. Let's convert it to a regular datetime index.
data['time'] = pd.to_datetime(data['time'].astype(str))
data = data.set_index('time')
# Rename the 'value' column for clarity
data.rename(columns={'value': 'Passengers'}, inplace=True)
print("First 5 rows of the dataset:")
print(data.head())

Method 2: Loading from a URL (Good for environments without `statsmodels`)

You can also find the dataset in CSV format online.

import pandas as pd
import io
import requests
# URL for the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
# Fetch the data from the URL
response = requests.get(url)
csv_data = response.text
# Load the data into a pandas DataFrame
data = pd.read_csv(io.StringIO(csv_data))
# The 'Month' column needs to be parsed as datetime and set as the index
data['Month'] = pd.to_datetime(data['Month'])
data = data.set_index('Month')
print("First 5 rows of the dataset:")
print(data.head())

Both methods will give you a DataFrame that looks like this:

First 5 rows of the dataset:
            Passengers
Month                 
1949-01-01         112
1949-02-01         118
1949-03-01         132
1949-04-01         129
1949-05-01         121

Step 3: Data Exploration and Visualization

Before modeling, it's crucial to understand the data's characteristics. A time series plot is the first step.

import matplotlib.pyplot as plt
# Set a nice style for the plots
plt.style.use('seaborn-v0_8-whitegrid')
# Plot the data
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Passengers'], label='Monthly Passengers')'Monthly Airline Passengers 1949-1960')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()

What this plot tells us:

（图片来源网络，侵删）

Trend: There is a clear upward trend over the years. More people are flying.
Seasonality: There is a very strong, regular pattern that repeats every year. Passenger numbers peak in the summer months (June-August) and dip in the winter.
Increasing Variance: The fluctuations around the trend seem to get larger as the trend increases. This is a key observation for modeling.

Step 4: Decomposing the Time Series

A time series can be decomposed into three parts:

Trend: The overall direction of the series.
Seasonality: The periodic, cyclical patterns.
Residuals (or Noise): What's left over after removing the trend and seasonality.

We can use statsmodels to decompose the series. We'll use a 'multiplicative' model because the seasonal fluctuations appear to grow with the level of the time series.

from statsmodels.tsa.seasonal import seasonal_decompose
# Perform seasonal decomposition
# We use a period of 12 because the data is monthly and we expect yearly seasonality
decomposition = seasonal_decompose(data['Passengers'], model='multiplicative', period=12)
# Plot the decomposed components
fig = decomposition.plot()
fig.set_size_inches((12, 8))
plt.show()

What the decomposition plots show:

Observed: The original data.
Trend: The long-term progression of the series, smoothed out.
Seasonal: The clear, repeating yearly pattern.
Resid: The noise. In a good model, the residuals should look like random noise with no clear pattern.

Step 5: Building a Forecasting Model (SARIMAX)

A great model for data with both trend and seasonality is the SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables) model.

We will perform the following steps:

Split the data into a training set and a testing set.
Train the model on the training data.
Forecast on the test data.
Evaluate the model's performance.

# Split the data into training and testing sets
# We'll use the last 12 months as our test set
train_size = len(data) - 12
train_data, test_data = data.iloc[:train_size], data.iloc[train_size:]
print(f"Training data points: {len(train_data)}")
print(f"Testing data points: {len(test_data)}")
# Build and fit the SARIMAX model
# We'll use a common set of parameters (p,d,q)(P,D,Q,s) = (1,1,1)(1,1,1,12)
# This is a good starting point, but parameters can be tuned for better results.
model = sm.tsa.SARIMAX(train_data['Passengers'], 
                        order=(1, 1, 1), 
                        seasonal_order=(1, 1, 1, 12),
                        enforce_stationarity=False,
                        enforce_invertibility=False)
fitted_model = model.fit(disp=False)
# Print the model summary
print(fitted_model.summary())

Step 6: Making Predictions and Evaluating the Model

Now, let's use our trained model to forecast the 12 months of our test set and see how well it performed.

# Get the predictions for the test set
# 'dynamic=False' means that we use all available data up to each point for the forecast
predictions = fitted_model.predict(start=test_data.index[0], end=test_data.index[-1], dynamic=False)
# Calculate the Root Mean Squared Error (RMSE) to evaluate the model
from sklearn.metrics import mean_squared_error
import numpy as np
rmse = np.sqrt(mean_squared_error(test_data['Passengers'], predictions))
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
# Visualize the results
plt.figure(figsize=(14, 7))
# Plot the training data
plt.plot(train_data.index, train_data['Passengers'], label='Training Data')
# Plot the test data
plt.plot(test_data.index, test_data['Passengers'], label='Actual Test Data', color='orange')
# Plot the predictions
plt.plot(test_data.index, predictions, label='Forecasted Data', color='green', linestyle='--')
'AirPassengers Forecast vs Actuals')
plt.xlabel('Date')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()

Output of the plot:

You should see that the green dashed line (our forecast) follows the orange line (actual data) very closely, indicating that our model has done a good job of capturing both the trend and the seasonality. The RMSE value gives you a quantitative measure of the average error of your predictions in the original units (passengers).

Full Script (All-in-One)

Here is the complete script from start to finish for easy copying and running.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.metrics import mean_squared_error
# --- 1. Load Data ---
try:
    # Try loading from statsmodels first
    data = sm.datasets.get_rdataset("AirPassengers", "datasets").data
    data['time'] = pd.to_datetime(data['time'].astype(str))
    data = data.set_index('time')
    data.rename(columns={'value': 'Passengers'}, inplace=True)
except:
    # Fallback to URL
    print("Could not load from statsmodels, loading

Python AirPassengers数据如何分析与预测？

Step 1: Prerequisites

Step 2: Loading the Data

Method 1: Using `statsmodels` (Recommended)

Method 2: Loading from a URL (Good for environments without `statsmodels`)

Step 3: Data Exploration and Visualization

Step 4: Decomposing the Time Series

Step 5: Building a Forecasting Model (SARIMAX)

Step 6: Making Predictions and Evaluating the Model

Full Script (All-in-One)

99ANYc3cd6

C设计模式视频教程该怎么选？

Java字符串如何判断含数字和字母？

Java俄罗斯方块教程从哪开始学？

3dmax车建模教程视频从哪开始学？

Python JSONRPC 教程，如何快速上手？

Java Socket编程实例具体怎么实现？

Cadence教程怎么学？入门到精通技巧？

Oracle数据库与Java如何高效交互？

EViews怎么用？入门教程从哪开始？

Python QPainter如何实现自定义绘图？

计算机二级C视频教程该怎么选？

Java两个List如何高效去重？

Java Tomcat Web如何高效部署与优化？

Python replace如何替换数字？

matlab与java混合编程

Python readline如何实现交互式命令行输入？

Python AirPassengers数据如何分析与预测？

Step 1: Prerequisites

Step 2: Loading the Data

Method 1: Using statsmodels (Recommended)

Method 2: Loading from a URL (Good for environments without statsmodels)

Step 3: Data Exploration and Visualization

Step 4: Decomposing the Time Series

Step 5: Building a Forecasting Model (SARIMAX)

Step 6: Making Predictions and Evaluating the Model

Full Script (All-in-One)

相关推荐

Java Socket编程实例具体怎么实现？

Method 1: Using `statsmodels` (Recommended)

Method 2: Loading from a URL (Good for environments without `statsmodels`)