杰瑞科技汇

Python中Newey-West标准误如何实现?

Of course! Implementing and using the Newey-West standard errors in Python is a common task in econometrics, especially when dealing with time-series data that might have heteroskedasticity and autocorrelation.

Here’s a comprehensive guide covering:

  1. What are Newey-West Standard Errors? (The intuition)
  2. When to Use Them? (The key scenarios)
  3. Method 1: Using statsmodels (The Recommended & Easiest Way)
  4. Method 2: Manual Implementation (For deeper understanding)
  5. A Complete Practical Example

What are Newey-West Standard Errors?

In a standard linear regression, we often assume that the model's errors are:

  • Homoskedastic: The variance of the errors is constant across observations.
  • Serially Uncorrelated: The error at time t is not correlated with the error at time t-1, t-2, etc.

When these assumptions are violated (which is common in time-series data), our standard OLS standard errors become biased. This leads to incorrect t-statistics, p-values, and confidence intervals.

The Newey-West correction is a method to compute robust standard errors that account for:

  • Heteroskedasticity: Non-constant variance.
  • Autocorrelation: Correlation of the error terms with their own lags.

It achieves this by applying a weighted moving average to the squared residuals, with weights decreasing as the lag length increases. The result is more reliable inference (t-tests, F-tests) when your data violates the classical OLS assumptions.


When to Use Them?

You should strongly consider using Newey-West standard errors in the following situations:

  • Working with Time-Series Data: This is the primary use case.
  • When you suspect Autocorrelation: For example, in models of asset returns, GDP growth, or any economic variable that is persistent over time.
  • When you suspect Heteroskedasticity: Common in financial data (volatility clustering) and many economic datasets.
  • As a "safer" default: If you are unsure about the properties of your error terms, using Newey-West standard errors is a robust practice.

Method 1: Using statsmodels (Recommended)

The statsmodels library is the standard for econometrics in Python. Its regression results object has a built-in method to easily compute Newey-West standard errors.

The key parameter is cov_type='HAC' (Heteroskedasticity and Autocorrelation Consistent) and cov_kwds to specify the Newey-West parameters.

  • maxlags: This is the most important parameter. It determines how many lags of the autocovariance are included. A common rule of thumb is to use floor(4*(T/100)^(2/9)), where T is the number of observations. statsmodels will calculate this for you if you set maxlags to None.

Step-by-Step Example:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# --- 1. Create some sample time-series data ---
# Let's create data with a clear trend and some autocorrelated noise
np.random.seed(42)
n = 200
time = np.arange(n)
X = time + np.random.normal(0, 5, n) # A predictor with a trend
# The true relationship is Y = 2 + 1.5*X + noise
# Let's make the noise autocorrelated (violates OLS assumption)
true_error = np.zeros(n)
for i in range(1, n):
    true_error[i] = 0.5 * true_error[i-1] + np.random.normal(0, 3)
Y = 2 + 1.5 * X + true_error
# Create a pandas DataFrame
df = pd.DataFrame({'Y': Y, 'X': X})
df.index = pd.date_range(start='2025-01-01', periods=n, freq='D')
# --- 2. Run the OLS Regression ---
# We need to add a constant (intercept) to the model
X_sm = sm.add_constant(df['X'])
model = sm.OLS(df['Y'], X_sm).fit()
# --- 3. Get the standard OLS results ---
print("--- Standard OLS Results ---")
print(model.summary())
print("\n")
# --- 4. Get Newey-West Standard Errors ---
# Use the .get_robustcov_results() method
# Set cov_type='HAC' and specify the Newey-West parameters
# maxlags is crucial. A common rule is floor(4*(n/100)^(2/9))
# For n=200, this is roughly 4. Let's use 4.
nw_model = model.get_robustcov_results(cov_type='HAC', maxlags=4)
# --- 5. Print the Newey-West results ---
print("--- Newey-West HAC Results (maxlags=4) ---")
print(nw_model.summary())

Interpreting the Output:

When you compare the two summaries, you'll notice that the coefficients (const, X) are identical. This is because Newey-West does not change the estimated coefficients; it only corrects their standard errors.

Look for the changes in:

  • Std. Error: The standard errors will likely be larger (or at least different) in the Newey-West output.
  • t-statistic: This is coef / std_err. Since the denominator changed, the t-statistic will also change.
  • P>|t|: The p-values will be updated based on the new t-statistics.

The Newey-West results provide a more trustworthy basis for concluding whether the relationship between X and Y is statistically significant.


Method 2: Manual Implementation (For Understanding)

While you should almost always use statsmodels in practice, implementing Newey-West manually is a fantastic way to understand what's happening under the hood. This implementation follows the formula:

$$ \hat{\sigma}^2_{NW} = \hat{\sigma}^20 + \sum{l=1}^{L} w_l (\hat{\sigma}^2l + \hat{\sigma}^2{-l}) $$

where $w_l = 1 - \frac{l}{L+1}$ are the weights.

def newey_west_std_errors(residuals, X, maxlags):
    """
    Calculates Newey-West standard errors manually.
    Parameters:
    residuals : array-like, The model residuals.
    X : array-like, The design matrix (with constant).
    maxlags : int, The maximum number of lags to consider.
    Returns:
    A tuple (nw_cov, se) where nw_cov is the NW covariance matrix
    and se is the vector of standard errors.
    """
    T, K = X.shape
    # Calculate OLS residuals
    u = residuals
    # Initialize the HAC covariance matrix
    S_0 = (u**2) @ X.T @ X / T
    nw_cov = S_0.copy()
    # Loop through lags from 1 to maxlags
    for l in range(1, maxlags + 1):
        # Calculate the lagged cross-product of residuals and X
        # u_t * X_t' * X_{t-l}
        cross_prod = 0
        for t in range(l, T):
            cross_prod += u[t] * X[t].reshape(-1, 1) @ X[t-l].reshape(1, -1)
        s_l = cross_prod / T
        # Newey-West weights
        weight = 1 - l / (maxlags + 1)
        # Add the lagged term to the covariance matrix
        nw_cov += weight * (s_l + s_l.T)
    # The standard errors are the square root of the diagonal
    se = np.sqrt(np.diag(nw_cov) / T)
    return nw_cov, se
# --- Using the manual function with our previous model ---
ols_residuals = model.resid
X_matrix = sm.add_constant(df['X']) # Design matrix
# Use the same maxlags as before
maxlags_manual = 4
nw_cov_manual, se_manual = newey_west_std_errors(ols_residuals, X_matrix, maxlags_manual)
print("--- Manual Newey-West Standard Errors ---")
print(f"Coefficient: {model.params['X']:.4f}")
print(f"Std. Error (Manual): {se_manual[1]:.4f}") # Index 1 for the 'X' coefficient
print(f"Std. Error (statsmodels): {nw_model.bse['X']:.4f}")
# They should be very close!

A Complete Practical Example: Predicting Stock Returns

Let's use real-world data to see Newey-West in action. We'll try to predict the daily returns of a stock (e.g., Apple) using its own lagged returns.

import yfinance as yf
import pandas as pd
import statsmodels.api as sm
# Download data
ticker = 'AAPL'
data = yf.download(ticker, start='2025-01-01', end='2025-12-31')
data['Return'] = data['Adj Close'].pct_change().dropna()
# Create lagged variable
data['Lag1_Return'] = data['Return'].shift(1).dropna()
# Drop the first row which will have a NaN for Lag1_Return
df_model = data.dropna(subset=['Lag1_Return'])
# Define X and Y
X = sm.add_constant(df_model['Lag1_Return'])
Y = df_model['Return']
# Run OLS
model = sm.OLS(Y, X).fit()
print("--- OLS on Daily Stock Returns ---")
print(model.summary())
# Run Newey-West
# For daily data, a common choice for maxlags is 5 (a week) or 22 (a month)
# Let's use 5 for this example
nw_model = model.get_robustcov_results(cov_type='HAC', maxlags=5)
print("\n--- Newey-West (maxlags=5) on Daily Stock Returns ---")
print(nw_model.summary())

What to expect here: Daily financial returns are notoriously heteroskedastic and often exhibit autocorrelation (especially at very short lags). You will almost certainly see a difference between the OLS and Newey-West standard errors. The Newey-West standard errors for the Lag1_Return coefficient will likely be larger, making the t-statistic smaller and the p-value larger. This might change your conclusion about whether past returns are a significant predictor of future returns.

分享:
扫描分享到社交APP
上一篇
下一篇