SPY Volatility Forecasting — CNN/LSTM

Feb 2026 · 6 min read · MSc AI, Kristiania University College — Advanced Machine Intelligence and Deep Learning

Python TensorFlow Keras CNN LSTM CNN-LSTM pandas NumPy time-series

The Problem

Volatility — the degree to which an asset's price fluctuates — is the central quantity in options pricing, risk management, and portfolio construction. Unlike price itself, future volatility is not directly observable. You are always forecasting something you cannot see until after it has happened.

The standard approach, the GARCH family of models, assumes stationary statistical relationships that financial markets frequently violate. The question this project explored was whether deep learning models — particularly those designed to extract both local patterns and sequential dependencies — could produce more adaptive volatility estimates for SPY (the S&P 500 ETF), the most liquid equity instrument in the world.

My Approach

I worked with 33 years of daily SPY closing prices (January 1993 to February 2026 — 8,318 trading days), computing log returns as the primary signal. The target variable is forward realized volatility measured over two horizons: 5 trading days (vol_5) and 10 trading days (vol_10). This is a regression problem, not classification.

To structure the data for supervised learning, I used a 120-day lookback window: each sample consists of the last 120 log returns as input features, with the corresponding forward volatility as the label. This produced 8,187 usable samples. The split was strictly chronological — 70% train, 15% validation, 15% test — to prevent any lookahead leakage. Standardization was fitted on the training set only and applied forward.

I evaluated four models to isolate the contribution of each architectural choice: a persistence baseline, a 1D CNN, a plain LSTM, and a CNN-LSTM hybrid.

What I Built

Baseline (Persistence): The last h observed log returns used as the volatility estimate. A deliberate floor — any trained model that cannot beat this is not worth deploying.

1D CNN: One-dimensional convolutional layers slide across the 120-day window to detect local return patterns — spikes, clusters of high variation — before pooling and passing to a dense head. CNNs are efficient at picking up local temporal motifs without needing to model the full sequence.

LSTM: The recurrent network processes the sequence step by step, maintaining a hidden state that captures longer-range dependencies. Dropout was applied to reduce overfitting on the limited training window. LSTMs are theoretically well-suited to volatility clustering — the tendency for high-volatility periods to persist.

CNN-LSTM Hybrid: Convolutional layers first extract local features from the raw sequence, then pass the compressed representation to an LSTM layer for sequential modelling. The hypothesis was that this combination would outperform either architecture alone by decomposing the learning problem into local feature extraction followed by temporal integration.

Dataset

8,318 days

Samples

8,187

Lookback

120 days

Horizons

vol_5, vol_10

Results

All three deep learning models were evaluated against the persistence baseline using MAE and RMSE across both vol_5 and vol_10 targets. The CNN-LSTM hybrid achieved the strongest performance, outperforming the standalone CNN and LSTM models on both horizons. The plain LSTM performed competitively on the longer 10-day horizon, where sequential dependencies have more time to manifest. The CNN showed the fastest training convergence.

The 10-day horizon consistently produced higher absolute errors than the 5-day horizon across all models — forecasting further into the future compounds uncertainty, regardless of architecture.

All trained models beat the persistence baseline, confirming that the learned representations carry useful predictive signal beyond simple naive extrapolation.

Note: exact MAE/RMSE values are in the full report (AMI4100_Volatility_Forecasting_Report.pdf) — link below.

What I Learned

The most important lesson was about data leakage in time-series problems. It is easy to accidentally standardize using full-dataset statistics or shuffle samples before splitting — both of which allow future information to bleed into the training set and produce deceptively good validation metrics. Enforcing strict chronological discipline throughout is non-negotiable.

I also came to appreciate how much of a CNN's strength comes from its inductive bias toward local patterns. Volatility clustering — the empirical tendency for turbulent periods to group together — is exactly the kind of local motif a 1D CNN can detect efficiently, which explains why it competed well despite being simpler than the LSTM.

Finally, this project made the case for always having a strong naive baseline. A persistence estimator is free to compute and often surprisingly hard to beat. Beating it is necessary but not sufficient — the question is always how much better and whether the additional complexity is worth it in deployment.

GitHub → Live Demo →