This repository implements the methods described in the paper "Correlation Matrix Clustering for Statistical Arbitrage Portfolios" by Cartea, Cucuringu, and Jin (2023), with significant enhancements to the original methodology.
Statistical arbitrage strategies aim to exploit temporary price deviations among similar assets. This implementation focuses on a two-step approach:
- Group similar assets: Use graph clustering algorithms to partition stocks into groups where stocks within the same group tend to co-move.
- Create arbitrage portfolios: Within each cluster, construct mean-reverting portfolios that exploit temporary deviations from the cluster's average return.
We start by calculating market residual returns for each stock, removing the market factor influence:
Where:
-
$R_{i,t}$ is the raw return of stock$i$ at time$t$ -
$\beta_i$ is the stock's sensitivity to market movements (CAPM beta) -
$R_{mkt,t}$ is the market return at time$t$ (S&P 500/SPY ETF)
Why this approach? The one-factor CAPM adjustment aims to isolate stock-specific movements by removing market-wide effects. This is crucial because we want to identify relative value opportunities within sectors/clusters rather than taking directional market bets. By removing the market factor, we reduce the risk of our strategy being merely a disguised bet on market direction.
While CAPM provides a basic market adjustment, research has shown that multiple factors drive stock returns. We extend beyond the simple one-factor model to more sophisticated approaches:
Generalized residualization process using multiple factors:
Where:
-
$F_{k,t}$ is the return of factor$k$ at time$t$ -
$\beta_{i,k}$ is the sensitivity of stock$i$ to factor$k$
Implemented factor models include:
- Fama-French Factors: Using the 5-factor model (Market, Size, Value, Profitability, Investment)
- PCA-Based Factors: Data-driven factors extracted from the return matrix
Why enhance with multi-factor models? Single-factor CAPM fails to account for well-documented risk premia like size, value, and momentum effects. By extending to multi-factor models:
- Improved signal isolation: We remove more systematic effects, leading to cleaner stock-specific signals
- Reduced factor exposure: The strategy becomes more market-neutral across multiple dimensions
- Better clustering accuracy: With more systematic effects removed, correlations better reflect true stock relationships
- Adaptability to changing markets: Data-driven PCA factors can identify emerging drivers of returns that traditional models might miss
The PCA approach is particularly valuable as it extracts factors directly from the data without relying on predefined economic interpretations, potentially capturing market dynamics that established models might miss.
We compute the correlation matrix of these residual returns:
Where:
-
$\bar{R}^{res}_i$ is the mean residual return of stock$i$ -
$\sigma_i$ and$\sigma_j$ are standard deviations -
$w$ is the lookback window size
Why correlation matrices? Correlation matrices provide a natural measure of co-movement between assets. Using residual returns rather than raw returns ensures that these correlations capture relationships beyond common market exposure. The correlation structure serves as the foundation for our clustering approach, helping identify natural groupings of stocks that tend to move together after accounting for systematic factors.
The implementation includes multiple clustering methods:
-
Spectral Clustering: Uses the spectrum of the graph Laplacian matrix. Since the correlation matrix may contain negative entries, we use the absolute values.
The Laplacian matrix is defined as:
$$L = D - A$$ Where:
-
$A$ is the adjacency matrix (absolute correlation matrix) -
$D$ is the diagonal degree matrix with$D_{ii} = \sum_{j=1}^{n} A_{ij}$
-
-
Signed Laplacian (Random Walk):
This handles signed networks directly, using:$$\bar{L}_{rw} = I - \bar{D}^{-1}A$$ Where $\bar{D}{ii} = \sum{j=1}^{n} |A_{ij}|$
-
Signed Laplacian (Symmetric):
A symmetric normalized variant:$$\bar{L}_{sym} = I - \bar{D}^{-1/2}A\bar{D}^{-1/2}$$ -
SPONGE:
SPONGE (Signed Positive Over Negative Generalized Eigenproblem) decomposes the adjacency matrix into positive and negative components:$$A = A^+ - A^-$$ And solves a generalized eigenvalue problem:
$$(L^+ + \tau^- D^-, L^- + \tau^+ D^+)$$ -
SPONGEsym:
A symmetric variant of SPONGE using the symmetric Laplacian.
Why these clustering approaches? Financial correlation matrices present unique challenges:
- Signed graphs: Unlike many network problems, financial correlations can be negative (assets moving in opposite directions), requiring specialized methods like Signed Laplacian and SPONGE
- Different information in positive and negative correlations: Negative correlations may indicate hedging relationships just as important as positive co-movement
- Noise sensitivity: Financial data contains substantial noise, requiring robust algorithms
- Non-spherical clusters: Market sectors often have complex shapes that aren't easily captured by simpler clustering methods
SPONGE methods in particular are valuable because they handle both positive and negative correlations in a principled way, balancing attractive forces (positive correlations) and repulsive forces (negative correlations) to find optimal clusters.
- Hierarchical Clustering with Dynamic Tree Cutting: Builds a dendrogram and intelligently cuts it to identify natural groupings.
- Density-Based Clustering (HDBSCAN): Identifies clusters of varying densities and handles noise points.
- Deep Embedding Clustering: Uses autoencoder neural networks to learn low-dimensional embeddings for clustering.
Why enhance with these methods? Each addresses specific limitations of traditional approaches:
-
Hierarchical Clustering with Dynamic Tree Cutting:
- Problem solved: Fixed number of clusters may not reflect natural market structure
- Benefits: Adaptively determines cluster boundaries based on actual data structure
- Advantage: Creates interpretable hierarchy of relationships, useful for risk monitoring
-
Density-Based Clustering (HDBSCAN):
- Problem solved: Standard methods struggle with varying cluster densities and outliers
- Benefits: Identifies stocks with unique behavior as outliers rather than forcing them into clusters
- Advantage: More robust to market disruptions where correlation structures temporarily break down
-
Deep Embedding Clustering:
- Problem solved: Linear methods may miss complex non-linear relationships
- Benefits: Captures subtle patterns in return dynamics beyond simple correlations
- Advantage: Can adapt to changing market conditions by learning relevant features directly from data
Together, these enhanced methods provide more flexible, adaptive clustering that better respects the natural structure in financial data.
We implement two methods:
We select eigenvalues exceeding the upper boundary of the MP distribution:
Select the top-k eigenvalues that explain a specified percentage of total variance:
Why these approaches? Determining the optimal number of clusters is a critical challenge:
-
Marchenko-Pastur (MP) method:
- Scientific foundation: Based on random matrix theory (RMT), which provides a sound theoretical framework
- Signal vs. noise separation: Distinguishes statistically significant correlations from random noise
- Adaptive: Accounts for the ratio of assets to time periods, crucial in high-dimensional financial data
-
Variance Explained method:
- Interpretability: Directly relates to how much of the return variation is captured
- Practical tuning: Allows practitioners to balance complexity vs. explanatory power
- Stability: Tends to produce more stable clusters over time
Using both methods provides complementary perspectives - the MP approach offers theoretical rigor while variance explained provides practical flexibility.
Within each cluster, we:
- Calculate the mean return of all stocks
- Identify "winners" (outperformers) and "losers" (underperformers)
- Take a contrarian position: short winners, long losers
- Normalize weights to create a zero-cost portfolio
Why this strategy? The portfolio construction follows a statistical arbitrage logic:
- Mean-reversion exploitation: Research shows stock returns tend to exhibit short-term reversal within similar asset groups
- Diversification benefit: By constructing multiple independent portfolios across clusters, we reduce overall strategy risk
- Market neutrality: Zero-cost construction within clusters helps maintain neutrality to broad market movements
- Sector neutrality: By operating within clusters rather than across the entire market, we avoid inadvertent sector bets
This approach harnesses both statistical principles (mean reversion) and financial market structure (clusters representing economic relationships) to identify temporary mispricings.
We implement regime-switching models to adaptively adjust strategy parameters:
Where:
-
$\theta_t$ is the set of strategy parameters at time$t$ -
$S_t$ is the detected market regime state -
$f$ is a mapping function from regimes to optimal parameters
Key adaptive parameters include:
- Lookback window length
- Rebalancing frequency
- Winner/loser classification threshold
- Correlation window size
Why dynamic parameters? Markets operate in distinct regimes with different characteristics:
-
Market regime sensitivity:
- Problem solved: Fixed parameters that work in calm markets often fail during high volatility periods
- Approach: Hidden Markov Models (HMMs) identify distinct volatility regimes
- Benefit: Parameters automatically adjust to prevailing conditions
-
Adaptivity to changing market speeds:
- Problem solved: Market reaction speeds vary across regimes (high volatility periods typically move faster)
- Implementation: Shorter lookback windows during volatile periods, longer during calm markets
- Advantage: Captures the appropriate signal horizon for current conditions
-
Threshold optimization:
- Problem solved: Fixed thresholds may be too strict during low-volatility periods and too loose during high volatility
- Method: Dynamically scale thresholds based on recent volatility
- Result: More consistent portfolio turnover across different market environments
This adaptive approach helps avoid the pitfall of parameters that work well in backtest but fail when market conditions change.
We incorporate non-price information sources to enhance clustering and signal generation:
Sentiment scores are used to adjust expected returns:
Where
Why sentiment data?
- Forward-looking information: Captures market expectations not yet reflected in price
- Crowdsourced intelligence: Aggregates views across many market participants
- Lead indicator: Often precedes price movements, especially around significant events
- Implementation benefit: More subtle than price movements, potentially identifying opportunities earlier
Implied volatility skew and term structure provide forward-looking information:
Why options data?
- Market expectations: Contains implied forecasts of future volatility and price direction
- Institutional insight: Often reflects positioning of sophisticated investors
- Risk premium indicator: Captures market pricing of tail risks
- Advantage: Provides a market-derived probability distribution of future returns, not just point estimates
Market microstructure information used to adjust thresholds:
Why order flow data?
- Real-time adaptation: Captures current market behavior at the microstructure level
- Supply-demand imbalances: Identifies temporary liquidity issues versus fundamental repricing
- Informed trading detection: May help distinguish noise from information-driven price movements
- Implementation value: Particularly helpful for avoiding trades against strong order flow momentum
By integrating these alternative data sources, the strategy gains multiple perspectives beyond just historical prices, potentially identifying opportunities earlier and with greater conviction.
The codebase is organized as follows:
data_preprocessing.py: Handles data loading, cleaning, beta calculation, and residual returnscorrelation_matrix.py: Builds correlation matrices from residual returnsclustering.py: Implements all clustering algorithmsportfolio_construction.py: Creates and rebalances portfolios based on clusteringbacktest.py: Evaluates strategy performanceutils.py: Utility functions for metrics and visualizationsmain.py: End-to-end execution
adaptive_parameters.py: Implements dynamic parameter optimization based on market regimesalternative_data.py: Processes and integrates sentiment, options, and order flow dataportfolio_integration.py: Manages integration with existing portfolios and capital allocationvisualization_plotly.py: Interactive visualizations using Plotly
The backtesting framework:
- Prepares historical price data
- Calculates residual returns
- Periodically rebalances portfolios based on clustering
- Tracks transaction costs
- Evaluates performance metrics
The framework includes advanced backtesting techniques:
Time-series cross-validation is used to assess strategy robustness across different market regimes:
validation_results = time_series_cross_validation(
tickers, start_date, end_date,
n_splits=5,
backtest_params={'clustering_method': 'sponge_sym'}
)Why time-series cross-validation?
- Problem addressed: Simple train-test splits don't account for regime changes and time-varying relationships
- Method: Multiple forward-testing periods with expanding windows preserve temporal order
- Benefit: More realistic assessment of how the strategy would have performed in real trading
- Advantage over alternatives: Produces distribution of results across different market periods, revealing strategy robustness
Bootstrapping techniques are employed to generate confidence intervals for performance metrics:
mc_results = monte_carlo_backtest(
returns, residual_returns,
n_simulations=1000,
bootstrap_method='block' # Block bootstrap preserves autocorrelation
)Why Monte Carlo simulation?
- Problem addressed: Historical results represent just one realization from a distribution of possibilities
- Method: Block bootstrap resampling preserves return autocorrelation structure while generating alternative scenarios
- Benefit: Produces confidence intervals around performance metrics rather than point estimates
- Key insight gained: Reveals strategy's sensitivity to specific historical sequences versus robust patterns
Parameters are continually re-optimized as new data arrives to maintain adaptivity:
wfo_results = walk_forward_optimization(
tickers, start_date, end_date,
initial_window=252, # 1 year initial window
step=20, # Re-optimize every 20 days
param_grid={
'lookback_window': [3, 5, 10, 15],
'rebalance_period': [1, 3, 5]
}
)Why walk-forward optimization?
- Problem addressed: Static parameters optimized on full sample lead to look-ahead bias and parameter overfitting
- Method: Parameters periodically retrained on expanding window, used only on unseen data
- Benefit: Mimics real-world process of periodic strategy recalibration
- Implementation advantage: Provides insights on parameter stability across different market regimes
These advanced validation techniques provide a more comprehensive understanding of strategy performance and robustness than simple full-sample backtests.
We provide methods to integrate the statistical arbitrage strategy with existing portfolios:
- Risk Parity: Equal risk contribution across strategies
- Minimum Variance: Minimizes portfolio volatility
- Maximum Diversification: Maximizes the diversification ratio
- Risk Budgeting: Allocates according to specified risk targets
Why these allocation methods?
- Problem addressed: Naïve equal weighting doesn't account for different strategy risk profiles and correlations
- Risk Parity approach: Allocates capital to equalize risk contribution, preventing any single strategy from dominating risk
- Minimum Variance benefit: Finding the mix that minimizes overall volatility
- Maximum Diversification advantage: Optimizes for diversification potential rather than just risk reduction
- Risk Budgeting value: Allows flexible capital allocation aligned with risk tolerance
Hierarchical clustering is used to identify genuinely diversifying strategies:
allocator = PortfolioAllocator(strategies)
clustering = allocator.cluster_strategies(threshold=0.5)
print(f"Identified {clustering['n_clusters']} strategy clusters")Why cluster strategies?
- Problem addressed: Superficially different strategies may actually capture the same underlying factor
- Method: Hierarchical clustering identifies natural groupings of strategies based on return correlations
- Benefit: Ensures capital is allocated to truly diverse alpha sources
- Implementation advantage: Helps identify redundant strategies that can be eliminated or re-weighted
Measures the impact of adding statistical arbitrage to an existing portfolio:
impact = calculate_information_ratio_impact(
existing_portfolio_returns,
stat_arb_returns,
allocation=0.1 # 10% allocation to stat arb
)
print(f"Sharpe improvement: {impact['sharpe_improvement']:.2f}")Why incremental benefit analysis?
- Problem addressed: Raw strategy metrics don't show marginal benefit to an existing portfolio
- Method: Information ratio measures excess return per unit of tracking error
- Benefit: Quantifies the value added by including the new strategy
- Key insight gained: A strategy with modest standalone metrics may still provide substantial portfolio benefit if its returns are uncorrelated with existing holdings
These portfolio integration tools help practitioners make informed allocation decisions and understand the true value added by the statistical arbitrage strategy.
We evaluate performance using:
- Annualized Return: Average yearly return of the strategy
-
Sharpe Ratio: Risk-adjusted return using standard deviation as risk
$$\text{Sharpe Ratio} = \frac{\text{Portfolio Return}}{\text{Standard Deviation}}$$ -
Sortino Ratio: Risk-adjusted return focusing on downside risk
$$\text{Sortino Ratio} = \frac{\text{Portfolio Return}}{\text{Downside Deviation}}$$ -
Information Ratio: Excess return relative to a benchmark per unit of tracking error
$$\text{Information Ratio} = \frac{\text{Strategy Return} - \text{Benchmark Return}}{\text{Tracking Error}}$$
- Python 3.7+
- NumPy
- pandas
- SciPy
- scikit-learn
- matplotlib
- plotly
- yfinance (for data fetching)
- tensorflow (for deep clustering)
- hdbscan (for density-based clustering)
- Clone the repository
- Install dependencies:
pip install -r requirements.txt - Run the main script:
python main.py
# Basic backtest with default parameters
backtest = StatisticalArbitrageBacktest(
tickers=tickers,
start_date=start_date,
end_date=end_date
)
backtest.run_backtest()
# Advanced backtest with dynamic parameters and alternative data
backtest = StatisticalArbitrageBacktest(
tickers=tickers,
start_date=start_date,
end_date=end_date,
factor_model='fama_french',
use_dynamic_parameters=True,
use_alternative_data=True
)
# Add sentiment data
sentiment_processor = SentimentDataProcessor()
sentiment_data = sentiment_processor.fetch_news_sentiment(tickers, start_date, end_date)
backtest.add_alternative_data('sentiment', sentiment_data)
# Run backtest
returns = backtest.run_backtest()
metrics = backtest.get_performance_metrics()
# Portfolio integration
allocator = PortfolioAllocator({
'StatArb': returns['Combined'],
'Market': market_returns,
'Value': value_returns
})
weights = allocator.optimize_allocation(method='risk_parity')You can adjust strategy parameters in config.py:
- Lookback window for beta calculation
- Rebalancing frequency
- Number of clusters or method for determining it
- Threshold for identifying winners/losers
- Stop-win threshold
- Factor model specifications
- Alternative data integration settings
- Cartea, Á., Cucuringu, M., & Jin, Q. (2023). Correlation Matrix Clustering for Statistical Arbitrage Portfolios.
- Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1), 1-22.
- Campello, R. J., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. Pacific-Asia conference on knowledge discovery and data mining.
- Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719-720.
- Patton, A. J., & Timmermann, A. (2012). Portfolio sorts and tests of cross-sectional patterns in expected returns. Journal of Financial Economics, 104(1), 141-161.
- Timmermann, A. (2018). Forecasting methods in finance. Annual Review of Financial Economics, 10, 449-479.
- Plerou, V., Gopikrishnan, P., Rosenow, B., Amaral, L. A. N., & Stanley, H. E. (2000). Random matrix approach to cross correlations in financial data. Physical Review E, 65(6), 066126.
- Bouchaud, J. P., & Potters, M. (2003). Theory of financial risk and derivative pricing: from statistical physics to risk management. Cambridge University Press.