correlation-analysis

Name: correlation-analysis
Author: HKUDS/Vibe-Trading
$npx mdskill add HKUDS/Vibe-Trading/correlation-analysis
Analyzes asset relationships for pairs trading and portfolio construction using correlation and cointegration methods.
Helps identify highly correlated assets and cointegrated pairs for trading strategies and risk management.
Integrates with financial data sources and uses Python libraries like pandas, numpy, and scipy for calculations.
Ranks assets by correlation, tests for cointegration, and filters candidates to surface viable trading opportunities.
Outputs candidate pools, correlation summaries, and pair-trading signals in structured formats for further analysis.
SKILL.md
.github/skills/correlation-analysisView on GitHub ↗
---
name: correlation-analysis
description: Correlation and cointegration analysis — co-movement discovery, deep return-correlation analysis, sector clustering, realized correlation, Engle-Granger / Johansen cointegration, half-life, Kalman dynamic hedge ratio, cross-market linkage analysis, and pair-trading signal generation
category: analysis
---

# Correlation and Cointegration Analysis

## Overview

Correlation analysis is a foundational tool for pairs trading, portfolio construction, and risk management. This skill covers four analysis modes (co-movement discovery / return-correlation deep dive / sector clustering / realized correlation), a full cointegration-testing framework, cross-market linkage analysis, and the complete workflow from analytics to pair-trading signals.

---

## Mode 1: Co-Movement Discovery

**Use case**: Given a target asset, scan a universe for highly correlated assets and build a candidate pool with similar industry or factor exposure, for use in pairs trading or substitute identification.

### Workflow

```
1. Pull daily return series for the target asset and N candidates
2. Compute Pearson / Spearman correlations between the target and each candidate
3. Rank by correlation in descending order and keep Top-K (usually K=10-20)
4. Run cointegration tests on the Top-K set to retain pairs with real long-run equilibrium
5. Output the candidate pool and a correlation summary
```

```python
import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr

def scan_correlated_assets(
    target_returns: pd.Series,
    universe_returns: pd.DataFrame,
    top_k: int = 20,
    min_corr: float = 0.5,
    method: str = "pearson",
) -> pd.DataFrame:
    """Scan for assets that are highly correlated with the target asset.

    Args:
        target_returns: Daily return series for the target asset
        universe_returns: Candidate-universe return matrix, columns are symbols
        top_k: Number of top candidates to return
        min_corr: Minimum absolute-correlation threshold
        method: "pearson" or "spearman"

    Returns:
        A DataFrame containing symbol / corr / p_value / rank
    """
    aligned = universe_returns.dropna(axis=1, how="any")
    aligned, target_aligned = aligned.align(target_returns, join="inner", axis=0)

    results = []
    for col in aligned.columns:
        if method == "spearman":
            corr, p = spearmanr(target_aligned, aligned[col])
        else:
            corr, p = pearsonr(target_aligned, aligned[col])
        results.append({"symbol": col, "corr": corr, "p_value": p})

    df = pd.DataFrame(results)
    df = df[df["corr"].abs() >= min_corr].sort_values("corr", ascending=False)
    df["rank"] = range(1, len(df) + 1)
    return df.head(top_k).reset_index(drop=True)
```

**Screening guidance**:

| Correlation | Conclusion | Follow-up Action |
|---------|------|---------|
| > 0.8 | Strong same-direction co-movement | Send to the cointegration test queue |
| 0.6 - 0.8 | Moderate co-movement | Check industry / factor alignment before cointegration |
| < 0.6 | Weak correlation | Usually unsuitable for pairs trading |
| Negative and < -0.6 | Strong inverse co-movement | Can be used in hedged portfolios, but be careful with spread direction |

---

## Mode 2: Deep Return-Correlation Analysis

**Use case**: Run a full bivariate correlation study on two assets, including multiple correlation coefficients, Beta / R², rolling correlation, and spread Z-Score.

### Core Metrics

```python
import statsmodels.api as sm
from scipy.stats import pearsonr, spearmanr, kendalltau

def bivariate_correlation_analysis(
    y: pd.Series,
    x: pd.Series,
    rolling_window: int = 60,
) -> dict:
    """Run deep correlation analysis for two assets.

    Args:
        y: Daily return series of asset A
        x: Daily return series of asset B
        rolling_window: Rolling-window length in trading days

    Returns:
        Dict of correlation statistics
    """
    # Align the two series.
    df = pd.concat([y.rename("y"), x.rename("x")], axis=1).dropna()
    y_clean, x_clean = df["y"], df["x"]

    # Static correlations.
    pearson_r, pearson_p = pearsonr(y_clean, x_clean)
    spearman_r, spearman_p = spearmanr(y_clean, x_clean)
    kendall_r, kendall_p = kendalltau(y_clean, x_clean)

    # OLS: y = α + β·x
    x_const = sm.add_constant(x_clean)
    ols = sm.OLS(y_clean, x_const).fit()
    beta = ols.params["x"]
    alpha = ols.params["const"]
    r_squared = ols.rsquared

    # Rolling Pearson correlation.
    rolling_corr = y_clean.rolling(rolling_window).corr(x_clean)

    # Spread and Z-Score using the hedge ratio.
    spread = y_clean - beta * x_clean
    spread_mean = spread.rolling(rolling_window).mean()
    spread_std = spread.rolling(rolling_window).std()
    z_score = (spread - spread_mean) / spread_std

    return {
        "pearson": {"r": round(pearson_r, 4), "p": round(pearson_p, 6)},
        "spearman": {"r": round(spearman_r, 4), "p": round(spearman_p, 6)},
        "kendall": {"r": round(kendall_r, 4), "p": round(kendall_p, 6)},
        "beta": round(beta, 4),
        "alpha": round(alpha, 6),
        "r_squared": round(r_squared, 4),
        "rolling_corr": rolling_corr,
        "spread": spread,
        "z_score": z_score,
        "spread_mean": spread_mean,
        "spread_std": spread_std,
    }
```

### Correlation-Coefficient Selection Guide

| Coefficient | Assumption | Best Use Case | Not Suitable When |
|------|------|---------|--------|
| Pearson | Linear, approximately normal | Return series | Heavy tails / many outliers |
| Spearman | Monotonic relationship | Ranking / quantile analysis, many outliers | When magnitude information matters |
| Kendall | Order consistency | Small samples, unknown distribution | Large samples due to slower computation |

**Practical rule in finance**: Usually report all three coefficients. If Pearson and Spearman differ by more than 0.1, the relationship is likely nonlinear or heavy-tailed, and Spearman should carry more weight.

---

## Mode 3: Sector Clustering

**Use case**: Run hierarchical clustering on the correlation matrix of N assets to discover sector structure, check portfolio diversification, and identify similar assets.

```python
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import squareform
import matplotlib.pyplot as plt
import seaborn as sns

def sector_clustering(
    returns: pd.DataFrame,
    method: str = "ward",
    n_clusters: int = 5,
    figsize: tuple = (12, 10),
) -> dict:
    """Run sector clustering analysis.

    Args:
        returns: Multi-asset daily return matrix, columns are symbols
        method: Linkage method: "ward" / "complete" / "average"
        n_clusters: Target number of clusters
        figsize: Heatmap size

    Returns:
        Dict containing the correlation matrix, cluster labels, and figure objects
    """
    # 1. Correlation matrix
    corr_matrix = returns.corr(method="pearson")

    # 2. Distance matrix where distance = 1 - |correlation|
    distance_matrix = 1 - corr_matrix.abs()
    condensed = squareform(distance_matrix.values, checks=False)

    # 3. Hierarchical clustering
    linkage_matrix = linkage(condensed, method=method)
    labels = fcluster(linkage_matrix, n_clusters, criterion="maxclust")
    cluster_df = pd.DataFrame({"symbol": corr_matrix.columns, "cluster": labels})

    # 4. Heatmap sorted by cluster
    order = cluster_df.sort_values("cluster").index
    sorted_corr = corr_matrix.iloc[order, order]

    fig_heatmap, ax = plt.subplots(figsize=figsize)
    sns.heatmap(
        sorted_corr,
        cmap="RdYlGn",
        center=0,
        vmin=-1,
        vmax=1,
        annot=len(corr_matrix) <= 20,
        fmt=".2f",
        ax=ax,
        cbar_kws={"label": "Pearson correlation"},
    )
    ax.set_title(f"Correlation Heatmap ({method.upper()} clustering order)")

    # 5. Dendrogram
    fig_dendro, ax2 = plt.subplots(figsize=(figsize[0], 6))
    dendrogram(
        linkage_matrix,
        labels=list(corr_matrix.columns),
        ax=ax2,
        leaf_rotation=90,
        color_threshold=0,
    )
    ax2.set_title(f"Hierarchical Dendrogram ({method.upper()} linkage)")
    ax2.set_ylabel("Distance")

    return {
        "corr_matrix": corr_matrix,
        "cluster_labels": cluster_df,
        "linkage_matrix": linkage_matrix,
        "fig_heatmap": fig_heatmap,
        "fig_dendrogram": fig_dendro,
        "n_clusters": n_clusters,
    }
```

### Comparison of Three Linkage Methods

| Method | Feature | Best Use Case | Weakness |
|------|------|---------|------|
| Ward | Minimizes within-cluster variance, gives compact clusters | **Default recommendation**, stock-sector discovery | Works best for spherical clusters, weaker for irregular shapes |
| Complete | Uses maximum pairwise distance, conservative | When high within-cluster similarity is required | Can produce elongated clusters |
| Average | Uses average distance, compromise approach | General analysis where compactness is not the top priority | Sensitive to noise |

---

## Mode 4: Realized Correlation

**Use case**: Compute rolling correlation time series and analyze conditional correlation by market regime (bull / bear / high-volatility) to discover how correlation evolves dynamically.

```python
def realized_correlation(
    y: pd.Series,
    x: pd.Series,
    benchmark: pd.Series,
    windows: list = [20, 60, 120],
    vol_window: int = 20,
    vol_threshold: float = 1.5,
) -> dict:
    """Rolling realized correlation plus regime-conditional correlation.

    Args:
        y, x: Daily return series of two assets
        benchmark: Daily return series of the benchmark index used for regime labeling
        windows: List of rolling windows in trading days
        vol_window: Volatility window
        vol_threshold: High-vol threshold as a multiple of average vol

    Returns:
        Rolling correlation series and conditional-correlation summary
    """
    df = pd.concat([y.rename("y"), x.rename("x"),
                    benchmark.rename("bm")], axis=1).dropna()

    # Rolling correlation time series.
    rolling_corrs = {}
    for w in windows:
        rolling_corrs[f"roll_{w}d"] = df["y"].rolling(w).corr(df["x"])

    # Regime labels.
    bm_ret_252 = df["bm"].rolling(252).mean()
    bm_vol = df["bm"].rolling(vol_window).std()
    bm_vol_mean = bm_vol.rolling(252).mean()

    df["regime"] = "sideways"
    df.loc[df["bm"] > bm_ret_252, "regime"] = "bull"
    df.loc[df["bm"] < -bm_ret_252.abs(), "regime"] = "bear"
    df.loc[bm_vol > bm_vol_mean * vol_threshold, "regime"] = "high_vol"

    # Conditional correlation.
    cond_corr = {}
    for regime in ["bull", "bear", "sideways", "high_vol"]:
        mask = df["regime"] == regime
        if mask.sum() >= 30:
            r, p = pearsonr(df.loc[mask, "y"], df.loc[mask, "x"])
            cond_corr[regime] = {"corr": round(r, 4), "p": round(p, 6), "n": int(mask.sum())}
        else:
            cond_corr[regime] = {"corr": None, "p": None, "n": int(mask.sum())}

    return {
        "rolling_corrs": pd.DataFrame(rolling_corrs),
        "regime_labels": df["regime"],
        "conditional_corr": cond_corr,
    }
```

### Typical Correlation Behavior by Market Regime

| Market Regime | Equity-Equity Correlation | Equity-Bond Correlation | A-Share Characteristic |
|---------|---------|---------|--------|
| Bull | Medium (0.4-0.6) | Low or negative | Small-cap names tend to move together strongly |
| Bear | **High (0.7-0.9)** | Negative (safe-haven effect) | Broad selloff, correlation jumps sharply |
| High volatility | **Very high (0.8+)** | Negative | In crises, correlation often converges toward 1 |
| Sideways | Low (0.2-0.4) | Near zero | Stock dispersion rises, ideal for pairs trading |

---

## Cointegration Analysis

Correlation measures the degree of co-movement. Cointegration measures whether a **long-run equilibrium relationship** exists. High correlation does not guarantee cointegration, and low correlation does not rule it out.

### Engle-Granger Two-Step Method

Suitable for two-variable pairs, quick and intuitive.

```python
from statsmodels.tsa.stattools import coint, adfuller
import statsmodels.api as sm
import numpy as np

def engle_granger_coint(
    y: pd.Series,
    x: pd.Series,
    significance: float = 0.05,
) -> dict:
    """Run the Engle-Granger two-step cointegration test.

    H0: No cointegration relationship exists (residuals contain a unit root).

    Args:
        y, x: Two price series. These must be non-stationary series,
            usually prices rather than returns.
        significance: Significance level

    Returns:
        Test results and spread series
    """
    # Step 1: estimate the cointegrating vector with OLS.
    x_const = sm.add_constant(x)
    ols = sm.OLS(y, x_const).fit()
    hedge_ratio = ols.params[x.name if x.name else "x"]
    intercept = ols.params["const"]
    residuals = ols.resid

    # Step 2: test residual stationarity with ADF.
    adf_res = adfuller(residuals, autolag="AIC")
    adf_stat, adf_p = adf_res[0], adf_res[1]

    # statsmodels coint wrapper.
    coint_stat, coint_p, crit_vals = coint(y, x)

    return {
        "method": "Engle-Granger",
        "is_cointegrated": coint_p < significance,
        "coint_p": round(coint_p, 6),
        "coint_stat": round(coint_stat, 4),
        "critical_values": {"1%": crit_vals[0], "5%": crit_vals[1], "10%": crit_vals[2]},
        "hedge_ratio": round(hedge_ratio, 6),
        "intercept": round(intercept, 6),
        "spread": residuals,
        "adf_on_spread": {"stat": round(adf_stat, 4), "p": round(adf_p, 6)},
    }
```

**Note**: Engle-Granger can detect only one cointegrating vector, and the test result depends on the ordering of `y` and `x`. In practice, test both directions and keep the direction with the smaller p-value.

### Johansen Cointegration Test for Multiple Variables

Suitable for three or more assets and for estimating the number of cointegrating vectors (rank).

```python
from statsmodels.tsa.vector_ar.vecm import coint_johansen

def johansen_coint(
    prices: pd.DataFrame,
    det_order: int = 0,
    k_ar_diff: int = 1,
) -> dict:
    """Run the Johansen cointegration test.

    Args:
        prices: Multi-asset price matrix, columns are symbols.
            The series must be non-stationary.
        det_order: Deterministic term. -1=no intercept, 0=intercept, 1=trend
        k_ar_diff: Number of lagged differences in the VAR, usually 1-5 chosen by AIC

    Returns:
        Trace-test and max-eigenvalue-test results
    """
    result = coint_johansen(prices.dropna(), det_order=det_order, k_ar_diff=k_ar_diff)
    n = prices.shape[1]

    # Trace test.
    trace_results = []
    for i in range(n):
        trace_results.append({
            "H0_rank_leq": i,
            "trace_stat": round(result.lr1[i], 4),
            "crit_10pct": result.cvt[i, 0],
            "crit_5pct": result.cvt[i, 1],
            "crit_1pct": result.cvt[i, 2],
            "reject_5pct": result.lr1[i] > result.cvt[i, 1],
        })

    # Max-eigenvalue test.
    maxeig_results = []
    for i in range(n):
        maxeig_results.append({
            "H0_rank_eq": i,
            "maxeig_stat": round(result.lr2[i], 4),
            "crit_10pct": result.cvm[i, 0],
            "crit_5pct": result.cvm[i, 1],
            "crit_1pct": result.cvm[i, 2],
            "reject_5pct": result.lr2[i] > result.cvm[i, 1],
        })

    # Cointegrating vectors, normalized.
    coint_vectors = pd.DataFrame(
        result.evec[:, :sum(r["reject_5pct"] for r in trace_results)],
        index=prices.columns,
    )

    return {
        "method": "Johansen",
        "n_coint_vectors_trace": sum(r["reject_5pct"] for r in trace_results),
        "trace_test": pd.DataFrame(trace_results),
        "maxeig_test": pd.DataFrame(maxeig_results),
        "coint_vectors": coint_vectors,
        "eigenvalues": result.eig,
    }
```

**Johansen rank interpretation rules**:

```
Start the trace test from H0: rank=0 and move upward.
The first rank that cannot be rejected is the estimated cointegration rank.

rank = 0   → no cointegration
rank = 1   → one cointegrating vector (most common, long-run equilibrium for a pair)
rank = k-1 → k-1 cointegrating vectors (system is tightly linked)
rank = k   → the series themselves are stationary, so cointegration is not needed
```

### Half-Life Calculation

Half-life measures how long a spread takes to mean-revert after deviating from equilibrium. It is a practical reference for expected holding period in pairs trading.

```python
def compute_half_life(spread: pd.Series) -> float:
    """Estimate mean-reversion half-life with OLS, in days.

    Principle:
        Estimate ΔSpread_t = λ·Spread_{t-1} + ε
        Half-life = -ln(2) / λ, where λ must be negative for mean reversion

    Args:
        spread: Spread series, which should be stationary

    Returns:
        Half-life in trading days. Negative or infinite values imply divergence.
    """
    spread_lag = spread.shift(1)
    delta = spread.diff()
    df = pd.concat([delta, spread_lag], axis=1).dropna()
    df.columns = ["delta", "lag"]

    x_const = sm.add_constant(df["lag"])
    ols = sm.OLS(df["delta"], x_const).fit()
    lam = ols.params["lag"]

    if lam >= 0:
        return float("inf")  # no mean reversion

    half_life = -np.log(2) / lam
    return round(half_life, 1)
```

**Half-life reference ranges**:

| Half-Life | Meaning | Trading Guidance |
|-------|------|---------|
| < 5 days | Extremely fast reversion | Intraday or overnight trading, friction cost matters |
| 5-20 days | Fast reversion | Ideal range for short-term pairs trading |
| 20-60 days | Medium-speed reversion | Medium-term holding, rolling windows 60-120 days |
| 60-180 days | Slow reversion | Long holding period, monitor cointegration stability |
| > 180 days | Near random walk | High pairs-trading risk, use cautiously |

### Kalman Filter Dynamic Hedge Ratio

Static OLS hedge ratios cannot capture gradual drift in the cointegration relationship. A Kalman filter provides a continuously updated dynamic hedge ratio.

```python
import numpy as np

def kalman_hedge_ratio(
    y: pd.Series,
    x: pd.Series,
    delta: float = 1e-4,
    vt: float = 1.0,
) -> pd.DataFrame:
    """Estimate a dynamic hedge ratio with a Kalman filter.

    State equation:
        β_t = β_{t-1} + w_t,  w ~ N(0, Q)
    Observation equation:
        y_t = β_t · x_t + v_t,  v ~ N(0, R)

    Args:
        y: Price series of asset A
        x: Price series of asset B
        delta: State-noise intensity. Larger means faster hedge-ratio adaptation
        vt: Observation-noise variance

    Returns:
        DataFrame containing the dynamic hedge ratio and spread
    """
    n = len(y)
    # State: [β, α] = hedge ratio + intercept
    Wt = delta / (1 - delta) * np.eye(2)
    Vt = vt

    # Initialization
    theta = np.zeros((n, 2))
    P = np.zeros((n, 2, 2))
    P[0] = np.eye(2)

    spread = np.zeros(n)
    spread[0] = float("nan")

    for t in range(1, n):
        F = np.array([x.iloc[t], 1.0])

        # Predict
        theta_pred = theta[t - 1]
        P_pred = P[t - 1] + Wt

        # Innovation
        innovation = y.iloc[t] - F @ theta_pred
        S = F @ P_pred @ F.T + Vt

        # Kalman gain
        K = P_pred @ F.T / S

        # Update
        theta[t] = theta_pred + K * innovation
        P[t] = (np.eye(2) - np.outer(K, F)) @ P_pred

        spread[t] = y.iloc[t] - theta[t, 0] * x.iloc[t] - theta[t, 1]

    return pd.DataFrame({
        "hedge_ratio": theta[:, 0],
        "intercept": theta[:, 1],
        "spread": spread,
    }, index=y.index)
```

**Static vs dynamic hedge-ratio comparison**:

| Method | Strength | Weakness | Best Use Case |
|------|------|------|------|
| OLS | Simple, stable | Cannot capture time variation | Short-term stable pairs |
| Rolling OLS | Time-varying, intuitive | Window-sensitive, endpoint effect | Medium-term pairs |
| Kalman Filter | Real-time, continuous update | `delta` is harder to tune | Long-term or structurally shifting pairs |

---

## Cross-Market Correlation

### Correlation Across China A-Share Sectors

```python
# Typical China A-share sector-correlation patterns
ASHARE_SECTOR_PATTERNS = {
    "strong_pairs_gt_0_7": [
        "Banks & insurance",
        "Baijiu & consumer staples",
        "New energy & solar",
        "Defense & aerospace",
    ],
    "medium_pairs_0_4_to_0_7": [
        "Pharma & consumer",
        "Technology & semiconductors",
        "Real estate & building materials",
    ],
    "low_or_negative_lt_0_3": [
        "Gold & technology",
        "Utilities & cyclicals",
        "Consumer & cyclicals",
    ],
}
```

### Cross-Market Linkage Analysis

```python
def cross_market_correlation(
    markets: dict,  # {"China A-shares": series, "Hong Kong": series, "crypto": series, "US": series}
    rolling_window: int = 60,
    lag_days: list = [0, 1, 2, 3],
) -> dict:
    """Cross-market correlation plus lead-lag analysis.

    Args:
        markets: Daily return series for each market
        rolling_window: Rolling window
        lag_days: List of lags to test

    Returns:
        Correlation matrix, lead-lag analysis, and rolling correlation
    """
    df = pd.DataFrame(markets).dropna()

    # Static correlation matrix
    static_corr = df.corr()

    # Lead-lag correlation to detect cross-market transmission
    lead_lag = {}
    mkt_names = list(markets.keys())
    for i, m1 in enumerate(mkt_names):
        for m2 in mkt_names[i + 1:]:
            pair_key = f"{m1}_{m2}"
            lead_lag[pair_key] = {}
            for lag in lag_days:
                if lag == 0:
                    r, _ = pearsonr(df[m1], df[m2])
                else:
                    r, _ = pearsonr(df[m1].iloc[lag:], df[m2].iloc[:-lag])
                lead_lag[pair_key][f"lag_{lag}d"] = round(r, 4)

    # Rolling correlation
    rolling_corrs = {}
    for i, m1 in enumerate(mkt_names):
        for m2 in mkt_names[i + 1:]:
            key = f"{m1}_{m2}"
            rolling_corrs[key] = df[m1].rolling(rolling_window).corr(df[m2])

    return {
        "static_corr": static_corr,
        "lead_lag": pd.DataFrame(lead_lag).T,
        "rolling_corrs": pd.DataFrame(rolling_corrs),
    }
```

### Empirical Cross-Market Linkage Patterns

| Market Pair | Average Correlation | Transmission Direction | Lag |
|-------|---------|---------|------|
| China A-shares ↔ Hong Kong | 0.5-0.7 | Two-way, Hong Kong slightly leads | 0-1 day |
| China A-shares ↔ U.S. equities | 0.2-0.4 | U.S. leads overnight | 1 day |
| BTC ↔ ETH | 0.7-0.9 | Highly synchronous | < 1 hour |
| China A-shares ↔ BTC | 0.0-0.2 | Mostly independent, except correlation spikes in crises | Unstable |
| U.S. equities ↔ BTC | 0.1-0.4 | U.S. leads through institutional capital flows | Within 1 day |
| RMB exchange rate ↔ China A-shares | -0.2 - 0.3 | RMB weakness → foreign outflows → China A-share weakness | 0-2 days |

### Impact of FX Factors on Cross-Market Correlation

Cross-market correlation analysis must distinguish between local-currency returns and FX-adjusted returns. Otherwise, exchange-rate moves can create spurious correlation or hide the true one.

```python
def fx_adjusted_correlation(
    foreign_price: pd.Series,   # foreign-market price, denominated in foreign currency
    domestic_price: pd.Series,  # domestic-market price
    fx_rate: pd.Series,         # foreign currency / domestic currency, e.g. USD/CNY
) -> dict:
    """Cross-market correlation adjusted for FX effects.

    Args:
        foreign_price: Foreign-market price series in foreign currency
        domestic_price: Domestic-market price series in domestic currency
        fx_rate: FX series expressed as foreign / domestic

    Returns:
        Raw correlation vs FX-adjusted correlation
    """
    # Domestic-currency foreign return = foreign return + FX return
    foreign_ret = foreign_price.pct_change()
    fx_ret = fx_rate.pct_change()
    foreign_ret_cny = (1 + foreign_ret) * (1 + fx_ret) - 1

    domestic_ret = domestic_price.pct_change()

    df = pd.concat([foreign_ret.rename("foreign_raw"),
                    foreign_ret_cny.rename("foreign_domestic"),
                    domestic_ret.rename("domestic"),
                    fx_ret.rename("fx")], axis=1).dropna()

    raw_corr, _ = pearsonr(df["foreign_raw"], df["domestic"])
    adj_corr, _ = pearsonr(df["foreign_domestic"], df["domestic"])
    fx_corr, _ = pearsonr(df["fx"], df["domestic"])

    return {
        "raw_corr_foreign_domestic": round(raw_corr, 4),
        "fx_adjusted_corr": round(adj_corr, 4),
        "fx_domestic_corr": round(fx_corr, 4),
        "fx_contribution": round(adj_corr - raw_corr, 4),
        "note": "fx_contribution > 0 means FX amplified cross-market correlation",
    }
```

### Correlation Breakdown During Crises

During crises, equity correlation converges toward 1 and diversification breaks down. This is one of the central challenges in portfolio risk management.

```python
def correlation_breakdown_test(
    returns: pd.DataFrame,
    crisis_threshold: float = -0.02,  # one-day benchmark drop threshold for crisis days
    benchmark_col: str = None,
    window: int = 20,
) -> dict:
    """Detect jumps in correlation during crisis periods.

    Args:
        returns: Multi-asset daily return matrix
        crisis_threshold: Benchmark return below this level defines a crisis day
        benchmark_col: Benchmark column name. If None, use cross-sectional mean return
        window: Window for rolling average correlation

    Returns:
        Comparison of correlation in normal periods vs crisis periods
    """
    if benchmark_col:
        bm = returns[benchmark_col]
    else:
        bm = returns.mean(axis=1)

    crisis_mask = bm < crisis_threshold
    normal_mask = ~crisis_mask

    # Average pairwise correlation for each period
    def avg_corr(df_subset: pd.DataFrame) -> float:
        if len(df_subset) < 5:
            return float("nan")
        c = df_subset.corr()
        upper = c.where(np.triu(np.ones(c.shape), k=1).astype(bool))
        return float(upper.stack().mean())

    crisis_corr = avg_corr(returns[crisis_mask])
    normal_corr = avg_corr(returns[normal_mask])

    # Rolling average correlation to detect structural change
    rolling_avg_corr = pd.Series(dtype=float, index=returns.index)
    for i in range(window, len(returns)):
        sub = returns.iloc[i - window:i]
        rolling_avg_corr.iloc[i] = avg_corr(sub)

    return {
        "normal_avg_corr": round(normal_corr, 4),
        "crisis_avg_corr": round(crisis_corr, 4),
        "corr_jump": round(crisis_corr - normal_corr, 4),
        "crisis_days": int(crisis_mask.sum()),
        "normal_days": int(normal_mask.sum()),
        "rolling_avg_corr": rolling_avg_corr,
    }
```

---

## Pair-Trading Signal Generation

### Full Workflow From Correlation to Signal

```
Step 1: Asset screening
  - Run scan_correlated_assets and keep candidate pairs with Pearson > 0.6
  - Run engle_granger_coint and keep pairs with p < 0.05

Step 2: Spread quality assessment
  - Run compute_half_life and keep pairs with half-life between 5 and 60 days
  - Test spread stationarity with ADF and require p < 0.05
  - Measure how often the absolute Z-Score exceeds 1.5 over the last 12 months
    to estimate trading frequency

Step 3: Hedge-ratio selection
  - Use static OLS for stable pairs
  - Use Kalman Filter for long-lived or drifting pairs

Step 4: Signal generation
  - Compute rolling Z-Score with a lookback of 2-3× half-life
  - Generate long / short / exit signals based on thresholds

Step 5: Signal monitoring
  - Recompute Z-Score daily
  - Re-run cointegration monthly to avoid broken relationships
  - Warn if half-life exceeds 2× the original value
```

### Z-Score Signal Generation

```python
def generate_pair_signals(
    y_price: pd.Series,
    x_price: pd.Series,
    lookback: int = 60,
    entry_z: float = 2.0,
    exit_z: float = 0.5,
    stop_z: float = 3.5,
    use_kalman: bool = False,
) -> pd.DataFrame:
    """Generate pair-trading signals.

    Args:
        y_price, x_price: Two price series
        lookback: Rolling Z-Score lookback, usually 2-3× half-life
        entry_z: Entry threshold
        exit_z: Exit threshold, usually near mean reversion
        stop_z: Stop threshold. Crossing it suggests cointegration may have broken
        use_kalman: Whether to use a Kalman dynamic hedge ratio

    Returns:
        DataFrame containing signals, Z-Score, and positions
    """
    if use_kalman:
        kf = kalman_hedge_ratio(y_price, x_price)
        spread = kf["spread"]
    else:
        y_ret = y_price.pct_change()
        x_ret = x_price.pct_change()
        res = bivariate_correlation_analysis(y_ret, x_ret, lookback)
        hedge_ratio = abs(res["beta"])
        spread = np.log(y_price) - hedge_ratio * np.log(x_price)

    spread_mean = spread.rolling(lookback).mean()
    spread_std = spread.rolling(lookback).std()
    z_score = (spread - spread_mean) / spread_std

    # Signal state machine to avoid repeated re-entry.
    signal_y = pd.Series(0.0, index=y_price.index)
    signal_x = pd.Series(0.0, index=x_price.index)
    position = 0  # 0=flat, 1=long spread, -1=short spread

    for i in range(lookback, len(z_score)):
        z = z_score.iloc[i]
        if np.isnan(z):
            continue

        if position == 0:
            if z < -entry_z:
                position = 1   # spread is too low: buy y, sell x
            elif z > entry_z:
                position = -1  # spread is too high: sell y, buy x
        elif position == 1:
            if z > -exit_z or z > stop_z:
                position = 0
        elif position == -1:
            if z < exit_z or z < -stop_z:
                position = 0

        signal_y.iloc[i] = 0.5 * position
        signal_x.iloc[i] = -0.5 * position

    return pd.DataFrame({
        "spread": spread,
        "z_score": z_score,
        "spread_mean": spread_mean,
        "spread_std": spread_std,
        "signal_y": signal_y,
        "signal_x": signal_x,
        "position": signal_y * 2,  # 1=long spread, -1=short spread, 0=flat
    })
```

### Z-Score Threshold Configuration Guide

| Parameter | Conservative | Standard | Aggressive | Notes |
|------|------|------|------|------|
| entry_z | 2.5 | 2.0 | 1.5 | Higher threshold means fewer trades |
| exit_z | 0.3 | 0.5 | 0.8 | Higher threshold means shorter holding periods |
| stop_z | 3.0 | 3.5 | 4.0 | Beyond this level, cointegration may have broken |
| lookback | 90 | 60 | 30 | Usually 2-3× half-life |

### Spread-Stability Monitoring

```python
def monitor_spread_health(
    spread: pd.Series,
    original_half_life: float,
    original_corr: float,
    warning_hl_multiple: float = 2.0,
    warning_corr_drop: float = 0.2,
) -> dict:
    """Monitor spread stability and judge whether cointegration still holds.

    Args:
        spread: Live spread series
        original_half_life: Half-life at entry, in days
        original_corr: Correlation at entry
        warning_hl_multiple: Warn if half-life exceeds this multiple of the original
        warning_corr_drop: Warn if correlation drops by more than this amount

    Returns:
        Health-status report
    """
    recent = spread.iloc[-60:] if len(spread) > 60 else spread

    current_hl = compute_half_life(recent)
    current_adf = adfuller(recent.dropna())[1]

    hl_ratio = current_hl / original_half_life if original_half_life > 0 else float("inf")

    # Cointegration health score
    health_score = 100
    warnings = []

    if current_adf > 0.10:
        health_score -= 40
        warnings.append(f"Spread ADF p={current_adf:.3f} > 0.10, stationarity has weakened")

    if hl_ratio > warning_hl_multiple:
        health_score -= 30
        warnings.append(
            f"Half-life {current_hl:.1f}d is {hl_ratio:.1f}x the original {original_half_life:.1f}d"
        )

    if current_adf > 0.20:
        health_score -= 20
        warnings.append("Spread may no longer be stationary. Re-test cointegration immediately")

    status = "healthy" if health_score >= 70 else "warning" if health_score >= 40 else "danger"

    return {
        "health_score": health_score,
        "status": status,
        "current_half_life": round(current_hl, 1),
        "hl_ratio": round(hl_ratio, 2),
        "spread_adf_p": round(current_adf, 4),
        "warnings": warnings,
        "action": "hold" if status == "healthy" else "reduce" if status == "warning" else "exit_now",
    }
```

---

## Visualization Templates

### Rolling-Correlation Time-Series Plot

```python
import matplotlib.pyplot as plt

def plot_rolling_correlation(
    rolling_corrs: pd.DataFrame,
    title: str = "Rolling Correlation",
    figsize: tuple = (14, 5),
) -> plt.Figure:
    """Plot rolling correlation time series across multiple windows."""
    fig, ax = plt.subplots(figsize=figsize)
    colors = ["#2196F3", "#FF9800", "#4CAF50"]
    for i, col in enumerate(rolling_corrs.columns):
        ax.plot(rolling_corrs.index, rolling_corrs[col],
                label=col, color=colors[i % len(colors)], alpha=0.8)
    ax.axhline(0, color="black", linestyle="--", linewidth=0.8)
    ax.axhline(0.6, color="green", linestyle=":", linewidth=0.8, label="high-correlation threshold (0.6)")
    ax.axhline(-0.6, color="red", linestyle=":", linewidth=0.8)
    ax.set_title(title)
    ax.set_ylabel("Correlation")
    ax.legend(loc="best")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    return fig
```

### Z-Score Signal Plot

```python
def plot_zscore_signals(
    signal_df: pd.DataFrame,
    entry_z: float = 2.0,
    stop_z: float = 3.5,
    figsize: tuple = (14, 8),
) -> plt.Figure:
    """Plot spread Z-Score and pair-trading signals."""
    fig, axes = plt.subplots(2, 1, figsize=figsize, sharex=True)

    # Top chart: spread
    axes[0].plot(signal_df["spread"], label="Spread", color="#1565C0")
    axes[0].plot(signal_df["spread_mean"], label="Mean", color="orange", linestyle="--")
    axes[0].fill_between(signal_df.index,
                         signal_df["spread_mean"] - signal_df["spread_std"],
                         signal_df["spread_mean"] + signal_df["spread_std"],
                         alpha=0.2, color="orange", label="±1σ")
    axes[0].set_title("Spread and Mean")
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Bottom chart: Z-Score + signals
    axes[1].plot(signal_df["z_score"], label="Z-Score", color="#1565C0")
    axes[1].axhline(entry_z, color="red", linestyle="--", label=f"entry threshold (±{entry_z})")
    axes[1].axhline(-entry_z, color="red", linestyle="--")
    axes[1].axhline(stop_z, color="darkred", linestyle=":", label=f"stop threshold (±{stop_z})")
    axes[1].axhline(-stop_z, color="darkred", linestyle=":")
    axes[1].axhline(0, color="black", linestyle="-", linewidth=0.8)

    # Annotate entry / exit points
    long_entry = signal_df["position"].diff() > 0
    short_entry = signal_df["position"].diff() < 0
    exit_pos = (signal_df["position"] == 0) & (signal_df["position"].shift(1) != 0)

    axes[1].scatter(signal_df.index[long_entry], signal_df["z_score"][long_entry],
                    color="green", marker="^", s=80, label="long spread", zorder=5)
    axes[1].scatter(signal_df.index[short_entry], signal_df["z_score"][short_entry],
                    color="red", marker="v", s=80, label="short spread", zorder=5)
    axes[1].scatter(signal_df.index[exit_pos], signal_df["z_score"][exit_pos],
                    color="gray", marker="o", s=40, label="exit", zorder=5)

    axes[1].set_title("Z-Score and Trading Signals")
    axes[1].legend(loc="best")
    axes[1].grid(True, alpha=0.3)

    plt.tight_layout()
    return fig
```

---

## Dependencies

```bash
pip install pandas numpy scipy statsmodels matplotlib seaborn
```

---

## Output Format

```markdown
## Correlation and Cointegration Analysis Report

### Pair: [Asset A] vs [Asset B] ([Start Date] - [End Date])

#### Correlation Statistics
| Metric | Value | Interpretation |
|------|------|------|
| Pearson r | 0.82 | Strong linear positive correlation |
| Spearman ρ | 0.80 | Consistent monotonic relationship |
| Beta (A/B) | 1.15 | Sensitivity of A to B |
| R² | 0.67 | 67% of return variance in A is explained by B |

#### Cointegration Tests
| Method | Statistic | p-value | Conclusion |
|------|--------|------|------|
| Engle-Granger | -4.12 | 0.008 | Cointegrated ** |
| Johansen trace test | 28.3 | — | 1 cointegrating vector |
| Spread ADF | -3.95 | 0.002 | Spread is stationary ** |

#### Mean-Reversion Characteristics
| Metric | Value |
|------|------|
| OLS hedge ratio | 1.23 |
| Half-life | 18.5 days |
| Suggested holding window | 10-30 days |
| Suggested lookback window | 40-60 days |

#### Conditional Correlation (Regime Analysis)
| Regime | Correlation | Sample Size |
|------|---------|--------|
| Bull | 0.76 | 312 days |
| Bear | 0.88 | 198 days |
| High volatility | 0.91 | 87 days |
| Sideways | 0.71 | 645 days |

#### Recommended Pair-Trading Signal Parameters
| Parameter | Value |
|------|-----|
| entry_z | 2.0 |
| exit_z | 0.5 |
| stop_z | 3.5 |
| lookback | 60 days |

#### Current Spread Status
| Metric | Value | Alert |
|------|-----|------|
| Current Z-Score | -2.3 | Near entry zone |
| Health score | 85/100 | Healthy |
| Half-life (last 60 days) | 21.2 days | Normal |
```

---

## Notes

1. **Prices vs returns**: Use price series, which are non-stationary, for cointegration tests; use return series, which are stationary, for correlation analysis. Mixing them is the most common mistake.
2. **Data alignment**: Cross-market analysis must handle holiday mismatches with an inner join. Do not forward-fill missing trading days, or you will create fake correlation.
3. **Cointegration is not the same as high correlation**: Two series can have Pearson < 0.3 and still be cointegrated, and the reverse can also happen.
4. **Out-of-sample validation**: If a pair is selected using cointegration on the first N years, you must verify whether the relationship survives in later out-of-sample data to avoid overfitting.
5. **Crisis-period risk**: Correlation jumps in crises, and both legs in a pair can crash together. Stop thresholds should be tighter than in normal periods.
6. **China A-share specifics**: China A-shares contain many non-trading days due to holidays and suspensions. Date alignment is especially important in cross-market comparison.
7. **Multiple testing**: When testing N asset pairs simultaneously, use Benjamini-Hochberg FDR adjustment on p-values. Otherwise false positives will be excessive.
8. **Kalman tuning**: Tune `delta` with grid search plus out-of-sample validation. Do not rely blindly on the default value.