Skip to content

BCF posterior-mean ATE highly seed-sensitive on small, confounded data #376

@asmahani

Description

@asmahani

Hi maintainers,

While developing our metacausal package (a causal-ML ensemble that integrates stochtree's BCF via a thin adapter alongside ~9 other causal-ML estimators), we observed unexpectedly large seed-to-seed variability in BCF's posterior-mean ATE on small, confounded data — wider than what any of the other ~9 estimators in our default ensemble exhibits, and large enough that we'd appreciate your read on whether this is the expected small-data regime or potentially a sampler issue worth deeper investigation.

Self-contained reproducer

Runs in ~30s; no external data, no metacausal dependency:

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_predict
from threadpoolctl import threadpool_limits
from stochtree import BCFModel


def confounded_dgp(n=445, p=8, n_seed=0):
    """Small confounded DGP roughly mimicking LaLonde-style earnings data."""
    rng = np.random.default_rng(n_seed)
    X = rng.normal(size=(n, p))
    e = 1 / (1 + np.exp(-(1.5 * X[:, 0] - 0.7 * X[:, 1])))   # confounded propensity
    T = rng.binomial(1, e).astype(float)
    tau = 1500 + 800 * X[:, 2] - 500 * X[:, 3]               # heterogeneous, mean ~1500
    Y0 = np.maximum(np.random.default_rng(n_seed + 1).normal(
        loc=3000 + 1500 * X[:, 0], scale=4000, size=n), 0)
    Y = Y0 + T * tau
    return X.astype(float), T, Y.astype(float)


X, T, Y = confounded_dgp(n_seed=0)   # fixed dataset

for seed in [0, 1, 7, 13, 42, 100, 2026]:
    propensity = cross_val_predict(
        HistGradientBoostingClassifier(max_iter=200, random_state=int(seed)),
        X, T.astype(int), method="predict_proba", cv=5,
    )[:, 1]
    propensity = np.clip(propensity, 0.01, 0.99)

    m = BCFModel()
    with threadpool_limits(limits=1):
        m.sample(
            X_train=X, Z_train=T, y_train=Y,
            propensity_train=propensity,
            num_gfr=5, num_burnin=200, num_mcmc=200,
            general_params={"random_seed": int(seed) % (2**31)},
        )
    tau_hat = m.predict(X=X, Z=T, propensity=propensity,
                        type="posterior", terms="tau")
    print(f"seed={seed:>4}  BCF ATE={float(np.mean(tau_hat)):>+8.0f}")

Observed (stochtree 0.4.2, Python 3.13.3, macOS)

seed=   0  BCF ATE=   +2775
seed=   1  BCF ATE=   +3386
seed=   7  BCF ATE=   +2200
seed=  13  BCF ATE=   +1683
seed=  42  BCF ATE=     +87
seed= 100  BCF ATE=   +1821
seed=2026  BCF ATE=    -318

The dataset is fixed across runs; only the BCF MCMC seed (and the propensity-CV seed) varies. Range: ~$3,700 against a true ATE of ~$1,500.

The same pattern holds on the LaLonde job-training data (n=445) bundled with our package, where the per-seed BCF ATE swings from −$5,550 to +$4,085 with identical configuration. Notably, on LaLonde, going from stochtree 0.4.0 → 0.4.2 (with an unchanged metacausal call) shifted the seed=42 estimate from +$1,482 to −$5,550. The 0.4.2 release notes mention BCF propensity-handling (#334) and parametric-intercept (#326) fixes, so we believe those fixes had a real effect; the question is whether the post-fix sampler is more small-data sensitive than expected.

What we tried

  • Increasing num_burnin and num_mcmc from 200 each to 1,000 each shrinks the seed range to ~$3,500 — helpful but the variability remains substantial.
  • threadpool_limits(limits=1) around model.sample (above) — no change.
  • A separately-seeded propensity vs. sharing the BCF seed — no change.
  • Other components in our ensemble (DoubleMLIRM, CausalForestDML, several meta-learners) on the same data have seed-to-seed std in the $100-$360 range; BCF's is ~$3,300.

Question

Is this magnitude of seed-sensitivity expected for BCF in the n≈400-500 / low-overlap regime, or does it warrant deeper investigation given the recent 0.4.2 sampler fixes? Happy to run additional diagnostics — multi-chain BCF, alternative propensity configurations, longer chains, anything else that would help.

Thanks for stochtree — it's a fantastic addition to the open-source causal-ML ecosystem, and our package depends on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions