BCF posterior-mean ATE highly seed-sensitive on small, confounded data

Hi maintainers,

While developing our [metacausal](https://pypi.org/project/metacausal/) package (a causal-ML ensemble that integrates stochtree's BCF via a thin adapter alongside ~9 other causal-ML estimators), we observed unexpectedly large seed-to-seed variability in BCF's posterior-mean ATE on small, confounded data — wider than what any of the other ~9 estimators in our default ensemble exhibits, and large enough that we'd appreciate your read on whether this is the expected small-data regime or potentially a sampler issue worth deeper investigation.

## Self-contained reproducer

Runs in ~30s; no external data, no metacausal dependency:

```python
import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_predict
from threadpoolctl import threadpool_limits
from stochtree import BCFModel


def confounded_dgp(n=445, p=8, n_seed=0):
    """Small confounded DGP roughly mimicking LaLonde-style earnings data."""
    rng = np.random.default_rng(n_seed)
    X = rng.normal(size=(n, p))
    e = 1 / (1 + np.exp(-(1.5 * X[:, 0] - 0.7 * X[:, 1])))   # confounded propensity
    T = rng.binomial(1, e).astype(float)
    tau = 1500 + 800 * X[:, 2] - 500 * X[:, 3]               # heterogeneous, mean ~1500
    Y0 = np.maximum(np.random.default_rng(n_seed + 1).normal(
        loc=3000 + 1500 * X[:, 0], scale=4000, size=n), 0)
    Y = Y0 + T * tau
    return X.astype(float), T, Y.astype(float)


X, T, Y = confounded_dgp(n_seed=0)   # fixed dataset

for seed in [0, 1, 7, 13, 42, 100, 2026]:
    propensity = cross_val_predict(
        HistGradientBoostingClassifier(max_iter=200, random_state=int(seed)),
        X, T.astype(int), method="predict_proba", cv=5,
    )[:, 1]
    propensity = np.clip(propensity, 0.01, 0.99)

    m = BCFModel()
    with threadpool_limits(limits=1):
        m.sample(
            X_train=X, Z_train=T, y_train=Y,
            propensity_train=propensity,
            num_gfr=5, num_burnin=200, num_mcmc=200,
            general_params={"random_seed": int(seed) % (2**31)},
        )
    tau_hat = m.predict(X=X, Z=T, propensity=propensity,
                        type="posterior", terms="tau")
    print(f"seed={seed:>4}  BCF ATE={float(np.mean(tau_hat)):>+8.0f}")
```

## Observed (stochtree 0.4.2, Python 3.13.3, macOS)

```
seed=   0  BCF ATE=   +2775
seed=   1  BCF ATE=   +3386
seed=   7  BCF ATE=   +2200
seed=  13  BCF ATE=   +1683
seed=  42  BCF ATE=     +87
seed= 100  BCF ATE=   +1821
seed=2026  BCF ATE=    -318
```

The dataset is fixed across runs; only the BCF MCMC seed (and the propensity-CV seed) varies. Range: ~$3,700 against a true ATE of ~$1,500.

The same pattern holds on the LaLonde job-training data (n=445) bundled with our package, where the per-seed BCF ATE swings from −$5,550 to +$4,085 with identical configuration. Notably, on LaLonde, going from stochtree 0.4.0 → 0.4.2 (with an unchanged metacausal call) shifted the seed=42 estimate from +$1,482 to −$5,550. The 0.4.2 release notes mention BCF propensity-handling (#334) and parametric-intercept (#326) fixes, so we believe those fixes had a real effect; the question is whether the post-fix sampler is more small-data sensitive than expected.

## What we tried

- Increasing `num_burnin` and `num_mcmc` from 200 each to 1,000 each shrinks the seed range to ~$3,500 — helpful but the variability remains substantial.
- `threadpool_limits(limits=1)` around `model.sample` (above) — no change.
- A separately-seeded propensity vs. sharing the BCF seed — no change.
- Other components in our ensemble (DoubleMLIRM, CausalForestDML, several meta-learners) on the same data have seed-to-seed std in the $100-$360 range; BCF's is ~$3,300.

## Question

Is this magnitude of seed-sensitivity expected for BCF in the n≈400-500 / low-overlap regime, or does it warrant deeper investigation given the recent 0.4.2 sampler fixes? Happy to run additional diagnostics — multi-chain BCF, alternative propensity configurations, longer chains, anything else that would help.

Thanks for stochtree — it's a fantastic addition to the open-source causal-ML ecosystem, and our package depends on it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BCF posterior-mean ATE highly seed-sensitive on small, confounded data #376

Self-contained reproducer

Observed (stochtree 0.4.2, Python 3.13.3, macOS)

What we tried

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BCF posterior-mean ATE highly seed-sensitive on small, confounded data #376

Description

Self-contained reproducer

Observed (stochtree 0.4.2, Python 3.13.3, macOS)

What we tried

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions