Hi maintainers,
While developing our metacausal package (a causal-ML ensemble that integrates stochtree's BCF via a thin adapter alongside ~9 other causal-ML estimators), we observed unexpectedly large seed-to-seed variability in BCF's posterior-mean ATE on small, confounded data — wider than what any of the other ~9 estimators in our default ensemble exhibits, and large enough that we'd appreciate your read on whether this is the expected small-data regime or potentially a sampler issue worth deeper investigation.
Self-contained reproducer
Runs in ~30s; no external data, no metacausal dependency:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_predict
from threadpoolctl import threadpool_limits
from stochtree import BCFModel
def confounded_dgp(n=445, p=8, n_seed=0):
"""Small confounded DGP roughly mimicking LaLonde-style earnings data."""
rng = np.random.default_rng(n_seed)
X = rng.normal(size=(n, p))
e = 1 / (1 + np.exp(-(1.5 * X[:, 0] - 0.7 * X[:, 1]))) # confounded propensity
T = rng.binomial(1, e).astype(float)
tau = 1500 + 800 * X[:, 2] - 500 * X[:, 3] # heterogeneous, mean ~1500
Y0 = np.maximum(np.random.default_rng(n_seed + 1).normal(
loc=3000 + 1500 * X[:, 0], scale=4000, size=n), 0)
Y = Y0 + T * tau
return X.astype(float), T, Y.astype(float)
X, T, Y = confounded_dgp(n_seed=0) # fixed dataset
for seed in [0, 1, 7, 13, 42, 100, 2026]:
propensity = cross_val_predict(
HistGradientBoostingClassifier(max_iter=200, random_state=int(seed)),
X, T.astype(int), method="predict_proba", cv=5,
)[:, 1]
propensity = np.clip(propensity, 0.01, 0.99)
m = BCFModel()
with threadpool_limits(limits=1):
m.sample(
X_train=X, Z_train=T, y_train=Y,
propensity_train=propensity,
num_gfr=5, num_burnin=200, num_mcmc=200,
general_params={"random_seed": int(seed) % (2**31)},
)
tau_hat = m.predict(X=X, Z=T, propensity=propensity,
type="posterior", terms="tau")
print(f"seed={seed:>4} BCF ATE={float(np.mean(tau_hat)):>+8.0f}")
Observed (stochtree 0.4.2, Python 3.13.3, macOS)
seed= 0 BCF ATE= +2775
seed= 1 BCF ATE= +3386
seed= 7 BCF ATE= +2200
seed= 13 BCF ATE= +1683
seed= 42 BCF ATE= +87
seed= 100 BCF ATE= +1821
seed=2026 BCF ATE= -318
The dataset is fixed across runs; only the BCF MCMC seed (and the propensity-CV seed) varies. Range: ~$3,700 against a true ATE of ~$1,500.
The same pattern holds on the LaLonde job-training data (n=445) bundled with our package, where the per-seed BCF ATE swings from −$5,550 to +$4,085 with identical configuration. Notably, on LaLonde, going from stochtree 0.4.0 → 0.4.2 (with an unchanged metacausal call) shifted the seed=42 estimate from +$1,482 to −$5,550. The 0.4.2 release notes mention BCF propensity-handling (#334) and parametric-intercept (#326) fixes, so we believe those fixes had a real effect; the question is whether the post-fix sampler is more small-data sensitive than expected.
What we tried
- Increasing
num_burnin and num_mcmc from 200 each to 1,000 each shrinks the seed range to ~$3,500 — helpful but the variability remains substantial.
threadpool_limits(limits=1) around model.sample (above) — no change.
- A separately-seeded propensity vs. sharing the BCF seed — no change.
- Other components in our ensemble (DoubleMLIRM, CausalForestDML, several meta-learners) on the same data have seed-to-seed std in the $100-$360 range; BCF's is ~$3,300.
Question
Is this magnitude of seed-sensitivity expected for BCF in the n≈400-500 / low-overlap regime, or does it warrant deeper investigation given the recent 0.4.2 sampler fixes? Happy to run additional diagnostics — multi-chain BCF, alternative propensity configurations, longer chains, anything else that would help.
Thanks for stochtree — it's a fantastic addition to the open-source causal-ML ecosystem, and our package depends on it.
Hi maintainers,
While developing our metacausal package (a causal-ML ensemble that integrates stochtree's BCF via a thin adapter alongside ~9 other causal-ML estimators), we observed unexpectedly large seed-to-seed variability in BCF's posterior-mean ATE on small, confounded data — wider than what any of the other ~9 estimators in our default ensemble exhibits, and large enough that we'd appreciate your read on whether this is the expected small-data regime or potentially a sampler issue worth deeper investigation.
Self-contained reproducer
Runs in ~30s; no external data, no metacausal dependency:
Observed (stochtree 0.4.2, Python 3.13.3, macOS)
The dataset is fixed across runs; only the BCF MCMC seed (and the propensity-CV seed) varies. Range: ~$3,700 against a true ATE of ~$1,500.
The same pattern holds on the LaLonde job-training data (n=445) bundled with our package, where the per-seed BCF ATE swings from −$5,550 to +$4,085 with identical configuration. Notably, on LaLonde, going from stochtree 0.4.0 → 0.4.2 (with an unchanged metacausal call) shifted the seed=42 estimate from +$1,482 to −$5,550. The 0.4.2 release notes mention BCF propensity-handling (#334) and parametric-intercept (#326) fixes, so we believe those fixes had a real effect; the question is whether the post-fix sampler is more small-data sensitive than expected.
What we tried
num_burninandnum_mcmcfrom 200 each to 1,000 each shrinks the seed range to ~$3,500 — helpful but the variability remains substantial.threadpool_limits(limits=1)aroundmodel.sample(above) — no change.Question
Is this magnitude of seed-sensitivity expected for BCF in the n≈400-500 / low-overlap regime, or does it warrant deeper investigation given the recent 0.4.2 sampler fixes? Happy to run additional diagnostics — multi-chain BCF, alternative propensity configurations, longer chains, anything else that would help.
Thanks for stochtree — it's a fantastic addition to the open-source causal-ML ecosystem, and our package depends on it.