Three Ways to Say ‘Because’ in Survival Analysis
Data Science @ Personio
and Open Source Contributor @ PyMC
2026-06-14
Who am I?
Link to my website
Code or it didn’t Happen
The full worked example can be found on my blog
There are conflicting views of causation and the conflict matters for mediation analysis in survival contexts.
Counterfactual view a causal claim is about probabilistic dependence under intervention
Process view a causal claim is about the unfolding of influence over time.
Both are intuitive, but they diverge on mediation cases. This talk builds a Bayesian dynamic path analysis in PyMC that bridges both views.
Act I — The puzzle: Pre-emption, competing risks, and why mediation splits the schools
Act II — The model: A hybrid Bayesian DPA: Pearl’s structural equations + Aalen’s time-varying mechanisms
Act III — The lesson: The lens, not just the answer. Why the macro view is primary — and the cross-world assumption it lets you escape.
The two dominant frameworks in causal inference — potential outcomes and do-calculus — share a common foundation:
“X caused Y” means Y would not have occurred without X.
Counterfactual difference-making is the shared engine of modern causal inference. Enormously powerful for total effects.
But what about decomposing effects by pathway? That’s where the test starts to strain.
Two assassins are hired. Assassin A poisons the coffee. Assassin B waits with a sniper rifle as backup.
A’s poison works. B never fires. The target dies.
Counterfactual test: “Would the target have survived if A hadn’t acted?”
But A clearly caused the death. The counterfactual test misses the process that did the work.
This is a granularity question. At what fidelity do we locate causes?
The assassin example is a competing risks problem
Mediation is a similar structure.
Attributing credit requires process decomposition
The three schools of causal inference give three different answers — each depending on where they locate causation.
Each school answers “where does causation live?” differently — and that answer dictates how each handles mediation.
The process view doesn’t just sidestep the cross-world problem — it diagnoses it. The micro lens isn’t a stricter version of the macro lens. It’s looking for causation in the wrong place.
| Potential Outcomes | Pearl SCM | Process (Aalen) | |
|---|---|---|---|
| Cause is… | Difference between worlds | Structural mechanism | Unfolding process |
| Mediation? | Ill-defined (cross-world) | Well-defined (structural eqns) | Naturally decomposable |
| Strength | Clean identification | Full causal calculus | Temporal dynamics |
| Weakness | No mediation | Untestable structural assumptions | OlS estimator and non-generative inference |
| Granularity | Endpoint | Each equation / node | Trajectory (cumulative) |
What if one model could make the mechanism transparent and deliver both kinds of estimand?
dpasurvA patient starts an exercise programme.
It reduces cardiovascular risk directly. Good.
But it also causes joint inflammation. The inflammation increases risk over time.
At the six-month endpoint, the treatment looks modestly protective.
You have no idea it’s being undermined.
To act on this, you need both answers: which process is doing the undermining, and how much benefit is being lost.
dpasurv benchmark example of mediation structures. We will replicate this example in PyMCThe arrows don’t just exist. They strengthen and weaken over time.
Identification assumption: sequential unconfoundedness
At each time point, conditional on observed history, there are no unmeasured common causes of treatment, mediator, and hazard. The DAG justifies the path decomposition; this assumption justifies calling it causal. In practice violations might occur but as long as they are not systematic, the decomposition can still be useful for understanding the process.
Generative AI has made one idea mainstream: a model that can sample is a model that can reason about what might have been.
The same logic applies here. The process view demands a model that generates trajectories — not just estimates coefficients.
Three technical moves make this possible: discretise, smooth, bridge.
Once time is discrete, binary events become counts capped at 1 — and counts have a natural likelihood.
\[d_{it} \sim \text{Poisson}(\lambda(t)), \quad \lambda(t) = Y_{it} \cdot \Delta t \cdot f(\text{linear predictor})\]
The baseline hazard becomes an explicit parameter, not a nuisance to be conditioned away.
This is what makes the model generative. We can sample from it. We can intervene on it.
We draw our sample trajectories on causal pathways with time-varying coefficients indexed to each temporal stage.
Mediator equation:
\[M_t = \beta(t) \cdot X + \rho \cdot M_{t-1} + \epsilon\]
The mediator has inertia. Its dynamics are modelled, not assumed away.
Hazard equation:
\[\lambda(t) = f\big(\alpha_0(t) + \alpha_1(t) X + \alpha_2(t) M_{t-1}\big)\]
Direct and indirect paths, both time-varying.
These are estimated simultaneously. One likelihood, one posterior.
with pm.Model() as model:
# 1. Time-varying coefficients via random-walk splines
beta_raw = pm.Normal("beta_raw", 0, 1,
dims=('splines', 'tv'))
coef_alpha = pt.cumsum(beta_raw * sigma_smooth, axis=0)
# 2. Mediator equation (Pearl's X → M, Aalen's β(t))
mu_m = beta_t[b] * trt + rho * med_lag
pm.Normal("obs_m", mu=mu_m, sigma=sigma_m,
observed=med)
# 3. Hazard equation (Pearl's X → Y + M → Y, Aalen's α(t))
eta = (alpha_0_t[b]
+ alpha_1_t[b] * trt # direct path
+ alpha_2_t[b] * med_lag) # mediated path
# 4. Softplus link → Poisson likelihood
Lambda = dt * pm.math.log1pexp(eta)
pm.Poisson("obs_event", mu=Lambda,
observed=events)
# 5. Path decomposition — falls out of the structure
pm.Deterministic("direct", alpha_1_t)
pm.Deterministic("indirect", beta_t * alpha_2_t)
pm.Deterministic("total", alpha_1_t + beta_t * alpha_2_t)
# 6. One call estimates everything jointly
idata = pm.sample(draws=2000, tune=2000,
nuts_sampler="numpyro")Left: direct effect over time reproduces the data. Middle: indirect effect. Right: Total Effect
Here we plot the time-varying path coefficients. The direct effect is protective throughout. The indirect effect is inert, then harmful after day 100. The total effect is attenuated recovering the dpasurv pattern.
# World 1: everyone treated
pm.set_data({'trt': np.ones(n)})
idata_trt1 = pm.sample_posterior_predictive(
idata, var_names=['Lambda', 'obs_m', 'obs_event'])
# World 0: nobody treated
pm.set_data({'trt': np.zeros(n)})
idata_trt0 = pm.sample_posterior_predictive(
idata, var_names=['Lambda', 'obs_m', 'obs_event'])We use posterior predictive sampling to generate two counterfactual worlds. Same population. Same model.
The mediator evolves naturally under each intervention and we can derive time-varying hazards and survival curves for both.
Here we show the hazards propagated under different interventions, and we compare the hazards to get at the relative risk estimand
Two independent axes — estimand (what are you asking?) and lens (where do you locate causation?):
| Macro lens (realised process) | Micro lens (point-wise modularity) | |
|---|---|---|
| Shape (which path) | ✓ trajectory coefficients | ✗ category error — pre-emption shows the cause is the process, not a point-wise modular fact |
| Size (how much) | — needs structural commitment to size | g-computation, NDE/NIE |
The diagonals carry the work. Shape lives at the macro lens — and only there. Size can be computed at the micro lens, but only because we accept structural commitments as a price — not because the micro lens is the right home for causal claims.
The asymmetry
The macro lens is primary. The micro lens earns its keep only when there’s a process already identified to be sized. G-computation is parasitic on the trajectory, not equipotent with it.
Ockham’s Razor says: one cause, one effect, pare it down.
Hickam’s Dictum says: “A patient can have as many diseases as they damn well please.”
. . .
The DPA model is Hickam operationalised:
Refusing to decompose isn’t caution, it’s a commitment. It assumes all paths point the same way. When they don’t, aggressive simplification is misspecification from the start — a commitment to looking for causation where it isn’t, at the endpoint, between frozen worlds, instead of along the path the process actually took.
The assumption that makes mediation fragile:
\[Y(1, M(0)) \perp\!\!\!\perp M(1) \mid X\]
“The M → Y relationship under treatment is the same as it would be if M had been generated under control.”
The macro lens doesn’t need it. It’s a strictly weaker identification regime.
Trajectory decomposition
\[\text{total}(t) = \alpha_1(t) + \beta(t)\alpha_2(t)\]
Algebraic identity, not counterfactual subtraction. Identified by sequential within-world unconfoundedness alone. No cross-world assumption.
G-computation NDE
Plug \(M(0)\) trajectory into hazard under \(X=1\).
This is a cross-world counterfactual. Pays the full identification price. The price buys you a magnitude.
The trajectory told us which path is harmful and when. Now we pay the modularity cost to answer: how much?
# Mediator trajectories under control
m_nat = idata_trt0.posterior_predictive['obs_m']
# Hazard under treatment, with control mediator
eta_nde = (alpha_0(t) + alpha_1(t) * 1
+ alpha_2(t) * m_nat)
# Transform through the link and simulate
hazard_nde = dt * log1pexp(eta_nde)Implication: Do not abandon the treatment. Intervene on the mediator after day 100. The NDE is a cross-world comparison — justified by modularity — but it’s what lets you subtract from the total and reveal the hidden structure.
1. A hybrid model: Pearl’s structural equations + Aalen’s time-varying mechanisms, unified in PyMC with full posterior uncertainty.
2. Two kinds of estimand from one model: coefficients for the shape of the mechanism, g-computation for the size of the effect — including NDE/NIE.
3. Two scales, but not symmetric: the macro lens (spline-smoothed trajectories) identifies the realised process and escapes cross-world independence. The micro lens (point-wise modularity) sizes what macro has already identified.
4. The macro question is the right question; the micro answer sizes what macro identifies. Probabilistic programming is the substrate that makes both answerable — but only one is primary.
Causation lives at the macro scale.
The pre-emption puzzle showed it: ask “did A make a difference?” and you get the wrong answer. Ask “which process transmitted the influence?” and you get the right one.
The trajectory coefficients are not a coarser estimand than the cross-world NDE. They are the right estimand for the question. The NDE is a useful auxiliary — it sizes what the trajectory has already identified.
The endpoint view isn’t caution, it’s a commitment to the wrong lens. So is point-wise modularity when the question is which path did the work?
Two scales. One primary.
Probabilistic programming gives us both scales — but they aren’t symmetric. The macro lens is primary.
Resources
Three Ways to Say ‘Because’ | PyData London 2026