Hazards on the Causal Path

Three Ways to Say ‘Because’ in Survival Analysis

Nathaniel Forde

Data Science @ Personio

and Open Source Contributor @ PyMC

2026-06-14

Preliminaries

Who am I?

I’m a data scientist at Personio
- Bayesian statistician,
- Reformed philosopher and logician.
Website: https://nathanielf.github.io/

Link to my website

Code or it didn’t Happen

The full worked example can be found on my blog

The Pitch

There are conflicting views of causation and the conflict matters for mediation analysis in survival contexts.

Counterfactual view a causal claim is about probabilistic dependence under intervention
Process view a causal claim is about the unfolding of influence over time.

Both are intuitive, but they diverge on mediation cases. This talk builds a Bayesian dynamic path analysis in PyMC that bridges both views.

Agenda

Act I — The puzzle: Pre-emption, competing risks, and why mediation splits the schools

Act II — The model: A hybrid Bayesian DPA: Pearl’s structural equations + Aalen’s time-varying mechanisms

Act III — The lesson: The lens, not just the answer. Why the macro view is primary — and the cross-world assumption it lets you escape.

Act I: Three Ways to Say “Because”

The Counterfactual Test

The two dominant frameworks in causal inference — potential outcomes and do-calculus — share a common foundation:

“X caused Y” means Y would not have occurred without X.

Potential outcomes: \(\text{ATE} = E[Y(1)] - E[Y(0)]\) — the difference between what happens and what would have happened.
Do-calculus: \(E[Y \mid do(X=1)] - E[Y \mid do(X=0)]\) — same logic, structural implementation.

Counterfactual difference-making is the shared engine of modern causal inference. Enormously powerful for total effects.

But what about decomposing effects by pathway? That’s where the test starts to strain.

The Pre-emption Puzzle

Two assassins are hired. Assassin A poisons the coffee. Assassin B waits with a sniper rifle as backup.

A’s poison works. B never fires. The target dies.
Counterfactual test: “Would the target have survived if A hadn’t acted?”
- No, because B would have fired. So A fails the counterfactual test for being the cause.
But A clearly caused the death. The counterfactual test misses the process that did the work.

This is a granularity question. At what fidelity do we locate causes?

From Competing Risks to Mediation

The assassin example is a competing risks problem
Mediation is a similar structure.
- A treatment works through multiple channels.
- The challenge is understanding the relative contribution of each channel to the total effect.
Attributing credit requires process decomposition

The three schools of causal inference give three different answers — each depending on where they locate causation.

The competing risks framing is the cleaner analogy. Pre-emption in the assassin case involves redundant backup causes, which is not quite the structure of mediation. But competing risks — multiple pathways to the same event, where you need to know which one operated — maps directly onto mediation analysis. In both cases, the counterfactual test on the endpoint is insufficient: the patient dies either way in competing risks; the treatment shows “modest benefit” in mediation. You need to decompose by pathway to understand what actually happened. Whether death occurs due to a mediated harmful effect or an alternative risk pathway, it is the process decomposition that tells you where to intervene. The three views on the next slide are three responses to this challenge — each placing causation in a different home. Potential outcomes tries to express mediation decomposition as nested counterfactuals and runs into trouble. Pearl embraces structural equations to make the decomposition well-defined. Aalen models the process directly, so the decomposition is algebraic rather than counterfactual.

Three Views of Primacy

Each school answers “where does causation live?” differently — and that answer dictates how each handles mediation.

Potential Outcomes — primacy of intervention. Rubin, Holland (1986): “no causation without manipulation.” A causal claim must map to an experiment you could run. NDE requires \(Y(1, M(0))\) — two simultaneous contrary interventions, no single experiment can implement it. Mediation has no operational meaning.

Structural Causal Models — primacy of equations. Pearl (2001): causation lives in the stability of structural relations under intervention. Cross-world counterfactuals are computations over the model. Mediation is first-class — if your equations are right.

Process / Aalen — primacy of the trajectory. Causation is the unfolding of influence in time. Effects are increments to risk; the causal story is told by the path, not the endpoint. Mediation is decomposed algebraically — no cross-world counterfactual ever arises.

In causal inference and judicial sentencing it matters if you were killed by a man or a werewolf

The process view doesn’t just sidestep the cross-world problem — it diagnoses it. The micro lens isn’t a stricter version of the macro lens. It’s looking for causation in the wrong place.

Three answers to the same question — where does causation live? Potential outcomes locates it in interventions: every causal claim must correspond to an experiment. That’s principled and clean for total effects, but it rules out mediation because the NDE Y(1, M(0)) requires two simultaneous interventions no single experiment can implement. Robins and Richardson have argued this makes the NDE incoherent within the PO framework. Pearl’s SCM relocates causation in the stability of structural equations under intervention. If the equations describe the actual generating process, cross-world counterfactuals are just computations over the model. Mediation becomes first-class — but the price is commitment to untestable structural assumptions. Aalen relocates causation again, this time in the trajectory itself. Effects are increments to risk over time, not comparisons between frozen worlds. The decomposition is algebraic, not counterfactual: total = direct + indirect as an identity in the additive structure. The cross-world counterfactual never arises. This handles pre-emption naturally — the cause is the process that actually transmitted the influence, identified by its coefficient. The deeper claim, which we’ll return to: the process view doesn’t just dodge the cross-world problem, it diagnoses it. Cross-world counterfactuals are looking for causation in the wrong place.

Perspectives on Mediation

	Potential Outcomes	Pearl SCM	Process (Aalen)
Cause is…	Difference between worlds	Structural mechanism	Unfolding process
Mediation?	Ill-defined (cross-world)	Well-defined (structural eqns)	Naturally decomposable
Strength	Clean identification	Full causal calculus	Temporal dynamics
Weakness	No mediation	Untestable structural assumptions	OlS estimator and non-generative inference
Granularity	Endpoint	Each equation / node	Trajectory (cumulative)

What if one model could make the mechanism transparent and deliver both kinds of estimand?

Concrete Mediation Example

Replicating `dpasurv`

A patient starts an exercise programme.
It reduces cardiovascular risk directly. Good.
But it also causes joint inflammation. The inflammation increases risk over time.
At the six-month endpoint, the treatment looks modestly protective.
You have no idea it’s being undermined.

To act on this, you need both answers: which process is doing the undermining, and how much benefit is being lost.

The `dpasurv` benchmark example of mediation structures. We will replicate this example in PyMC

The Structure: A Sequence of DAGs

The arrows don’t just exist. They strengthen and weaken over time.

Identification assumption: sequential unconfoundedness

At each time point, conditional on observed history, there are no unmeasured common causes of treatment, mediator, and hazard. The DAG justifies the path decomposition; this assumption justifies calling it causal. In practice violations might occur but as long as they are not systematic, the decomposition can still be useful for understanding the process.

Act II: Building the Process Model

Why a Generative Model?

Generative AI has made one idea mainstream: a model that can sample is a model that can reason about what might have been.

The same logic applies here. The process view demands a model that generates trajectories — not just estimates coefficients.

A partial likelihood (Cox) conditions out time. Time is in the residual.
A full Poisson likelihood puts time back in: explicit baseline hazard, time-varying coefficients, and the ability to simulate counterfactual worlds.

Three technical moves make this possible: discretise, smooth, bridge.

The generative AI connection is not just rhetorical. The reason LLMs are useful for counterfactual reasoning — “what would the text have said if the premise were different?” — is that they model the full joint distribution and can sample from it. The same property is what makes a full Poisson likelihood more powerful than a Cox partial likelihood for causal inference. Cox trades away the baseline hazard (conditions it out) in exchange for robustness. That’s a good trade when you only need a total effect. It’s a bad trade when you need to simulate — because there’s nothing to simulate from. The Poisson bridge keeps the baseline hazard as an explicit parameter, making the model generative. The plot on the right shows what this buys: posterior samples of the time-varying direct effect coefficient, each a plausible trajectory. The posterior is a distribution over mechanisms, not just a point estimate. That is what process reasoning requires.

Discretise + Smooth

Discretise time into bins

# Each row: who was at risk, how long, what happened
df["dt"] = df["stop"] - df["start"]
bin_idx = df["bin_idx"].values

Smooth with B-splines

# Random walk prior on spline coefficients
beta_raw = pm.Normal("beta_raw", 0, 1,
                     dims=('splines', 'tv'))
coef_alpha = pt.cumsum(beta_raw * sigma_smooth,
                       axis=0)
alpha_1_t = pt.dot(basis, coef_alpha[:, 1])

Smoothing splines chosen by LOO cross-validation

B-spline basis functions and a weighted sum forming a smooth trajectory

The Poisson Bridge

Once time is discrete, binary events become counts capped at 1 — and counts have a natural likelihood.

\[d_{it} \sim \text{Poisson}(\lambda(t)), \quad \lambda(t) = Y_{it} \cdot \Delta t \cdot f(\text{linear predictor})\]

The baseline hazard becomes an explicit parameter, not a nuisance to be conditioned away.

This is what makes the model generative. We can sample from it. We can intervene on it.

We draw our sample trajectories on causal pathways with time-varying coefficients indexed to each temporal stage.

The Joint Model

Mediator equation:

\[M_t = \beta(t) \cdot X + \rho \cdot M_{t-1} + \epsilon\]

The mediator has inertia. Its dynamics are modelled, not assumed away.

Hazard equation:

\[\lambda(t) = f\big(\alpha_0(t) + \alpha_1(t) X + \alpha_2(t) M_{t-1}\big)\]

Direct and indirect paths, both time-varying.

These are estimated simultaneously. One likelihood, one posterior.

This is a hybrid: structural equations in Pearl’s sense, time-varying mechanisms in Aalen’s sense. PyMC is the substrate that makes both cohabit — the probabilistic programming framework estimates the structure, tracks the process, and propagates uncertainty through both.

The joint estimation is important. Information flows between the mediator model and the hazard model. If the mediator equation fits well, the hazard model gets a cleaner signal about M’s causal contribution. You’re fitting a system, not two regressions. The hybrid nature is the key engineering insight. Pearl gives you the structural equations — identifiable paths with a causal interpretation. Aalen gives you the temporal dynamics — coefficients that evolve, mechanisms that wax and wane. Neither alone would suffice: Pearl’s SCM at a single time point can’t show you the mediator turning harmful after day 100; Aalen’s additive model alone doesn’t give you marginal counterfactual contrasts. Probabilistic programming is what unifies them: it estimates both equations jointly, lets you read the coefficients for mechanism, and lets you simulate counterfactual worlds for magnitude — all within a single model with coherent uncertainty.

The Model in Code

with pm.Model() as model:
    # 1. Time-varying coefficients via random-walk splines
    beta_raw = pm.Normal("beta_raw", 0, 1,
                        dims=('splines', 'tv'))
    coef_alpha = pt.cumsum(beta_raw * sigma_smooth, axis=0)

    # 2. Mediator equation  (Pearl's X → M, Aalen's β(t))
    mu_m = beta_t[b] * trt + rho * med_lag
    pm.Normal("obs_m", mu=mu_m, sigma=sigma_m,
            observed=med)

    

    # 3. Hazard equation  (Pearl's X → Y + M → Y, Aalen's α(t))
    eta = (alpha_0_t[b]
        + alpha_1_t[b] * trt        # direct path
        + alpha_2_t[b] * med_lag)   # mediated path

    # 4. Softplus link → Poisson likelihood
    Lambda = dt * pm.math.log1pexp(eta)
    pm.Poisson("obs_event", mu=Lambda,
            observed=events)

    # 5. Path decomposition — falls out of the structure
    pm.Deterministic("direct",   alpha_1_t)
    pm.Deterministic("indirect", beta_t * alpha_2_t)
    pm.Deterministic("total",    alpha_1_t + beta_t * alpha_2_t)

    # 6. One call estimates everything jointly
    idata = pm.sample(draws=2000, tune=2000,
                    nuts_sampler="numpyro")

Walk through this slide step by step. Step 1: the random-walk splines are the Aalen contribution — they let every coefficient evolve smoothly over time. Step 2: the mediator equation is the first structural equation, Pearl’s X→M path, but with a time-varying coefficient β(t). Step 3: the hazard equation is the second structural equation — the direct path α₁(t) and the mediated path α₂(t) are both time-varying. Step 4: the softplus link ensures positive hazards, and the Poisson likelihood makes the model generative — this is the “bridge” from process to simulation. Step 5: the path decomposition is not a post-hoc calculation; it falls directly out of the model’s structure, because the equations explicitly separate the paths. Step 6: one pm.sample call estimates the entire system jointly — both equations, all time-varying coefficients, with full posterior uncertainty. This is what the probabilistic programming substrate buys you: Pearl’s structure and Aalen’s dynamics in a single inference pass.

Direct and Indirect Effects

(Latent Additive Scale)

Left: direct effect over time reproduces the data. Middle: indirect effect. Right: Total Effect

Here we plot the time-varying path coefficients. The direct effect is protective throughout. The indirect effect is inert, then harmful after day 100. The total effect is attenuated recovering the dpasurv pattern.

G-Computation for the Marginal Contrasts

# World 1: everyone treated
pm.set_data({'trt': np.ones(n)})
idata_trt1 = pm.sample_posterior_predictive(
    idata, var_names=['Lambda', 'obs_m', 'obs_event'])

# World 0: nobody treated
pm.set_data({'trt': np.zeros(n)})
idata_trt0 = pm.sample_posterior_predictive(
    idata, var_names=['Lambda', 'obs_m', 'obs_event'])

We use posterior predictive sampling to generate two counterfactual worlds. Same population. Same model.

The mediator evolves naturally under each intervention and we can derive time-varying hazards and survival curves for both.

Translating to Hazards

(Observed Scale)

Here we show the hazards propagated under different interventions, and we compare the hazards to get at the relative risk estimand

The Dual Output

Two independent axes — estimand (what are you asking?) and lens (where do you locate causation?):

	Macro lens (realised process)	Micro lens (point-wise modularity)
Shape (which path)	✓ trajectory coefficients	✗ category error — pre-emption shows the cause is the process, not a point-wise modular fact
Size (how much)	— needs structural commitment to size	g-computation, NDE/NIE

The diagonals carry the work. Shape lives at the macro lens — and only there. Size can be computed at the micro lens, but only because we accept structural commitments as a price — not because the micro lens is the right home for causal claims.

The asymmetry

The macro lens is primary. The micro lens earns its keep only when there’s a process already identified to be sized. G-computation is parasitic on the trajectory, not equipotent with it.

This is the deepest conceptual point in the talk. The two outputs don’t just answer different questions — they locate causation at different levels of granularity. The parametric path coefficients are macro objects: smoothed by the B-spline basis, governed by a random walk prior that says “adjacent time points share structural relationships.” The causal story is told by the cumulative trajectory, not by any single tick. This is the process view operationalised. G-computation, by contrast, requires the structural equations to hold point-wise: at each time step, alpha_2(t) relates M to the hazard the same way regardless of which intervention generated M. That’s a micro-level commitment — modularity at every tick. The splines bridge these scales. They smooth over point-wise structural detail, encoding the pragmatic attitude that fine-grained modularity violations wash out in the aggregate. LOO cross-validation across knot counts tests this empirically: if the causal conclusions are stable whether you use 6 knots or 12, you have evidence that the macro view is robust to the micro-level granularity — which is indirect evidence that point-wise violations, if any, are not systematic. This is the “inspectability” claim made concrete. You’re not just trusting modularity — you’re checking whether your answer depends on how seriously you take it at the finest scale.

Act III: What the Model Teaches Us

Hickam’s Dictum for Causal Paths

Ockham’s Razor says: one cause, one effect, pare it down.

An endpoint analysis applies Ockham to the treatment: 17% risk reduction. Done.

Hickam’s Dictum says: “A patient can have as many diseases as they damn well please.”

If treatment operates through multiple channels simultaneously collapsing them is oversimplification.

. . .

The DPA model is Hickam operationalised:

Direct path: protective throughout
Indirect path: inert, then harmful after day 100
Total effect: attenuated — the endpoint hides the war between paths

Refusing to decompose isn’t caution, it’s a commitment. It assumes all paths point the same way. When they don’t, aggressive simplification is misspecification from the start — a commitment to looking for causation where it isn’t, at the endpoint, between frozen worlds, instead of along the path the process actually took.

Hickam’s Dictum is the medical counter to Ockham’s Razor. Ockham says prefer the simplest explanation. Hickam says that in medicine, patients often have multiple concurrent conditions, and forcing a single diagnosis can be actively harmful. The same logic applies to causal mediation: the treatment doesn’t have one effect, it has multiple effects running through different channels, and they can work in opposite directions. An endpoint analysis that reports “17% risk reduction” is Ockham applied to causal inference — it gives you one clean number but hides the fact that the direct path is giving you 25% while the mediated path is clawing back 8%. The PO school’s refusal to engage with mediation is often framed as principled caution — “the assumptions are too strong, so we won’t decompose.” But this refusal is itself a modelling choice: it treats all causal paths as interchangeable, which is wrong whenever they oppose each other. A model that collapses a 25% benefit and an 8% harm into “17% net effect” is not assumption-free — it has baked in the assumption that path heterogeneity doesn’t matter. That is misspecification, and it’s misspecification you chose at the outset by refusing to look. The DPA model takes on structural assumptions (modularity, sequential unconfoundedness), but it uses them to reveal structure that the simpler model hides. The risk of misspecification from oversimplification can be worse than the risk of misspecification from structural commitment — especially when the structural commitment is inspectable.

Cross-World Independence

The assumption that makes mediation fragile:

\[Y(1, M(0)) \perp\!\!\!\perp M(1) \mid X\]

“The M → Y relationship under treatment is the same as it would be if M had been generated under control.”

Untestable from data
Often implausible — treatment frequently changes the meaning of M

The macro lens doesn’t need it. It’s a strictly weaker identification regime.

Trajectory decomposition

\[\text{total}(t) = \alpha_1(t) + \beta(t)\alpha_2(t)\]

Algebraic identity, not counterfactual subtraction. Identified by sequential within-world unconfoundedness alone. No cross-world assumption.

G-computation NDE

Plug \(M(0)\) trajectory into hazard under \(X=1\).

This is a cross-world counterfactual. Pays the full identification price. The price buys you a magnitude.

This is the sharpest philosophical move in the talk. Cross-world independence — Y(1, M(0)) independent of M(1) given X — is the assumption that makes most mediation projects fragile. It says the M-Y relationship under treatment is the same as it would be if M had been generated under control. This is untestable, often implausible because treatment usually changes the meaning of M, and it’s the assumption that has broken hundreds of mediation papers in epidemiology and the social sciences. The deep claim of this slide: the trajectory decomposition does not require this assumption. The path coefficients alpha_1(t) and beta(t)*alpha_2(t) are within-world structural coefficients, identified by sequential within-world unconfoundedness plus correct additive specification. The decomposition total = direct + indirect is an algebraic identity in the additive hazard model, not a counterfactual subtraction. So the cross-world assumption never enters. The g-computation NDE, by contrast, plugs M(0) trajectories into the X=1 hazard equation — that IS the cross-world counterfactual, and it pays the full identification price. The macro lens isn’t a coarser version of the micro lens. It’s a strictly weaker identification regime. That asymmetry — different and weaker, not just different — is what justifies treating macro as primary. Aalen, Røysland, and Strohmaier have argued this in print for the additive hazard mediation framework specifically.

NDE and the Hidden Harm

The trajectory told us which path is harmful and when. Now we pay the modularity cost to answer: how much?

# Mediator trajectories under control
m_nat = idata_trt0.posterior_predictive['obs_m']

# Hazard under treatment, with control mediator
eta_nde = (alpha_0(t) + alpha_1(t) * 1
           + alpha_2(t) * m_nat)

# Transform through the link and simulate
hazard_nde = dt * log1pexp(eta_nde)

Plug one world’s mediator into another world’s hazard — this is the cross-world counterfactual, requiring full modularity.
The price buys you a number you can subtract from the total.

Implication: Do not abandon the treatment. Intervene on the mediator after day 100. The NDE is a cross-world comparison — justified by modularity — but it’s what lets you subtract from the total and reveal the hidden structure.

This slide combines the NDE computation with its actionable consequence. The left column shows the code: plug control-world mediator trajectories into the treatment-world hazard equation. This IS a cross-world counterfactual — it only makes sense if alpha_2(t) acts on M the same way regardless of which intervention generated M. That’s modularity. But the payoff is concrete: the NDE gives 25% risk reduction via the direct path alone. The total effect is only 17%. The difference — the 8% the mediator claws back — is invisible to any endpoint analysis. It’s only visible because the NDE lets you subtract the direct contribution from the total. The clinical recommendation follows: keep the treatment (it works), but manage the mediator after day 100 (where the process coefficients showed it turning harmful). Shape told you where and when. Size told you how much. The NDE is the bridge — it gives you the direct effect as a marginal quantity, which you subtract from the total to reveal the indirect harm.

What We Built

1. A hybrid model: Pearl’s structural equations + Aalen’s time-varying mechanisms, unified in PyMC with full posterior uncertainty.

2. Two kinds of estimand from one model: coefficients for the shape of the mechanism, g-computation for the size of the effect — including NDE/NIE.

3. Two scales, but not symmetric: the macro lens (spline-smoothed trajectories) identifies the realised process and escapes cross-world independence. The micro lens (point-wise modularity) sizes what macro has already identified.

4. The macro question is the right question; the micro answer sizes what macro identifies. Probabilistic programming is the substrate that makes both answerable — but only one is primary.

The Lesson

Causation lives at the macro scale.

The pre-emption puzzle showed it: ask “did A make a difference?” and you get the wrong answer. Ask “which process transmitted the influence?” and you get the right one.
The trajectory coefficients are not a coarser estimand than the cross-world NDE. They are the right estimand for the question. The NDE is a useful auxiliary — it sizes what the trajectory has already identified.
The endpoint view isn’t caution, it’s a commitment to the wrong lens. So is point-wise modularity when the question is which path did the work?

Two scales. One primary.

Shape (estimand) lives at the macro lens — and only there. The splines identify the realised process — which path, when, what sign.
Size (estimand) is computed at the micro lens via g-computation — but only once the macro lens has identified the process worth sizing.
LOO and Sensitivity Analysis test whether your answer depends on the precise model or is robust to violated assumptions.

Probabilistic programming gives us both scales — but they aren’t symmetric. The macro lens is primary.

Practitioner’s Takeaway

This is the lesson of the talk. The three schools disagreed because they were asking subtly different questions under the same banner of “causation” — and at different levels of granularity. The process question — which path transmitted the influence, and when? — is answered by the time-varying coefficients. These are macro objects: smoothed by the B-spline basis, governed by a random walk prior. The story is in the cumulative trajectory, not in any single time step. That is shape. The difference-making question — how much benefit does the treatment produce through each path? — is answered by g-computation, which requires micro-level modularity: at each time step the structural equations must hold under intervention. That is size. The splines are what bridge these two scales. They encode the pragmatic assumption that fine-grained structural violations — noise in whether modularity holds perfectly at tick 47 versus tick 48 — wash out in the aggregate trajectory. The random walk prior says adjacent time points share structure; the B-spline basis smooths over point-wise variation. Together they operationalise the idea that systematic confounding would show up in the trajectory, while non-systematic noise cancels. LOO cross-validation across knot counts tests this empirically: if causal conclusions are stable from 6 knots to 12, the macro story is robust to micro-level granularity. That is the unification. The process view (macro, trajectory, Aalen) and the counterfactual view (micro, point-wise, Pearl) are not rival accounts of causation. They are complementary questions at different scales, answered by the same model, bridged by the splines, and validated by the same posterior.

Thank You

Resources

Blog post: nathanielf.github.io
PyMC: pymc.io
Fosen et al. (2006), “Dynamic path analysis,” Lifetime Data Analysis
Aalen (1989), “A linear regression model for the analysis of life times,” Statistics in Medicine
Holland (1986), “Statistics and Causal Inference,” JASA
Pearl (2001), “Direct and Indirect Effects,” UAI
Robins & Richardson (2010), “Alternative graphical causal models”

Hazards on the Causal Path

Preliminaries

The Pitch

Agenda

Act I: Three Ways to Say “Because”

The Counterfactual Test

The Pre-emption Puzzle

From Competing Risks to Mediation

Three Views of Primacy

Perspectives on Mediation

Concrete Mediation Example

Replicating dpasurv

The Structure: A Sequence of DAGs

Act II: Building the Process Model

Why a Generative Model?

Discretise + Smooth

The Poisson Bridge

The Joint Model

The Model in Code

Direct and Indirect Effects

(Latent Additive Scale)

G-Computation for the Marginal Contrasts

Translating to Hazards

(Observed Scale)

The Dual Output

Act III: What the Model Teaches Us

Hickam’s Dictum for Causal Paths

Cross-World Independence

NDE and the Hidden Harm

What We Built

The Lesson

Practitioner’s Takeaway

Thank You

Replicating `dpasurv`