Hazards on the Causal Path

Three Ways to Say ‘Because’ in Survival Analysis

Nathaniel Forde

Data Science @ Personio

and Open Source Contributor @ PyMC

2026-06-14

Preliminaries

Who am I?

Link to my website

Code or it didn’t Happen

The full worked example can be found on my blog

The Pitch

There are conflicting views of causation and the conflict matters for mediation analysis in survival contexts.

  • Counterfactual view a causal claim is about probabilistic dependence under intervention

  • Process view a causal claim is about the unfolding of influence over time.

Both are intuitive, but they diverge on mediation cases. This talk builds a Bayesian dynamic path analysis in PyMC that bridges both views.

Agenda

Act I — The puzzle: Pre-emption, competing risks, and why mediation splits the schools

Act II — The model: A hybrid Bayesian DPA: Pearl’s structural equations + Aalen’s time-varying mechanisms

Act III — The lesson: The lens, not just the answer. Why the macro view is primary — and the cross-world assumption it lets you escape.

Act I: Three Ways to Say “Because”

The Counterfactual Test

The two dominant frameworks in causal inference — potential outcomes and do-calculus — share a common foundation:

“X caused Y” means Y would not have occurred without X.

  • Potential outcomes: \(\text{ATE} = E[Y(1)] - E[Y(0)]\) — the difference between what happens and what would have happened.
  • Do-calculus: \(E[Y \mid do(X=1)] - E[Y \mid do(X=0)]\) — same logic, structural implementation.

Counterfactual difference-making is the shared engine of modern causal inference. Enormously powerful for total effects.

But what about decomposing effects by pathway? That’s where the test starts to strain.

The Pre-emption Puzzle

Two assassins are hired. Assassin A poisons the coffee. Assassin B waits with a sniper rifle as backup.

  • A’s poison works. B never fires. The target dies.

  • Counterfactual test: “Would the target have survived if A hadn’t acted?”

    • No, because B would have fired. So A fails the counterfactual test for being the cause.
  • But A clearly caused the death. The counterfactual test misses the process that did the work.

This is a granularity question. At what fidelity do we locate causes?

From Competing Risks to Mediation

  • The assassin example is a competing risks problem

  • Mediation is a similar structure.

    • A treatment works through multiple channels.
    • The challenge is understanding the relative contribution of each channel to the total effect.
  • Attributing credit requires process decomposition

The three schools of causal inference give three different answers — each depending on where they locate causation.

Three Views of Primacy

Each school answers “where does causation live?” differently — and that answer dictates how each handles mediation.

  • Potential Outcomesprimacy of intervention. Rubin, Holland (1986): “no causation without manipulation.” A causal claim must map to an experiment you could run. NDE requires \(Y(1, M(0))\) — two simultaneous contrary interventions, no single experiment can implement it. Mediation has no operational meaning.
  • Structural Causal Modelsprimacy of equations. Pearl (2001): causation lives in the stability of structural relations under intervention. Cross-world counterfactuals are computations over the model. Mediation is first-class — if your equations are right.
  • Process / Aalenprimacy of the trajectory. Causation is the unfolding of influence in time. Effects are increments to risk; the causal story is told by the path, not the endpoint. Mediation is decomposed algebraically — no cross-world counterfactual ever arises.

In causal inference and judicial sentencing it matters if you were killed by a man or a werewolf

The process view doesn’t just sidestep the cross-world problem — it diagnoses it. The micro lens isn’t a stricter version of the macro lens. It’s looking for causation in the wrong place.

Perspectives on Mediation

Potential Outcomes Pearl SCM Process (Aalen)
Cause is… Difference between worlds Structural mechanism Unfolding process
Mediation? Ill-defined (cross-world) Well-defined (structural eqns) Naturally decomposable
Strength Clean identification Full causal calculus Temporal dynamics
Weakness No mediation Untestable structural assumptions OlS estimator and non-generative inference
Granularity Endpoint Each equation / node Trajectory (cumulative)

What if one model could make the mechanism transparent and deliver both kinds of estimand?

Concrete Mediation Example

Replicating dpasurv

  • A patient starts an exercise programme.

  • It reduces cardiovascular risk directly. Good.

  • But it also causes joint inflammation. The inflammation increases risk over time.

  • At the six-month endpoint, the treatment looks modestly protective.

  • You have no idea it’s being undermined.

To act on this, you need both answers: which process is doing the undermining, and how much benefit is being lost.

The dpasurv benchmark example of mediation structures. We will replicate this example in PyMC

The Structure: A Sequence of DAGs

The arrows don’t just exist. They strengthen and weaken over time.

Identification assumption: sequential unconfoundedness

At each time point, conditional on observed history, there are no unmeasured common causes of treatment, mediator, and hazard. The DAG justifies the path decomposition; this assumption justifies calling it causal. In practice violations might occur but as long as they are not systematic, the decomposition can still be useful for understanding the process.

Act II: Building the Process Model

Why a Generative Model?

Generative AI has made one idea mainstream: a model that can sample is a model that can reason about what might have been.

The same logic applies here. The process view demands a model that generates trajectories — not just estimates coefficients.

  • A partial likelihood (Cox) conditions out time. Time is in the residual.
  • A full Poisson likelihood puts time back in: explicit baseline hazard, time-varying coefficients, and the ability to simulate counterfactual worlds.

Three technical moves make this possible: discretise, smooth, bridge.

Discretise + Smooth

Discretise time into bins

# Each row: who was at risk, how long, what happened
df["dt"] = df["stop"] - df["start"]
bin_idx = df["bin_idx"].values

Smooth with B-splines

# Random walk prior on spline coefficients
beta_raw = pm.Normal("beta_raw", 0, 1,
                     dims=('splines', 'tv'))
coef_alpha = pt.cumsum(beta_raw * sigma_smooth,
                       axis=0)
alpha_1_t = pt.dot(basis, coef_alpha[:, 1])

Smoothing splines chosen by LOO cross-validation

B-spline basis functions and a weighted sum forming a smooth trajectory

The Poisson Bridge

Once time is discrete, binary events become counts capped at 1 — and counts have a natural likelihood.

\[d_{it} \sim \text{Poisson}(\lambda(t)), \quad \lambda(t) = Y_{it} \cdot \Delta t \cdot f(\text{linear predictor})\]

The baseline hazard becomes an explicit parameter, not a nuisance to be conditioned away.

This is what makes the model generative. We can sample from it. We can intervene on it.

We draw our sample trajectories on causal pathways with time-varying coefficients indexed to each temporal stage.

The Joint Model

Mediator equation:

\[M_t = \beta(t) \cdot X + \rho \cdot M_{t-1} + \epsilon\]

The mediator has inertia. Its dynamics are modelled, not assumed away.

Hazard equation:

\[\lambda(t) = f\big(\alpha_0(t) + \alpha_1(t) X + \alpha_2(t) M_{t-1}\big)\]

Direct and indirect paths, both time-varying.

These are estimated simultaneously. One likelihood, one posterior.

  • This is a hybrid: structural equations in Pearl’s sense, time-varying mechanisms in Aalen’s sense. PyMC is the substrate that makes both cohabit — the probabilistic programming framework estimates the structure, tracks the process, and propagates uncertainty through both.

The Model in Code

with pm.Model() as model:
    # 1. Time-varying coefficients via random-walk splines
    beta_raw = pm.Normal("beta_raw", 0, 1,
                        dims=('splines', 'tv'))
    coef_alpha = pt.cumsum(beta_raw * sigma_smooth, axis=0)

    # 2. Mediator equation  (Pearl's X → M, Aalen's β(t))
    mu_m = beta_t[b] * trt + rho * med_lag
    pm.Normal("obs_m", mu=mu_m, sigma=sigma_m,
            observed=med)

    

    # 3. Hazard equation  (Pearl's X → Y + M → Y, Aalen's α(t))
    eta = (alpha_0_t[b]
        + alpha_1_t[b] * trt        # direct path
        + alpha_2_t[b] * med_lag)   # mediated path

    # 4. Softplus link → Poisson likelihood
    Lambda = dt * pm.math.log1pexp(eta)
    pm.Poisson("obs_event", mu=Lambda,
            observed=events)

    # 5. Path decomposition — falls out of the structure
    pm.Deterministic("direct",   alpha_1_t)
    pm.Deterministic("indirect", beta_t * alpha_2_t)
    pm.Deterministic("total",    alpha_1_t + beta_t * alpha_2_t)

    # 6. One call estimates everything jointly
    idata = pm.sample(draws=2000, tune=2000,
                    nuts_sampler="numpyro")

Direct and Indirect Effects

(Latent Additive Scale)

Left: direct effect over time reproduces the data. Middle: indirect effect. Right: Total Effect

Here we plot the time-varying path coefficients. The direct effect is protective throughout. The indirect effect is inert, then harmful after day 100. The total effect is attenuated recovering the dpasurv pattern.

G-Computation for the Marginal Contrasts

# World 1: everyone treated
pm.set_data({'trt': np.ones(n)})
idata_trt1 = pm.sample_posterior_predictive(
    idata, var_names=['Lambda', 'obs_m', 'obs_event'])

# World 0: nobody treated
pm.set_data({'trt': np.zeros(n)})
idata_trt0 = pm.sample_posterior_predictive(
    idata, var_names=['Lambda', 'obs_m', 'obs_event'])

We use posterior predictive sampling to generate two counterfactual worlds. Same population. Same model.

The mediator evolves naturally under each intervention and we can derive time-varying hazards and survival curves for both.

Translating to Hazards

(Observed Scale)

Here we show the hazards propagated under different interventions, and we compare the hazards to get at the relative risk estimand

The Dual Output

Two independent axes — estimand (what are you asking?) and lens (where do you locate causation?):

Macro lens (realised process) Micro lens (point-wise modularity)
Shape (which path) ✓ trajectory coefficients ✗ category error — pre-emption shows the cause is the process, not a point-wise modular fact
Size (how much) — needs structural commitment to size g-computation, NDE/NIE

The diagonals carry the work. Shape lives at the macro lens — and only there. Size can be computed at the micro lens, but only because we accept structural commitments as a price — not because the micro lens is the right home for causal claims.

The asymmetry

The macro lens is primary. The micro lens earns its keep only when there’s a process already identified to be sized. G-computation is parasitic on the trajectory, not equipotent with it.

Act III: What the Model Teaches Us

Hickam’s Dictum for Causal Paths

Ockham’s Razor says: one cause, one effect, pare it down.

  • An endpoint analysis applies Ockham to the treatment: 17% risk reduction. Done.

Hickam’s Dictum says: “A patient can have as many diseases as they damn well please.”

  • If treatment operates through multiple channels simultaneously collapsing them is oversimplification.

. . .

The DPA model is Hickam operationalised:

  • Direct path: protective throughout
  • Indirect path: inert, then harmful after day 100
  • Total effect: attenuated — the endpoint hides the war between paths

Refusing to decompose isn’t caution, it’s a commitment. It assumes all paths point the same way. When they don’t, aggressive simplification is misspecification from the start — a commitment to looking for causation where it isn’t, at the endpoint, between frozen worlds, instead of along the path the process actually took.

Cross-World Independence

The assumption that makes mediation fragile:

\[Y(1, M(0)) \perp\!\!\!\perp M(1) \mid X\]

“The M → Y relationship under treatment is the same as it would be if M had been generated under control.”

  • Untestable from data
  • Often implausible — treatment frequently changes the meaning of M

The macro lens doesn’t need it. It’s a strictly weaker identification regime.

Trajectory decomposition

\[\text{total}(t) = \alpha_1(t) + \beta(t)\alpha_2(t)\]

Algebraic identity, not counterfactual subtraction. Identified by sequential within-world unconfoundedness alone. No cross-world assumption.

G-computation NDE

Plug \(M(0)\) trajectory into hazard under \(X=1\).

This is a cross-world counterfactual. Pays the full identification price. The price buys you a magnitude.

NDE and the Hidden Harm

The trajectory told us which path is harmful and when. Now we pay the modularity cost to answer: how much?

# Mediator trajectories under control
m_nat = idata_trt0.posterior_predictive['obs_m']

# Hazard under treatment, with control mediator
eta_nde = (alpha_0(t) + alpha_1(t) * 1
           + alpha_2(t) * m_nat)

# Transform through the link and simulate
hazard_nde = dt * log1pexp(eta_nde)
  • Plug one world’s mediator into another world’s hazard — this is the cross-world counterfactual, requiring full modularity.
  • The price buys you a number you can subtract from the total.

Implication: Do not abandon the treatment. Intervene on the mediator after day 100. The NDE is a cross-world comparison — justified by modularity — but it’s what lets you subtract from the total and reveal the hidden structure.

What We Built

1. A hybrid model: Pearl’s structural equations + Aalen’s time-varying mechanisms, unified in PyMC with full posterior uncertainty.

2. Two kinds of estimand from one model: coefficients for the shape of the mechanism, g-computation for the size of the effect — including NDE/NIE.

3. Two scales, but not symmetric: the macro lens (spline-smoothed trajectories) identifies the realised process and escapes cross-world independence. The micro lens (point-wise modularity) sizes what macro has already identified.

4. The macro question is the right question; the micro answer sizes what macro identifies. Probabilistic programming is the substrate that makes both answerable — but only one is primary.

The Lesson

Causation lives at the macro scale.

  • The pre-emption puzzle showed it: ask “did A make a difference?” and you get the wrong answer. Ask “which process transmitted the influence?” and you get the right one.

  • The trajectory coefficients are not a coarser estimand than the cross-world NDE. They are the right estimand for the question. The NDE is a useful auxiliary — it sizes what the trajectory has already identified.

  • The endpoint view isn’t caution, it’s a commitment to the wrong lens. So is point-wise modularity when the question is which path did the work?

Two scales. One primary.

  • Shape (estimand) lives at the macro lens — and only there. The splines identify the realised process — which path, when, what sign.
  • Size (estimand) is computed at the micro lens via g-computation — but only once the macro lens has identified the process worth sizing.
  • LOO and Sensitivity Analysis test whether your answer depends on the precise model or is robust to violated assumptions.

Probabilistic programming gives us both scales — but they aren’t symmetric. The macro lens is primary.

Practitioner’s Takeaway

Thank You

Resources

  • Blog post: nathanielf.github.io
  • PyMC: pymc.io
  • Fosen et al. (2006), “Dynamic path analysis,” Lifetime Data Analysis
  • Aalen (1989), “A linear regression model for the analysis of life times,” Statistics in Medicine
  • Holland (1986), “Statistics and Causal Inference,” JASA
  • Pearl (2001), “Direct and Indirect Effects,” UAI
  • Robins & Richardson (2010), “Alternative graphical causal models”