Structural Causal Models in PyMC

Conditionalisation Strategies and Valid Causal Inference

Nathaniel Forde

Data Science @ Personio

and Open Source Contributor @ PyMC

1/11/25

The Pitch

Valid causal claims are of paramount importance in science. Structural causal models impose extra-statistical constraints on the derivation of causal claims from data.

These constraints can be understood concretely with reference to the SEM literature. We outline the development of structural causal models in the Bayesian setting highlighting the use of the do-operator on generative model graph structures.

Agenda

  • Valid Causal Structure
    • Form and Content of Causal Inference
    • Abstract Conception
    • Concrete Conception
    • The Bayesian Phrasing
    • Implementation in Code
  • Measurement Models
    • Measurement Models Abstract Conception
    • Bayesian Phrasing: Concrete Example
    • Bayesian Phrasing: Code in PyMC
    • Model Fit and Diagnostics
  • Structural Equation Models
    • Adding Structural Components
    • The Generative Model in PyMC
    • Model Comparison: CFA and SEM
  • Sensitivity Analysis and the Do-Operator
    • The Do-Calculus and Sensitivity Analysis
    • What-if Structures and the Do-Operator
    • The Do-Operator in PyMC
    • Causal Estimands under Intervention

Preliminaries

Who am I?

Code or it didn’t Happen

My Website

Valid Causal Structure

Form and Content of Causal Inference

Directed Acyclic Graphs and their Meaning

“Every proposition has a content and a form. We get the picture of the pure form if we abstract from the meaning of the single words, or symbols (so far as they have independent meanings)… By syntax in this general sense of the word I mean the rules which tell us in which connections only a word [makes] sense, thus excluding nonsensical structures.” - Wittgenstein Some Remarks on Logical Form

Sets of Admissable Graphs \[ \psi | \neg \psi | \psi \rightarrow \phi \]

Sets of Admissable Graphs and Well formed valid Fomulae

Abstract Conception

Non-Parametric Structural Diagrams

Non-parametric Structural Causal Models highlight the aspects of the Data Generating processes that threaten the valid construction of a causal claim.

\[ Y \sim f_{Y}(Z, X, U_{Y}) \] \[ Z \sim f_{Z}(X, IV, U_{Z})\] \[ X \sim f_{X}(U_{X})\] \[ IV \sim f_{IV}(U_{IV})\]

Concrete Conception

Parametric Approximation via Regression

\[ y \sim 1 + Z + u \] Regression Approximation to estimating valid coefficients in systems of simultaneous equation.

\[ y \sim 1 + \widehat{Z} + u \] \[ \widehat{Z} \sim 1 + IV + u \]

The Bayesian Phrasing

\[\begin{align*} \left( \begin{array}{cc} Y \\ Z \end{array} \right) & \sim \text{MultiNormal}(\color{green} \mu, \color{purple} \Sigma) \\ \color{green} \mu & = \left( \begin{array}{cc} \mu_{y} \\ \mu_{z} \end{array} \right) = \left( \begin{array}{cc} \beta_{00} + \color{blue} \beta_{01}Z ... \\ \beta_{10} + \beta_{11}IV ... \end{array} \right) \end{align*} \]

The treatment effect \(\color{blue}\beta_{01}\) of is the primary quantity of interest

\[ \color{purple} \Sigma = \begin{bmatrix} 1 & \color{purple} \sigma \\ \color{purple} \sigma & 1 \end{bmatrix} \] But the estimation depends on the multivariate realisation of the data

  • Even the “simple” IV design is a structural causal model.
  • The crucial component is the covariance structure of the joint-distribution and the instrument’s theoretical validity
  • The Bayesian Estimation strategy estimates the IV model by seeking a parameterisation where the potential outcomes are conditionally exchangeable.

\[ p((y_{1}, z_{1})^{T} ....(y_{q}, z_{q})^{T}) = \dfrac{p(YZ | \theta)p(\theta)}{\sum p_{i}(YZ)} \]

\[ = \dfrac{p( (y_{1}, z_{1})^{T} ....(y_{q}, z_{q})^{T}) | \Sigma , \beta)p(\Sigma , \beta) }{\sum p_{i}(YZ)} \]

Implementation in Code

PyMC model context

## PyMC model context
with pm.Model() as iv_model:
    # Initialising Regression Priors
    beta_t = pm.Normal(
        name="beta_t",
        mu=priors["mus_t"],
        sigma=priors["sigma_t"],
        dims="instruments")

    beta_z = pm.Normal(
        name="beta_z",
        mu=priors["mus_z"],
        sigma=priors["sigma_z"],
        dims="covariates")
    
    # Covariance Prior Parameters
    sd_dist = pm.Exponential.dist(priors["lkj_sd"], shape=2)
    chol, corr, sigmas = pm.LKJCholeskyCov(
        name="chol_cov",
        eta=priors["eta"],
        n=2,
        sd_dist=sd_dist,
    )
    # compute and store the covariance matrix
    pm.Deterministic(name="cov", 
    var=pt.dot(l=chol, r=chol.T))

    # --- Parameterization ---
    # focal regression
    mu_y = pm.Deterministic(name="mu_y", 
    var=pm.math.dot(X, beta_z))
    # instrumental regression
    mu_t = pm.Deterministic(name="mu_t", 
    var=pm.math.dot(Z, beta_t))
    
    # Stack regressions for MvNormal
    mu = pm.Deterministic(name="mu", 
    var=pt.stack(tensors=(mu_y, mu_t), 
    axis=1))

    # --- Likelihood ---
    pm.MvNormal(
        name="likelihood",
        mu=mu,
        chol=chol,
        observed=np.stack(arrays=(y, t), axis=1),
        shape=(X.shape[0], 2),
    )

Functional Form and the State of the World

Credible Maps

Causal Structural Models generate valid causal claims just when:

  • the mathematical model is apt to account for risks of confounding in the assumed data generating process
  • the model parameters or theoretical estimands can be properly identified and estimated with the data available.
  • the assumed data generating process is a close enough approximation of the actual process
  • The set of admissable causal claims is constrained by the substantive extra-statistical requirement to credibly map model to the world

Theory and Credibility

\[ \Bigg( \begin{array}{cc} Y \sim f_{Y}(Z, X, U_{Y}) \\ Z \sim f_{Z}(X, IV, U_{Z}) \end{array} \Bigg) \approxeq \Bigg[ \begin{array}{cc} Y \sim 1 + \widehat{Z} + u \\ \widehat{Z} \sim 1 + IV + u \end{array} \Bigg] \\ \approxeq \dfrac{p( (y_{1}, z_{1})^{T} ....(y_{q}, z_{q})^{T}) | \color{purple}\Sigma ,\color{blue}\beta \color{black} )p(\color{purple}\Sigma ,\color{blue}\beta \color{black}) }{\sum p_{i}(YZ)} \]

Measurement Models

Measurement Models in SCMs

\[ ADJUST \sim f_{ADJUST}(motiv, harm, stabi, U) \] \[ RISK \sim f_{RISK}(ppsych, ses, verbal, U)\] \[ ACHIEVE \sim f_{ACHIEVE}(read, arith, spell, U)\]

Non-parametric phrasing of SCMs under-specifies the relations of interest.

CFA models estimates the multivariate correlation structure with a focus on relations between the factors.

Measurement Models

and Factor Structures

The Confirmatory Factor Model can also be phrased in the Bayesian way.

\[p(\mathbf{x_{i}}^{T}.....\mathbf{x_{q}}^{T} | \text{Ksi}, \Psi, \tau, \Lambda) \sim Normal(\tau + \Lambda\cdot \text{Ksi}, \Psi)\]

\[ \lambda_{11} ..... \lambda_{21} ..... \lambda_{31} \in \Lambda \]

  motiv harm stabi ppsych ses verbal read arith spell
0 -7.907122 -5.075312 -3.138836 -17.800210 4.766450 -3.633360 -3.488981 -9.989121 -6.567873
1 1.751478 -4.155847 3.520752 7.009367 -6.048681 -7.693461 -4.520552 8.196238 8.778973
2 14.472570 -4.540677 4.070600 23.734260 -16.970670 -3.909941 -4.818170 7.529984 -5.688716
3 -1.165421 -5.668406 2.600437 1.493158 1.396363 21.409450 -3.138441 5.730547 -2.915676
4 -4.222899 -10.072150 -6.030737 -5.985864 -18.376400 -1.438816 -2.009742 -0.623953 -1.024624

Measurement Model Graph

Measurement Models in PyMC


coords = {
    "obs": list(range(len(df))),
    "indicators": ["motiv", "harm", "stabi", "verbal", "ses", "ppsych", "read", "arith", "spell"],
    "indicators_1": ["motiv", "harm", "stabi"],
    "indicators_2": ["verbal", "ses", "ppsych"],
    "indicators_3": ["read", "arith", "spell"],
    "latent": ["adjust", "risk", "achieve"],
}

def make_lambda(indicators, name='lambdas1', priors=[1, 10]):
    """ Takes an argument indicators which is a string in the coords dict"""
    temp_name = name + '_'
    lambdas_ = pm.Normal(temp_name, priors[0], priors[1], dims=(indicators))
    # Force a fixed scale on the factor loadings for factor 1
    lambdas_1 = pm.Deterministic(
        name, pt.set_subtensor(lambdas_[0], 1), dims=(indicators)
    )
    return lambdas_1

obs_idx = list(range(len(df)))
with pm.Model(coords=coords) as model:

    # Set up Factor Loadings
    lambdas_1 = make_lambda('indicators_1', 'lambdas1')
    lambdas_2 = make_lambda('indicators_2', 'lambdas2', priors=[1, 2])
    lambdas_3 = make_lambda('indicators_3', 'lambdas3')
    
    # Specify covariance structure between latent factors
    sd_dist = pm.Exponential.dist(1.0, shape=3)
    chol, _, _ = pm.LKJCholeskyCov("chol_cov", n=3, eta=2, sd_dist=sd_dist, compute_corr=True)
    ksi = pm.MvNormal("ksi", 0, chol=chol, dims=("obs", "latent"))

    # Construct Pseudo Observation matrix based on Factor Loadings
    m1 = ksi[obs_idx, 0] * lambdas_1[0]
    m2 = ksi[obs_idx, 0] * lambdas_1[1]
    m3 = ksi[obs_idx, 0] * lambdas_1[2]
    m4 = ksi[obs_idx, 1] * lambdas_2[0]
    m5 = ksi[obs_idx, 1] * lambdas_2[1]
    m6 = ksi[obs_idx, 1] * lambdas_2[2]
    m7 = ksi[obs_idx, 2] * lambdas_3[0]
    m8 = ksi[obs_idx, 2] * lambdas_3[1]
    m9 = ksi[obs_idx, 2] * lambdas_3[2]

    mu = pm.Deterministic("mu", pm.math.stack([m1, m2, m3, m4, m5, m6, m7, m8, m9]).T)

    ## Error Terms
    Psi = pm.InverseGamma("Psi", 2, 13, dims="indicators")

    # Likelihood
    _ = pm.Normal(
        "likelihood",
        mu,
        Psi,
        observed=df[coords['indicators']].values,
    )

Model Estimates: Factor Loadings

The weights accorded to each of the indicator metrics on each of the specified factors.

Healthy traceplots indicating reasonable exploration of the sample space

Model Fit: Covariance Structures

The residuals between the sample covariance and model predicted covariances are largely small

Model Fit: Posterior Predictive Checks

The model can successfully retrodict the observed data

Model Implications: Latent Scores

What are the relations between these Latent components?

Structural Equation Models

Adding Structural Components

# measurement model
adjust =~ motiv + harm + stabi
risk =~ verbal + ses + ppsych
achieve =~ read + arith + spell

# focal regressions
adjust ~ risk 
achieve ~ adjust + risk

\[SEM : \underbrace{\text{Ksi}^{*}}_{latent} = \underbrace{\mathbf{B}\text{Ksi}}_{structural \\ component} + \underbrace{\Lambda\cdot \text{Ksi}}_{measurement \\ component}\]

\[CFA : \underbrace{\text{Ksi}^{*}}_{latent} = \underbrace{\mathbf{0}\text{Ksi}}_{structural \\ component} + \underbrace{\Lambda\cdot \text{Ksi}}_{measurement \\ component}\]

The Structural Matrix

\[ \overbrace{\text{Ksi}}^{latent} = \begin{bmatrix} \overbrace{2.5}^{adjust} & \overbrace{3.2}^{risk} & \overbrace{7.3}^{achieve} \\ ... & ... & ... \\ ... & ... & ... \\ ... & ... \\ 3.9 & ...\end{bmatrix} \begin{bmatrix} 0 & \beta_{12} & \beta_{13} \\ \beta_{21} & 0 & \beta_{23} \\ \beta_{31} & \beta_{32} & 0 \end{bmatrix} = \underbrace{\mathbf{B}}_{Regression \\ Coefficients} \]

The Structural Matrix

\[ \overbrace{\text{Ksi}}^{latent} = \begin{bmatrix} \overbrace{2.5}^{adjust} & \overbrace{3.2}^{risk} & \overbrace{7.3}^{achieve} \\ ... & ... & ... \\ ... & ... & ... \\ ... & ... \\ 3.9 & ...\end{bmatrix} \begin{bmatrix} 0 & 0 & \color{blue}{\beta_{13}} \\ \color{blue}{\beta_{21}} & 0 & \color{blue}{\beta_{23}} \\ 0 & 0 & 0 \end{bmatrix} = \underbrace{\mathbf{B}}_{Regression \\ Coefficients} \]

The specification of the structural matrix encodes the candidate regression relations which are to be evaluated in the SEM fit.

adjust ~ risk 
achieve ~ adjust + risk

The Generative Model in PyMC


coords = {
    "obs": list(range(len(df))),
    "indicators": ["motiv", "harm", "stabi", "verbal", "ses", "ppsych", "read", "arith", "spell"],
    "indicators_1": ["motiv", "harm", "stabi"],
    "indicators_2": ["verbal", "ses", "ppsych"],
    "indicators_3": ["read", "arith", "spell"],
    "latent": ["adjust", "risk", "achieve"],
    "paths": ["risk->adjust", "risk-->achieve", "adjust-->achieve"]
    }

    obs_idx = list(range(len(df)))
    with pm.Model(coords=coords) as model:

        # Set up Factor Loadings
        lambdas_1 = make_lambda('indicators_1', 'lambdas1', priors=[1, .5]) #adjust
        lambdas_2 = make_lambda('indicators_2', 'lambdas2', priors=[1, .5]) # risk
        lambdas_3 = make_lambda('indicators_3', 'lambdas3', priors=[1, .5]) # achieve
        
        # Specify covariance structure between latent factors
        sd_dist = pm.Exponential.dist(1.0, shape=3)
        chol, _, _ = pm.LKJCholeskyCov("chol_cov", n=3, eta=2, sd_dist=sd_dist, compute_corr=True)
        ksi = pm.MvNormal("ksi", 0, chol=chol, dims=("obs", "latent"))

        ## Build Regression Components
        coefs = pm.Normal('betas', 0, .5, dims='paths')
        zeros = pt.zeros((3, 3))
        ## adjust ~ risk
        zeros1 = pt.set_subtensor(zeros[1, 0], coefs[0])
        ## achieve ~ risk + adjust
        zeros2 = pt.set_subtensor(zeros1[1, 2], coefs[1])
        coefs_ = pt.set_subtensor(zeros2[0, 2], coefs[2])
        
        structural_relations = pm.Deterministic('endogenous_structural_paths', 
        pm.math.dot(ksi, coefs_))

        # Construct Pseudo Observation matrix based on Factor Loadings
        m1 = ksi[obs_idx, 0] * lambdas_1[0] +  structural_relations[obs_idx, 0] #adjust
        m2 = ksi[obs_idx, 0] * lambdas_1[1] +  structural_relations[obs_idx, 0] #adjust
        m3 = ksi[obs_idx, 0] * lambdas_1[2] +  structural_relations[obs_idx, 0] #adjust
        m4 = ksi[obs_idx, 1] * lambdas_2[0] +  structural_relations[obs_idx, 1]  #risk
        m5 = ksi[obs_idx, 1] * lambdas_2[1] +  structural_relations[obs_idx, 1]  #risk
        m6 = ksi[obs_idx, 1] * lambdas_2[2] +  structural_relations[obs_idx, 1]  #risk
        m7 = ksi[obs_idx, 2] * lambdas_3[0] +  structural_relations[obs_idx, 2]  #achieve
        m8 = ksi[obs_idx, 2] * lambdas_3[1] +  structural_relations[obs_idx, 2]  #achieve
        m9 = ksi[obs_idx, 2] * lambdas_3[2] +  structural_relations[obs_idx, 2]  #achieve

        mu = pm.Deterministic("mu", pm.math.stack([m1, m2, m3, m4, m5, m6, m7, m8, m9]).T)


        ## Error Terms
        Psi = pm.InverseGamma("Psi", 2, 13, dims="indicators")

        # Likelihood
        _ = pm.Normal(
            "likelihood",
            mu,
            Psi,
            observed=df[coords['indicators']].values,
        )

Model Comparison

Local Model checks

Model ranked on Leave one out cross validation

SEM model achieves better performance on the local and global model checks than the CFA model

The Do-Calculus and Sensitivity Analysis

Causal Hierarchy

The graph algebra of the do-calculus sets rules on admissable structures required to warrant valid causal claims when accounting for different species of confounding.

They complement the development and analysis of SEM models, providing minimalist admissablility conditions for a causal interpretation of these structural relations.

The substantive justification of the causal claims implicit in a SEM model need to be made more explicitly by the researcher with reference to evaluation of model fit to data by comparing to competing models.

What-if Structures and the Do-Operator

“[T]ypically the causal assumptions are less established, though they should be defensible and consistent with the current state of knowledge. The analysis is done under the speculation of “what if these causal assumptions were true.” These latter analyses are useful because there are often ways of testing the model, or parts of it. These tests can be helpful in rejecting one or more of the causal assumptions, thereby revealing flaws in specification. Of course, passing these tests does not prove the validity of the causal assumptions, but it lends credibility to them.”

  • Bollen and Pearl in Eight myths about causality and structural equation models

The Do-Operator uses graph mutilation techniques to interrogate the impact of different data generating conditions including the analysis of causal claims.

Graphs Mutilation in PyMC

## Model Estimated on Data
idata_sem, model_sem = make_sem()
pm.model_to_graphviz(model_sem)

## The Do-Operator in Action
model_beta0 = do(model_sem, 
{"betas": [0, 0 , 0]}, 
prune_vars=True)
pm.model_to_graphviz(model_beta0)

Causal Estimands under Intervention

model_beta0 = do(model_sem, {"betas": [0, 0 , 0]}, prune_vars=True)
model_beta1 = do(model_sem, {"betas": [.6, .3, .7]}, prune_vars=True)

# Sample new data assuming path parameters 0: P(Y | c, do(beta=0))
idata_z0 = pm.sample_posterior_predictive(
    idata_sem,
    model=model_beta0,
    predictions=False,
    var_names=["likelihood", "betas",],
)
# Sample new data assuming substantive specification: P(Y | c, do(z=[.6, .3, .7]))
idata_z1 = pm.sample_posterior_predictive(
    idata_sem,
    model=model_beta1,
    predictions=False,
    var_names=["likelihood", "betas"],
)

Causal Estimands under Intervention

\[ E\Big[(Y| do(\beta=0)) - (Y | do(\beta \neq 0) \Big] \]

Model Fit Sensitivity under Intervention

The plots are scaled identically here between -1 and 1. Highlighting significantly worse model fit under implausibly zero’d out beta coefficients.

Conclusion

“SEM is an inference engine that takes in two inputs, qualitative causal assumptions and empirical data, and derives two logical consequences of these inputs: quantitative causal conclusions and statistical measures of fit for the testable implications of the assumptions. Failure to fit the data casts doubt on the strong causal assumptions of zero coefficients or zero covariances and guides the researcher to diagnose, or repair the structural misspecifications. Fitting the data does not “prove” the causal assumptions, but it makes them tentatively more plausible. Any such positive results need to be replicated and to withstand the criticisms of researchers who suggest other models for the same data. ”

  • Bollen and Pearl in Eight myths about causality and structural equation models