Heuristics in Latent Space: VAEs and Bayesian Inference

Abstract

Missing data imposes two costs: the expense of collecting new observations and the risk of distorting the underlying structure when we construct plausible replacements. In job satisfaction surveys, non-response patterns can conceal meaningful latent relationships. This study compares statistical and machine learning approaches for imputation, focusing on their ability to preserve these latent structures while producing realistic reconstructions.

TLDR: Transparency and Heuristics in Latent Space

Principal Components Analysis is used to optimally reconstruct a complex multivariate data set from a lower level compressed dimensional space. Variational auto-encoders allow us to achieve yet more flexible reconstruction results in non-linear cases. Drawing a new sample from the posterior predictive distribution of Bayesian models similarly supplies us with insight in the variability of realised data. Both methods assume a latent model of the data generating process that aims to leverage a compressed representation of the data. These are different heuristics with different consequences for how we understand the variability in the world. Both encode different degrees of inductive bias through architectural specification, but of the two, only Bayesian methods do so transparently.

We find (a) a VAE architecture that can recover the latent correlation structure of a multivariate normal outcome if scaled up with enough data and (b) a theoretically superior architecture designed to handle missing data that breaks on the reconstruction task when trying to do so. Both are then compared to the simpler multivariate normal Bayesian model in PyMC which is able to recover the correlation structure much more efficiently. These methods are applied to an imputation problem for non-response in job-satisfaction surveys. We then draw some lessons about the value of transparent inductive biases in encoded in latent factor structures of Structural Equation Models.

Reconstruction Error and Latent Representations

In applied statistical modeling, some of the hardest problems are those we can’t observe directly. Inferring latent structure from incomplete data forces us to ask: what do we believe about the world, and how do our models reflect those beliefs? The answer lies in inductive bias—the assumptions built into a model’s architecture—that guide how it reconstructs and fills in the blanks.

We discuss the idea of reconstruction loss in the context of PCA/SVD before examining two main approaches to this challenge:

Variational autoencoders (VAEs), both standard and mask-aware, which allow flexible non-linear reconstructions and can be adapted to handle missingness explicitly.
Bayesian multivariate models with explicit priors over covariance structures, which transparently encode inductive bias and naturally impute missing values through the posterior.

Each method is applied to non-response in job satisfaction surveys. Our focus is not only on reconstruction accuracy, but on the recovery of latent correlation structures—because superficial similarity is not enough. Good marginal fits can still hide deep structural errors.

import torch
import torchvision.datasets as dsets
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import pymc as pm 
import arviz as az
import pytensor.tensor as pt

np.random.seed(42324)

import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

Job Satisfaction Data

Job satisfaction is not one thing. It’s a bundle: how we think, how we feel, how we work. In our simulation, we model this bundle through twelve indicators—grouped into job satisfaction (JW1–JW3), well-being (UF1, UF2, FOR), dysfunctional patterns (DA1–DA3), and constructive thought strategies (EBA, ST, MI). In particular, the focus is on the relationship of EBA (Evaluating beliefs and assumptions), ST (Self-talk) and MI (Mental imagery formation) have on the other constructs. This framing is adapted from the discussion in Vehkalahti and Everitt’s Multivariate Analysis for the Behavioral Sciences. The Python code generates data by first defining a realistic correlation matrix and scaling it by standard deviations to yield a structured covariance matrix. The parameterisation is drawn from the original study. From this, we draw synthetic observations using a multivariate normal distribution. This allows us to test estimators under controlled complexity.

Code

# Standard deviations
stds = np.array([0.939, 1.017, 0.937, 0.562, 0.760, 0.524, 
                 0.585, 0.609, 0.731, 0.711, 1.124, 1.001])

n = len(stds)

# Lower triangular correlation values as a flat list
corr_values = [
    1.000,
    .668, 1.000,
    .635, .599, 1.000,
    .263, .261, .164, 1.000,
    .290, .315, .247, .486, 1.000,
    .207, .245, .231, .251, .449, 1.000,
   -.206, -.182, -.195, -.309, -.266, -.142, 1.000,
   -.280, -.241, -.238, -.344, -.305, -.230,  .753, 1.000,
   -.258, -.244, -.185, -.255, -.255, -.215,  .554,  .587, 1.000,
    .080,  .096,  .094, -.017,  .151,  .141, -.074, -.111,  .016, 1.000,
    .061,  .028, -.035, -.058, -.051, -.003, -.040, -.040, -.018,  .284, 1.000,
    .113,  .174,  .059,  .063,  .138,  .044, -.119, -.073, -.084,  .563,  .379, 1.000
]

# Fill correlation matrix
corr_matrix = np.zeros((n, n))
idx = 0
for i in range(n):
    for j in range(i+1):
        corr_matrix[i, j] = corr_values[idx]
        corr_matrix[j, i] = corr_values[idx]
        idx += 1

# Covariance matrix: Sigma = D * R * D
cov_matrix = np.outer(stds, stds) * corr_matrix
#cov_matrix_test = np.dot(np.dot(np.diag(stds), corr_matrix), np.diag(stds))
FEATURE_COLUMNS=["JW1","JW2","JW3", "UF1","UF2","FOR", "DA1","DA2","DA3", "EBA","ST","MI"]
corr_df = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS)

cov_df = pd.DataFrame(cov_matrix, columns=FEATURE_COLUMNS)

def make_sample(cov_matrix, size, columns, missing_frac=0.0, impute=False):
    sample_df = pd.DataFrame(np.random.multivariate_normal([0]*12, cov_matrix, size=size), columns=FEATURE_COLUMNS)
    if missing_frac > 0.0: 
        total_values = sample_df.size
        num_nans = int(total_values * missing_frac)

        # Choose random flat indices
        nan_indices = np.random.choice(total_values, num_nans, replace=False)

        # Convert flat indices to (row, col)
        rows, cols = np.unravel_index(nan_indices, sample_df.shape)

        # Set the values to NaN
        sample_df.values[rows, cols] = np.nan

    if impute: 
        sample_df.fillna(sample_df.mean(axis=0), inplace=True)


    return sample_df

sample_df = make_sample(cov_matrix, 263, FEATURE_COLUMNS)
sample_df.head(10)

	JW1	JW2	JW3	UF1	UF2	FOR	DA1	DA2	DA3	EBA	ST	MI
0	2.578412	3.565673	1.976356	0.261307	-0.162994	0.425867	-0.341104	-0.663246	-1.898728	0.412019	0.167105	1.179567
1	0.147594	0.316841	1.244429	-0.566278	0.157724	0.084468	0.625595	0.590472	0.663543	-0.392881	0.434897	-0.379236
2	0.799763	0.277303	-0.361947	0.703534	0.326682	-0.503207	0.558374	0.313461	0.005397	0.373729	-0.487022	1.029725
3	-1.245506	-1.338533	-0.845951	-0.700084	-0.324149	0.159880	0.205925	0.402143	1.305458	0.987969	1.313665	0.171655
4	0.066713	0.958231	0.697802	0.071260	-0.102972	0.186527	-0.015224	0.147267	0.674598	-0.719964	-0.845735	-0.203553
5	-0.324724	0.189098	0.508612	0.078070	0.239252	-0.556651	0.560134	0.180378	-0.813426	-0.731620	-1.727260	1.150983
6	-0.190670	-0.169900	-0.798084	0.247005	-0.475737	0.325712	0.892692	-0.094400	-0.648839	-0.306447	-0.484535	0.358890
7	-1.042066	-1.583407	-2.318488	-0.073663	-0.476409	0.235916	0.038662	-0.154278	0.301485	0.881227	1.128888	1.524227
8	-0.818276	-0.596287	-1.432412	1.094289	-0.194693	-0.699425	-0.652304	-0.281707	-0.135852	-0.930961	0.236916	-0.257751
9	-1.167284	-1.860362	-0.947581	-0.767971	-1.306440	-0.266554	0.019500	0.711873	1.820465	-1.407315	-0.165145	-2.395606

A stringent test of any model for the job satisfaction process is whether it recovers the observed correlation structure. The code extracts pairwise correlations from the simulated data and visualizes them using a heatmap. Strong ties—positive or negative—appear as vivid blocks, offering a visual audit of relational intensity. This pattern matters. While it does not imply causation, it does offer a benchmark: if a model can’t replicate these patterns, it’s likely missing structure.

Code

data = sample_df.corr()

def get_srmr(resids):
    srmr = np.sqrt(np.sum(np.tril(resids).flatten()**2) / (12*(12 + 1))/2) 
    return srmr

def plot_heatmap(data, title="Correlation Matrix",  vmin=-.2, vmax=.2, ax=None, figsize=(10, 6), colorbar=True, cmap='magma'):
    data_matrix = data.values
    if ax is None:
        fig, ax = plt.subplots(figsize=figsize)
    im = ax.imshow(data, cmap=cmap, vmin=vmin, vmax=vmax)

    for i in range(data_matrix.shape[0]):
        for j in range(data_matrix.shape[1]):
            text = ax.text(
                j, i,                      # x, y coordinates
                f"{data_matrix[i, j]:.2f}",       # text to display
                ha="center", va="center",  # center alignment
                color="white" if data_matrix[i,j] < 0.5 else "black"  # contrast color
            )

    ax.set_title(title)
    ax.set_xticklabels(data.columns)  
    ax.set_xticks(np.arange(data.shape[1]))
    ax.set_yticklabels(data.index)  
    ax.set_yticks(np.arange(data.shape[0]))
    if colorbar:
        plt.colorbar(im)

plot_heatmap(data, vmin=-1, vmax=1, cmap='coolwarm')

Singular Value Decomposition and Reconstruction Error

Multivariate systems often lie on lower-dimensional structures. Principal Component Analysis (PCA), via Singular Value Decomposition (SVD), helps us explore these spaces. Any data matrix \(X\) can be decomposed as \(X = U\Sigma V^{T}\) where \(\Sigma\) contains singular values ordered by importance. Truncating this expansion at rank \(k\) reconstructs an approximation of the original. The code below shows that even with only 2 or 5 components, much of the structure is retained. As we increase \(k\) the reconstruction error falls—revealing the data’s latent geometry.

X = make_sample(cov_matrix, 100, columns=FEATURE_COLUMNS)
U, S, VT = np.linalg.svd(X, full_matrices=False)

SVD decomposes a matrix into interpretable parts: directions of variation, scaled by importance. The reconstruction plots show that a handful of components (principal axes) capture most of the variation in job satisfaction data. This supports the idea of latent representations: the surface complexity of our variables arises from simpler underlying factors. Variational autoencoders (VAEs) and Bayesian posterior predictive sampling attempt similar compression and recovery strategies, but embed it within generative models. As we’ll see, their reconstruction quality also reflects the strength and focus of their latent encoding.

Code

ranks = [2, 5, 12]
reconstructions = []
for k in ranks:
    X_k = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]
    reconstructions.append(X_k)

# Plot original and reconstructed matrices
fig, axes = plt.subplots(1, len(ranks) + 1, figsize=(10,15))
axes[0].imshow(X, cmap='viridis')
axes[0].set_title("Original")
axes[0].axis("off")

for ax, k, X_k in zip(axes[1:], ranks, reconstructions):
    ax.imshow(X_k, cmap='viridis')
    ax.set_title(f"Rank {k}")
    ax.axis("off")

plt.suptitle("Reconstruction of Data Using SVD \n various truncation options",fontsize=12, x=.5, y=1.01)
plt.tight_layout()
plt.show()

SVD shows us that high-dimensional data can often be compressed with little loss. But it’s linear and fixed: it doesn’t learn a generalisable representation. Variational autoencoders generalize this idea. Like PCA, they seek a latent representation, but they do so through non-linear mappings learned from data. An autoencoder has two parts: an encoder, which compresses data into a latent measure, and a decoder, which reconstructs it. Unlike PCA, this process is flexible and adapts to the data distribution, not just its linear axes. The variational auto-encoder (VAE) extends this idea by placing probability distributions over these latent codes, allowing us to sample and generate new data points.

Variational Auto-Encoders

The NumericVAE class below captures this structure. The encoder maps observed data to the parameters of a distribution over latent variables \(z\) specifically a Gaussian with learned mean and variance: \(q(z|x) = N(\mu(x), \sigma^{2}(x))\) Using the reparameterization trick, we sample \(z \sim q(z| x)\) and decode it into a predicted distribution over the observed variables \(p(x |z)\). This makes the model generative: not only can it reconstruct known data, it can synthesize new, plausible examples by sampling from the latent prior \(p(z) \sim N(0, I)\) and decoding. The result is a flexible, probabilistic architecture that mimics and extends the SVD idea of structure through compression.

class NumericVAE(nn.Module):
    def __init__(self, n_features, hidden_dim=64, latent_dim=8):
        super().__init__()
        
        # ---------- ENCODER ----------
        # First layer: compress input features into a hidden representation
        self.fc1 = nn.Linear(n_features, hidden_dim)
        
        # Latent space parameters (q(z|x)): mean and log-variance
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)       # μ(x)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)   # log(σ^2(x))
        
        # ---------- DECODER ----------
        # First layer: map latent variable z back into hidden representation
        self.fc2 = nn.Linear(latent_dim, hidden_dim)
        
        # Output distribution parameters for reconstruction p(x|z)
        # For numeric data, we predict both mean and log-variance per feature
        self.fc_out_mu = nn.Linear(hidden_dim, n_features)        # μ_x(z)
        self.fc_out_logvar = nn.Linear(hidden_dim, n_features)    # log(σ^2_x(z))

    # ENCODER forward pass: input x -> latent mean, log-variance
    def encode(self, x):
        h = F.relu(self.fc1(x))       # Hidden layer with ReLU
        mu = self.fc_mu(h)            # Latent mean vector
        logvar = self.fc_logvar(h)    # Latent log-variance vector
        return mu, logvar

    # Reparameterization trick: sample z = μ + σ * ε  (ε ~ N(0,1))
    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)   # σ = exp(0.5 * logvar)
        eps = torch.randn_like(std)     # ε ~ N(0, I)
        return mu + eps * std           # z = μ + σ * ε

    # DECODER forward pass: latent z -> reconstructed mean, log-variance
    def decode(self, z):
        h = F.relu(self.fc2(z))             # Hidden layer with ReLU
        recon_mu = self.fc_out_mu(h)        # Mean of reconstructed features
        recon_logvar = self.fc_out_logvar(h)# Log-variance of reconstructed features
        return recon_mu, recon_logvar

    # Full forward pass: input x -> reconstructed (mean, logvar), latent params
    def forward(self, x):
        mu, logvar = self.encode(x)            # q(z|x)
        z = self.reparameterize(mu, logvar)    # Sample z from q(z|x)
        recon_mu, recon_logvar = self.decode(z)# p(x|z)
        return (recon_mu, recon_logvar), mu, logvar

    # Sample new synthetic data: z ~ N(0,I), decode to x
    def generate(self, n_samples=100):
        self.eval()
        with torch.no_grad():
            # Sample z from standard normal prior
            z = torch.randn(n_samples, self.fc_mu.out_features)
            
            # Decode to get reconstruction distribution parameters
            cont_mu, cont_logvar = self.decode(z)
            
            # Sample from reconstructed Gaussian: μ_x + σ_x * ε
            return cont_mu + torch.exp(0.5 * cont_logvar) * torch.randn_like(cont_mu)

This generative ability connects VAEs to Bayesian modeling. In Bayesian inference, posterior predictive sampling draws new observations by integrating over uncertainty in parameters. VAEs achieve something similar by sampling from a learned latent distribution and propagating it through the decoder. Both approaches simulate data from a model conditioned on what has been learned. But where Bayesian models emphasize interpretability and prior structure, VAEs emphasize flexibility and scalability with “deep” latent structure. In the next section, we use this NumericVAE to see how well it reconstructs and generates data compared to posterior predictive draws from a Bayesian multivariate normal model.

Variational autoencoders are trained by pairing their generative structure with a principled loss. The encoder maps input data to a distribution over latent variables; the decoder maps sampled latents back to data. This two-step process is trained end-to-end by minimizing a loss that balances two goals: accurate reconstruction and regularization of the latent space. The vae_loss function captures this trade-off. It computes a reconstruction loss using the negative log-likelihood of a Gaussian, plus a penalty—the Kullback-Leibler (KL) divergence—between the approximate posterior \(q(z | x)\) and the prior \(p(x)\)

def vae_loss(recon_mu, recon_logvar, x, mu, logvar):
    # Reconstruction loss: Gaussian log likelihood
    recon_var = torch.exp(recon_logvar)
    recon_nll = 0.5 * (torch.log(2 * torch.pi * recon_var) + (x - recon_mu) ** 2 / recon_var)
    recon_loss = recon_nll.sum(dim=1).mean()  # sum over features, mean over batch

    # KL divergence: D_KL(q(z|x) || p(z)) where p(z)=N(0,I)
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1)
    kl_loss = kl_div.mean()

    return recon_loss + kl_loss, recon_loss, kl_loss

The reconstruction loss assumes the decoder outputs a Gaussian distribution for each input dimension. The term recon_nll \(=log(2\pi\sigma^{2}) + \frac{(x-\mu)^{2}}{\sigma^{2}}\) measures how surprising the observed data \(x\) would be under the predicted mean and variance. It penalizes poor fits and rewards precise reconstructions. This is summed across features and averaged across the batch. High-fidelity reconstruction requires the decoder to learn how each latent variable shapes the original space, making the VAE a powerful model for estimating high-dimensional latent representations that emit full probability distributions over inputs.

The KL divergence term acts as a regularizer. It measures how far the encoder’s learned distribution \(q(z∣x)\) is from the standard normal prior \(p(z)\). Without this term, the model could overfit—memorizing reconstructions without learning a useful latent structure. By keeping the latent codes close to \(N(0,I)\), the model remains generative: we can sample from the prior and decode into valid synthetic data. In short, KL divergence ensures generalization and interpretability, while reconstruction loss ensures fidelity. Their sum is what guides learning in the VAE.

Training VAE models

At the core of training deep learning models is backpropagation - an algorithm that computes gradients of a loss function with respect to each model parameter. PyTorch handles this automatically. We seperate the data into batches and compute the loss for each batch, Then calling loss.backward() computes these gradients via the chain rule. The optimizer then adjusts the weights with optimizer.step(), moving the model parameters in the direction that reduces the loss. Each training iteration refines the model, gradually improving its ability to encode and reconstruct the data. The train_vae function applies this cycle across epochs, using PyTorch’s autograd system to manage the entire computation graph.

The prep_data_vae function creates simulated data, adds optional missingness, and splits it into training and test sets. Each training set—of size 500, 1,000, or 10,000—is passed to a new VAE instance for fitting. More data provides better coverage of the latent space, allowing the model to generalize more confidently. With small data, the model risks memorizing noise. With large data, KL regularization helps the decoder learn smooth, generalizable mappings from latent codes to input reconstructions. The training runs below test this sensitivity empirically, keeping architecture fixed and varying only the sample size.

Code

def prep_data_vae(sample_size=1000, missing_frac=0.0, impute=False):
    sample_df = make_sample(cov_matrix=cov_matrix, size=sample_size, columns=FEATURE_COLUMNS, missing_frac=missing_frac, impute=impute)

    X_train, X_test = train_test_split(sample_df.values, test_size=0.2, random_state=890)

    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)

    train_loader = torch.utils.data.DataLoader(X_train, batch_size=32, shuffle=True)
    test_loader = torch.utils.data.DataLoader(X_test, batch_size=32)
    return train_loader, test_loader

Training a variational autoencoder means adjusting a large array of weights to minimize reconstruction error and regularize the latent space. The train_vae function loops over batches from the training set, computes the loss, and updates the model via backpropagation. Each batch yields predicted distributions over features, from which we compute both the reconstruction loss and the KL divergence. These are averaged and logged for both training and held-out test data. Early stopping halts training when improvements plateau. Preventing overfitting while preserving the best model seen so far.

def train_vae(vae, optimizer, train, test, patience=10, wait=0, n_epochs=1000):
    best_loss = float('inf')
    losses = []

    for epoch in range(n_epochs):
        vae.train()
        train_loss = 0.0
        
        for batch in train:
            optimizer.zero_grad()

            (recon_mu, recon_logvar), mu, logvar = vae(batch)
            loss, recon_loss, kl_loss = vae_loss(recon_mu, recon_logvar, batch, mu, logvar)

            loss.backward()
            optimizer.step()

            train_loss += loss.item() * batch.size(0)

        avg_train_loss = train_loss / train.dataset.shape[0]

        # --- Test Loop ---
        vae.eval()
        test_loss = 0.0
        with torch.no_grad():
            for batch in test:
                (recon_mu, recon_logvar), mu, logvar = vae(batch)
                loss, _, _ = vae_loss(recon_mu, recon_logvar, batch, mu, logvar)
                test_loss += loss.item() * batch.size(0)
        avg_test_loss = test_loss / test.dataset.shape[0]

        print(f"Epoch {epoch+1}/{n_epochs} | Train Loss: {avg_train_loss:.4f} | Test Loss: {avg_test_loss:.4f}")

        if  avg_test_loss < best_loss - 1e-4:
            best_loss, wait = avg_test_loss, 0
            best_state = vae.state_dict()
        else:
            wait += 1
            if wait >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                vae.load_state_dict(best_state)  # restore best
                break
        losses.append([avg_train_loss, avg_test_loss, best_loss])

    return vae, pd.DataFrame(losses, columns=['train_loss', 'test_loss', 'best_loss'])

We are now in a position to fit the same model architecture to different samples of the data. Training on 10,000 samples, for instance, challenges the model to encode the full covariance structure of our job satisfaction data while retaining generative flexibility. This training routine sets the stage for comparing VAEs to Bayesian models, where posterior predictive sampling plays a similar generative role—but with different assumptions and computational tradeoffs.

train_500, test_500 = prep_data_vae(500)
vae = NumericVAE(n_features=train_500.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_500, losses_df_500 = train_vae(vae, optimizer, train_500, test_500)

train_1000, test_1000 = prep_data_vae(1000)
vae = NumericVAE(n_features=train_1000.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_1000, losses_df_1000 = train_vae(vae, optimizer, train_1000, test_1000)

train_10_000, test_10_000 = prep_data_vae(10_000)
vae = NumericVAE(n_features=train_10_000.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_10_000, losses_df_10_000 = train_vae(vae, optimizer, train_10_000, test_10_000)

Training and test losses tell us how well the model fits the data and how well it generalizes. The plots show these losses over training epochs for datasets of size 500, 1,000, and 10,000. A falling training loss indicates the model is learning to reconstruct the data. If test loss follows the same trend, generalization is improving too. If test loss flattens or rises while training loss keeps falling, the model may be overfitting—memorizing rather than learning.

Code

def plot_metric_recovery(fit_vae, test_dataset):
    recon_df = pd.DataFrame(fit_vae.generate(test_dataset.shape[0]), columns=FEATURE_COLUMNS)

    test_df = pd.DataFrame(test_dataset, columns=FEATURE_COLUMNS)

    fig, axs = plt.subplots(6, 2, figsize=(8, 20))
    axs = axs.flatten()
    for col, ax in zip(FEATURE_COLUMNS, axs):
        ax.hist(recon_df[col], color='red', alpha=0.5, ec='grey', label=f'Reconstructed Metric {col}')
        ax.hist(test_df[col], color='blue', alpha=0.5, ec='black', label=f'Test Metric {col}')
        ax.legend()

fig, axs = plt.subplots(1, 3, figsize=(8, 6))
axs=axs.flatten()
losses_df_500[['train_loss', 'test_loss']].plot(ax=axs[0])
losses_df_1000[['train_loss', 'test_loss']].plot(ax=axs[1])
losses_df_10_000[['train_loss', 'test_loss']].plot(ax=axs[2])

axs[0].set_title("Training and Test Losses \n 500 observations");
axs[1].set_title("Training and Test Losses \n 1000 observations");
axs[2].set_title("Training and Test Losses \n 10_000 observations");

These curves also validate our early stopping rule. When test loss plateaus, training halts—preserving the model that best balances fidelity and generalization. The plots confirm that this dynamic works across scales. In sum, training and test loss curves are not just diagnostics of performance—they’re windows into the model’s learning process and how data scale affects estimation.

We can also see that the models are able to re-construct the marginal metric distributions

plot_metric_recovery(vae_fit_10_000, test_10_000.dataset)

To evaluate how well the VAE learns the joint structure of the data and not just feature-wise accuracy, we compare the correlation matrices of the reconstructed data. We know the ground truth and ask: does the VAE, when generating synthetic data, reproduce the correct dependency structure? For each bootstrap iteration, we sample new data from the trained VAE and compute its correlation matrix. We subtract this from the true correlation matrix, storing the residuals. Averaging across 1,000 samples gives us the expected discrepancy between true and learned structure.

This bootstrapping approach is key. Any one synthetic sample from the VAE might show idiosyncratic deviations due to randomness in the generative process. By repeating the process many times, we average out this noise and obtain a stable, expected residual. This makes the evaluation fair: it judges the model based on the distribution it has learned, not on any one lucky—or unlucky—draw. A good model will show consistently small residuals, especially along the strongest correlations of the test set.

def bootstrap_residuals(vae_fit, X_test, n_boot=1000):
    recons = []
    srmrs = []
    resid_array = np.zeros((n_boot, 12, 12))
    for i in range(n_boot):
        recon_data = vae_fit.generate(n_samples=len(X_test))
        reconstructed_df = pd.DataFrame(recon_data, columns=FEATURE_COLUMNS)
        resid = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS) - reconstructed_df.corr()
        srmr = get_srmr(resid)
        srmrs.append(srmr)
        resid_array[i] = resid.values
        recons.append(reconstructed_df)

    avg_resid = resid_array.mean(axis=0)
    bootstrapped_resids = pd.DataFrame(avg_resid, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)
    return bootstrapped_resids, srmrs

bootstrapped_resids_500, srmrs_500 = bootstrap_residuals(vae_fit_500, pd.DataFrame(test_500.dataset, columns=FEATURE_COLUMNS))

bootstrapped_resids_1000, srmrs_1000 = bootstrap_residuals(vae_fit_1000, pd.DataFrame(test_1000.dataset, columns=FEATURE_COLUMNS))

bootstrapped_resids_10_000, srmrs_10_000 = bootstrap_residuals(vae_fit_10_000, pd.DataFrame(test_10_000.dataset, columns=FEATURE_COLUMNS))


fig, axs = plt.subplots(3, 1, figsize=(10, 20))
axs = axs.flatten()
plot_heatmap(bootstrapped_resids_500, title=f"""Expected Correlation Residuals for 500 observations \n Under 1000 Bootstrapped Reconstructions. SRMR: {np.mean(srmrs_500).round(2)}""", ax=axs[0], colorbar=True, vmin=-.25, vmax=.25)

plot_heatmap(bootstrapped_resids_1000, title=f"""Expected Correlation  Residuals  for 1000 observations \n Under 1000 Bootstrapped Reconstructions. SRMR: {np.mean(srmrs_1000).round(2)}""", ax=axs[1], colorbar=True, vmin=-.25, vmax=.25)

plot_heatmap(bootstrapped_resids_10_000, title=f"""Expected Correlation  Residuals  for 10,000 observations \n Under 1000 Bootstrapped Reconstructions. SRMR: {np.mean(srmrs_10_000).round(2)}""", ax=axs[2], colorbar=True, vmin=-.25, vmax=.25)

The resulting heatmaps visualize the average residuals across correlation pairs. With 500 training samples, patterns are weaker and more erratic. At 1,000, the VAE begins to capture more of the joint structure. With 10,000, the residuals shrink further, especially near key variable clusters. This confirms that the VAE learns not only to reconstruct inputs but to replicate their internal relationships—an essential feature for any model claiming to approximate the data-generating process.

Missing Data and Inductive Bias

Missing data is a common challenge in survey research, especially in job satisfaction studies. Respondents often skip questions—sometimes at random, sometimes in patterns related to stress, disengagement, or privacy concerns. This creates non-response bias, where the absence of data is itself informative. Any imputation method must account not only for what’s missing, but why it’s missing. Ignoring this can distort correlation estimates and lead to misleading inferences.

sample_df_missing = sample_df.copy()

# Randomly pick 5% of the total elements
mask_remove = np.random.rand(*sample_df_missing.shape) < 0.05

# Set those elements to NaN
sample_df_missing[mask_remove] = np.nan
sample_df_missing.head()

	JW1	JW2	JW3	UF1	UF2	FOR	DA1	DA2	DA3	EBA	ST	MI
0	2.578412	3.565673	1.976356	0.261307	-0.162994	0.425867	-0.341104	-0.663246	-1.898728	0.412019	0.167105	1.179567
1	0.147594	0.316841	1.244429	-0.566278	0.157724	0.084468	NaN	0.590472	0.663543	-0.392881	0.434897	-0.379236
2	0.799763	0.277303	-0.361947	0.703534	0.326682	-0.503207	0.558374	0.313461	0.005397	0.373729	-0.487022	1.029725
3	-1.245506	-1.338533	-0.845951	-0.700084	-0.324149	0.159880	0.205925	0.402143	1.305458	0.987969	1.313665	0.171655
4	0.066713	0.958231	0.697802	0.071260	-0.102972	0.186527	-0.015224	0.147267	0.674598	NaN	-0.845735	-0.203553

In the code above, we simulate this problem by introducing 5% random missingness into the dataset. This models a common real-world scenario: partial responses from otherwise complete surveys. While this version assumes missing at random (MAR), in practice, missingness may depend on latent traits—precisely what our models aim to capture. A naive mean imputation might smooth over the problem, but risks losing structural information or introducing artificial correlations. We want to allow that our deep learning architecture can handle missing-data and we adapt our prep_data_vae_missing function to accomodate missing data.

Code

class MissingDataDataset(Dataset):
    def __init__(self, x, mask):
        # x and mask are tensors of same shape
        self.x = x
        self.mask = mask
        
    def __len__(self):
        return self.x.shape[0]
    
    def __getitem__(self, idx):
        return self.x[idx], self.mask[idx]

def prep_data_vae_missing(sample_size=1000, batch_size=32, missing_frac=0.2, impute=False):
    sample_df = make_sample(cov_matrix=cov_matrix, size=sample_size, columns=FEATURE_COLUMNS, missing_frac=missing_frac, impute=impute)

    X_train, X_test = train_test_split(sample_df.values, test_size=0.2, random_state=890)

    # Mask: 1=observed, 0=missing
    mask_train = ~pd.DataFrame(X_train).isna()
    mask_test = ~pd.DataFrame(X_test).isna()

    # Tensors (keep NaNs for missing values)
    x_train_tensor = torch.tensor(X_train, dtype=torch.float32)
    mask_train_tensor = torch.tensor(mask_train.values, dtype=torch.float32)

    x_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    mask_test_tensor = torch.tensor(mask_test.values, dtype=torch.float32)

    train_dataset = MissingDataDataset(x_train_tensor, mask_train_tensor)
    test_dataset = MissingDataDataset(x_test_tensor, mask_test_tensor)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    return train_loader, test_loader

Deep learning models, including VAEs, come with inductive biases—assumptions about the structure of the data that guide learning in the face of ambiguity. These biases can be helpful or harmful. In the next section, we explore how our VAE behaves when trained and applied to incomplete data. This will show both the flexibility of generative models and the risks of overconfidence and poor generalisation.

To handle missing data, we extend the VAE by allowing it to learn how to impute. The NumericVAE_missing model introduces a vector of learnable parameters—self.missing_embeddings—one per feature. During training, missing values in the input \(x\) are replaced with these parameters using torch.where, giving the model freedom to discover plausible replacements. Unlike mean imputation, this strategy is dynamic: the imputation values are tuned as part of the model’s optimization, guided by the same loss function that trains the encoder and decoder.

class NumericVAE_missing(nn.Module):
    def __init__(self, n_features, hidden_dim=64, latent_dim=24, dropout_rate=0.2):
        super().__init__()
        self.n_features = n_features
        self.dropout = nn.Dropout(p=dropout_rate)

        # ---------- Learnable Imputation ----------
        # One learnable parameter per feature for missing values
        self.missing_embeddings = nn.Parameter(torch.zeros(n_features))

        # ---------- ENCODER ----------
        self.fc1_x = nn.Linear(n_features, hidden_dim)

        # Stronger mask encoder: 2-layer MLP
        self.fc1_mask = nn.Sequential(
            nn.Linear(n_features, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

        # Combine feature and mask embeddings
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # ---------- DECODER ----------
        self.fc2 = nn.Linear(latent_dim, hidden_dim)
        self.fc_out_mu = nn.Linear(hidden_dim, n_features)
        self.fc_out_logvar = nn.Linear(hidden_dim, n_features)

    def encode(self, x, mask):
        # Impute missing values with learnable parameters
        x_filled = torch.where(
            torch.isnan(x),
            self.missing_embeddings.expand_as(x),
            x
        )

        # Encode features and mask separately
        h_x = F.silu(self.fc1_x(x_filled))
        h_x = self.dropout(h_x)
        h_mask = self.fc1_mask(mask)

        # Combine embeddings
        h = h_x + h_mask

        # Latent space
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h = F.silu(self.fc2(z))
        h = self.dropout(h)  # optional: decoder dropout
        recon_mu = self.fc_out_mu(h)
        recon_logvar = self.fc_out_logvar(h)
        recon_logvar = torch.clamp(self.fc_out_logvar(h), -5, 5)
        return recon_mu, recon_logvar

    def forward(self, x, mask):
        mu, logvar = self.encode(x, mask)
        z = self.reparameterize(mu, logvar)
        recon_mu, recon_logvar = self.decode(z)
        return (recon_mu, recon_logvar), mu, logvar

    def generate(self, n_samples=100):
        self.eval()
        with torch.no_grad():
            z = torch.randn(n_samples, self.fc_mu.out_features)
            recon_mu, recon_logvar = self.decode(z)
            return recon_mu + torch.exp(0.5 * recon_logvar) * torch.randn_like(recon_mu)

The model also needs to know what was missing. For this, we pass a binary mask indicating which values are observed. This mask is processed through a two-layer MLP (self.fc1_mask), creating a learned representation of the missingness pattern. The encoded mask is then added to the encoded (and imputed) features. This design allows the encoder to distinguish between genuine low values and missing entries—and to condition the latent representation not just on the data, but on its absence.

Aside from this masking logic, the rest of the model mirrors a standard VAE: it uses a reparameterized latent space, decodes into means and variances for each feature, and generates new samples by drawing from the latent prior. Dropout layers add regularization. The overall effect is to give the model flexibility in the face of partial data. It can learn both how to fill in what’s missing and how to encode uncertainty about those guesses. This approach injects an inductive bias that missingness matters—and that learning should account for it directly.

def vae_loss_missing(recon_mu, recon_logvar, x, mu, logvar, mask):
    # Clamp log-variance for numerical stability
    recon_logvar = torch.clamp(recon_logvar, min=-5.0, max=5.0)
    recon_var = torch.exp(recon_logvar)

    # Only use observed entries
    masked_x = torch.where(mask.bool(), x, torch.zeros_like(x))
    masked_diff_sq = ((masked_x - recon_mu) ** 2) * mask
    masked_logvar = recon_logvar * mask
    masked_var = recon_var * mask

    # Gaussian NLL: only computed on observed entries
    recon_nll = 0.5 * (
        torch.log(2 * torch.pi * masked_var + 1e-8) + masked_diff_sq / (masked_var + 1e-8)
    )

    # Reduce over features and average over batch
    obs_counts = mask.sum(dim=1).clamp(min=1)
    recon_loss = (recon_nll.sum(dim=1) / obs_counts).mean()

    # KL divergence
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1)
    kl_loss = kl_div.mean()

    return recon_loss, kl_loss

To adapt the VAE’s loss function for missing data, we restrict reconstruction error to observed entries only. The function vae_loss_missing computes a masked negative log-likelihood (NLL), treating missing values as undefined and excluding them from the loss. This is done by elementwise multiplying the squared error and log-variance terms by the binary mask. The mean squared error is thus scaled only over the available data, avoiding any implicit assumptions about the unobserved values. This approach lets the model learn from incomplete input without penalizing it for what it can’t see.

The KL divergence term remains unchanged, regularizing the latent distribution regardless of missingness. However, the reconstruction term is adjusted for each sample by dividing by the count of observed entries (obs_counts), avoiding bias due to varying amounts of missing data across rows. Numerical stability is enforced via clamping and small epsilon terms. Together, these changes ensure the objective function reflects uncertainty where it should and remains focused on the parts of the data the model can justifiably be judged on.

The training loop remains largely unchanged, but we fit two versions of the model architecture sampling 5000 and 25,000 data points respectively.

def fit_vae_missing(vae_missing, train_loader, test_loader, optimizer, patience=10, wait=0, n_epochs=1000):
    best_loss = float('inf')
    losses = []

    for epoch in range(n_epochs):
        vae_missing.train()
        
        train_loss = 0
        for x_batch, mask_batch in train_loader:
            optimizer.zero_grad()
            (recon_mu, recon_logvar), mu, logvar = vae_missing(x_batch, mask_batch)
            recon_loss, kl_loss = vae_loss_missing(recon_mu, recon_logvar, x_batch, mu, logvar, mask_batch)

            loss = recon_loss + kl_loss
            loss.backward()
            optimizer.step()
            train_loss += loss.item() * x_batch.size(0)

        avg_train_loss = train_loss / len(train_loader.dataset)

        # --- Validation ---
        vae_missing.eval()
        test_loss = 0.0
        with torch.no_grad():
            for x_batch, mask_batch in test_loader:
                (recon_mu, recon_logvar), mu, logvar = vae_missing(x_batch, mask_batch)
                recon_loss, kl_loss = vae_loss_missing(recon_mu, recon_logvar, x_batch, mu, logvar, mask_batch)
                loss = recon_loss + kl_loss
                test_loss += loss.item() * x_batch.size(0)
        avg_test_loss = test_loss / len(test_loader.dataset)

        print(f"Epoch {epoch+1}/{n_epochs} | Train: {avg_train_loss:.4f} | Test: {avg_test_loss:.4f}")

        # Early stopping
        if avg_test_loss < best_loss - 1e-4:
                best_loss, wait = avg_test_loss, 0
                best_state = vae_missing.state_dict()
        else:
            wait += 1
            if wait >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                vae_missing.load_state_dict(best_state)  # restore best
                break
        losses.append([avg_train_loss, avg_test_loss, best_loss])
    losses_df = pd.DataFrame(losses, columns=['train_loss', 'test_loss', 'best_loss'])    

    return vae_missing, losses_df


train_loader_5000, test_loader_5000 = prep_data_vae_missing(5000, batch_size=32)
vae_missing_5000 = NumericVAE_missing(n_features=next(iter(train_loader_5000))[0].shape[1], latent_dim=12)
optimizer = optim.Adam(vae_missing_5000.parameters(), lr=1e-4)

fit_vae_missing_5000, losses_df_missing_5000 = fit_vae_missing(vae_missing_5000, train_loader_5000, test_loader_5000, optimizer)


train_loader_25_000, test_loader_25_000 = prep_data_vae_missing(25_000, batch_size=32)
vae_missing_25_000 = NumericVAE_missing(n_features=next(iter(train_loader_25_000))[0].shape[1], latent_dim=12)
optimizer = optim.Adam(vae_missing_25_000.parameters(), lr=1e-4)

fit_vae_missing_25_000, losses_df_missing_25_000 = fit_vae_missing(vae_missing_25_000, train_loader_25_000, test_loader_25_000, optimizer)

Plot Metric Reconstruction

The histograms below show how well the VAE reconstructs the marginal distributions of individual survey metrics. For each variable, we overlay the reconstructed values (in red) with the test set values (in blue). Visually, there’s strong overlap across most dimensions, suggesting the model has learned to approximate the correct univariate distributions—even with missing data. This kind of marginal recovery is encouraging, especially in low-data regimes or when imputing incomplete responses.

Code

plot_metric_recovery(fit_vae_missing_5000, test_loader_5000.dataset.x)

However, strong metric-wise recovery does not imply that the model has captured the joint structure of the data. Each histogram tells us how accurate the model is on a single variable, but ignores how variables co-vary—a crucial element in survey analysis where latent constructs often manifest as correlated responses. It’s entirely possible to match marginal distributions while failing to reproduce the inter-variable dependencies that define the data’s true geometry.

For this reason, histogram overlap is best viewed as a necessary but insufficient indicator of VAE reconstruction quality. It confirms that the decoder isn’t collapsing or hallucinating values, but it doesn’t validate the model’s deeper understanding of the data-generating process. In the next step, we’ll examine the reconstructed correlation structure—where, despite strong marginal performance, the VAE fails to capture the multivariate dependencies accurately.

Plot Correlation Reconstruction

The heatmaps below visualize the residuals between the observed and reconstructed correlation matrices; a more stringent test of reconstruction fidelity than individual metric distributions. While the marginal histograms previously gave the impression of strong model performance, the correlation residuals tell a different story. There are consistent and structured deviations between the original and reconstructed relationships, particularly off-diagonal. These residuals are not random noise; they reveal systematic failure to capture the joint dependencies among survey metrics.

bootstrapped_resids_5000, srmrs_5000 = bootstrap_residuals(fit_vae_missing_5000, pd.DataFrame(test_loader_5000.dataset.x, columns=FEATURE_COLUMNS))

bootstrapped_resids_25_000, srmrs_25_000 = bootstrap_residuals(fit_vae_missing_25_000, pd.DataFrame(test_loader_25_000.dataset.x, columns=FEATURE_COLUMNS))


fig, axs = plt.subplots(2, 1, figsize=(8, 15))
axs = axs.flatten()
plot_heatmap(bootstrapped_resids_5000, title=f"""Expected Correlation Residuals for 5000 observations \n Under 1000 Bootstrapped Reconstructions. SRMR: {np.mean(srmrs_5000).round(2)}""", ax=axs[0], colorbar=True, vmin=-.25, vmax=.25)

plot_heatmap(bootstrapped_resids_25_000, title=f"""Expected Correlation  Residuals  for 25,000 observations \n Under 1000 Bootstrapped Reconstructions SRMR: {np.mean(srmrs_25_000).round(2)}""", ax=axs[1], colorbar=True, vmin=-.25, vmax=.25)

Notably, the magnitude and pattern of these residuals remain remarkably stable across both 5,000 and 25,000 observations. If the problem were due to data scarcity, we would expect improvements with more data. Instead, this invariance points to the model architecture itself—specifically, to the inductive biases embedded in the structure of the VAE. Despite being trained on rich and complete data, the model consistently struggles to recover the true correlational geometry.

This reflects a broader challenge in deep learning. Complex architectures like VAEs encode implicit assumptions about data structure through their layers and non-linearities. These assumptions are rarely transparent. In this case, the interaction between the encoder’s masking strategy, the learnable imputations, and the decoder’s reconstruction logic appears to flatten or distort inter-variable relationships in ways that simple performance metrics can’t detect. It’s a cautionary example of how black-box models can fail silently—even when outputs look convincing on the surface.

Simple Imputation and Simple Variational Encoders

Despite the sophistication of our masked VAE architecture, the earlier results showed it struggled to recover the true correlation structure of the data. In contrast, here we demonstrate that a simpler strategy—mean imputation followed by a standard VAE—performs substantially better.

Rather than explicitly modeling missingness, we use straightforward mean imputation, replacing each missing entry with the feature-wise mean from the training set. This naive approach ignores the potential structure in the missing data, but it has one key advantage: it preserves the full input dimensionality without introducing additional model components or parameters.

We then train a standard VAE on this imputed data. The correlation residual heatmap below shows that, unlike the masked VAE, this model accurately recovers the underlying correlation structure, even with 20% missing data. The off-diagonal residuals are smaller and less structured, indicating that the joint metric relationships are better preserved.

train_loader_imputed, test_loader_imputed = prep_data_vae(sample_size=10_000, missing_frac=0.2, impute=True)

vae = NumericVAE(n_features=train_loader_imputed.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_10_000_imputed, losses_df_10_000_imputed = train_vae(vae, optimizer, train_loader_imputed, test_loader_imputed)

bootstrapped_resids_10_000_imputed, srmrs_10_000_imputed = bootstrap_residuals(vae_fit_10_000_imputed, pd.DataFrame(test_loader_imputed.dataset, columns=FEATURE_COLUMNS))

Code

plot_heatmap(bootstrapped_resids_10_000_imputed, title=f"""Expected Correlation  Residuals  for 10,000 observations \n Under 1000 Bootstrapped Reconstructions SRMR: {np.mean(srmrs_10_000_imputed).round(2)}""", colorbar=True, vmin=-.25, vmax=.25);

This contrast highlights a recurring lesson in applied modeling: simple can outperform complex - especially when the complexity introduces new sources of opaque inductive bias. The masked VAE encoded specific assumptions about missingness and reconstruction that, despite being theoretically appealing, failed to generalize. In contrast, the imputation-plus-VAE approach leverages a simpler inductive bias: that missing values can be smoothed over without disrupting the overall dependency structure. In this context, that assumption proved to be not only adequate but effective.

Bayesian Inference

In contrast to the VAE, where inductive bias emerges from the architecture and training dynamics, the Bayesian model defines its assumptions transparently through the prior. Here, we specify a multivariate normal likelihood with a mean vector and a covariance matrix drawn from an LKJ prior—a structured, interpretable distribution over valid correlation matrices. The model focuses directly on estimating the joint structure we care about, rather than reconstructing data through a bottleneck.

This PyMC model embodies the Bayesian workflow: first, we define a prior over parameters (here, the mean and correlation structure of features); second, we update this prior in light of observed data via Bayes’ rule, yielding the posterior; and finally, we generate samples from the posterior predictive distribution, which reflects the uncertainty in both parameter estimates and future data. These steps—prior → posterior → predictive sampling—form a disciplined approach to inference that naturally captures parameter uncertainty.

def make_pymc_model(sample_df):
    coords = {'features': FEATURE_COLUMNS,
            'features1': FEATURE_COLUMNS ,
            'obs': range(len(sample_df))}

    with pm.Model(coords=coords) as model:
        # Priors
        mus = pm.Normal("mus", 0, 1, dims='features')
        chol, _, _ = pm.LKJCholeskyCov("chol", n=12, eta=1.0, sd_dist=pm.HalfNormal.dist(1))
        cov = pm.Deterministic('cov', pm.math.dot(chol, chol.T), dims=('features', 'features1'))

        pm.MvNormal('likelihood', mus, cov=cov, observed=sample_df.values, dims=('obs', 'features'))
        
        idata = pm.sample_prior_predictive()
        idata.extend(pm.sample(random_seed=120))
        pm.sample_posterior_predictive(idata, extend_inferencedata=True)

    return idata, model 

sample_df = make_sample(cov_matrix, 500, FEATURE_COLUMNS)

idata, model = make_pymc_model(sample_df)

While the VAE also performs generative modeling, it does so with less targeted control over latent dependencies. Its inductive bias is entangled in neural architecture choices—nonlinearities, hidden dimensions, initialization schemes—making its behavior harder to interpret. The Bayesian model, by contrast, isolates the structure of interest in the prior and expresses uncertainty explicitly through the posterior. This cleaner separation of modeling goals is a strength: for structured estimation tasks like recovering correlation matrices, simpler Bayesian formulations often outperform complex, less interpretable alternatives.

pm.model_to_graphviz(model)

An additional advantage of the Bayesian approach is that the explicit inductive bias encoded in the prior helps stabilize estimation in low-data regimes. Because we constrain the model to estimate only correlation structures consistent with the LKJ prior—i.e., valid, regularized correlation matrices—it becomes possible to recover the essential structure of the data even with a small sample size. In this case, using just 500 observations, the model still reconstructs the underlying correlation matrix with notable accuracy. This is a powerful demonstration of the bias–variance trade-off in action: by introducing structured assumptions through the prior, we reduce variance in our estimates and avoid overfitting, which would be more likely in a more flexible, under-constrained model like the VAE.

expected_corr = pd.DataFrame(az.summary(idata, var_names=['chol_corr'])['mean'].values.reshape((12, 12)), columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)

resids = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS) - expected_corr

srmr_bayes_500 = get_srmr(resids)
plot_heatmap(resids, title=f"Expected Correlation Residuals  \n on the Posterior Correlation Estimate. SRMR: {np.mean(srmr_bayes_500).round(2)}")

Missing Data

One of the most elegant features of Bayesian modeling is its native ability to handle missing data through implicit imputation. In our probabilistic model, we never need to manually fill in missing values or mask them—PyMC automatically marginalizes over the missing entries in the observed data during sampling. The model treats these missing values as additional unknowns and estimates them jointly with the other parameters using the same prior and likelihood structure. This is particularly valuable in survey data settings, such as job satisfaction responses, where non-response bias is common and must be accounted for directly.

sample_df_missing = make_sample(cov_matrix, 500, FEATURE_COLUMNS, missing_frac=0.1)


idata_missing, model_missing = make_pymc_model(sample_df_missing)

Our example shows that, even with 10% missingness in a modest-sized dataset of only 500 observations, the posterior over the correlation matrix remains well-calibrated, accurately recovering the underlying covariance structure. This demonstrates the strength of Bayesian inference: its inductive bias (in this case via the LKJ prior) not only makes estimation more robust but also provides principled uncertainty quantification for both model parameters and imputed values, all within a unified framework.

pm.model_to_graphviz(model_missing)

The model structure automatically parses out the observed from missing cells and imputes them as required automatically.

expected_corr = pd.DataFrame(az.summary(idata_missing, var_names=['chol_corr'])['mean'].values.reshape((12, 12)), columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)

resids = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS) - expected_corr

srmr_bayes_500_missing = get_srmr(resids)
plot_heatmap(resids, title=f"Expected Correlation Residuals  \n on the Posterior Correlation Estimate. SRMR: {np.mean(srmr_bayes_500_missing).round(2)}")

Deep Latent Structures with SEMs

We’ve seen how these kinds of models handle the estimation of latent correlation structure, but this is only to scratch the surface. Most work on job-satisfaction would suggest that there is a much richer causal structure between the variables in our data set than we recover with simple correlation analysis. Houghton and Jinkerson’s initial work on “Constructive Thought Strategies and Job Satisfaction: A Preliminary Examination” suggested that the latent causal structure was compellingly modeled with the following SEM structure

Houghton and Jinkerson’s suggest SEM model of Job Satisfaction

Here they express the idea that the effects of constructive thought strategies are doubly mediated by an individual’s dysfunctional thought processes and their sense of subjective well being. We can here develop the SEM model, but the general point is that complex causal processes require transparent and expressive models for their assessment. For background on SEM modelling with PyMC see here

Code

sample_df = make_sample(cov_matrix, 263, FEATURE_COLUMNS)

coords = {
    "obs": list(range(len(sample_df))),
    "indicators": FEATURE_COLUMNS,
    "indicators_1": ["JW1", "JW2",  "JW3"], # job satisfaction
    "indicators_2": ["UF1", "UF2", "FOR"], # well being
    "indicators_3": ["DA1", "DA2", "DA3"], # dysfunction
    "indicators_4": ["EBA", "ST", "MI"], # constructive thought strategies
    "latent": ["satisfaction", "well being", "dysfunctional", "constructive"],
    "latent1": ["satisfaction", "well being", "dysfunctional", "constructive"],
    "paths": ["dysfunctional ~ constructive", "well being ~ dysfunctional", "well being ~ constructive", "satisfaction ~ well being", "satisfaction ~ dysfunction" ,  "satisfaction ~ constructive"], 
    "sd_params": [i + '_sd' for i in FEATURE_COLUMNS],
    "corr_params": ['UF1 ~~ FOR']
}

def make_lambda(indicators, name='lambdas1', priors=[1, 10]):
    """ Takes an argument indicators which is a string in the coords dict"""
    temp_name = name + '_'
    lambdas_ = pm.Normal(temp_name, priors[0], priors[1], dims=(indicators))
    # Force a fixed scale on the factor loadings for factor 1
    lambdas_1 = pm.Deterministic(
        name, pt.set_subtensor(lambdas_[0], 1), dims=(indicators)
    )
    return lambdas_1

def make_B(priors=[0, .5]):
    coefs = pm.Normal('mu_betas', [0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], dims='paths')

    zeros = pt.zeros((4, 4))
    ## dysfunctional ~ constructive
    zeros = pt.set_subtensor(zeros[3, 2], coefs[0])
    ## well being ~ dysfunctional
    zeros = pt.set_subtensor(zeros[2, 1], coefs[1])
    ## well being ~ constructive
    zeros = pt.set_subtensor(zeros[3, 1], coefs[2])
    ## satisfaction ~ well being
    zeros = pt.set_subtensor(zeros[1, 0], coefs[3])
    ## satisfaction ~ dysfunction
    zeros = pt.set_subtensor(zeros[2, 0], coefs[4])
    ## satisfaction ~ constructive
    coefs_ = pt.set_subtensor(zeros[3, 0], coefs[5])
    return coefs_

def make_Psi(indicators, name='Psi_cov'):
    """ Takes an argument indicators which is a string in the coords dict"""
    temp_name = name + '_'
    n = len(coords[indicators])
    cov_params = pm.InverseGamma(temp_name, 3, 4, dims='sd_params')
    r = pt.zeros((n, n))
    beta_params = pm.Beta(temp_name + 'beta', 1, 1, dims='corr_params')
    for i in range(len(coords[indicators])):
        r = pt.set_subtensor(r[i, i], 1)
    # UF1 ~~ FOR
    r = pt.set_subtensor(r[3, 5], beta_params[0])
    s = pt.diag(cov_params)
    cov = (s @ r) @ pt.transpose(s @ r)
    r = pm.Deterministic('Psi_corr', r)
    cov = pm.Deterministic('Psi_cov', cov)

    return cov

obs_idx = list(range(len(sample_df)))
observed_data = sample_df[coords['indicators']].values
conditional = False

with pm.Model(coords=coords) as sem_model_clean:
    
    # --- Factor loadings ---
    lambdas_1 = make_lambda('indicators_1', 'lambdas1', priors=[1, .5])
    lambdas_2 = make_lambda('indicators_2', 'lambdas2', priors=[1, .5])
    lambdas_3 = make_lambda('indicators_3', 'lambdas3', priors=[1, .5])
    lambdas_4 = make_lambda('indicators_4', 'lambdas4', priors=[1, .5])

    Lambda = pt.zeros((12, 4))
    Lambda = pt.set_subtensor(Lambda[0:3, 0], lambdas_1)
    Lambda = pt.set_subtensor(Lambda[3:6, 1], lambdas_2)
    Lambda = pt.set_subtensor(Lambda[6:9, 2], lambdas_3)
    Lambda = pt.set_subtensor(Lambda[9:12, 3], lambdas_4)
    Lambda = pm.Deterministic('Lambda', Lambda)
    
    # --- Structural coefficients matrix B ---
    coefs_ = make_B()
    coefs_ = pm.Deterministic('B', coefs_, dims=('latent', 'latent1'))
    
    # Identity matrix for latent_dim
    latent_dim = len(coords['latent'])
    I = pt.eye(latent_dim)
    
    # Structural errors covariance (Psi_zeta)
    chol_latent, corr_latent, _ = pm.LKJCholeskyCov(
        "chol_latent", n=latent_dim, eta=2, sd_dist=pm.HalfNormal.dist(1.0)
    )
    Psi_zeta = pm.Deterministic("Psi_zeta", chol_latent.dot(chol_latent.T))
    
    # Structural errors for latent variables (exogenous residuals)
    zeta = pm.MvNormal("zeta", mu=pt.zeros(latent_dim), chol=chol_latent, shape=(len(sample_df), latent_dim))
    
    # Solve (I - B) * ksi.T = zeta.T  => ksi = ((I-B)^-1) * zeta
    lhs = I - coefs_ + 1e-8 * pt.eye(latent_dim)  # (latent_dim, latent_dim)
    lhs = pm.Deterministic('I-B', lhs)
    ksi_T = pt.slinalg.solve(lhs, zeta.T)  # solve for ksi.T shape (latent_dim, n_obs)
    ksi = pm.Deterministic("ksi", ksi_T.T)  # transpose back (n_obs, latent_dim)
    

    # --- Residual covariance for observed variables ---
    Psi = make_Psi('indicators')

    if conditional:
        mu = pt.dot(ksi, Lambda.T)  # (n_obs, n_indicators)

        # --- Observed data likelihood ---
        _ = pm.MvNormal("likelihood", mu=mu, cov=Psi, observed=observed_data)

    else: 
        inv_lhs = pt.slinalg.solve(lhs, pt.eye(latent_dim))
        inv_lhs = pm.Deterministic('solve_I-B', inv_lhs, dims=('latent', 'latent1'))
        Sigma_y = pm.Deterministic('Sigma_y', Lambda.dot(inv_lhs).dot(Psi_zeta).dot(inv_lhs.T).dot(Lambda.T) + Psi)
        _ = pm.MvNormal("likelihood", mu=0, cov=Sigma_y, observed=observed_data)

The model structure is now more complicated than the simple multivariate outcomes we’ve seen before - but this complexity is tied to an expressive structure and deliberately coded relations.

pm.model_to_graphviz(sem_model_clean)

In the Lavaan syntax we have something like:

# measurement part
CTS =~ EBA + ST + MI
DTP =~ DA1 + DA2 + DA3
SWB =~ UF1 + UF2 + FOR
JS =~ JW1 + JW2 + JW3

# error covariance
UF1 ~~ FOR

# structural part
DTP ~ CTS
SWB ~ CTS + DTP
JS  ~ CTS + DTP + SWB

where we have explicitly coded the indicator variables in our data set as loading onto one of four distinct factors. These factors in their turn are then placed into a regression relationship and we allow that there is a covariance structure among two of our indicator variables. We can now sample from this structure estimate the factor loadings and path coefficients at the same time.

with sem_model_clean: 
    idata = pm.sample_prior_predictive()
    idata.extend(pm.sample(chains=4, tune=2000, draws=10_000, target_accept=.95, random_seed=232))

We plot the estimates parameters here. We never explicitly encoded the linear relationships in our simulation routine, but even so, it seems we can recover and expected positive impact of constructive thought strategies on job-satisfaction. These estimates align well with the original results reported in Vehkalahti and Everitt’s Multivariate Analysis for the Behavioral sciences.

ax = az.plot_forest(idata, figsize=(8, 8),
var_names=['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'mu_betas'], combined=True, r_hat=True);
ax[0].axvline(0, linestyle='--', color='k')
ax[0].set_title("Structural Parameter Estimates from SEM model");

az.summary(idata, var_names=['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'mu_betas'])

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
lambdas1[JW1]	1.000	0.000	1.000	1.000	0.000	NaN	40000.0	40000.0	NaN
lambdas1[JW2]	1.030	0.093	0.861	1.209	0.001	0.000	16702.0	22989.0	1.00
lambdas1[JW3]	0.941	0.085	0.783	1.101	0.001	0.000	19206.0	25428.0	1.00
lambdas2[UF1]	1.000	0.000	1.000	1.000	0.000	NaN	40000.0	40000.0	NaN
lambdas2[UF2]	2.070	0.232	1.644	2.512	0.002	0.001	10599.0	19707.0	1.00
lambdas2[FOR]	1.016	0.146	0.747	1.292	0.001	0.001	12431.0	20468.0	1.00
lambdas3[DA1]	1.000	0.000	1.000	1.000	0.000	NaN	40000.0	40000.0	NaN
lambdas3[DA2]	1.147	0.077	1.005	1.294	0.001	0.000	22948.0	27973.0	1.00
lambdas3[DA3]	1.013	0.095	0.841	1.197	0.000	0.000	36920.0	31202.0	1.00
lambdas4[EBA]	1.000	0.000	1.000	1.000	0.000	NaN	40000.0	40000.0	NaN
lambdas4[ST]	1.063	0.136	0.802	1.316	0.001	0.001	29345.0	28238.0	1.00
lambdas4[MI]	1.546	0.155	1.255	1.835	0.001	0.001	15054.0	20925.0	1.00
mu_betas[dysfunctional ~ constructive]	-0.044	0.563	-1.079	1.043	0.017	0.011	1051.0	2277.0	1.01
mu_betas[well being ~ dysfunctional]	-0.401	0.631	-1.593	0.806	0.017	0.011	1458.0	2466.0	1.00
mu_betas[well being ~ constructive]	0.306	0.746	-1.138	1.678	0.016	0.008	2171.0	4796.0	1.00
mu_betas[satisfaction ~ well being]	0.140	0.202	-0.255	0.524	0.008	0.005	605.0	1215.0	1.01
mu_betas[satisfaction ~ dysfunction]	-0.123	0.321	-0.765	0.454	0.011	0.008	800.0	1517.0	1.01
mu_betas[satisfaction ~ constructive]	0.063	0.402	-0.701	0.824	0.013	0.009	1000.0	1741.0	1.01

and the chains look to have sampled the parameter space well.

az.plot_trace(idata, var_names=['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'mu_betas'], figsize=(8, 10));
plt.tight_layout();

Now we can sample the posterior predictive distribution for the multivariate outcome.

with sem_model_clean:
    idata.extend(pm.sample_posterior_predictive(idata))

Finally, we plot the reconstruction of the correlation structure based on the posterior predictive distribution of the structural equation model. The results are pretty compelling. Recall now what we’ve done. We’ve encoded a causal DAG structure in our model specification and were able to recover a plausible reconstruction of the multivariate correlations while estimating structural parameters of interest.

def sample_ppc_calc_resids(idata, n_samples=300):
    resid_array = np.zeros((n_samples, 12, 12))
    for i in range(n_samples):
        ppc_corr = pd.DataFrame(idata['posterior_predictive']['likelihood'].stack(sample=["chain", "draw"])[:, :, i], columns=FEATURE_COLUMNS).corr()
        resid = ppc_corr - pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)
        resid_array[i] = resid.values

    average_resids = pd.DataFrame(resid_array.mean(axis=0), columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)
    return average_resids

average_resids = sample_ppc_calc_resids(idata)
srmr = get_srmr(average_resids)
plot_heatmap(average_resids, title=f"Reconstructed Correlation Structure from SEM model \n SRMR: {srmr.round(2)}")

We can additional pull out the direct, indirect and total effects of the mediated predictor relations.

Code

fig, axs = plt.subplots(1, 3, figsize=(9, 4))
axs = axs.flatten()

total_effect = az.summary(idata, var_names=['solve_I-B'])['mean'].values.reshape((4, 4))

direct_effect = az.summary(idata, var_names=['B'])['mean'].values.reshape((4, 4))

i, j = np.triu_indices(total_effect.reshape((4, 4)).shape[0])
total_effect[i, j] = np.nan
direct_effect[i, j] = np.nan

indirect_effect = total_effect - direct_effect

plot_heatmap(pd.DataFrame(direct_effect, columns=coords['latent'], index=coords['latent']), ax=axs[0], vmax=1, vmin=-1, colorbar=False, title="Direct Effects", cmap='viridis')
plt.tight_layout()
axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=65)
axs[0].set_xlabel("Target Variable")
axs[0].set_ylabel("Regression Predictor")


plot_heatmap(pd.DataFrame(indirect_effect, columns=coords['latent'], index=coords['latent']), ax=axs[1], vmax=1, vmin=-1, colorbar=False, title="In-Direct Effects", cmap='viridis')
plt.tight_layout()
axs[1].set_yticklabels([])
axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=65)
axs[1].set_xlabel("Target Variable")

plot_heatmap(pd.DataFrame(total_effect.reshape((4, 4)), columns=coords['latent'], index=coords['latent']), ax=axs[2], vmax=1, vmin=-1, colorbar=False, title="Total Effects", cmap='viridis')
plt.tight_layout()
axs[2].set_yticklabels([])
axs[2].set_xticklabels(axs[2].get_xticklabels(), rotation=65)
axs[2].set_xlabel("Target Variable");

Re-cap and Overview

Let’s recall all we’ve just seen. We started with a specific covariance structure about the relationships between 12 indicator metrics that related to measures of job-satisfaction. We assessed two VAE architectures to determine if their latent structure was rich enough to recover correlation structure of those variables. We found one architecture which could recover the latent and another which broke in a subtle way. These models were compared first to a vanilla multivariate gaussian Bayesian with and without missing data. In both cases we saw that the Bayesian model specification was more efficient at recovering the latent structure. Finally, we asked whether we could further infer the deep structure of the latent correlations using a SEM architecture. This latent factor structure was able to recover the observed correlation structure and further decompose that matrix into a series of regression relations. This decomposition offers more insight into the additive effects of constructive thought processes on job satisfaction and well being i.e. that your own well being is the most decisive factor in predicting job-satisfaction, and dysfunctional thought strategies can have a negative effect on job-satisfaction.

Conclusion and Key Takeaways

Most complex phenomena have a complex data generating process. For the statistical model to retain plausible fidelty with the phenomena we should transparently encode as much complexity as required in our models. This, as we’ve seen, is not completely trivial.

Inductive Bias Shapes Learning: The assumptions built into models ,whether explicit priors or implicit architectures, directly influence what patterns are captured and how well they generalize.
Latent Spaces Are Tricky: Heuristics in latent space can be elusive; interpretability and validation demand careful scrutiny beyond surface-level metrics.
Simplicity Often Wins: Straightforward imputation paired with simpler models can outperform complex architectures when it comes to reconstructing key data structures like correlations.
Bayesian Frameworks Offer Clarity: Encoding domain knowledge through priors leads to cleaner, more targeted models that perform well even with limited data.
Implicit Handling of Missing Data: Bayesian inference naturally accommodates uncertainty and missingness without complex architectural tweaks.
Iterative Model Development: Effective modeling is a continuous process of refining assumptions, balancing complexity, and aligning with the data’s underlying structure.

Our exploration reveals a fundamental truth in statistical modeling: building models is not just about fitting data, but about embedding assumptions—inductive biases—that shape how we interpret and generalize from incomplete information. Complex models with rich latent spaces can mislead if their structure obscures what they truly learn. Simpler, well-targeted approaches, especially those grounded in principled Bayesian inference, often provide clearer, more reliable insight.

Ultimately, model development is an iterative art of balancing flexibility with structure, complexity with interpretability, and data with prior knowledge. There is a notorious empirical trial and test pattern to model development with deep learning models. This can seem opaque and frustrating, but I think this is more profitably seen along side contemporary Bayesian model development workflow as an exploration versus exploitation trade-off - where we systematically assess implications of model specification. Mastering this balance is key to robust inference, meaningful imputation, and trustworthy prediction in the face of uncertainty.

Citation

BibTeX citation:

@online{forde2025,
  author = {Forde, Nathaniel},
  title = {Heuristics in {Latent} {Space:} {VAEs} and {Bayesian}
    {Inference}},
  date = {2025-07-25},
  langid = {en},
  abstract = {Missing data imposes two costs: the expense of collecting
    new observations and the risk of distorting the underlying structure
    when we construct plausible replacements. In job satisfaction
    surveys, non-response patterns can conceal meaningful latent
    relationships. This study compares statistical and machine learning
    approaches for imputation, focusing on their ability to preserve
    these latent structures while producing realistic reconstructions.}
}

For attribution, please cite this work as:

Forde, Nathaniel. 2025. “Heuristics in Latent Space: VAEs and Bayesian Inference.” July 25, 2025.