library(lavaan)
library(dplyr)
library(reticulate)
library(marginaleffects)
library(modelsummary)
library(ggplot2)
library(tidyr)
library(egg)
library(lme4)
library(semPlot)
library(tinytable)
library(kableExtra)
library(reshape2)
::py_run_string("import pymc as pm")
reticulate
options(rstudio.python.installationPath = "/Users/nathanielforde/mambaforge/envs")
options("modelsummary_factory_default" = "tinytable")
options(repr.plot.width=15, repr.plot.height=8)
::knit_engines$set(python = reticulate::eng_python)
knitroptions(scipen=999)
set.seed(130)
“In all fiction, when a man is faced with alternatives he chooses one at the expense of the others. In the almost unfathomable Ts’ui Pen, he chooses – simultaneously – all of them. He thus creates various futures, various times which start others that will in their turn branch out and bifurcate in other times” - Borges in The Garden of Forking Paths
Measurement and Measurement Constructs
Science is easier when there is a recipe. When there is some procedure to adopt, routine to ape or track to follow, you can out-source the responsibility for methodological justification. One egregious pattern in this vein tries to mask implausible nonsense with the mathematical gloss of “statistical significance”. Seen from 1000 feet, this is misguided but not surprising. Lip-service is paid to the idea of scientific method and we absolve ourselves of the requirement for principled justification, substantive argument with a facsimile of evidence.
Evidence should be marshaled in service to argument. But it’s an absurd pretense to claim that data speaks for itself in this argument. Good and compelling argumentation is at the heart of any sound inference. It is a necessary obligation if you expect anyone to make any decision on the strength of evidence you provide. Procedures and routine tests offer only a facsimile of sound argument and p-values are a poor substitute for narrative cogency, logical consequence and substantive derivation.
Data is found, gathered or (maybe even) carefully curated. In all cases there is need for a defensive posture and an argument that the data is fit-for-purpose. Nowhere is this more clear than in psychometrics where the data is often derived from a strategically constructed survey aimed at a particular target phenomena. Some intuited, but not yet measured, concept that arguably plays a role in human action, motivation or sentiment. The relative “fuzziness” of the subject matter in psychometrics has had a catalyzing effect on the methodological rigour sought in the science. Survey designs are agonized over for correct tone and rhythm of sentence structure. Measurement scales are doubly checked for reliability and correctness. The literature is consulted and questions are refined. Analysis steps are justified and tested under a wealth of modelling routines. Model architectures are defined and refined to better express the hypothesized structures in the data-generating process. We will see how such due diligence leads to powerful and expressive models that grant us tractability on thorny questions of human affect.
Structure of Presentation
We aim to follow a an escalating series of model complexity highlighting the variety of choices that go into modeling psychometric survey data, ultimately concluding by demonstrating how to fit appropriately complex models in PyMC
.
Exploratory Data Visualization
- Pulling out views to try and make the multivariate survey response seem a bit more concrete.
Exploratory Regression Models
- Highlighting the nature of the representation that is used when we fit this data into a simple regression context. We also discuss how a model is a tool for exploring the data as much as it is a statement about the nature of the relations in our data-set.
Interlude on Model Design and Conditional Independence
- A brief digression on what we’re aiming to do when fit a model to this kind of data. In particular how we need richer structures to articulate the models that can actually answer questions of interest.
Expressive Structure: Building the Model we Need
Confirmatory Factor Models in
lavaan
Structural Equation Models in
lavaan
Bayesian Confirmatory Factor Models in
PyMC
Bayesian Structural Equation Models in
PyMC
Throughout we’ll draw on Levy and Mislevy’s Bayesian Psychometric Modeling to highlight how good model design seeks structures which justify claim(s) of conditional independence; thereby supporting the argument to compelling inference.
The Data
Our data is borrowed from work by Boris Mayer and Andrew Ellis found here. They demonstrate CFA and SEM modelling with lavaan
. We’ll load up their data.
= read.csv('sem_data.csv')
df $ls_sum <- df$ls_p1 + df$ls_p2 + df$ls_p3
df$ls_mean <- rowMeans(df[ c('ls_p1', 'ls_p2', 'ls_p3')])
df
head(df) |> kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria")
ID | region | gender | age | se_acad_p1 | se_acad_p2 | se_acad_p3 | se_social_p1 | se_social_p2 | se_social_p3 | sup_friends_p1 | sup_friends_p2 | sup_friends_p3 | sup_parents_p1 | sup_parents_p2 | sup_parents_p3 | ls_p1 | ls_p2 | ls_p3 | ls_sum | ls_mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | west | female | 13 | 4.857143 | 5.571429 | 4.500000 | 5.80 | 5.500000 | 5.40 | 6.5 | 6.5 | 7.0 | 7.0 | 7.0 | 6.0 | 5.333333 | 6.75 | 5.50 | 17.58333 | 5.861111 |
2 | west | male | 14 | 4.571429 | 4.285714 | 4.666667 | 5.00 | 5.500000 | 4.80 | 4.5 | 4.5 | 5.5 | 5.0 | 6.0 | 4.5 | 4.333333 | 5.00 | 4.50 | 13.83333 | 4.611111 |
10 | west | female | 14 | 4.142857 | 6.142857 | 5.333333 | 5.20 | 4.666667 | 6.00 | 4.0 | 4.5 | 3.5 | 7.0 | 7.0 | 6.5 | 6.333333 | 5.50 | 4.00 | 15.83333 | 5.277778 |
11 | west | female | 14 | 5.000000 | 5.428571 | 4.833333 | 6.40 | 5.833333 | 6.40 | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | 4.333333 | 6.50 | 6.25 | 17.08333 | 5.694444 |
12 | west | female | 14 | 5.166667 | 5.600000 | 4.800000 | 5.25 | 5.400000 | 5.25 | 7.0 | 7.0 | 7.0 | 6.5 | 6.5 | 7.0 | 5.666667 | 6.00 | 5.75 | 17.41667 | 5.805556 |
14 | west | male | 14 | 4.857143 | 4.857143 | 4.166667 | 5.20 | 5.000000 | 4.20 | 5.5 | 6.5 | 7.0 | 6.5 | 6.5 | 6.5 | 5.000000 | 5.50 | 5.50 | 16.00000 | 5.333333 |
We have survey responses from ~300 individuals who have answered questions regarding their upbringing, self-efficacy and reported life-satisfaction. The hypothetical dependency structure in this life-satisfaction data-set posits a moderated relationship between scores related to life-satisfaction, parental and family support and self-efficacy. It is not a trivial task to be able to design a survey that can elicit answers plausibly mapped to each of these “factors” or themes, never mind finding a model of their relationship that can inform us as to the relative of impact of each on life-satisfaction outcomes.
We will try to show some of the subtle aspects of these relationships as we go.
We can thematically cluster our survey questions along these lines because the source survey deliberately targeted each of the underlying themes in effort to pin down the relations between these hard to measure constructs. We are assuming that our measurement scales are well designed and that the mapping of themes is appropriate. Here we can see how the high level summary statistics show the variation across these measures.
#| code-fold: true
#| code-summary: "Show the code"
datasummary_skim(df)|>
style_tt(
i = 15:17,
j = 1:1,
background = "#20AACC",
color = "white",
italic = TRUE) |>
style_tt(
i = 18:19,
j = 1:1,
background = "#2888A0",
color = "white",
italic = TRUE) |>
style_tt(
i = 2:14,
j = 1:1,
background = "#17C2AD",
color = "white",
italic = TRUE)
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
ID | 283 | 0 | 187.9 | 106.3 | 1.0 | 201.0 | 367.0 | |
age | 5 | 0 | 14.7 | 0.8 | 13.0 | 15.0 | 17.0 | |
se_acad_p1 | 32 | 0 | 5.2 | 0.8 | 3.1 | 5.1 | 7.0 | |
se_acad_p2 | 36 | 0 | 5.3 | 0.7 | 3.1 | 5.4 | 7.0 | |
se_acad_p3 | 29 | 0 | 5.2 | 0.8 | 2.8 | 5.2 | 7.0 | |
se_social_p1 | 24 | 0 | 5.3 | 0.8 | 1.8 | 5.4 | 7.0 | |
se_social_p2 | 27 | 0 | 5.5 | 0.7 | 2.7 | 5.5 | 7.0 | |
se_social_p3 | 31 | 0 | 5.4 | 0.8 | 3.0 | 5.5 | 7.0 | |
sup_friends_p1 | 13 | 0 | 5.8 | 1.1 | 1.0 | 6.0 | 7.0 | |
sup_friends_p2 | 10 | 0 | 6.0 | 0.9 | 2.5 | 6.0 | 7.0 | |
sup_friends_p3 | 13 | 0 | 6.0 | 1.1 | 1.0 | 6.0 | 7.0 | |
sup_parents_p1 | 11 | 0 | 6.0 | 1.1 | 1.0 | 6.0 | 7.0 | |
sup_parents_p2 | 11 | 0 | 5.9 | 1.1 | 2.0 | 6.0 | 7.0 | |
sup_parents_p3 | 13 | 0 | 5.7 | 1.1 | 1.0 | 6.0 | 7.0 | |
ls_p1 | 15 | 0 | 5.2 | 0.9 | 2.0 | 5.3 | 7.0 | |
ls_p2 | 21 | 0 | 5.8 | 0.7 | 2.5 | 5.8 | 7.0 | |
ls_p3 | 22 | 0 | 5.2 | 0.9 | 2.0 | 5.2 | 7.0 | |
ls_sum | 98 | 0 | 16.2 | 2.1 | 8.7 | 16.4 | 20.8 | |
ls_mean | 97 | 0 | 5.4 | 0.7 | 2.9 | 5.5 | 6.9 | |
N | % | |||||||
region | east | 142 | 50.2 | |||||
west | 141 | 49.8 | ||||||
gender | female | 132 | 46.6 | |||||
male | 151 | 53.4 |
Note how we’ve distinguished among the metrics for the “outcome” metrics and the “driver” metrics. Such a distinction may seem trivial, but it is only possible because we bring substantive knowledge to bear on the problem in the design of our survey and the postulation of the theoretical construct. Practically our data is just a multivariate outcome vector. It’s theory that offers a structural understanding of how these metrics reflect a dependence. There are a large range of possible interacting effects and the patterns of realization for our life-satisfaction scores. The true “drivers” of satisfaction may be quite different than the hypothesised structure. It is this open question that we’re aiming to uncover in the analysis of our survey data. Does our data support the view that parental support is essential to life-satisfaction? To what degree? How do we tease out this implications precisely?
Sample Covariances and Correlations
To start, let’s plot the relations amongst our various indicator scores.
= c('se_acad_p1', 'se_acad_p2', 'se_acad_p3', 'se_social_p1', 'se_social_p2', 'se_social_p3', 'sup_friends_p1','sup_friends_p2', 'sup_friends_p3', 'sup_parents_p1' , 'sup_parents_p2' , 'sup_parents_p3', 'ls_p1', 'ls_p2', 'ls_p3')
drivers
<- function(df, title="Sample Covariances", subtitle="Observed Measures") {
plot_heatmap = df |> as.matrix() |> melt()
heat_df colnames(heat_df) <- c("x", "y", "value")
<- heat_df |> mutate(value = round(value, 2)) |> ggplot(aes(x = x, y = y, fill = value)) +
g geom_tile() + geom_text(aes(label = value), color = "black", size = 4) +
scale_fill_gradient2(
high = 'dodgerblue4',
mid = 'white',
low = 'firebrick2'
+ theme(axis.text.x = element_text(angle=45)) + ggtitle(title, subtitle)
)
g
}
= plot_heatmap(cov(df[, drivers]))
g1
= plot_heatmap(cor(df[, drivers]), "Sample Correlations")
g2
<- ggarrange(g1,g2, ncol=1, nrow=2); plot
Note the relatively strong correlations between measures of parental support and the life-satisfaction outcome ls_p3
. Similarly, the social self-efficacy scores seem well correlated with the secondary life satisfaction indicator ls_p2
. These observed correlations merit some further investigation. We can also plot the pairs of scatter plots to “dig deeper”. What kind of correlation holds between these scores? Are any driven by extreme outliers? This is what we’re looking for in a pair plot. Should we cull the outliers? Leave them in? We leave them in.
Pair Plots of Multivariate Outcome
<- df |>
df mutate(id = row_number())
# Prepare data to be plotted on the x axis
<- pivot_longer(data = df,
x_vars cols = se_acad_p1:ls_p3,
names_to = "variable_x",
values_to = "x")
# Prepare data to be plotted on the y axis
<- pivot_longer(data = df,
y_vars cols = se_acad_p1:ls_p3,
names_to = "variable_y",
values_to = "y")
# Join data for x and y axes and make plot
full_join(x_vars, y_vars,
by = c("id"),
relationship = "many-to-many") |>
ggplot() +
aes(x = x, y = y) +
geom_point() + geom_smooth(method='lm') +
facet_grid(c("variable_x", "variable_y")) + ggtitle("Pair Plot of Indicator Metrics",
"Comparing Against Life Satisfaction Scores")
`geom_smooth()` using formula = 'y ~ x'
The scatter plots among the highly correlated variables in the heat-map do seem to exhibit some kind of linear relationship with aspects of the life-satisfaction scores. We now turn to modelling these relationships to tease out some kind of inferential summary of those relationships.
Exploratory Regression Models
To model these effects we make use of the aggregated sum and mean scores to express the relationships between these indicator metrics and life-satisfaction. We fit a variety of models each one escalating in the number of indicator metrics we incorporate into our model of the life-satisfaction outcome. This side-steps the multivariate nature of hypothesised constructs and crudely amalgamates the indicator metrics. This may be more or less justified depending on how similar in theme the three outcome questions ls_p1, ls_p2, ls_p3
are in nature. We’ve dodged the question of thematic unity as all our metrics “load equally” on the outcome variable here and we let our regression model sort out the weighting.
= " ls_sum ~ se_acad_p1 + se_social_p1 + sup_friends_p1 + sup_parents_p1"
formula_sum_1st = " ls_mean ~ se_acad_p1 + se_social_p1 + sup_friends_p1 + sup_parents_p1"
formula_mean_1st
= " ls_sum ~ se_acad_p1 + se_acad_p2 + se_social_p1 + se_social_p2 +
formula_sum_12 sup_friends_p1 + sup_friends_p2 + sup_parents_p1 + sup_parents_p2"
= " ls_mean ~ se_acad_p1 + se_acad_p2 + se_social_p1 + se_social_p2 +
formula_mean_12 sup_friends_p1 + sup_friends_p2 + sup_parents_p1 + sup_parents_p2"
= " ls_sum ~ se_acad_p1 + se_acad_p2 + se_acad_p3 + se_social_p1 + se_social_p2 + se_social_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + sup_parents_p1 + sup_parents_p2 + sup_parents_p3"
formula_sum = " ls_mean ~ se_acad_p1 + se_acad_p2 + se_acad_p3 + se_social_p1 + se_social_p2 + se_social_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + sup_parents_p1 + sup_parents_p2 + sup_parents_p3"
formula_mean = lm(formula_sum, df)
mod_sum = lm(formula_sum_1st, df)
mod_sum_1st = lm(formula_sum_12, df)
mod_sum_12 = lm(formula_mean, df)
mod_mean = lm(formula_mean_1st, df)
mod_mean_1st = lm(formula_mean_12, df)
mod_mean_12
<- function(x) {
norm - mean(x)) / sd(x)
(x
}
<- as.data.frame(lapply(df[c(5:19)], norm))
df_norm
$ls_sum <- df$ls_sum
df_norm$ls_mean <- df$ls_mean
df_norm
= lm(formula_sum, df_norm)
mod_sum_norm = lm(formula_mean, df_norm)
mod_mean_norm
= list(
models "Outcome: sum_score" = list("model_sum_1st_factors" = mod_sum_1st,
"model_sum_1st_2nd_factors" = mod_sum_12,
"model_sum_score" = mod_sum,
"model_sum_score_norm" = mod_sum_norm
),"Outcome: mean_score" = list(
"model_mean_1st_factors" = mod_mean_1st,
"model_mean_1st_2nd_factors" = mod_mean_12,
"model_mean_score"= mod_mean,
"model_mean_score_norm" = mod_mean_norm
) )
The classical presentation of regression models reports the coefficient weights accorded to each of the input variables. We present these models to highlight that the manner in which we represent our theoretical constructs has ramifications for the interpretation of the data generating process. In particular, note how different degrees of significance are accorded to the different variables depending on which alternate variables are included.
#| code-fold: true
#| code-summary: "Show the code"
modelsummary(models, stars=TRUE, shape ="cbind") |>
style_tt(
i = 2:25,
j = 1:1,
background = "#17C2AD",
color = "white",
italic = TRUE)
Outcome: sum_score | Outcome: mean_score | |||||||
---|---|---|---|---|---|---|---|---|
model_sum_1st_factors | model_sum_1st_2nd_factors | model_sum_score | model_sum_score_norm | model_mean_1st_factors | model_mean_1st_2nd_factors | model_mean_score | model_mean_score_norm | |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||||||
(Intercept) | 5.118*** | 2.644** | 2.094* | 16.192*** | 1.706*** | 0.881** | 0.698* | 5.397*** |
(0.907) | (0.985) | (0.954) | (0.089) | (0.302) | (0.328) | (0.318) | (0.030) | |
se_acad_p1 | 0.242 | -0.034 | -0.208 | -0.165 | 0.081 | -0.011 | -0.069 | -0.055 |
(0.147) | (0.180) | (0.192) | (0.153) | (0.049) | (0.060) | (0.064) | (0.051) | |
se_social_p1 | 1.088*** | 0.501* | 0.355+ | 0.269+ | 0.363*** | 0.167* | 0.118+ | 0.090+ |
(0.162) | (0.204) | (0.200) | (0.152) | (0.054) | (0.068) | (0.067) | (0.051) | |
sup_friends_p1 | 0.125 | -0.224+ | -0.272+ | -0.307+ | 0.042 | -0.075+ | -0.091+ | -0.102+ |
(0.088) | (0.133) | (0.150) | (0.169) | (0.029) | (0.044) | (0.050) | (0.056) | |
sup_parents_p1 | 0.561*** | 0.238+ | 0.072 | 0.077 | 0.187*** | 0.079+ | 0.024 | 0.026 |
(0.100) | (0.141) | (0.143) | (0.154) | (0.033) | (0.047) | (0.048) | (0.051) | |
se_acad_p2 | 0.448* | 0.327 | 0.224 | 0.149* | 0.109 | 0.075 | ||
(0.197) | (0.202) | (0.139) | (0.066) | (0.067) | (0.046) | |||
se_social_p2 | 0.756*** | 0.509* | 0.356* | 0.252*** | 0.170* | 0.119* | ||
(0.213) | (0.219) | (0.153) | (0.071) | (0.073) | (0.051) | |||
sup_friends_p2 | 0.369* | 0.331* | 0.313* | 0.123* | 0.110* | 0.104* | ||
(0.157) | (0.160) | (0.151) | (0.052) | (0.053) | (0.050) | |||
sup_parents_p2 | 0.370** | 0.118 | 0.129 | 0.123** | 0.039 | 0.043 | ||
(0.138) | (0.144) | (0.157) | (0.046) | (0.048) | (0.052) | |||
se_acad_p3 | 0.153 | 0.123 | 0.051 | 0.041 | ||||
(0.174) | (0.140) | (0.058) | (0.047) | |||||
se_social_p3 | 0.443** | 0.354** | 0.148** | 0.118** | ||||
(0.161) | (0.129) | (0.054) | (0.043) | |||||
sup_friends_p3 | 0.165 | 0.181 | 0.055 | 0.060 | ||||
(0.130) | (0.143) | (0.043) | (0.048) | |||||
sup_parents_p3 | 0.526*** | 0.602*** | 0.175*** | 0.201*** | ||||
(0.126) | (0.144) | (0.042) | (0.048) | |||||
Num.Obs. | 283 | 283 | 283 | 283 | 283 | 283 | 283 | 283 |
R2 | 0.399 | 0.467 | 0.517 | 0.517 | 0.399 | 0.467 | 0.517 | 0.517 |
R2 Adj. | 0.390 | 0.451 | 0.496 | 0.496 | 0.390 | 0.451 | 0.496 | 0.496 |
AIC | 1090.9 | 1064.7 | 1044.7 | 1044.7 | 469.1 | 442.9 | 422.9 | 422.9 |
BIC | 1112.8 | 1101.2 | 1095.7 | 1095.7 | 491.0 | 479.4 | 473.9 | 473.9 |
Log.Lik. | -539.455 | -522.373 | -508.341 | -508.341 | -228.548 | -211.466 | -197.434 | -197.434 |
RMSE | 1.63 | 1.53 | 1.46 | 1.46 | 0.54 | 0.51 | 0.49 | 0.49 |
We can similarly plot the coefficient values and their uncertainty highlighting how the representation or scaling of the variables impact the scale of the coefficient weights and therefore the surety of any subsequent claims.
= list(
models_sum "model_sum_1st_factors" = mod_sum_1st,
"model_sum_1st_2nd_factors" = mod_sum_12,
"model_sum_score" = mod_sum,
"model_sum_score_norm" = mod_sum_norm
)
= list(
models_mean "model_mean_1st_factors" = mod_mean_1st,
"model_mean_1st_2nd_factors" = mod_mean_12,
"model_mean_score"= mod_mean,
"model_mean_score_norm" = mod_mean_norm
)
= modelplot(models_sum, coef_omit = 'Intercept') + geom_vline(xintercept = 0, linetype="dotted",
g1 color = "black") + ggtitle("Comparing Sum Model Parameter Estimates", "Across Covariates")
= modelplot(models_mean, coef_omit = 'Intercept') + geom_vline(xintercept = 0, linetype="dotted",
g2 color = "black") + ggtitle("Comparing Mean Model Parameter Estimates", "Across Covariates")
<- ggarrange(g1,g2, ncol=2, nrow=1); plot
Significant Coefficients?
An alternative lens on these figures highlights the statistical significance of the coefficients. But again, these criteria are much abused. Significance at what level? Conditional on which representation? Which composite outcome score? It should matter if thematically similar relationships are highlighted as we vary the input variables. This is a kind of check on the hypothesised dependency structure - on what cluster of variables does our model place the most weight?
= modelplot(mod_mean, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.001, "Significant at 0.001", "Not significant at 0.001")) +
g1 scale_color_manual(values = c("grey", "blue")) + ggtitle("Significance of Coefficient Values", "Mean Model at Different Levels")
= modelplot(mod_mean, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.05")) +
g2 scale_color_manual(values = c("grey", "black"))
<- ggarrange(g1,g2, ncol=2, nrow=1); plot
Aggregate Driver Scores
Perhaps we play with the feature representation and increase the proportion of significant indicators. Can we now tell a more definitive story about how parental support and social self-efficact are determinants of self-reported life-satisfaction scores? Let’s focus here on the sum score representation and add interaction effects.
$se_acad_mean <- rowMeans(df[c('se_acad_p1', 'se_acad_p2', 'se_acad_p3')])
df$se_social_mean <- rowMeans(df[c('se_social_p1', 'se_social_p2', 'se_social_p3')])
df$sup_friends_mean <- rowMeans(df[c('sup_friends_p1', 'sup_friends_p2', 'sup_friends_p3')])
df$sup_parents_mean <- rowMeans(df[c('sup_parents_p1', 'sup_parents_p2', 'sup_parents_p3')])
df
= "ls_sum ~ se_acad_mean + se_social_mean +
formula_parcel_sum sup_friends_mean + sup_parents_mean " #sup_parents_mean*se_social_mean"
= "ls_sum ~ se_acad_mean + se_social_mean +
formula_parcel_sum_inter sup_friends_mean + sup_parents_mean + sup_parents_mean*se_social_mean"
= lm(formula_parcel_sum, df)
mod_sum_parcel = lm(formula_parcel_sum_inter, df)
mod_sum_inter_parcel
= list("model_sum_score" = mod_sum_parcel,
models_parcel "model_sum_inter_score"= mod_sum_inter_parcel
)
modelsummary(models_parcel, stars=TRUE)
model_sum_score | model_sum_inter_score | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 2.728** | -6.103+ |
(0.931) | (3.356) | |
se_acad_mean | 0.307+ | 0.370* |
(0.158) | (0.158) | |
se_social_mean | 1.269*** | 2.859*** |
(0.175) | (0.606) | |
sup_friends_mean | 0.124 | 0.183+ |
(0.097) | (0.099) | |
sup_parents_mean | 0.726*** | 2.242*** |
(0.099) | (0.562) | |
se_social_mean × sup_parents_mean | -0.292** | |
(0.107) | ||
Num.Obs. | 283 | 283 |
R2 | 0.489 | 0.503 |
R2 Adj. | 0.482 | 0.494 |
AIC | 1044.6 | 1039.0 |
BIC | 1066.4 | 1064.5 |
Log.Lik. | -516.288 | -512.513 |
RMSE | 1.50 | 1.48 |
What does definitive mean here? Is it so simple as more significant coefficients? Marginally better performance measures?
= modelplot(mod_sum_parcel, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.05")) +
g1 scale_color_manual(values = c("grey", "blue"), guide=FALSE) + ggtitle("Significance of Coefficient Values", "At Different Levels for Sum and Mean Scores Life Satisfaction ")
= modelplot(mod_sum_inter_parcel, coef_omit = 'Intercept') + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.05")) +
g2 scale_color_manual(values = c("grey", "blue"))
<- ggarrange(g1,g2, ncol=2, nrow=1); plot
Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
ggplot2 3.3.4.
ℹ Please use "none" instead.
ℹ The deprecated feature was likely used in the modelsummary package.
Please report the issue at
<https://github.com/vincentarelbundock/modelsummary/issues/>.
This kind of brinkmanship is brittle. Any one of these kinds of choice can be justified but more often than not results from an suspect exploratory process. Steps down a “garden of forking paths” seeking some kind of story to justify an analysis or promote a conclusion. This post-hoc “seeking” is just bad science undermining the significance claims that accrue to reliable procedures. It warps the nature of testing procedure by corrupting the assumed consistency of repeatable trials. The guarantees of statistical significance attach to a conclusion just when the procedure is imagined replicible and repeated under identical conditions. By exploring the different representations and criteria of narrative adequacy we break those guarantees. The false precision of appeals to null hypothesis statistical hides this breadth of this discovery process while leveraging the rhetoric of theoretical guarantees that apply only in an alternative possible world.
Exploratory and Confirmatory Modes
One of the things that psychometrics has pioneered well is the distinction between an exploratory and confirmatory models. This distinction, when made explicit, partially guards against the abuse of inferential integrity we see in more common work-flows. We may fit a model to our data, we may tweak or adjust it in any way we care to - this is exploratory work, the point is that the model design is validated on unseen data. But additionally, models are often opaque - you may (as above) have improved some measure of model fit, changed the parameter weighting accorded to an observed feature, but what does that mean? Exploration of model architectures, design choices and feature creation is just how we come to understand the meaning of a model specification. Even in the simple case of regression we’ve seen how adding an interaction term changes the interpretability of a model. How then are we to stand behind uncertainty estimates accorded to parameter weights when we barely intuit the implications of a model design?
Marginal Effects
The answer is to not to rely on intuition, but push forward and test the tangible implications of a fitted model. A model is hypothesis which should be applied to stringent test. We should subject the logical consequences of the design to appropriate scrutiny. We understand the implications and relative plausibility of any model in terms of the predicted outcomes more easily than we understand the subtle interaction effects expressed parameter movements. As such we should adopt this view in our evaluation of a model fit too.
Consider how we do this using Vincent Arel-Bundock’s wonderful marginaleffects
package passing a grid of new values through to our fitted model.These implications are a test of model plausibility too.
<- predictions(mod_sum_parcel, newdata = datagrid(sup_parents_mean = 1:10, se_social_mean = 1:10 ))
pred <- predictions(mod_sum_inter_parcel, newdata = datagrid(sup_parents_mean = 1:10, se_social_mean = 1:10))
pred1
|> tail(10) |> kableExtra::kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") pred1
rowid | estimate | std.error | statistic | p.value | s.value | conf.low | conf.high | se_acad_mean | sup_friends_mean | sup_parents_mean | se_social_mean | ls_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
91 | 91 | 19.27615 | 2.2881204 | 8.424449 | 0 | 54.61499 | 14.79152 | 23.76079 | 5.237686 | 5.929329 | 10 | 1 | 17.58333 |
92 | 92 | 19.21698 | 1.7840261 | 10.771691 | 0 | 87.46457 | 15.72035 | 22.71361 | 5.237686 | 5.929329 | 10 | 2 | 17.58333 |
93 | 93 | 19.15780 | 1.2888397 | 14.864380 | 0 | 163.60757 | 16.63172 | 21.68388 | 5.237686 | 5.929329 | 10 | 3 | 17.58333 |
94 | 94 | 19.09863 | 0.8188846 | 23.322730 | 0 | 397.24885 | 17.49364 | 20.70361 | 5.237686 | 5.929329 | 10 | 4 | 17.58333 |
95 | 95 | 19.03945 | 0.4595015 | 41.435010 | 0 | Inf | 18.13884 | 19.94006 | 5.237686 | 5.929329 | 10 | 5 | 17.58333 |
96 | 96 | 18.98027 | 0.5318049 | 35.690291 | 0 | 924.33454 | 17.93795 | 20.02259 | 5.237686 | 5.929329 | 10 | 6 | 17.58333 |
97 | 97 | 18.92110 | 0.9410614 | 20.106123 | 0 | 296.26806 | 17.07665 | 20.76554 | 5.237686 | 5.929329 | 10 | 7 | 17.58333 |
98 | 98 | 18.86192 | 1.4210849 | 13.272902 | 0 | 131.14398 | 16.07665 | 21.64720 | 5.237686 | 5.929329 | 10 | 8 | 17.58333 |
99 | 99 | 18.80274 | 1.9194980 | 9.795657 | 0 | 72.84938 | 15.04060 | 22.56489 | 5.237686 | 5.929329 | 10 | 9 | 17.58333 |
100 | 100 | 18.74357 | 2.4249884 | 7.729343 | 0 | 46.39459 | 13.99068 | 23.49646 | 5.237686 | 5.929329 | 10 | 10 | 17.58333 |
We will see here how the impact of changes in se_social_mean
is much reduced when sup_parent_mean
is held fixed at high values (9, 10) when our model allows for an interaction effect.
|> head(10) |> kableExtra::kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") pred
rowid | estimate | std.error | statistic | p.value | s.value | conf.low | conf.high | se_acad_mean | sup_friends_mean | sup_parents_mean | se_social_mean | ls_sum |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7.065559 | 0.7860786 | 8.988362 | 0 | 61.78928 | 5.524873 | 8.606245 | 5.237686 | 5.929329 | 1 | 1 | 17.58333 |
2 | 8.334837 | 0.6546017 | 12.732685 | 0 | 120.95075 | 7.051842 | 9.617833 | 5.237686 | 5.929329 | 1 | 2 | 17.58333 |
3 | 9.604115 | 0.5480173 | 17.525206 | 0 | 226.01129 | 8.530021 | 10.678209 | 5.237686 | 5.929329 | 1 | 3 | 17.58333 |
4 | 10.873394 | 0.4830922 | 22.507905 | 0 | 370.25976 | 9.926550 | 11.820237 | 5.237686 | 5.929329 | 1 | 4 | 17.58333 |
5 | 12.142672 | 0.4771468 | 25.448505 | 0 | 472.16118 | 11.207481 | 13.077862 | 5.237686 | 5.929329 | 1 | 5 | 17.58333 |
6 | 13.411950 | 0.5321613 | 25.202790 | 0 | 463.16951 | 12.368933 | 14.454967 | 5.237686 | 5.929329 | 1 | 6 | 17.58333 |
7 | 14.681228 | 0.6324223 | 23.214278 | 0 | 393.60150 | 13.441703 | 15.920753 | 5.237686 | 5.929329 | 1 | 7 | 17.58333 |
8 | 15.950506 | 0.7602342 | 20.981043 | 0 | 322.26019 | 14.460474 | 17.440538 | 5.237686 | 5.929329 | 1 | 8 | 17.58333 |
9 | 17.219784 | 0.9039855 | 19.048739 | 0 | 266.32548 | 15.448005 | 18.991563 | 5.237686 | 5.929329 | 1 | 9 | 17.58333 |
10 | 18.489062 | 1.0571941 | 17.488806 | 0 | 225.08895 | 16.417000 | 20.561124 | 5.237686 | 5.929329 | 1 | 10 | 17.58333 |
This modifying effect is not as evident in the simpler model.
Regression Marginal Effects
We can see this more clearly with a plot of the marginal effects.
= plot_predictions(mod_sum_parcel, condition = c("se_social_mean", "sup_parents_mean"), type = "response") + ggtitle("Counterfactual Shift of Outcome: se_social_mean", "Holding all else Fixed: Simple Model")
g
= plot_predictions(mod_sum_inter_parcel, condition = c("se_social_mean", "sup_parents_mean"), type = "response") + ggtitle("Counterfactual Shift of Outcome: se_social_mean", "Holding all else Fixed Interaction Model")
g1
<- ggarrange(g,g1, ncol=1, nrow=2); plot
Here we’ve pulled out some of the implications in terms of the outcome predictions and see how the interaction effect modifies the additive effect downwards at the upper end of the se_social_mean
scale. That is to say the y-axis recounts the aggregate ls_sum
outcome variable and the x-axis varies values for se_social_mean
while holding values the other covariates fixed at the mean level. The differences we see in the two plots are due to the extra effect of the modification the interaction term has on our prediction scale. The significance and practical impact of this interaction term should make us think harder about the manner in which social self-efficacy and parental support are related.
Models with Controls
The garden of forking paths presents itself within any set of covariates. How do we represent their effects? Which interactions are meaningful? How do we argue for one model design over another? The questionable paths are multiplied when we begin to consider additional covariates and group effects. But also additional covariates help structure our expectations too. Yes, you can cut and chop your way to through the garden to find some spurious correlation but more plausibly you can bring in structurally important variables which helpfully moderate the outcomes based on our understanding of the data generating process.
Let’s assess the question by allowing the model to account for differences in region
. You might imagine other demographic features which contribute to varying life-satisfaction scores and it would be a further model refinement if we could collect the associated measures to augment our analysis. For now we focus on region
.
= "ls_sum ~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + se_acad_p1 + se_acad_p2 + se_acad_p3 +
formula_no_grp_sum se_social_p1 + se_social_p2 + se_social_p3"
= "ls_sum ~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3 + sup_friends_p1 + sup_friends_p2 + sup_friends_p3 + se_acad_p1 + se_acad_p2 + se_acad_p3 +
formula_grp_sum se_social_p1 + se_social_p2 + se_social_p3 + factor(region)"
<- lm(formula_no_grp_sum , data = df)
no_grp_sum_fit <- lm(formula_grp_sum, data = df)
grp_sum_fit
= modelplot(no_grp_sum_fit, re.form=NA) + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant at 0.001")) +
g1 scale_color_manual(values = c("grey", "blue"), guide='none') + ggtitle("Significance of Coefficient Values", "No Group Effects Model")
= modelplot(grp_sum_fit, re.form=NA) + aes(color = ifelse(p.value < 0.05, "Significant at 0.05", "Not significant 0.05")) +
g2 scale_color_manual(values = c("grey", "blue")) + ggtitle("Significance of Coefficient Values", "Group Effects Model")
<- ggarrange(g1,g2, ncol=2, nrow=1); plot
We can see here how the additional factor variable is reported to be significant conditional on a model specification. However the intercept is no longer well identified.
modelsummary(list("No Group Effects Fit"= no_grp_sum_fit,
"Group Effects Fit"= grp_sum_fit),
stars = TRUE) |>
style_tt(
i = 2:25,
j = 1:1,
background = "#17C2AD",
color = "white",
italic = TRUE)
No Group Effects Fit | Group Effects Fit | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 2.094* | 1.979+ |
(0.954) | (1.041) | |
sup_parents_p1 | 0.072 | 0.066 |
(0.143) | (0.145) | |
sup_parents_p2 | 0.118 | 0.122 |
(0.144) | (0.145) | |
sup_parents_p3 | 0.526*** | 0.529*** |
(0.126) | (0.126) | |
sup_friends_p1 | -0.272+ | -0.272+ |
(0.150) | (0.150) | |
sup_friends_p2 | 0.331* | 0.332* |
(0.160) | (0.160) | |
sup_friends_p3 | 0.165 | 0.166 |
(0.130) | (0.131) | |
se_acad_p1 | -0.208 | -0.212 |
(0.192) | (0.193) | |
se_acad_p2 | 0.327 | 0.328 |
(0.202) | (0.202) | |
se_acad_p3 | 0.153 | 0.169 |
(0.174) | (0.184) | |
se_social_p1 | 0.355+ | 0.345+ |
(0.200) | (0.204) | |
se_social_p2 | 0.509* | 0.517* |
(0.219) | (0.221) | |
se_social_p3 | 0.443** | 0.446** |
(0.161) | (0.161) | |
factor(region)west | 0.056 | |
(0.202) | ||
Num.Obs. | 283 | 283 |
R2 | 0.517 | 0.517 |
R2 Adj. | 0.496 | 0.494 |
AIC | 1044.7 | 1046.6 |
BIC | 1095.7 | 1101.3 |
Log.Lik. | -508.341 | -508.300 |
RMSE | 1.46 | 1.46 |
Again the cleanest way to interpret the implications of these specifications is derive the conditional marginal effects and assess these for plausibility.
Group Marginal Effects
We see here the different levels expected for different regional responses for changes in each of these input variables.
= plot_predictions(grp_sum_fit, condition = c("sup_parents_p3", "region"), type = "response") + ggtitle("Counterfactual Shift of Outcome: sup_parents_p3", "Holding all else Fixed")
g
= plot_predictions(grp_sum_fit, condition = c("sup_friends_p1", "region"), type = "response") + ggtitle("Counterfactual Shift of Outcome: sup_friends_p1", "Holding all else Fixed")
g1
= plot_predictions(grp_sum_fit, condition = c("se_acad_p1", "region"),
g2 type = "response") + ggtitle("Counterfactual Shift of Outcome: se_acad_p1", "Holding all else Fixed")
<- ggarrange(g,g1,g2, ncol=1, nrow=3); plot
In this case the differences are slight, but the point here is just to consider how aspects of the data generating process can be modified by broadly demographic features systematically. In an ideal circumstance a probability-sample of survey respondents should mitigate the need for additionally controls variables in so far as a probability sample should ensure exchangeability
Interlude: Model Design and Conditional Exchangeability
Modelling is nearly always about contrasts and questions of meaningful differences. What we seek to do when we build a model is to find a structure that enables “fair” or “justified” inferences about these contrasts. This is maybe most palpably brought home when we consider cases of causal inference. Here we want to define a meaningful causal estimand - some contrast between treatment group and control group where we can be confident that the two groups are suitably “representative” so that the observed differences represents solely the effect of the treatment. We are saying that conditional on our model design we consider the potential outcomes to be exchangable. Or another way of putting it is that we have controlled for all aspects of systematic differences between the control and treatment group and this warrants the causal interpretation of the contrast between treatment and control.
But as we’ve seen above the cleanest way to interpret almost any regression model is to understand the model design in terms of the marginal effects on the outcome scale. These are just contrasts. All statistical models are, in some sense, focused on finding a structure that licenses an inference about some contrast of interest between the levels of some observed variable and the implied outcomes. In this way we want to include as much structure in our model that would induce the status of conditional exchangeability between the units of study across such group contrasts. This notion of conditional exchangeability is inherently an epistemic notion. We believe that our model is apt to induce the status of conditional exchangeability such that the people in our survey have no systematic difference which biases the measured differences. The implied differences in some marginal effect (while holding all else fixed) is a “fair” representation of the same counterfactual adjustment in the population or sub-population defined by the covariate profile \(X\). The model implied difference is a proxy for the result of possible future interventions and as such merits our attention in policy design.
One point of caution here is that we ought not to be too quick when interpreting model coefficients as causal effects. If we have not designed our model to isolate the causal or generalisable effect, while making appropriarate adjustments for risks of confounding, we will end up conflating biased estimates with true effects. In particular the view here is that we ought to be deliberate in how we structure our models to license the plausibility of the exchangeability claim with respect to a relationship of interest. By De Finetti’s theorem a distribution of exchangeable sequence of variables be expressed as mixture of conditional independent variables.
\[ p(x_{1}....x_{m}) = \dfrac{p(X | \theta)p(\theta)}{p_{i}(X)} \]
So if we specify the conditional distribution correctly, we recover the conditions that warrant inference with a well designed model.The mixture distribution is just the vector of parameters \(\boldsymbol{\theta}\) upon which we condition our model. Mislevy and Levy highlight this important observation has implications for model development by quoting Judea Pearl:
[C]onditional independence is not a grace of nature for which we must wait passively, but rather a psychological necessity which we satisfy by organising our knowledge in a specific way. An important tool in such an organisation is the identification of intermediate variables that induce conditional independence among observables; if such variables are not in our vocabulary, we create them. In medical diagnosis, for instance, when some symptoms directly influence one another, the medical profession invents a name for that interaction (e.g. “syndrome”, “complication”, “pathological state”) and treats it as a new auxiliary variable that induces conditional independence.” - Pearl quoted in Bayesian Psychometric Modeling p61
The organisation of our data into meaningful categories or clusters is common in machine learning, but similarly in psychometrics we see the prevalence of factor models which are more explicitly structural models where we posit some unifying theme to the cluster of measures. In each tradition we are seeking conditional independence by allowing the model to account for these dependencies in how we structure the model. Pearl famously goes on to work-out conditions in the data-generating processes that lead to confounded causal inference stemming from inadequate conditionalisation strategies. Assessing different independence structures for our model represent a fruitful aspect of walking down through the garden of forking paths. There is then a difference between the exploratory seeking for some pre-determined story and an experimental seeking of clarity. With this in mind we now turn to types of modeling architecture that open up new expressive possibilities and offer a way to think about the relationships between observed data and latent factors that drive the observed outcomes. This greater expressive power offers different routes to inducing patterns of conditional exchangeability and different risks too.
Expressive Structure: Building the Model we Need
In some sense the modeling we’ve seen so far is simply inadequate to express our theory of the case. The hypothetical structure with which we started abstracted over the measures in our data. The theory talks of LIFE SATISFACTION and PARENTAL SUPPORT as grand holistic constructs, yet in our regression model(s) we have loosely coupled our measurements and the intended target via indirect but hopefully indicative measures. We can do better if we’re more explicit about how these manifest metrics ladder up to appropriate latent constructs. These latent constructs are theoretical posits we use to structure the nature of the inference and express the thematic unity between certain of our indicator variables. The metaphysically conservative theoretician may balk at such models, but it is practically akin to the usage of cluster variables in machine learning.
Confirmatory Factor Analysis
We will illustrate the details of confirmatory factor modelling using the lavaan
framework. The focus is mostly on the mechanics of how these models are estimated using maximum likelihood techniques before illustrating the salient differences of Bayesian estimation.
First recall that the idea of confirmatory factor analysis is that there are some latent constructs which determine our data generating process. In our survey we’ve already clustered our questions by themes so it makes sense to extend this idea to posit latent constructs mapped to each of these themes. In the jargon of structural equation models this is called the measurement model i.e. a model which aims to “simply” predict/retrodict the measured outcomes before we try anything more sophisticated.
<- "
model_measurement # Measurement model
SUP_Parents =~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3
SUP_Friends =~ sup_friends_p1 + sup_friends_p2 + sup_friends_p3
SE_Academic =~ se_acad_p1 + se_acad_p2 + se_acad_p3
SE_Social =~ se_social_p1 + se_social_p2 + se_social_p3
LS =~ ls_p1 + ls_p2 + ls_p3
"
<- "
model_measurement1 # Measurement model
SUP_Parents =~ b1*sup_parents_p1 + b2*sup_parents_p2 + b3*sup_parents_p3
SUP_Friends =~ a1*sup_friends_p1 + a2*sup_friends_p2 + a3*sup_friends_p3
SE_Academic =~ c1*se_acad_p1 + c2*se_acad_p2 + c3*se_acad_p3
SE_Social =~ se_social_p1 + se_social_p2 + se_social_p3
LS =~ ls_p1 + ls_p2 + ls_p3
a1 == a2
a1 == a3
b1 == b2
b1 == b3
c1 == c2
c1 == c3
"
<- cfa(model_measurement, data = df)
fit_mod <- cfa(model_measurement1, data = df) fit_mod_1
In the above syntax we have specified two slightly different measurement models. In each case we allow that the questions of our survey are mapped to an appropriate latent factor e.g LS =~ ls_p1 + ls_p2 + ls_p3
. The “=~” syntax denotes a “measured by” relationship in which the goal is to estimate how each of observed measurements load on the latent factor. In the first model we have allowed each of the factor loadings to be estimated freely, but in the second we have forced equal weights on the SUP_Friends
construct. A benefit of this framework is that we do not have to resort to crude aggregations like sum-scores or mean-scores over the outcome variables we can allow that they vary freely and let the model estimate the multivariate relationships between the observed variables and these latent constructs.
If we plot the estimated parameters as before we’ll see some additional parameters reported.
= list("full_measurement_model" = fit_mod,
cfa_models "measurement_model_reduced" = fit_mod_1)
modelplot(cfa_models)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).
Here there are two distinct types of parameters: (i) the factor loadings accorded (=~) to the individual observed metrics and (ii) the covariances (~~) between the latent constructs. We can further report the extent of the model fit summaries.
summary(fit_mod, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-18 ended normally after 49 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 40
Number of observations 283
Model Test User Model:
Test statistic 223.992
Degrees of freedom 80
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 2696.489
Degrees of freedom 105
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.944
Tucker-Lewis Index (TLI) 0.927
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -4285.972
Loglikelihood unrestricted model (H1) -4173.976
Akaike (AIC) 8651.944
Bayesian (BIC) 8797.761
Sample-size adjusted Bayesian (SABIC) 8670.921
Root Mean Square Error of Approximation:
RMSEA 0.080
90 Percent confidence interval - lower 0.067
90 Percent confidence interval - upper 0.092
P-value H_0: RMSEA <= 0.050 0.000
P-value H_0: RMSEA >= 0.080 0.500
Standardized Root Mean Square Residual:
SRMR 0.056
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
SUP_Parents =~
sup_parents_p1 1.000 0.935 0.873
sup_parents_p2 1.036 0.056 18.613 0.000 0.969 0.887
sup_parents_p3 0.996 0.059 16.754 0.000 0.932 0.816
SUP_Friends =~
sup_friends_p1 1.000 1.021 0.906
sup_friends_p2 0.792 0.043 18.463 0.000 0.809 0.857
sup_friends_p3 0.891 0.050 17.741 0.000 0.910 0.831
SE_Academic =~
se_acad_p1 1.000 0.695 0.878
se_acad_p2 0.809 0.050 16.290 0.000 0.562 0.820
se_acad_p3 0.955 0.058 16.500 0.000 0.664 0.829
SE_Social =~
se_social_p1 1.000 0.638 0.843
se_social_p2 0.967 0.056 17.248 0.000 0.617 0.885
se_social_p3 0.928 0.067 13.880 0.000 0.592 0.741
LS =~
ls_p1 1.000 0.667 0.718
ls_p2 0.778 0.074 10.498 0.000 0.519 0.712
ls_p3 0.968 0.090 10.730 0.000 0.645 0.732
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
SUP_Parents ~~
SUP_Friends 0.132 0.064 2.073 0.038 0.138 0.138
SE_Academic 0.218 0.046 4.727 0.000 0.336 0.336
SE_Social 0.282 0.045 6.224 0.000 0.472 0.472
LS 0.405 0.057 7.132 0.000 0.650 0.650
SUP_Friends ~~
SE_Academic 0.071 0.047 1.493 0.136 0.100 0.100
SE_Social 0.196 0.046 4.281 0.000 0.301 0.301
LS 0.175 0.051 3.445 0.001 0.257 0.257
SE_Academic ~~
SE_Social 0.271 0.036 7.493 0.000 0.611 0.611
LS 0.238 0.039 6.065 0.000 0.514 0.514
SE_Social ~~
LS 0.321 0.042 7.659 0.000 0.755 0.755
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sup_parents_p1 0.273 0.037 7.358 0.000 0.273 0.238
.sup_parents_p2 0.255 0.038 6.738 0.000 0.255 0.213
.sup_parents_p3 0.437 0.048 9.201 0.000 0.437 0.335
.sup_friends_p1 0.227 0.040 5.656 0.000 0.227 0.179
.sup_friends_p2 0.238 0.030 7.936 0.000 0.238 0.266
.sup_friends_p3 0.371 0.042 8.809 0.000 0.371 0.310
.se_acad_p1 0.144 0.022 6.593 0.000 0.144 0.229
.se_acad_p2 0.153 0.018 8.621 0.000 0.153 0.327
.se_acad_p3 0.200 0.024 8.372 0.000 0.200 0.313
.se_social_p1 0.166 0.020 8.134 0.000 0.166 0.290
.se_social_p2 0.106 0.016 6.542 0.000 0.106 0.217
.se_social_p3 0.288 0.028 10.132 0.000 0.288 0.451
.ls_p1 0.417 0.045 9.233 0.000 0.417 0.484
.ls_p2 0.261 0.028 9.321 0.000 0.261 0.492
.ls_p3 0.362 0.040 9.005 0.000 0.362 0.465
SUP_Parents 0.875 0.098 8.910 0.000 1.000 1.000
SUP_Friends 1.042 0.111 9.407 0.000 1.000 1.000
SE_Academic 0.483 0.054 8.880 0.000 1.000 1.000
SE_Social 0.407 0.048 8.403 0.000 1.000 1.000
LS 0.444 0.069 6.394 0.000 1.000 1.000
Note how in addition to the individual parameter estimates the summaries highlight various measures of global model fit. These model fit statistics are important for evaluating alternative ways of parameterising our models. The number of parameters is a real concern in the maximum likelihood approaches to estimating these models. Too many parameters and we can easily over fit to the particular sample data. This stems in part from the limitations of the optimization goal in the traditional CFA framework - we are intent to optimize model parameters to recover a compelling estimate based on the observed covariance matrix. Once we have more parameters than there are points in the covariance matrix the model is free to overfit considerably. This can then be checked as measure of local model fit and may highlight infelicities or suspicious convergence between the true data generating process and the learned model.
= plot_heatmap(cov(df[, drivers]))
g1
= plot_heatmap(data.frame(fitted(fit_mod)$cov)[drivers, drivers], title="Model Implied Covariances", "Fitted Values")
g2
= cov(df[, drivers]) - data.frame(fitted(fit_mod)$cov)[drivers, drivers]
resids
= plot_heatmap(resids, title="Residuals", "Fitted Values versus Observe Sample Covariance")
g3
<- ggarrange(g1,g2,g3, ncol=1, nrow=3); plot
Summary Global Fit Measures
We can also compare models based on their global measures of model fit giving some indication of whether parameter specifications improve or reduce fidelity with the true data generating process.
= cbind(fitMeasures(fit_mod, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")),
summary_df fitMeasures(fit_mod_1, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")))
colnames(summary_df) = c('Full Measurement Model', 'Reduced Measurement Model')
|> kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") summary_df
Full Measurement Model | Reduced Measurement Model | |
---|---|---|
chisq | 223.9922306 | 256.0287010 |
baseline.chisq | 2696.4887420 | 2696.4887420 |
cfi | 0.9444365 | 0.9343896 |
aic | 8651.9435210 | 8671.9799914 |
bic | 8797.7613969 | 8795.9251859 |
rmsea | 0.0797501 | 0.0835831 |
srmr | 0.0558656 | 0.0625710 |
There a wealth of metrics associated with CFA model fit and it can be hard to see the forest for the trees.
Visualizing the Relationships
One of the better ways to visualize these models is to use the semPlot
package. Here we can plot all the parameter estimates in one graph. Following convention the rectangular boxes represent observed measures. Oval or circular objects represent the latent constructs. The self-directed arrows on each node is the variance of that measure. The two-way arrows between nodes represents the covariance between those two nodes. The single headed arrows from the latent construct to the indicator variables denotes the factor loading of the variable on the construct. For instance SUP_F -> sp_f_3
is set at 0.89.
::semPaths(fit_mod, whatLabels = 'est', intercepts = FALSE) semPlot
For instance, in this plot you can see that for each latent construct one of the indicator variables has their factor loading set to 1. This is a mathematical requirement we’ll see below that is used to ensure identifiability of the parameters akin to setting a reference category in categorical regression. In the plot you can “read-off” the covariances between our constructs e.g. the covariance between LS
and SUP_P
is 0.36 the largest value amongst the set of covariances.
Comparing Models
We can use a variety of chi-squared tests to evaluate the goodness of fit for our models. If we pass in each model individually we perform a test comparing our model to the saturated model. The Chi-Squared test compares the model-implied variance-covariance matrix (expected) to the variance-covariance matrix computed from the actual data (observed). The null hypothesis for the Chi-Squared Goodness-of-Fit test is that the model fits the data perfectly, meaning that there is no significant difference between the observed and model-implied variance-covariance matrices. The goal is to see if the differences between these matrices are large enough that we can reject the null.
lavTestLRT(fit_mod)
Chi-Squared Test Statistic (unscaled)
Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
Saturated 0 0.00
Model 80 8651.9 8797.8 223.99 223.99 80 0.000000000000001443 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Passing in the one model we can reject the null hypothesis that the saturated model’s (perfect fit) and the candidate variance-covariance matrix are drawn from the same distribution. Comparing between our two model fits we also reject that these two models are drawn from the same distribution.
lavTestLRT(fit_mod, fit_mod_1)
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_mod 80 8651.9 8797.8 223.99
fit_mod_1 86 8672.0 8795.9 256.03 32.036 0.12383 6 0.00001606 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test also reports a number of other fit indices and the degrees of freedom available to each model. These are important because the Chi-Squared test is overly sensitive in large-sample data and the model adequacy is a multi-dimensional question.
Structural Equation Models
So far so good. We have a confirmatory factor measurement model. We’ve structured it so that we can make inferences about the correlations and covariances between 5 latent constructs of independent interest. We’ve calibrated the model fit statistics by ensuring the model can reasonably recover the observed variance-covariance structure. But what about our dependency relations between constructs? We can evaluate these too! Adding regressions to our model allows to express these relationships and then recover summary statistics of the same.
<- "
model # Measurement model
SUP_Parents =~ sup_parents_p1 + sup_parents_p2 + sup_parents_p3
SUP_Friends =~ sup_friends_p1 + sup_friends_p2 + sup_friends_p3
SE_Academic =~ se_acad_p1 + se_acad_p2 + se_acad_p3
SE_Social =~ se_social_p1 + se_social_p2 + se_social_p3
LS =~ ls_p1 + ls_p2 + ls_p3
# Structural model
# Regressions
SE_Academic ~ SUP_Parents + SUP_Friends
SE_Social ~ SUP_Parents + SUP_Friends
LS ~ SE_Academic + SE_Social + SUP_Parents + SUP_Friends
# Residual covariances
SE_Academic ~~ SE_Social
"
<- sem(model, data = df) fit_mod_sem
This model structure adds a set of directional arrows between the latent constructs
::semPaths(fit_mod_sem, whatLabels = 'est', intercepts = FALSE) semPlot
The relative uncertainty in each estimates
modelplot(fit_mod_sem)
Compare this structure against the previous and simpler measurement model and we observe a puzzling phenomena. The models report identical measures of fit! Despite the extra structure in the SEM model the optimization routine has returned identical parameter specifications.
= cbind(fitMeasures(fit_mod, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")),
summary_df fitMeasures(fit_mod_sem, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")),
fitMeasures(fit_mod_1, c("chisq", "baseline.chisq", "cfi", "aic", "bic", "rmsea","srmr")))
colnames(summary_df) = c('Measurement Model', 'SEM Model', 'Reduced Measurement Model')
|> kable() |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") summary_df
Measurement Model | SEM Model | Reduced Measurement Model | |
---|---|---|---|
chisq | 223.9922306 | 223.9922306 | 256.0287010 |
baseline.chisq | 2696.4887420 | 2696.4887420 | 2696.4887420 |
cfi | 0.9444365 | 0.9444365 | 0.9343896 |
aic | 8651.9435210 | 8651.9435210 | 8671.9799914 |
bic | 8797.7613969 | 8797.7613969 | 8795.9251859 |
rmsea | 0.0797501 | 0.0797501 | 0.0835831 |
srmr | 0.0558656 | 0.0558655 | 0.0625710 |
The issue here is that the models have the same degrees of freedom which suggests in some sense we have already saturated our model fit and are unable to evaluate further parameter estimates.
lavTestLRT(fit_mod_sem, fit_mod)
Warning: lavaan->lavTestLRT():
some models have the same degrees of freedom
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_mod_sem 80 8651.9 8797.8 223.99
fit_mod 80 8651.9 8797.8 223.99 0.0000000043218 0 0
lavTestLRT(fit_mod_sem, fit_mod_1)
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit_mod_sem 80 8651.9 8797.8 223.99
fit_mod_1 86 8672.0 8795.9 256.03 32.036 0.12383 6 0.00001606 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see this similarly in the plotted residuals which are identical across the models despite meaningful structural differences.
= data.frame(resid(fit_mod)$cov)
heat_df = heat_df |> as.matrix() |> melt()
heat_df colnames(heat_df) <- c("x", "y", "value")
= heat_df |> mutate(value = round(value, 2)) |> ggplot(aes(x = x, y = y, fill = value)) +
g1 geom_tile() + geom_text(aes(label = value), color = "black", size = 4) +
scale_fill_gradient2(
high = 'dodgerblue4',
mid = 'white',
low = 'firebrick2'
+ theme(axis.text.x = element_text(angle=45)) + ggtitle("Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of Structural-Model fit")
)
= data.frame(resid(fit_mod_sem)$cov)
heat_df = heat_df |> as.matrix() |> melt()
heat_df colnames(heat_df) <- c("x", "y", "value")
= heat_df |> mutate(value = round(value, 2)) |> ggplot(aes(x = x, y = y, fill = value)) +
g2 geom_tile() + geom_text(aes(label = value), color = "black", size = 4) +
scale_fill_gradient2(
high = 'dodgerblue4',
mid = 'white',
low = 'firebrick2'
+ theme(axis.text.x = element_text(angle=45)) + ggtitle("Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of Measurement-Model fit")
)
<- ggarrange(g1,g2, ncol=1, nrow=2); plot
This is a genuine limitation in the expressive power of SEM models when they are fit using maximum likelihood while optimizing for fidelity to the sample variance-covariance matrix. Model identification possibilities are constrained by the optimisation goal and the degrees of freedom available to the optimiser. This constrains the model structures we can evaluate. Next we’ll show how the Bayesian approach to estimating these models frees us from this limitation.
Confirmatory Factor Models with PyMC
So far we’ve seen CFA and SEM models specified using a kind of formula syntax. This useful but here we switch to the distributional perspective which is more directly relevant to the Bayesian phrasing of the CFA model. We’ll unpack these components of the distributional view with a simpler data set initially for illustrative purposes and then we’ll switch back to the Life Satisfaction question.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pytensor import tensor as pt
import arviz as az
import networkx as nx
150)
np.random.seed(
= pd.read_csv('IIS.dat', sep='\s+')
df_p df_p.head()
PI AD IGC FI FC
0 4.00 3.38 4.67 2.6 3.2
1 2.57 3.00 3.50 2.4 2.8
2 2.29 3.29 4.83 2.0 3.4
3 2.43 3.63 4.33 3.6 3.8
4 3.00 4.00 4.83 3.4 3.8
We have here a data-set drawn from Mislevy and Levy. which records measures of students collegiate experiences along multiple dimensions. The data here represents the averages of PI
peer interaction, AD
academic development IGC
goal commitment, FI
faculty interaction and FC
faculty concern items for 500 hundred students. We will fit a 2 factor Bayesian CFA model to this data set - one factor for he faculty items and the other factor for self-reporting items. First consider the code and then we’ll step through it in more detail.
= {'obs': list(range(len(df_p))),
coords 'indicators': ['PI', 'AD', 'IGC', 'FI', 'FC'],
'indicators_1': ['PI', 'AD', 'IGC'],
'indicators_2': ['FI', 'FC'],
'latent': ['Student', 'Faculty']
}
= list(range(len(df_p)))
obs_idx with pm.Model(coords=coords) as model:
= pm.InverseGamma('Psi', 5, 10, dims='indicators')
Psi = pm.Normal('lambdas_1', 1, 10, dims=('indicators_1'))
lambdas_ # Force a fixed scale on the factor loadings for factor 1
= pm.Deterministic('lambdas1', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_1'))
lambdas_1 = pm.Normal('lambdas_2', 1, 10, dims=('indicators_2'))
lambdas_ # Force a fixed scale on the factor loadings for factor 2
= pm.Deterministic('lambdas2', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_2'))
lambdas_2 = pm.Normal('tau', 3, 10, dims='indicators')
tau # Specify covariance structure between latent factors
= 0
kappa = pm.Exponential.dist(1.0, shape=2)
sd_dist = pm.LKJCholeskyCov('chol_cov', n=2, eta=2,
chol, _, _ =sd_dist, compute_corr=True)
sd_dist= pm.MvNormal('ksi', kappa, chol=chol, dims=('obs', 'latent'))
ksi
# Construct Observation matrix
= tau[0] + ksi[obs_idx, 0]*lambdas_1[0]
m1 = tau[1] + ksi[obs_idx, 0]*lambdas_1[1]
m2 = tau[2] + ksi[obs_idx, 0]*lambdas_1[2]
m3 = tau[3] + ksi[obs_idx, 1]*lambdas_2[0]
m4 = tau[4] + ksi[obs_idx, 1]*lambdas_2[1]
m5
= pm.Deterministic('mu', pm.math.stack([m1, m2, m3, m4, m5]).T)
mu = pm.Normal('likelihood', mu, Psi, observed=df_p.values)
_
= pm.sample(nuts_sampler='numpyro', target_accept=.95,
idata ={"log_likelihood": True})
idata_kwargs idata.extend(pm.sample_posterior_predictive(idata))
= az.summary(idata, var_names=['lambdas1', 'lambdas2', 'tau', 'Psi', 'ksi', 'chol_cov_corr'], coords= {'obs': [0, 7]}) summary_df
The model specification here has a Normal likelihood term which takes as observed data the matrix of student responses. However we construct the input for the likelihood term in a way that we posit two latent factor variables ksi[obs_idx, 0]
and ksi[obs_idx, 1]
respectively. These factors are modified by the factor loading terms in e.g. lambdas_1
specification to create a regression model m1
which predicts the indicator variable PI
. The lambda terms are drawn from a shared multivariate normal distribution with a weakly informative prior and fed forward into our pseudo outcome matrix mu
. In this way we allow that the latent factors are related to the observed data matrix. The process of Bayesian updating then calibrates their values against observed data and modifies our priors for the parameterisations of the lambdas, ksi terms. This reflects what the model learns about the interactions between the latent constructs. Crucially we force a covariance structure on ksi
terms to reflect the correlations and covariance between these constructs of interest, and we specify the factor loading terms to have a required scale by forcing the first indicator loading to 1 for each factor.
In a Bayesian approach to the estimation of these models we are not limited by the degrees of freedom available in the optimisation routine. Instead we’re aiming to estimate the posterior distribution using MCMC on draws from well specified priors on each the quantities listed above. In a picture the structure is like this:
We are conditioning on the observed data and seek to estimate sound parameterisations of the ksi
, lambda
and tau
that go into building our mu structure. The marginal distribution of the observables is derived by integrating over the distribution of the latent variables in this fashion so that:
\[ p(x_{i}.....x_{n} | ksi, \Psi, \tau, \Lambda) \sim Normal(\tau + \Lambda\cdot ksi, \Psi) \] For more details on the precise derivation of this likelihood term we recommend the discussion in the book. It is this same likelihood specification that is used in maximum likelihood fits of the same and the Normal assumption has proved viable and computationally tractable. The model yields summary statistics for each of these parameters.
$summary_df |> kable() |> kable_classic(full_width = F, html_font = "Cambria") py
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
lambdas1[PI] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
lambdas1[AD] | 0.904 | 0.062 | 0.788 | 1.022 | 0.004 | 0.003 | 279 | 594 | 1.02 |
lambdas1[IGC] | 0.538 | 0.046 | 0.457 | 0.630 | 0.002 | 0.002 | 405 | 948 | 1.01 |
lambdas2[FI] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
lambdas2[FC] | 0.981 | 0.057 | 0.878 | 1.087 | 0.002 | 0.002 | 606 | 1087 | 1.00 |
tau[PI] | 3.333 | 0.036 | 3.269 | 3.405 | 0.001 | 0.001 | 712 | 1505 | 1.01 |
tau[AD] | 3.897 | 0.026 | 3.848 | 3.946 | 0.001 | 0.001 | 471 | 1120 | 1.01 |
tau[IGC] | 4.596 | 0.020 | 4.558 | 4.636 | 0.001 | 0.001 | 708 | 1302 | 1.01 |
tau[FI] | 3.033 | 0.039 | 2.951 | 3.098 | 0.002 | 0.001 | 607 | 1286 | 1.01 |
tau[FC] | 3.712 | 0.034 | 3.650 | 3.780 | 0.002 | 0.001 | 488 | 1058 | 1.01 |
Psi[PI] | 0.611 | 0.024 | 0.564 | 0.652 | 0.001 | 0.000 | 1558 | 2805 | 1.00 |
Psi[AD] | 0.316 | 0.020 | 0.279 | 0.352 | 0.001 | 0.001 | 654 | 1670 | 1.01 |
Psi[IGC] | 0.355 | 0.014 | 0.329 | 0.379 | 0.000 | 0.000 | 2974 | 2704 | 1.00 |
Psi[FI] | 0.569 | 0.025 | 0.519 | 0.615 | 0.001 | 0.000 | 1447 | 2467 | 1.00 |
Psi[FC] | 0.420 | 0.026 | 0.372 | 0.468 | 0.001 | 0.001 | 720 | 1261 | 1.00 |
ksi[0, Student] | -0.230 | 0.223 | -0.666 | 0.180 | 0.004 | 0.003 | 3840 | 3014 | 1.00 |
ksi[0, Faculty] | -0.370 | 0.275 | -0.890 | 0.156 | 0.004 | 0.003 | 4335 | 3051 | 1.00 |
ksi[7, Student] | 0.894 | 0.228 | 0.454 | 1.304 | 0.004 | 0.003 | 2847 | 2782 | 1.00 |
ksi[7, Faculty] | 0.876 | 0.275 | 0.387 | 1.433 | 0.004 | 0.003 | 3791 | 2812 | 1.00 |
chol_cov_corr[0, 0] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
chol_cov_corr[0, 1] | 0.849 | 0.028 | 0.794 | 0.899 | 0.001 | 0.001 | 415 | 615 | 1.01 |
chol_cov_corr[1, 0] | 0.849 | 0.028 | 0.794 | 0.899 | 0.001 | 0.001 | 415 | 615 | 1.01 |
chol_cov_corr[1, 1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 3785 | 3584 | 1.00 |
We can additionally check the diagnostic trace plots to ensure that the model has sampled well even across the relatively complex structure.
=['lambdas1', 'lambdas2', 'tau', 'Psi', 'ksi']); az.plot_trace(idata, var_names
/Users/nathanielforde/mambaforge/envs/pymc_causal/lib/python3.11/site-packages/arviz/stats/density_utils.py:487: UserWarning: Your data appears to have a single value or no finite values
warnings.warn("Your data appears to have a single value or no finite values")
/Users/nathanielforde/mambaforge/envs/pymc_causal/lib/python3.11/site-packages/arviz/stats/density_utils.py:487: UserWarning: Your data appears to have a single value or no finite values
warnings.warn("Your data appears to have a single value or no finite values")
These diagnostic plots and parameter estimates reflect well the same figures reported by Levy and Mislevy, so we won’t dwell too much further on this model only to say that the basic structure here is expanded upon as we apply more factor structures and SEM like structures to our original Life-Satisfaction data-set.
Factoring for Life Satisfaction
Now we want fit the life-satisfaction measurement model using PyMC
. This has fundamentally a similar structure to the simple CFA model we’ve seen above. However we now have 5 factors with 3 factor loadings to be estimated for three indicators per factor. We will fit this model and then show how to pull out some of model fit statistics that are relevant for the assessment of the model.
= pd.read_csv('sem_data.csv')
df = ['se_acad_p1', 'se_acad_p2',
drivers 'se_acad_p3', 'se_social_p1', 'se_social_p2', 'se_social_p3',
'sup_friends_p1', 'sup_friends_p2', 'sup_friends_p3', 'sup_parents_p1',
'sup_parents_p2', 'sup_parents_p3', 'ls_p1', 'ls_p2', 'ls_p3']
= {'obs': list(range(len(df))),
coords 'indicators': drivers,
'indicators_1': ['se_acad_p1','se_acad_p2','se_acad_p3'],
'indicators_2': ['se_social_p1','se_social_p2','se_social_p3'],
'indicators_3': ['sup_friends_p1','sup_friends_p2','sup_friends_p3'],
'indicators_4': [ 'sup_parents_p1','sup_parents_p2','sup_parents_p3'],
'indicators_5': ['ls_p1','ls_p2', 'ls_p3'],
'latent': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS'],
'latent1': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
}
= list(range(len(df)))
obs_idx with pm.Model(coords=coords) as model:
= pm.InverseGamma('Psi', 5, 10, dims='indicators')
Psi = pm.Normal('lambdas_1', 1, 10, dims=('indicators_1'))
lambdas_ = pm.Deterministic('lambdas1', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_1'))
lambdas_1 = pm.Normal('lambdas_2', 1, 10, dims=('indicators_2'))
lambdas_ = pm.Deterministic('lambdas2', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_2'))
lambdas_2 = pm.Normal('lambdas_3', 1, 10, dims=('indicators_3'))
lambdas_ = pm.Deterministic('lambdas3', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_3'))
lambdas_3 = pm.Normal('lambdas_4', 1, 10, dims=('indicators_4'))
lambdas_ = pm.Deterministic('lambdas4', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_4'))
lambdas_4 = pm.Normal('lambdas_5', 1, 10, dims=('indicators_5'))
lambdas_ = pm.Deterministic('lambdas5', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_5'))
lambdas_5 = pm.Normal('tau', 3, 10, dims='indicators')
tau = 0
kappa = pm.Exponential.dist(1.0, shape=5)
sd_dist = pm.LKJCholeskyCov('chol_cov', n=5, eta=2,
chol, _, _ =sd_dist, compute_corr=True)
sd_dist= pm.Deterministic("cov", chol.dot(chol.T), dims=('latent', 'latent1'))
cov = pm.MvNormal('ksi', kappa, chol=chol, dims=('obs', 'latent'))
ksi
= tau[0] + ksi[obs_idx, 0]*lambdas_1[0]
m0 = tau[1] + ksi[obs_idx, 0]*lambdas_1[1]
m1 = tau[2] + ksi[obs_idx, 0]*lambdas_1[2]
m2 = tau[3] + ksi[obs_idx, 1]*lambdas_2[0]
m3 = tau[4] + ksi[obs_idx, 1]*lambdas_2[1]
m4 = tau[5] + ksi[obs_idx, 1]*lambdas_2[2]
m5 = tau[6] + ksi[obs_idx, 2]*lambdas_3[0]
m6 = tau[7] + ksi[obs_idx, 2]*lambdas_3[1]
m7 = tau[8] + ksi[obs_idx, 2]*lambdas_3[2]
m8 = tau[9] + ksi[obs_idx, 3]*lambdas_4[0]
m9 = tau[10] + ksi[obs_idx, 3]*lambdas_4[1]
m10 = tau[11] + ksi[obs_idx, 3]*lambdas_4[2]
m11 = tau[12] + ksi[obs_idx, 4]*lambdas_5[0]
m12 = tau[13] + ksi[obs_idx, 4]*lambdas_5[1]
m13 = tau[14] + ksi[obs_idx, 4]*lambdas_5[2]
m14
= pm.Deterministic('mu', pm.math.stack([m0, m1, m2, m3, m4, m5, m6, m7,
mu
m8, m9, m10, m11, m12, m13, m14]).T)= pm.Normal('likelihood', mu, Psi, observed=df[drivers].values)
_
= pm.sample(nuts_sampler='numpyro', target_accept=.95, tune=1000,
idata ={"log_likelihood": True}, random_seed=100)
idata_kwargs idata.extend(pm.sample_posterior_predictive(idata))
= az.summary(idata, var_names=['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'lambdas5', 'tau', 'Psi']) summary_df1
= pd.DataFrame(az.extract(idata['posterior'])['cov'].mean(axis=2))
cov_df = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
cov_df.index = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
cov_df.columns
= pd.DataFrame(az.extract(idata['posterior'])['chol_cov_corr'].mean(axis=2))
correlation_df = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
correlation_df.index = ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P', 'LS']
correlation_df.columns
= pd.DataFrame(az.summary(idata, var_names=['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'lambdas5'])['mean']).reset_index() factor_loadings
'factor'] = factor_loadings['index'].str.split('[', expand=True)[0]
factor_loadings[=['factor_loading', 'factor_loading_weight', 'factor']
factor_loadings.columns 'factor_loading_weight_sq'] = factor_loadings['factor_loading_weight']**2
factor_loadings['sum_sq_loadings'] = factor_loadings.groupby('factor')['factor_loading_weight_sq'].transform(sum) factor_loadings[
Here we’ve calculated a bunch of summary statistics which we will highlight below, but first observe the analogous model structure here to the simple CFA model above.
Next we’ll plot the posterior summary statistics for each of the key parameters.
$summary_df1 |> kable() |> kable_classic(full_width = F, html_font = "Cambria") py
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
lambdas1[se_acad_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
lambdas1[se_acad_p2] | 0.817 | 0.052 | 0.720 | 0.915 | 0.001 | 0.001 | 1193 | 2008 | 1.00 |
lambdas1[se_acad_p3] | 0.967 | 0.060 | 0.854 | 1.076 | 0.002 | 0.001 | 1286 | 2037 | 1.00 |
lambdas2[se_social_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
lambdas2[se_social_p2] | 0.965 | 0.058 | 0.856 | 1.071 | 0.002 | 0.002 | 757 | 1634 | 1.00 |
lambdas2[se_social_p3] | 0.941 | 0.072 | 0.805 | 1.076 | 0.002 | 0.002 | 878 | 1580 | 1.00 |
lambdas3[sup_friends_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
lambdas3[sup_friends_p2] | 0.802 | 0.044 | 0.720 | 0.887 | 0.001 | 0.001 | 1045 | 1701 | 1.00 |
lambdas3[sup_friends_p3] | 0.905 | 0.053 | 0.805 | 1.006 | 0.002 | 0.001 | 1235 | 2150 | 1.00 |
lambdas4[sup_parents_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
lambdas4[sup_parents_p2] | 1.040 | 0.059 | 0.931 | 1.150 | 0.002 | 0.002 | 758 | 1383 | 1.00 |
lambdas4[sup_parents_p3] | 1.010 | 0.064 | 0.898 | 1.137 | 0.002 | 0.001 | 1051 | 1840 | 1.00 |
lambdas5[ls_p1] | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 4000 | 4000 | NaN |
lambdas5[ls_p2] | 0.791 | 0.085 | 0.627 | 0.944 | 0.004 | 0.003 | 541 | 1074 | 1.00 |
lambdas5[ls_p3] | 0.990 | 0.103 | 0.806 | 1.187 | 0.004 | 0.003 | 543 | 878 | 1.00 |
tau[se_acad_p1] | 5.153 | 0.044 | 5.069 | 5.234 | 0.002 | 0.001 | 488 | 1151 | 1.01 |
tau[se_acad_p2] | 5.345 | 0.039 | 5.271 | 5.414 | 0.002 | 0.001 | 528 | 1033 | 1.01 |
tau[se_acad_p3] | 5.209 | 0.045 | 5.127 | 5.297 | 0.002 | 0.001 | 526 | 1290 | 1.01 |
tau[se_social_p1] | 5.286 | 0.042 | 5.208 | 5.366 | 0.002 | 0.002 | 380 | 743 | 1.01 |
tau[se_social_p2] | 5.473 | 0.039 | 5.397 | 5.544 | 0.002 | 0.001 | 363 | 742 | 1.01 |
tau[se_social_p3] | 5.437 | 0.045 | 5.351 | 5.522 | 0.002 | 0.001 | 492 | 982 | 1.00 |
tau[sup_friends_p1] | 5.782 | 0.068 | 5.651 | 5.904 | 0.004 | 0.003 | 333 | 763 | 1.01 |
tau[sup_friends_p2] | 6.007 | 0.057 | 5.909 | 6.125 | 0.003 | 0.002 | 397 | 872 | 1.00 |
tau[sup_friends_p3] | 5.987 | 0.066 | 5.864 | 6.115 | 0.003 | 0.002 | 385 | 890 | 1.01 |
tau[sup_parents_p1] | 5.973 | 0.061 | 5.858 | 6.085 | 0.003 | 0.002 | 427 | 1059 | 1.00 |
tau[sup_parents_p2] | 5.925 | 0.062 | 5.807 | 6.040 | 0.003 | 0.002 | 394 | 924 | 1.01 |
tau[sup_parents_p3] | 5.716 | 0.066 | 5.596 | 5.840 | 0.003 | 0.002 | 470 | 1294 | 1.00 |
tau[ls_p1] | 5.188 | 0.053 | 5.092 | 5.289 | 0.002 | 0.001 | 654 | 1378 | 1.00 |
tau[ls_p2] | 5.775 | 0.041 | 5.693 | 5.849 | 0.002 | 0.001 | 716 | 1596 | 1.00 |
tau[ls_p3] | 5.219 | 0.051 | 5.121 | 5.314 | 0.002 | 0.001 | 666 | 1609 | 1.00 |
Psi[se_acad_p1] | 0.412 | 0.028 | 0.359 | 0.465 | 0.001 | 0.001 | 1278 | 1740 | 1.00 |
Psi[se_acad_p2] | 0.413 | 0.024 | 0.367 | 0.456 | 0.001 | 0.000 | 2170 | 2268 | 1.00 |
Psi[se_acad_p3] | 0.468 | 0.027 | 0.418 | 0.519 | 0.001 | 0.000 | 1844 | 2408 | 1.00 |
Psi[se_social_p1] | 0.431 | 0.026 | 0.381 | 0.477 | 0.001 | 0.000 | 1382 | 2219 | 1.00 |
Psi[se_social_p2] | 0.361 | 0.025 | 0.314 | 0.405 | 0.001 | 0.000 | 1486 | 2135 | 1.00 |
Psi[se_social_p3] | 0.553 | 0.029 | 0.500 | 0.606 | 0.001 | 0.000 | 2594 | 2803 | 1.00 |
Psi[sup_friends_p1] | 0.517 | 0.040 | 0.439 | 0.587 | 0.001 | 0.001 | 866 | 1739 | 1.00 |
Psi[sup_friends_p2] | 0.508 | 0.031 | 0.454 | 0.568 | 0.001 | 0.001 | 1420 | 1985 | 1.00 |
Psi[sup_friends_p3] | 0.625 | 0.036 | 0.562 | 0.694 | 0.001 | 0.001 | 2090 | 2329 | 1.00 |
Psi[sup_parents_p1] | 0.550 | 0.035 | 0.485 | 0.615 | 0.001 | 0.001 | 1530 | 2075 | 1.00 |
Psi[sup_parents_p2] | 0.536 | 0.038 | 0.465 | 0.605 | 0.001 | 0.001 | 1192 | 2078 | 1.01 |
Psi[sup_parents_p3] | 0.675 | 0.038 | 0.602 | 0.745 | 0.001 | 0.001 | 2089 | 2371 | 1.00 |
Psi[ls_p1] | 0.671 | 0.038 | 0.603 | 0.744 | 0.001 | 0.001 | 1045 | 2387 | 1.00 |
Psi[ls_p2] | 0.534 | 0.030 | 0.477 | 0.591 | 0.001 | 0.001 | 1409 | 2472 | 1.00 |
Psi[ls_p3] | 0.622 | 0.035 | 0.554 | 0.688 | 0.001 | 0.001 | 1597 | 2393 | 1.00 |
Using a forest plots it’s easier to see how the uncertainty attaches to some of these estimates.
= plt.subplots(figsize=(15, 8))
fig, ax =['lambdas1', 'lambdas2', 'lambdas3', 'lambdas4', 'lambdas5'], combined=True, ax=ax);
az.plot_forest(idata, var_names"Factor Loadings for each of the Five Factors"); ax.set_title(
We can also pull out the covariances between our latent constructs. Highlighting the key covariances relations between support and self-efficacy measures with the life-satisfaction outcome.
$cov_df |> kable(caption= "Covariances Amongst Latent Factors",digits=2) |> kable_styling() %>% kable_classic(full_width = F, html_font = "Cambria") |> row_spec(5, color = "red") py
SE_ACAD | SE_SOCIAL | SUP_F | SUP_P | LS | |
---|---|---|---|---|---|
SE_ACAD | 0.47 | 0.26 | 0.06 | 0.20 | 0.22 |
SE_SOCIAL | 0.26 | 0.39 | 0.19 | 0.26 | 0.30 |
SUP_F | 0.06 | 0.19 | 1.03 | 0.12 | 0.16 |
SUP_P | 0.20 | 0.26 | 0.12 | 0.86 | 0.38 |
LS | 0.22 | 0.30 | 0.16 | 0.38 | 0.42 |
One benefit of fitting a mutlivariate normal relationship between these factors is that we can pull and translate the covariance relation in terms of the correlation patterns implied.
$correlation_df |> kable( caption= "Correlations Amongst Latent Factors", digits=2) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") |> row_spec(5, color = "red") py
SE_ACAD | SE_SOCIAL | SUP_F | SUP_P | LS | |
---|---|---|---|---|---|
SE_ACAD | 1.00 | 0.60 | 0.09 | 0.32 | 0.50 |
SE_SOCIAL | 0.60 | 1.00 | 0.29 | 0.45 | 0.75 |
SUP_F | 0.09 | 0.29 | 1.00 | 0.12 | 0.25 |
SUP_P | 0.32 | 0.45 | 0.12 | 1.00 | 0.64 |
LS | 0.50 | 0.75 | 0.25 | 0.64 | 1.00 |
Another lens on the factors is the relative factor strength which highlights where the majority of substantive variation exists in our data-set.
$factor_loadings[ c('factor', 'factor_loading', 'factor_loading_weight', 'factor_loading_weight_sq', 'sum_sq_loadings')] |> kable( caption= "Factor Loadings and Sum of Squared loadings", digits=2) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") |> column_spec(5, bold=TRUE, color = ifelse(py$factor_loadings$sum_sq_loadings > 2.8, "red", "black")) py
factor | factor_loading | factor_loading_weight | factor_loading_weight_sq | sum_sq_loadings |
---|---|---|---|---|
lambdas1 | lambdas1[se_acad_p1] | 1.00 | 1.00 | 2.60 |
lambdas1 | lambdas1[se_acad_p2] | 0.82 | 0.67 | 2.60 |
lambdas1 | lambdas1[se_acad_p3] | 0.97 | 0.94 | 2.60 |
lambdas2 | lambdas2[se_social_p1] | 1.00 | 1.00 | 2.82 |
lambdas2 | lambdas2[se_social_p2] | 0.96 | 0.93 | 2.82 |
lambdas2 | lambdas2[se_social_p3] | 0.94 | 0.89 | 2.82 |
lambdas3 | lambdas3[sup_friends_p1] | 1.00 | 1.00 | 2.46 |
lambdas3 | lambdas3[sup_friends_p2] | 0.80 | 0.64 | 2.46 |
lambdas3 | lambdas3[sup_friends_p3] | 0.90 | 0.82 | 2.46 |
lambdas4 | lambdas4[sup_parents_p1] | 1.00 | 1.00 | 3.10 |
lambdas4 | lambdas4[sup_parents_p2] | 1.04 | 1.08 | 3.10 |
lambdas4 | lambdas4[sup_parents_p3] | 1.01 | 1.02 | 3.10 |
lambdas5 | lambdas5[ls_p1] | 1.00 | 1.00 | 2.61 |
lambdas5 | lambdas5[ls_p2] | 0.79 | 0.63 | 2.61 |
lambdas5 | lambdas5[ls_p3] | 0.99 | 0.98 | 2.61 |
So we can evaluate some of the similar relationships at the level of summary statistics, but more powerfully we’ve built a generative model of the process yielding our observed data and as such we can evaluate the fidelity of the predictive output of our model using posterior predictive checks.
def make_ppc(idata, samples=100):
= plt.subplots(5, 3, figsize=(20, 20))
fig, axs = axs.flatten()
axs for i in range(15):
for j in range(samples):
= az.extract(idata['posterior_predictive'].sel({'likelihood_dim_3': i}))['likelihood'].values[:, j]
temp = pd.DataFrame(temp, columns=['likelihood'])
temp if j == 0:
=0.3, ec='black', bins=20, label='Observed Scores')
axs[i].hist(df[drivers[i]], alpha'likelihood'], color='purple', alpha=0.1, bins=20, label='Predicted Scores')
axs[i].hist(temp[else:
=0.3, ec='black', bins=20)
axs[i].hist(df[drivers[i]], alpha'likelihood'], color='purple', alpha=0.1, bins=20)
axs[i].hist(temp[f"Posterior Predictive Checks {drivers[i]}")
axs[i].set_title(;
axs[i].legend()
plt.show()
=100) make_ppc(idata, samples
In this way we can see how plausibly our model can capture the patterns in the observed data at the level of individual predictions. But additionally we can assess the posterior predictive fit at the level of summary statistics by defining the residuals on the model-fit covariance matrix based on draws from the posterior distribution.
def get_posterior_resids(idata, samples=100, metric='cov'):
= []
resids for i in range(100):
if metric == 'cov':
= pd.DataFrame(az.extract(idata['posterior_predictive'])['likelihood'][:, :, i]).cov()
model_cov = df[drivers].cov()
obs_cov else:
= pd.DataFrame(az.extract(idata['posterior_predictive'])['likelihood'][:, :, i]).corr()
model_cov = df[drivers].corr()
obs_cov = obs_cov.index
model_cov.index = obs_cov.columns
model_cov.columns = model_cov - obs_cov
residuals
resids.append(residuals.values.flatten())
= pd.DataFrame(pd.DataFrame(resids).mean().values.reshape(15, 15))
residuals_posterior = obs_cov.index
residuals_posterior.index = obs_cov.index
residuals_posterior.columns return residuals_posterior
= get_posterior_resids(idata, 2500)
residuals_posterior_cov = get_posterior_resids(idata, 2500, metric='corr')
residuals_posterior_corr
Then we can plot the familiar heat-maps from above highlighting the comparable accuracy of the Bayesian model fit.
plot_heatmap(py$residuals_posterior_cov, "Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of Bayesian Measurement Model fit")
plot_heatmap(py$residuals_posterior_corr, "Residuals of the Sample Correlations and Model Implied Correlations", "A Visual Check of Bayesian Measurement Model fit")
This process highlights one major difference in how the Bayesian CFA model is free to fit to the observed data. The parameterisation of the model is calibrated through the likelihood mechanism but the posterior samples provide important information not recoverable by optimizing fits for measurement and structural parameters in the MLE setting. In the Bayesian setting we have access to the latent factor scores for each of the individual responses, giving us a means of assessing outliers on the latent scales. For instance
= plt.subplots(1, 3, figsize=(10, 10))
fig, axs = axs.flatten()
axs = axs[0]
ax = axs[1]
ax1 = axs[2]
ax2 =['ksi'], combined=True, ax=ax, coords={'latent': ['SUP_P']});
az.plot_forest(idata, var_names=['ksi'], combined=True, ax=ax1, colors="forestgreen", coords={'latent': ['SE_SOCIAL']});
az.plot_forest(idata, var_names=['ksi'], combined=True, ax=ax2, colors="slateblue", coords={'latent': ['LS']});
az.plot_forest(idata, var_names;
ax.set_yticklabels([])"SUP_P");
ax.set_xlabel(;
ax1.set_yticklabels([])"SE_SOCIAL");
ax1.set_xlabel(;
ax2.set_yticklabels([])"LS");
ax2.set_xlabel(-2, color='red');
ax.axvline(-2, color='red');
ax1.axvline(-2, color='red');
ax2.axvline("Individual Parental Support Metric \n On Latent Factor SUP_P");
ax.set_title("Individual Social Self Efficacy \n On Latent Factor SE_SOCIAL");
ax1.set_title("Individual Life Satisfaction Metric \n On Latent Factor LS");
ax2.set_title(; plt.show()
So while we are fitting a confirmatory factor model the edges between exploratory and confirmatory blur in this setting. Just because we have we a hypothesized factor structure we wish to evaluate does not mean the implications of such structures are crystal clear. Here we need to explore that the individually reported scores render plausible judgments about the relative position of each survey respondent on these latent scales. These are the implications of our model fit that would help us confirm the plausibility of the hypothesized factors, but additionally they offer new potential options for exploring our data and categorizing our observations. This is a common experience in contemporary Bayesian model workflow which recommends iterative model refinement and evaluation.
“A key part of Bayesian workflow is adapting the model to the data at hand and the questions of interest. The model does not exist in isolation and is not specified from the outside; it emerges from engagement with the application and the available data” - Gelman et al in Bayesian Workflow
This is the lens that matters. Yes, iterative refinement and model checking can lead to models which are over-fit to the sample data. This is always a risk, but explicit modeling workflow highlights prior designs and exposes this sequence of model choices to stringent tests and evaluative consideration. The choices that the determine trajectory of our refinements should represent a process of learning and discovery that uncovers more questions of interest as we expand the capacity of our models.
Structural Equation Modelling in PyMC
In the context of structural equation modeling our hypothetical factor structure and dependency relations are even more specific than the simple “measurement model” we’ve just seen. A structural equation model incorporates a measurement model (a sub-component) and adds regression components between the latent and manifest variables as is required to evaluate candidate hypotheses regarding the structure of the data generating process. The range of possibilities we can encode here is vast, but candidate model architectures are constrained by our theory of the case and the demands of explanatory coherence.
SEM and Indirect Effects
With the life-satisfaction data-set we’ve been been looking at we want to evaluate the indirect and direct effects of parental and peer support on life-satisfaction outcomes. To do so we need to add a dependency structure to our data generating process and fit the regressions which would inform us as to relative importance of these types of support as mediated through our self-efficacy scores.
def make_indirect_sem(priors):
= {'obs': list(range(len(df))),
coords 'indicators': drivers,
'indicators_1': ['se_acad_p1','se_acad_p2','se_acad_p3'],
'indicators_2': ['se_social_p1','se_social_p2','se_social_p3'],
'indicators_3': ['sup_friends_p1','sup_friends_p2','sup_friends_p3'],
'indicators_4': [ 'sup_parents_p1','sup_parents_p2','sup_parents_p3'],
'indicators_5': ['ls_p1','ls_p2', 'ls_p3'],
'latent': ['SUP_F', 'SUP_P'],
'latent1': ['SUP_F', 'SUP_P'],
'latent_regression': ['SUP_F->SE_ACAD', 'SUP_P->SE_ACAD', 'SUP_F->SE_SOC', 'SUP_P->SE_SOC'],
'regression': ['SE_ACAD', 'SE_SOCIAL', 'SUP_F', 'SUP_P']
}
= list(range(len(df)))
obs_idx with pm.Model(coords=coords) as model:
= pm.InverseGamma('Psi', 5, 10, dims='indicators')
Psi = pm.Normal('lambdas_1', priors['lambda'][0], priors['lambda'][1], dims=('indicators_1'))
lambdas_ = pm.Deterministic('lambdas1', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_1'))
lambdas_1 = pm.Normal('lambdas_2', priors['lambda'][0], priors['lambda'][1], dims=('indicators_2'))
lambdas_ = pm.Deterministic('lambdas2', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_2'))
lambdas_2 = pm.Normal('lambdas_3', priors['lambda'][0], priors['lambda'][1], dims=('indicators_3'))
lambdas_ = pm.Deterministic('lambdas3', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_3'))
lambdas_3 = pm.Normal('lambdas_4', priors['lambda'][0], priors['lambda'][1], dims=('indicators_4'))
lambdas_ = pm.Deterministic('lambdas4', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_4'))
lambdas_4 = pm.Normal('lambdas_5', priors['lambda'][0], priors['lambda'][1], dims=('indicators_5'))
lambdas_ = pm.Deterministic('lambdas5', pt.set_subtensor(lambdas_[0], 1), dims=('indicators_5'))
lambdas_5 = pm.Normal('tau', 3, 10, dims='indicators')
tau = 0
kappa = pm.Exponential.dist(1.0, shape=2)
sd_dist = pm.LKJCholeskyCov('chol_cov', n=2, eta=priors['eta'],
chol, _, _ =sd_dist, compute_corr=True)
sd_dist= pm.Deterministic("cov", chol.dot(chol.T), dims=('latent', 'latent1'))
cov = pm.MvNormal('ksi', kappa, chol=chol, dims=('obs', 'latent'))
ksi
# Regression Components
= pm.Normal('beta_r', 0, 0.5, dims='latent_regression')
beta_r = pm.Normal('beta_r2', 0, 1, dims='regression')
beta_r2 = pm.LKJCholeskyCov('resid_chol', n=2, eta=priors['eta'],
resid_chol, _, _ =sd_dist, compute_corr=True)
sd_dist= pm.Deterministic("resid_cov", chol.dot(chol.T))
_ = pm.MvNormal('sigmas_resid', kappa, chol=resid_chol)
sigmas_resid
# SE_ACAD ~ SUP_FRIENDS + SUP_PARENTS
= pm.Normal('regr_se_acad', beta_r[0]*ksi[obs_idx, 0] + beta_r[1]*ksi[obs_idx, 1], sigmas_resid[0])
regression_se_acad # SE_SOCIAL ~ SUP_FRIENDS + SUP_PARENTS
= pm.Normal('regr_se_social', beta_r[2]*ksi[obs_idx, 0] + beta_r[3]*ksi[obs_idx, 1], sigmas_resid[1])
regression_se_social
# LS ~ SE_ACAD + SE_SOCIAL + SUP_FRIEND + SUP_PARENTS
= pm.Normal('regr', beta_r2[0]*regression_se_acad + beta_r2[1]*regression_se_social +
regression 2]*ksi[obs_idx, 0] + beta_r2[3]*ksi[obs_idx, 1], 1)
beta_r2[
= tau[0] + regression_se_acad*lambdas_1[0]
m0 = tau[1] + regression_se_acad*lambdas_1[1]
m1 = tau[2] + regression_se_acad*lambdas_1[2]
m2 = tau[3] + regression_se_social*lambdas_2[0]
m3 = tau[4] + regression_se_social*lambdas_2[1]
m4 = tau[5] + regression_se_social*lambdas_2[2]
m5 = tau[6] + ksi[obs_idx, 0]*lambdas_3[0]
m6 = tau[7] + ksi[obs_idx, 0]*lambdas_3[1]
m7 = tau[8] + ksi[obs_idx, 0]*lambdas_3[2]
m8 = tau[9] + ksi[obs_idx, 1]*lambdas_4[0]
m9 = tau[10] + ksi[obs_idx, 1]*lambdas_4[1]
m10 = tau[11] + ksi[obs_idx, 1]*lambdas_4[2]
m11 = tau[12] + regression*lambdas_5[0]
m12 = tau[13] + regression*lambdas_5[1]
m13 = tau[14] + regression*lambdas_5[2]
m14
= pm.Deterministic('mu', pm.math.stack([m0, m1, m2, m3, m4, m5, m6, m7,
mu
m8, m9, m10, m11, m12, m13, m14]).T)= pm.Normal('likelihood', mu, Psi, observed=df[drivers].values)
_
= pm.sample(10_000, chains=4, nuts_sampler='numpyro', target_accept=.99, tune=2000,
idata ={"log_likelihood": True}, random_seed=110)
idata_kwargs
idata.extend(pm.sample_posterior_predictive(idata))
return model, idata
= make_indirect_sem(priors={'eta': 2, 'lambda': [1, 1]}) model2, idata2
There is quite a bit of extra structure now in our model. This structure articulates the path-dependence of the self-efficacy constructs on the support constructs. We’ve specified these as separate regressions with coefficient values for each of the support constructs, but additionally we need to ensure there is a correlation between the implied self-efficacy scores so we’ve wrapped the regression equations in two normal distributions with correlated residual terms. In this manner we preserve the multivariate relationships and allow the model to place weight on this prior if there is a strong co-determination effect. It’s perhaps easier to see in a picture than think through the code directly:
The main point to emphasize here is the expressive nature of the modelling paradigm. Our ability to construct and estimate different chains of dependence allows us to test and evaluate the direct and indirect effects due to the different “drivers” of life-satisfaction.
As before we can still recover posterior predictive check to evaluate granular level performance against the observed data points.
100) make_ppc(idata2,
But additionally we can pull out the regression summary coefficients and report on the direct and indirect effects of our constructs
= az.summary(idata2, var_names=['beta_r', 'beta_r2'])
summary_df
def calculate_effects(summary_df, var='SUP_P'):
#Indirect Paths
## VAR -> SE_SOC ->LS
= summary_df.loc[f'beta_r[{var}->SE_SOC]']['mean']*summary_df.loc['beta_r2[SE_SOCIAL]']['mean']
indirect_parent_soc
## VAR -> SE_SOC ->LS
= summary_df.loc[f'beta_r[{var}->SE_ACAD]']['mean']*summary_df.loc['beta_r2[SE_ACAD]']['mean']
indirect_parent_acad
## Total Indirect Effects
= indirect_parent_soc + indirect_parent_acad
total_indirect
## Total Effects
= total_indirect + summary_df.loc[f'beta_r2[{var}]']['mean']
total_effect
return pd.DataFrame([[indirect_parent_soc, indirect_parent_acad, total_indirect, total_effect]],
=[f'{var} -> SE_SOC ->LS', f'{var} -> SE_ACAD ->LS', f'Total Indirect Effects {var}', f'Total Effects {var}']
columns
)
= calculate_effects(summary_df, 'SUP_P')
indirect_p = calculate_effects(summary_df, 'SUP_F')
indirect_f
= get_posterior_resids(idata2, 2500)
residuals_posterior_cov = get_posterior_resids(idata2, 2500, 'corr') residuals_posterior_corr
The regression effects seem well sampled and show the strength of relationships between social self-efficacy scores and reported life-satisfaction.
$summary_df |> kable( caption= "Regression Coefficients Amongst Latent Factors", digits=2) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") py
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
beta_r[SUP_F->SE_ACAD] | 0.05 | 0.04 | -0.04 | 0.13 | 0 | 0 | 35872 | 29560 | 1 |
beta_r[SUP_P->SE_ACAD] | 0.26 | 0.05 | 0.17 | 0.36 | 0 | 0 | 22718 | 26763 | 1 |
beta_r[SUP_F->SE_SOC] | 0.16 | 0.04 | 0.08 | 0.22 | 0 | 0 | 27895 | 29498 | 1 |
beta_r[SUP_P->SE_SOC] | 0.31 | 0.04 | 0.23 | 0.40 | 0 | 0 | 13525 | 23161 | 1 |
beta_r2[SE_ACAD] | 0.17 | 0.12 | -0.06 | 0.38 | 0 | 0 | 28858 | 29141 | 1 |
beta_r2[SE_SOCIAL] | 0.58 | 0.15 | 0.30 | 0.85 | 0 | 0 | 23839 | 28409 | 1 |
beta_r2[SUP_F] | 0.02 | 0.07 | -0.12 | 0.15 | 0 | 0 | 32759 | 30795 | 1 |
beta_r2[SUP_P] | 0.27 | 0.09 | 0.11 | 0.44 | 0 | 0 | 25497 | 30201 | 1 |
But now we can also pull out estimates of the indirect and total effects of parental and peer support on the outcome of life-satisfaction using the multiplicative path-tracing rules.
$indirect_p |> kable( caption= "Total and Indirect Effects: Parental Support", digits=2) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") py
SUP_P -> SE_SOC ->LS | SUP_P -> SE_ACAD ->LS | Total Indirect Effects SUP_P | Total Effects SUP_P |
---|---|---|---|
0.18 | 0.04 | 0.22 | 0.5 |
$indirect_f |> kable( caption= "Total and Indirect Effects: Friend Support", digits=4) |> kable_styling() |> kable_classic(full_width = F, html_font = "Cambria") py
SUP_F -> SE_SOC ->LS | SUP_F -> SE_ACAD ->LS | Total Indirect Effects SUP_F | Total Effects SUP_F |
---|---|---|---|
0.0896 | 0.0076 | 0.0972 | 0.1152 |
Note that we’ve defined these effects as point estimates on the mean realizations of the regression coefficients. But we could have defined these quantities in the model context and sampled the posterior distribution too. However, the model was already complex enough that we kept it simpler for the clarity of exposition.
Finally we can plot the residual variance-covariance for the Bayesian estimate of Structural equation model we hypothesized.
plot_heatmap(py$residuals_posterior_cov, "Residuals of the Sample Covariances and Model Implied Covariances", "A Visual Check of Bayesian Structural Model fit")
The overall impression seems quite reasonable, but notably distinct from the fit of the simpler measurement model. In this way the flexibility of the Bayesian estimation strategy is not as constrained as the MLE variance-covariance estimation strategy and we can learn of meaningful interactions in a richer more theoretically compelling model while still calibrating against the observed data. This is borne out global model comparisons too.
= az.compare({'Measurement Model': idata, 'Structural Model': idata2}); compare_df
; az.plot_compare(compare_df)
plt.show()
The structural model is marginally worse by measures of global fit to the data, but this summary skips over the variety of questions we can answer with this new structure. The broad fidelity to observed data seen in the posterior predictive checks and the variance covariance residuals suggests that we ought to be willing to take this hit to global performance if we can answer questions left unanswered by the measurement model.
Conclusion: Build the Model with Lasers!
“Model experiments offer less inferential power to learn about the world. But the possibility of model experiments to surprise, that is, to produce results that are unexpected against the background of existing knowledge and understanding, remains […] important … The surprising results of model experiments lead not to the discovery of new phenomena in the real world, but the recognition of new things in the world of the model and thence to the development of new categories of things and new concepts and ideas …” - Mary Morgan in The World in the Model
At the outset of this analysis we stated the hypothetical structure of interest. This suggests a pure confirmatory project, but in practice much of the feasibility of this analysis is path dependent and constrained by the viability of the various model architectures with respect to the data. It’s easy to tell the story of increasingly sophisticated models, more powerful, more insightful inferential claims. Yet, this narrative is compelling only in so far as there is a justifiable sequence of steps through the ascension of model complexity. The increase in complexity is off-set by a rationale at each step. Our final model encompasses aspects of our initial modeling in a expansionary escalation that preserves the insight in the earliest editions but improves them with our burgeoning understanding of the computational and theoretical dimensions of our problem.
To recap we have worked through a series of simple regression models seeking to tease out the weighted importance of various factors on the outcomes of life-satisfaction. These models highlighted fairly consistently the relative importance of metrics related to parental support and social self-efficacy. We also saw how there was a significant modification effect for the interaction term. These are all suggestive that we’ve picked out a meaningful set of metrics and which ought to support inferences about the phenomena of interest. But how do we parcel those metrics in such a way to support clean and compelling statements about the influence of each on reported life-satisfaction? Here we sought to leverage CFA and SEM models to clarify the structure of this influence and the particular pathways by which parental support drives these outcomes. Our MLE estimation strategy breaks when trying to estimate mediation effects in a complex SEM structure.
These latter models were then re-articulated as a species of Bayesian hierarchical models which aim to induce conditional independence amongst the thematically similar metrics by grouping each under a common factor. Which is to say, we assume that the joint distribution of outcomes is exchangeable and we can fit a conditional mixture to represent that joint distribution. We then showed how these conditional structures can be further adapted to answer questions about direct and indirect causation that bear precisely on the hypothetical structure we declared to be our target at the outset. Yet it’s not enough to fit a model and claim to be done. You have to proffer a defense that the model estimated is both plausible and recovers the observed data well. The very process of iterative Bayesian model development encodes these checks and balances. Iterative model refinement is no guide to which questions should be asked of a model, yet questions occur as we explore the posterior distribution. Inspiration strikes when you have samples to hand and can test the implications of some idle thought. Perhaps some parameterisation yields poor model fit and the re-parameterisation (a shift in some focal parameter’s prior) recovers better performance and prompts a new theoretical lens on your problem. While not a panacea to all forms of statistical misstep, this workflow is a positive, rich and rewarding methodological practice. The iterative model workflow is a natural seeking of the “right” conditionalisation to inform our decisions. Our theory is rich, and maybe even a tad risque for conservative tastes, but it articulates the latent unobservable constructs we believe to operate in reality. In the end we make an inference to the best explanation by an argument that covers theoretical nuance, computational burden and points of practical compromise. But we are making an argument - Yes, we’ve walked down the garden of forking garden paths, but our choices were deliberate and we suffer no regret.
Citation
@online{forde,
author = {Nathaniel Forde},
title = {Measurement, {Latent} {Factors} and the {Garden} of {Forking}
{Paths}},
langid = {en}
}