Stratum Specific effect modification
Data Science @ Personio
and Open Source Contributor @ PyMC
1/1/24
Opt-out policies in SAAS customer surveys represent a risk for bias in survey derived summaries of sentiment
Multilevel Regression with Post-stratification is a corrective technique for intellegently re-weighting the summaries to better reflect the true population distribution of sentiment.
In which we illustrate how regression models automate strata specific effect modification
\[\hat{y_{i}} = \alpha + \beta_{1}X_{1} ... \beta_{n}X_{n}\]
Assume \(y = \hat{y_{i}} + \epsilon\) where \(E(\epsilon) = 0\)
\[ E[y | X = x] = \alpha + \beta_{1}X_{1} ... \beta_{n}X_{n}\]
\[ y \sim Normal(\hat{y_{i}}, \sigma) \]
m0 = smf.ols('np.log(hwage) ~ job + educ', data=df).fit()
m1 = smf.ols('np.log(hwage) ~ job + educ + male ', data=df).fit()
pred = m0.predict(['software_engineer', 'college'])
pred1 = m1.predict(['software_engineer', 'college', 1])
diff = predc - pred1
As we add more covariates we add more combinatorial branches which define the available strata across our population of interest.
A fitted regression model allows us to explore the conditional branching probabilities.
reg = bmb.Model("Outcome ~ 1 + Treatment", df)
results = reg.fit()
reg_strata = bmb.Model("""Outcome ~ 1 + Treatment + Risk_Strata
+ Treatment_x_Risk_Strata""", df)
results_strata = reg_strata.fit()
bmb.interpret.plot_predictions(reg, results, conditional=["Treatment"])
bmb.interpret.plot_predictions(reg_strata, results_strata, conditional=["Treatment"])
Regression automates the more manual re-weighting that needs to occur to account for different varieties of risk across the strata of our population
In which we discuss the manner in which sample bias can corrupt the inferences drawn in even theoretically sound regression models.
We examine a comprehensive YouGov poll on whether employers should cover abortion in their coverage plans.
We select a biased subsample.
“In conventional sampling theory, the only scenario considered is essentially that of ‘drawing from an urn’, and the only probabilities that arise are those that presuppose the contents of the ‘urn’ or the ‘population’ already known, and seek to predict what ‘data’ we are likely to get as a result. …It was our use of probability theory as logic that has enabled us to do so easily what was impossible for those who thought of probability as a physical phenomenon associated with ‘randomness’. Quite the opposite; we have thought of probability distributions as carriers of information.” - Edwin Jaynes in Probability: The Logic of Science pg88 & p117
We fit a preliminary model to investigate the interactions across demographic splits. We specify a logit model using the binomial link.
formula = """ p(abortion, n) ~ C(state) + C(eth) + C(edu) + male + repvote"""
base_model = bmb.Model(formula, model_df, family="binomial")
result = base_model.fit(
random_seed=100,
target_accept=0.95,
idata_kwargs={"log_likelihood": True},
)
fig, ax = bmb.interpret.plot_comparisons(
model=base_model,
idata=result,
contrast={"eth": ["Black", "White"]},
conditional=["age", "edu"],
comparison_type="diff",
subplot_kwargs={"main": "age", "group": "edu"},
fig_kwargs={"figsize": (12, 5), "sharey": True},
legend=True,
)
ax[0].set_title("Comparison of Difference in Ethnicity \n within Age and Educational Strata");
In which we fit our full model to the biased data.
\[Pr(y_i = 1) = logit^{-1}\Bigg( \alpha_{\rm s[i]}^{\rm state} + \alpha_{\rm a[i]}^{\rm age} + \alpha_{\rm r[i]}^{\rm eth} + \alpha_{\rm e[i]}^{\rm edu} \\ + \beta^{\rm male} \cdot {\rm Male}_{\rm i} + \alpha_{\rm g[i], r[i]}^{\rm male.eth} + \alpha_{\rm e[i], a[i]}^{\rm edu.age} + \alpha_{\rm e[i], r[i]}^{\rm edu.eth} \Bigg)\]
Allowing for stratum specific intercept terms for each level of the demographic categories and their interaction effects.
In which we use the fitted model to predict rates of voting over the population and adjust the predicted values by the relative weights each strata occupies in the population.
estimates = []
## The base model posterior fitted on biased sample
abortion_posterior_base = az.extract(result)["p(abortion, n)_mean"]
## The posterior updated with national level figures
abortion_posterior_mrp = az.extract(result_adjust)["p(abortion, n)_mean"]
## Adjusting the predictions on state level
for s in new_data["state"].unique():
idx = new_data.index[new_data["state"] == s].tolist()
predicted_mrp = (
((abortion_posterior_mrp[idx].mean(dim="sample") *
new_data.iloc[idx]["state_percent"]))
.sum()
.item()
)
predicted_mrp_lb = (
(
(
abortion_posterior_mrp[idx].quantile(0.025, dim="sample")
* new_data.iloc[idx]["state_percent"]
)
)
.sum()
.item()
)
predicted_mrp_ub = (
(
(
abortion_posterior_mrp[idx].quantile(0.975, dim="sample")
* new_data.iloc[idx]["state_percent"]
)
)
.sum()
.item()
)
predicted = abortion_posterior_base[idx].mean().item()
base_lb = abortion_posterior_base[idx].quantile(0.025).item()
base_ub = abortion_posterior_base[idx].quantile(0.975).item()
estimates.append(
[s, predicted, base_lb, base_ub, predicted_mrp, predicted_mrp_ub, predicted_mrp_lb]
)
state_predicted = pd.DataFrame(
estimates,
columns=["state", "base_expected", "base_lb",
"base_ub", "mrp_adjusted", "mrp_ub", "mrp_lb"],
)
state_predicted = (
state_predicted.merge(cces_all_df.groupby("state")[["abortion"]].mean().reset_index())
.sort_values("mrp_adjusted")
.rename({"abortion": "census_share"}, axis=1)
)
state_predicted.head()
Derived state level predictions using the biased sample and the corrected values.
“The IID condition is a mathematical specification of what Hume called the uniformity of nature. To say that nature is uniform means that whatever circumstances holds for the observed, the same circumstances will continue to hold for the unobserved. This is what Hume required for the possibility of inductive reasoning”
Post-Stratification Weighting