import pandas as pd
import matplotlib.pyplot as plt
import bambi as bmb
import seaborn as sns
from pygam.datasets import mcycle
from pygam import LinearGAM, s, f, GAM, l, utils
import numpy as np
import arviz as az
import pymc as pm
import scipy as sp
from patsy import bs as bs_patsy, dmatrix
import pytensor.tensor as pt
= 100
random_seed
import warnings
'ignore') warnings.simplefilter(
GAMs and GPs: Flexibility and Calibration
Flexible Spline models risk overfit and need to be carefully calibrated against real data. Hierarchical models impose structure and aid calibration. We combine them.
Theoretical Posits and Scientific Commitments
The scientific enterprise must abstract away details of reality to provide insight into the structure of reality. Statistical models of real mechanisms are cobbled together from a variety of abstractions; probability distributions and functional relations cojoined, are then calibrated against observed data. We are building on disparate foundations. Good abstractions aim at modelling the structures or real patterns that matter. It doesn’t matter if those abstractions are themselves reflective of reality.
[T]here is no way to establish [a] fully secured, … starting point of the sciences. There is no tabula rasa. We are like sailors who have to rebuild their ship on the open sea, without ever being able to dismantle it in dry-dock and reconstruct it from its best components. - Otto Neurath
This is an apt metaphor for spline models in particular - like these ships built at sea, spline models are constructed from ad-hoc materials fished from the vast ocean of possible models. They are linear approxiations invoked to ride the next wave form.
In this blog post we’re going to dive into modelling of non-linear functions and explore some of the tooling available in the python eco-system. We’ll start by looking into Generalised Additive Models with splines in pyGAM
before proceeding to look at how Bayesian versions of spline modelling compares to Gaussian processes in Bambi
and PyMC
. Finally we will show how hierarchical bayesian models avoid some of the issues of overfit in simpler spline models.
Our interest in these models stems from their flexibility to approximate functions of arbitrary complexity. We’ll see how the methods work in the case of relatively straightforward toy example and then we’ll apply each of the methods to deriving insights into the functional form of insurance loss curves. In this application we adapt a data set discussed in Mick Cooney’s Stan
case study to demonstrate the power of hierarchical spline models. Throughout we’ll draw on the discussion of these methods in Osvaldo Martin’s “Bayesian Analysis with Python” for practical details implementing these models.
Generalised Additive models
Nonlinear functions can be approximated with linearly additive combinations of component features. Before delving into the practicalities we’ll quickly note some of the theoretical background. The canonical reference for these additive models is Simon Wood’s “Generalised Additive Models: An Introduction with R” which outlines in some detail the theoretical background of splines and univariate smoothers. The book stresses the trade-offs between the flexibility of splines and the need for cross-validation and penalised estimation methods for spline based modelling. Spline models ape the complexities of esoteric functions by regression-like formulas combining subordinate functions of observed variables to to predict another:
\[y \sim f(x_1) + f(x_2) ...\]
The structure of the individual \(f\) functions is approximated. So we are approximating \(y\) by adding a series of input approximations \(\sum f_{i}\). In R these penalised models fits can be achieved in mgcv
which incorporates a Wilkinson like formula syntax for model specification: y ~ s(x) + s(x1)
. The closest implementation in python is available in PyGam
and we will adopt this package to illustrate the spline based smoothing.
PyGAM and Penalised Fits
Let’s first look at an example data set on which to demonstrate univariate smoothing patterns using penalised splines. We’ll initially look simply at the function calls before going “under the hood”.
Spline models can be optimised by fitting differing strength penalities over a varying number of splines. Splines are used to construct non-linear functions of the input variable which are combined additively in our model equation. The penalities used in the optimisation routines constrain how “wiggly” these constructions will be. The number of splines determine how precise our will be.
= mcycle(return_X_y=True)
X, y
= plt.subplots(figsize=(10, 6))
fig, ax
ax.scatter(X, y)"Acceleration")
ax.set_ylabel("Time Step")
ax.set_xlabel("Crash Test Dummy Acceleration \n Simulated Motorcycle Crash", fontsize=20); ax.set_title(
Now we fit a number of different models to account for the herky-jerky nature of the data generating processs. We vary the parameterisations to see how the numbers of splines and strength of the penalty help account for the variation in \(y\) over the support of \(X\).
= LinearGAM(s(0, n_splines=5)).fit(X, y)
gam1 = LinearGAM(s(0, n_splines=7)).fit(X, y)
gam2 = LinearGAM(s(0, n_splines=10)).fit(X, y)
gam3 = LinearGAM(s(0, n_splines=15)).fit(X, y)
gam4 = LinearGAM(s(0, lam=.1)).fit(X, y)
gam5 = LinearGAM(s(0, lam=.5)).fit(X, y)
gam6 = LinearGAM(s(0, lam=5)).fit(X, y)
gam7 = LinearGAM(s(0, lam=15)).fit(X, y)
gam8
def plot_fit(gam, X, y, ax, t, c1='b', c2='r'):
= gam.generate_X_grid(term=0, n=500)
XX
=c2, linestyle='--')
ax.plot(XX, gam.predict(XX), color=.95), color=c1, ls='--')
ax.plot(XX, gam.prediction_intervals(XX, width
='gray', edgecolors='none')
ax.scatter(X, y, facecolorf"""95% prediction interval with {t} \n LL: {gam.statistics_['loglikelihood']}""");
ax.set_title(
= plt.subplots(4,2, figsize=(10, 20))
fig, axs = axs.flatten()
axs = ['5_splines', '7_splines', '10_splines', '15_splines',
titles 'lam=.1', 'lam=.5', 'lam=5', 'lam=15']
= [gam1, gam2, gam3, gam4, gam5, gam6, gam7, gam8]
gs for ax, g, t in zip(axs, gs, titles):
plot_fit(g, X, y, ax, t)
Here we’ve seen the PyGAM
package applied to fitting our model to the data. In the formula specification we see y ~ s(i)
where i
denotes the index of the column variable in the X data.
Over the range of of the x-axis we can see how the conditional expectation is more or less well fit to the data depending on how the penalities and complexity of the model is specified. The art and science of developing a GAM model is finding the constraints on the approximation that help capture the observed patterns but do not excessively overfit to the observed data.
Optimising The Parameter Setting
The manner of the the linear combination achieved by GAMs is constrained by the optimisation goal in their model fit. We can see from the model summary what is going on under the hood. For a given model specification the summary will report a number of model-fit statistics such as the log-likelihood and the AIC.
## Naive Model manually specified splines
= LinearGAM(s(0, n_splines=10), fit_intercept=False).fit(X, y)
gam_raw print("log_likelihood:", gam_raw.statistics_['loglikelihood'])
print("AIC:", gam_raw.statistics_['AIC'])
log_likelihood: -1042.2807399926558
AIC: 2097.7407642114244
The question then becomes what changes are induced in the model as we seek to optimise these model fit statistics.
## model optimised
= LinearGAM(s(0), fit_intercept=False)
gam
gam.gridsearch(X, y) gam.summary()
0% (0 of 11) | | Elapsed Time: 0:00:00 ETA: --:--:--
9% (1 of 11) |## | Elapsed Time: 0:00:00 ETA: 0:00:00
100% (11 of 11) |########################| Elapsed Time: 0:00:00 Time: 0:00:00
LinearGAM
=============================================== ==========================================================
Distribution: NormalDist Effective DoF: 11.8135
Link Function: IdentityLink Log Likelihood: -952.2409
Number of Samples: 133 AIC: 1930.1088
AICc: 1933.0789
GCV: 609.3811
Scale: 512.7965
Pseudo R-Squared: 0.7984
==========================================================================================================
Feature Function Lambda Rank EDoF P > x Sig. Code
================================= ==================== ============ ============ ============ ============
s(0) [0.2512] 20 11.8 1.11e-16 ***
==========================================================================================================
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
which can cause p-values to appear significant when they are not.
WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
known smoothing parameters, but when smoothing parameters have been estimated, the p-values
are typically lower than they should be, meaning that the tests reject the null too readily.
Fortunately, this routine can be performed directly and results in the following differences between the naive and optimised model.
Plot the GAM fits
= plt.subplots(figsize=(10, 6))
fig, ax "Unoptimised Fit", c1='orange', c2='green')
plot_fit(gam_raw, X, y, ax, "Optimised Fit") plot_fit(gam, X, y, ax,
This is all well and good! We’ve seen an approach to modelling that can capture eccentric patterns in raw data. But how does it work and why should we care? If you’re familiar with the nomenclature of machine learning, you should think of spline modelling as a variety of feature creation. It generates “synthetic” features over the range of the observed variable. These synthetic features are the splines in question. Model building here too then relies on feature selection and principles for model-comparison.
Digression on the usage of “Splines”
The history of the term “spline” is related to the history of draftmanship. Historically splines were thin strips of flexible wood or plastic that could be bent or shaped around a weight or “knot” points to express a traceable curve over the space of a “numberline”. The elastic nature of the spline material allowed it to be bent around the knot points of curvature expressing a smooth or continuous bend.
The mathematical technique apes these properties by defining a curve over in an analogous way. We specify “knots” to carve up the support of the random variable \(X\) into portions that require different weighting schemes to represent the outcome \(y\) in each partition of the support variable.
Extracting the Splines
Returning to our model we can extract the spline features used in the PyGAM
by invoking the following commands. We first identify the knot points and create the b-spline basis appropriate for the variable \(X\).
=utils.gen_edge_knots(X,dtype='numerical')
knot_edges=np.linspace(knot_edges[0],knot_edges[-1],len(gam.coef_))
knots
= utils.b_spline_basis(X, edge_knots=knot_edges, sparse=False)
splines
= pd.DataFrame(splines, columns=[f'basis_{i}' for i in range(len(gam.coef_))])
splines_df
10) splines_df.head(
basis_0 | basis_1 | basis_2 | basis_3 | basis_4 | basis_5 | basis_6 | basis_7 | basis_8 | basis_9 | basis_10 | basis_11 | basis_12 | basis_13 | basis_14 | basis_15 | basis_16 | basis_17 | basis_18 | basis_19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.166667 | 0.666667 | 0.166667 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.132997 | 0.661606 | 0.205334 | 0.000063 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.059688 | 0.594827 | 0.341426 | 0.004059 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.030095 | 0.518726 | 0.437481 | 0.013698 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.012374 | 0.428013 | 0.527144 | 0.032470 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.000000 | 0.040337 | 0.551431 | 0.399315 | 0.008917 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
6 | 0.000000 | 0.018232 | 0.465467 | 0.492630 | 0.023671 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 0.000000 | 0.011137 | 0.418489 | 0.535407 | 0.034967 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
8 | 0.000000 | 0.000014 | 0.189310 | 0.664817 | 0.145859 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
9 | 0.000000 | 0.000000 | 0.120914 | 0.656897 | 0.222015 | 0.000174 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
These spline features range the extent of the of covariate space \(X\) defining “partitions” of the space. The model “learns” to capture the shape of the outcome variable \(y\) by figuring out how to weight the different portion of this spline basis matrix i.e. the linear combination of this basis matrix with the derived coefficients is a model of our outcome variable.
Note how each row is 0 everywhere except within the columns that represent a partition of \(X\). Additionally each row sums to unity. These properties are important because they ensure that any weighted combination of this basis represents the outcome variable in a controlled and quite granular manner. The more splines we use the more control we have of the representation, but we also risk overfit.
Plotting the Weighted Spline
= splines_df.dot(gam.coef_).plot(title='Weighted Splines', label='Weighted Combination of Spline Basis', figsize=(10, 6))
ax "Acceleration")
ax.set_ylabel("Time Steps")
ax.set_xlabel(; ax.legend()
In this manner we can see how the specification of a spline basis can help us model eccentric curves and waves in an outcome space. Spline features are like ad-hoc joists structuring our ship’s rigging. Features filling the gaps in our theory, they act like duct-tape binding the pieces together. If we hold this part down there, and jiggle another part here, our theory stays afloat. Spline features exemplify the contorted nature of empirically informed theory construction.
Next we’ll see how to more directly work with the specification of basis splines, passing these feature matrices into Bambi models.
Bayesian Splines with bambi
Spline models can be built and assessed using Bayesian approaches to quantify the uncertainty in the constructed function. Under the hood Bambi
makes use of the ‘formulae’ package to allow for formula-like syntax to specify spline basis terms and mixed effect model terms. We leverage this to streamline the specification of spline models and demonstrate different kinds of spline-features.
= np.linspace(0, np.max(X), 6+2)[1:-1]
knots_6 = np.linspace(0, np.max(X), 10+2)[1:-1]
knots_10 = np.linspace(0, np.max(X), 20+2)[1:-1]
knots_20
= pd.DataFrame({'X': X.flatten(), 'y': y})
df = 'bs(X, degree=0, knots=knots_6)'
formula1 = 'bs(X, degree=1, knots=knots_6, intercept=False)'
formula2 = 'bs(X, degree=3, knots=knots_6, intercept=False)'
formula3 = 'bs(X, degree=3, knots=knots_10, intercept=False)'
formula4 = 'bs(X, degree=3, knots=knots_20, intercept=False)'
formula5 = bmb.Model(f"y ~ {formula1}", df)
model_spline1 = bmb.Model(f"y ~ {formula2}", df)
model_spline2 = bmb.Model(f"y ~ {formula3}", df)
model_spline3 = bmb.Model(f"y ~ {formula4}", df)
model_spline4 = bmb.Model(f"y ~ {formula5}", df)
model_spline5 model_spline5
Formula: y ~ bs(X, degree=3, knots=knots_20, intercept=False)
Family: gaussian
Link: mu = identity
Observations: 133
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: -25.5459, sigma: 222.6623)
bs(X, degree=3, knots=knots_20, intercept=False) ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], sigma: [1407.0172 1448.7263 1394.069 877.5692
912.3295 679.8904 507.1974
696.2455 860.7155 666.7815 665.863 932.178 883.0419 740.3925
1010.3524 914.7406 923.9545 1218.535 1355.2211 1420.6241 1363.5314
2403.9764 1393.2356])
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 48.14)
As you can see here we are specifying a range of different degrees of spline basis. The different degrees corresspond to the smoothness of the overlapping splines. The degree=0
splines mean we specify a piecewise constant basis i.e. 0 or 1 within each region of the partition. But we can add more degrees to see more flexible representations of the space.
model_spline5.build() model_spline5.graph()
The differences here will become clearer as we plot the various spline basis matrices below, but the main thought is that there are different degrees of “smoothness” to the linear basis of the spline features.
Plot the Spline Basis
The below functions extract the basis specification from each model and plots the basis design for an increasingly complex series of spline basis matrices.
def plot_spline_basis(basis, X, ax, title="Spline Basis"):
= (
df
pd.DataFrame(basis)=X)
.assign(X"X", var_name="basis_idx", value_name="y")
.melt(
)
for idx in df.basis_idx.unique():
= df[df.basis_idx == idx]
d "X"], d["y"])
ax.plot(d[
ax.set_title(title)return ax
def plot_knots(knots, ax):
for knot in knots:
="0.1", alpha=0.4)
ax.axvline(knot, colorreturn ax
= plt.subplots(5, 1, figsize=(9, 20))
fig, axs = axs.flatten()
axs
axs.flatten()= model_spline1.response_component.design.common[formula1]
B1 "X"].values, ax=axs[0], title="Piecewise Constant Basis")
plot_spline_basis(B1, df[0]);
plot_knots(knots_6, axs[
= model_spline2.response_component.design.common[formula2]
B2 = plot_spline_basis(B2, df["X"].values, axs[1],
ax ="Piecewise Linear Basis")
title1]);
plot_knots(knots_6, axs[
= model_spline3.response_component.design.common[formula3]
B3 = plot_spline_basis(B3, df["X"].values, axs[2],
ax ="Cubic Spline Basis (6 Knots)")
title2]);
plot_knots(knots_6, axs[
= model_spline4.response_component.design.common[formula4]
B4 = plot_spline_basis(B4, df["X"].values, axs[3],
ax ="Cubic Spline Basis (10 Knots)")
title3]);
plot_knots(knots_10, axs[
= model_spline5.response_component.design.common[formula5]
B5 = plot_spline_basis(B5, df["X"].values, axs[4],
ax ="Cubic Spline Basis (20 Knots)")
title4]); plot_knots(knots_20, axs[
Here we’ve seen the nature of the modelling splines we’ll fit to the data. By “fit” we mean estimate a set of linear coefficients that we can use to additively combine these spline features creating a smooth linear representation of the outcome.
Fit the Individual Spline Models
We now combine the spline components within a linear model fit, pulling and tying them together in such a way as to ape the shape of the observed sequence.
= model_spline1.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})
idata_spline1
= model_spline2.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})
idata_spline2
= model_spline3.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})
idata_spline3
= model_spline4.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})
idata_spline4
= model_spline5.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True}) idata_spline5
Finally we can plot fits achieved by each of the models and compare how the different spline smooths contribute to approxiating the outcome variable.
Plot the Weighted Mean
def plot_weighted_splines(B, idata, formula, ax, knots):
= az.extract(idata)
posterior_stacked = posterior_stacked[formula].mean("sample").values
wp
* wp.T, df["X"].values, ax)
plot_spline_basis(B ="black", lw=3, label='Weighted Splines')
ax.plot(df.X.values, np.dot(B, wp.T), color;
plot_knots(knots, ax)
ax.legend()
= plt.subplots(5, 1, figsize=(10, 20))
fig, axs = axs.flatten()
axs
axs.flatten()
0], knots_6)
plot_weighted_splines(B1, idata_spline1, formula1, axs[1], knots_6)
plot_weighted_splines(B2, idata_spline2, formula2, axs[2], knots_6)
plot_weighted_splines(B3, idata_spline3, formula3, axs[3], knots_10)
plot_weighted_splines(B4, idata_spline4, formula4, axs[4], knots_20) plot_weighted_splines(B5, idata_spline5, formula5, axs[
Here we can see how the models with increasingly complex splines are more exactly able to fit the herky jerky trajectory of the outcome variable in each interval. The fewer the intervals, the less flexibility available to the model.
Compare Model Fits
As before we can evaluate these model fits and compare them based on leave-one-out cross validation scores and information theoretic complexity measures.
= plt.subplots(1, 2, figsize=(10, 7))
fig, axs = axs.flatten()
axs = {"piecewise_constant": idata_spline1, "piecewise_linear": idata_spline2, "cubic_bspline": idata_spline3, "cubic_bspline_10": idata_spline4,
models_dict "cubic_bspline_20": idata_spline5}
= az.compare(models_dict)
df_compare =axs[0])
az.plot_compare(df_compare, ax0].get_legend().remove()
axs['waic'), ax=axs[1])
az.plot_compare(az.compare(models_dict, 1].set_yticklabels([])
axs[ df_compare
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
cubic_bspline_10 | 0 | -612.572745 | 11.646898 | 0.000000 | 9.011087e-01 | 9.646678 | 0.000000 | True | log |
cubic_bspline_20 | 1 | -620.709337 | 19.669201 | 8.136592 | 1.360171e-13 | 9.600116 | 2.074742 | True | log |
cubic_bspline | 2 | -634.647180 | 8.703519 | 22.074435 | 6.196578e-14 | 8.915728 | 6.642850 | True | log |
piecewise_constant | 3 | -643.781042 | 6.981098 | 31.208296 | 5.299668e-02 | 9.770740 | 8.728072 | False | log |
piecewise_linear | 4 | -647.016885 | 5.987587 | 34.444140 | 4.589467e-02 | 7.914787 | 8.116163 | False | log |
Here we see that the extra complexity of using 20 splines leads to slightly worse performance measures than the less complex but seemingly adequate 10 splines. In other words, it’s starting to overfit to the data at 20. This is the price of flexibility and a sign of a model unlikely to generalise well on out-of-sample data.
= pd.DataFrame({"X": np.linspace(df.X.min() - 5, df.X.max() + 5, num=500)})
new_data
=new_data,
model_spline4.predict(idata_spline4, data='pps', inplace=True)
kind
idata_spline4
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000, bs(X, degree=3, knots=knots_10, intercept=False)_dim: 13, y_obs: 500) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 ... 999 * bs(X, degree=3, knots=knots_10, intercept=False)_dim (bs(X, degree=3, knots=knots_10, intercept=False)_dim) int64 ... * y_obs (y_obs) int64 0 ...... Data variables: Intercept (chain, draw) float64 ... bs(X, degree=3, knots=knots_10, intercept=False) (chain, draw, bs(X, degree=3, knots=knots_10, intercept=False)_dim) float64 ... y_sigma (chain, draw) float64 ... y_mean (chain, draw, y_obs) float64 ... Attributes: created_at: 2024-05-27T06:00:45.172558 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3 sampling_time: 2.553478956222534 tuning_steps: 1000 modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000, y_obs: 500) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999 * y_obs (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499 Data variables: y (chain, draw, y_obs) float64 -1.769e+03 -1.707e+03 ... -438.7 Attributes: modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000, y_obs: 133) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999 * y_obs (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132 Data variables: y (chain, draw, y_obs) float64 -4.94 -4.349 -4.059 ... -4.232 -5.027 Attributes: created_at: 2024-05-27T06:00:45.279798 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3 modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 ... 994 995 996 997 998 999 Data variables: (12/17) perf_counter_diff (chain, draw) float64 0.0007404 ... 0.0007515 lp (chain, draw) float64 -721.3 -715.6 ... -710.3 -711.0 step_size_bar (chain, draw) float64 0.3142 0.3142 ... 0.3369 0.3369 smallest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan acceptance_rate (chain, draw) float64 0.9519 0.8589 ... 0.7224 0.9861 reached_max_treedepth (chain, draw) bool False False False ... False False ... ... largest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan energy (chain, draw) float64 728.6 729.2 ... 725.7 714.8 tree_depth (chain, draw) int64 4 4 4 4 3 4 4 5 ... 4 5 4 3 4 4 4 diverging (chain, draw) bool False False False ... False False index_in_trajectory (chain, draw) int64 -7 -13 -14 -3 6 ... -3 4 -4 7 8 process_time_diff (chain, draw) float64 0.000741 0.000737 ... 0.000751 Attributes: created_at: 2024-05-27T06:00:45.180249 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3 sampling_time: 2.553478956222534 tuning_steps: 1000 modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (y_obs: 133) Coordinates: * y_obs (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132 Data variables: y (y_obs) float64 0.0 -1.3 -2.7 0.0 -2.7 ... -2.7 10.7 -2.7 10.7 Attributes: created_at: 2024-05-27T06:00:45.182737 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3 modeling_interface: bambi modeling_interface_version: 0.13.0
Next we plot the posterior predictive distribution of our observed variable and compare against the observed data. Additionally we plot the 89th and 50% HDI.
= az.plot_hdi(new_data['X'], idata_spline4['posterior_predictive']['y'], fill_kwargs={'alpha': 0.2, 'color':'firebrick'}, hdi_prob=0.89, figsize=(10, 8))
ax
'X'], idata_spline4['posterior_predictive']['y'], fill_kwargs={'alpha': 0.8, 'color':'firebrick'}, hdi_prob=0.5)
az.plot_hdi(new_data[
= idata_spline4['posterior_predictive']['y'].mean(dim=('chain', 'draw'))
y_mean
'X'], y_mean, label='Expected posterior predictive', color='k')
ax.plot(new_data["Time Point")
ax.set_xlabel("Acceleration")
ax.set_ylabel(
'X'], df['y'], label='Observed Datapoints')
ax.scatter(df[
ax.legend()
"Posterior Predictive Distribution \n Based on 10 Knots"); ax.set_title(
This represents a good a clean model fit to the observed data using univariate spline smoothers. However, it’s clear that the uncertainty spikes massively with data points even slightly out of the training data. Next we’ll see another alternative approach to model this outcome variable using approximate gaussian processes.
Approximate Gaussian processes
Gaussian processes models allow us to express our beliefs about the structure of a function and calibrate these beliefs against the observed data. The topic of gaussian processes is rich and complex. Too rich to be fairly covered in this blog post, so we’ll just say that we’re using a method designed for function approximation that makes use of drawing samples from a multivariate normal distribution under a range of different covariance relationships. The properties of these covariance relations determine the shape and fluidity of the function realisations. We will use the HSGP approximation of gaussian processes models.
The gaussian process models can be somewhat interrogated by examining the different combinations of covariance relationships with priors over the parameters governing the covariance of a sequence of points. For example consider the following parameterisations.
= plt.subplots(2, 2, figsize=(9, 10))
fig, axs = axs.flatten()
axs
def plot_cov_draws(ax1, ax2, lengthscale=3, sigma=13):
= sigma**2 * pm.gp.cov.ExpQuad(1, lengthscale)
cov = np.linspace(0, 60, 200)[:, None]
X = cov(X).eval()
K
=0, xticklabels=[], yticklabels=[], ax=ax1, cmap='crest');
sns.heatmap(pd.DataFrame(K), centerf"Covariance Length Scale {lengthscale}")
ax1.set_title(
ax2.plot(
X,
pm.draw(=np.zeros(len(K)), cov=K, shape=K.shape[0]), draws=10, random_seed=random_seed
pm.MvNormal.dist(mu='blue', alpha=0.5
).T, color
)f"Samples from the GP prior \n lengthscale: {lengthscale}, sigma: {sigma}")
ax2.set_title("y")
plt.ylabel("X");
plt.xlabel(
0], axs[1])
plot_cov_draws(axs[2], axs[3], lengthscale=10, sigma=13) plot_cov_draws(axs[
We’ve specified the range of X to reflect the support of the acceleration example and allowed the draws to be informed by a covariance function we have parameterised using the Exponentiated Quadratic kernel. We’ve additionally tweaked the lengthscale to demonstrate how proximity in the sequence informs the kernel and determines the shape of the covariance structure.
\[k(x, x') = \mathrm{exp}\left[ -\frac{(x - x')^2}{2 \ell^2} \right]\]
The plot on the left highlights the importance of the lengthscale parameter over the sequence. The wider the central shading is the more pronounced is the correlation among more points in the sequence. The patterns to the right show a good range of “wiggliness” that they should be flexible enough to capture the shape of the acceleration. These are the constraints of structure we can impose of the theoretical realisations of our model. If we can calibrate the model and derive posterior parameters against the observed data, we can learn the probable shape of our functional relationship.
Priors on Gaussian Processes
Consider the following specification for the priors
= plt.subplots(1, 2, figsize=(9, 6))
fig, axs = axs.flatten()
axs 0].hist(pm.draw(pm.InverseGamma.dist(mu=1, sigma=1), 1000), ec='black', bins=30);
axs[0].set_title("Priors for Lengthscale \n in ExpQuad Kernel")
axs[1].hist(pm.draw(pm.Exponential.dist(lam=1), 1000), ec='black', bins=30);
axs[1].set_title("Priors for Amplitude \n in ExpQuad Kernel") axs[
Text(0.5, 1.0, 'Priors for Amplitude \n in ExpQuad Kernel')
We use these to specify priors on the Hilbert space approximation of gaussian priors available in the Bambi
package.
= {
prior_hsgp "sigma": bmb.Prior("Exponential", lam=1), # amplitude
"ell": bmb.Prior("InverseGamma", mu=1, sigma=1) # lengthscale
}
# This is the dictionary we pass to Bambi
= {
priors "hsgp(X, m=10, c=1)": prior_hsgp,
"sigma": bmb.Prior("HalfNormal", sigma=4)
}= bmb.Model("y ~ 0 + hsgp(X, m=10, c=1)", df, priors=priors)
model_hsgp model_hsgp
Formula: y ~ 0 + hsgp(X, m=10, c=1)
Family: gaussian
Link: mu = identity
Observations: 133
Priors:
target = mu
HSGP contributions
hsgp(X, m=10, c=1)
cov: ExpQuad
sigma ~ Exponential(lam: 1.0)
ell ~ InverseGamma(mu: 1.0, sigma: 1.0)
Auxiliary parameters
sigma ~ HalfNormal(sigma: 4.0)
Here we’ve set the m=10
to determine the number of basis vectors used in the Hilbert space approximation. The idea differs in detail from the spline based approximations we’ve seen, but it’s perhaps useful to think of the process in the same vein. Here again we have a theory of the world expressed as a function of opaque components jerry-rigged for modelling some phenomena.
= model_hsgp.fit(inference_method="nuts_numpyro",target_accept=0.95, random_seed=121195,
idata_hsgp ={"log_likelihood": True})
idata_kwargsprint(idata_hsgp.sample_stats["diverging"].sum().to_numpy())
Compiling...
Compilation time = 0:00:01.138923
Sampling...
Sampling time = 0:00:01.870798
Transforming variables...
Transformation time = 0:00:00.176689
Computing Log Likelihood...
Log Likelihood time = 0:00:00.144039
0
This model fits and the sampling seems to have worked well.
={"layout": "constrained"}, figsize=(9, 15)); az.plot_trace(idata_hsgp, backend_kwargs
The lengthscale and sigma parameters we have learned by calibrating our priors against the data. The degree to which these parameters are meaningful depend a little on how familar you are with covariance matrix kernels and their properties, so we won’t dwell on the point here.
=['hsgp(X, m=10, c=1)_ell', 'hsgp(X, m=10, c=1)_sigma', 'y_sigma', 'hsgp(X, m=10, c=1)_weights']) az.summary(idata_hsgp, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
hsgp(X, m=10, c=1)_ell | 3.301 | 0.666 | 2.099 | 4.567 | 0.016 | 0.011 | 1763.0 | 1815.0 | 1.0 |
hsgp(X, m=10, c=1)_sigma | 23.675 | 2.679 | 18.925 | 28.847 | 0.069 | 0.049 | 1497.0 | 2167.0 | 1.0 |
y_sigma | 20.589 | 1.121 | 18.590 | 22.751 | 0.019 | 0.014 | 3440.0 | 2485.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[0] | -131.588 | 13.784 | -157.019 | -105.189 | 0.216 | 0.154 | 4071.0 | 3020.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[1] | -94.391 | 18.010 | -125.921 | -58.020 | 0.363 | 0.256 | 2500.0 | 2876.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[2] | 106.121 | 20.717 | 69.627 | 147.247 | 0.458 | 0.324 | 2046.0 | 2563.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[3] | 149.289 | 22.154 | 106.508 | 188.107 | 0.521 | 0.368 | 1807.0 | 2467.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[4] | -83.399 | 22.728 | -124.812 | -40.406 | 0.539 | 0.383 | 1776.0 | 2517.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[5] | -125.322 | 23.273 | -169.346 | -82.187 | 0.531 | 0.377 | 1917.0 | 2940.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[6] | 37.835 | 21.153 | -1.712 | 77.574 | 0.512 | 0.362 | 1713.0 | 2577.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[7] | 94.574 | 20.745 | 53.167 | 132.182 | 0.423 | 0.299 | 2393.0 | 3105.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[8] | -1.494 | 17.184 | -31.835 | 33.063 | 0.361 | 0.255 | 2264.0 | 2853.0 | 1.0 |
hsgp(X, m=10, c=1)_weights[9] | -45.442 | 15.955 | -76.649 | -16.636 | 0.289 | 0.208 | 3041.0 | 3183.0 | 1.0 |
But again we can sample from the posterior predictive distribution of the outcome variable
=new_data,
model_hsgp.predict(idata_hsgp, data='pps', inplace=True)
kind
idata_hsgp
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000, hsgp(X, m=10, c=1)_weights_dim: 10, y_obs: 500) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 ... 996 997 998 999 * hsgp(X, m=10, c=1)_weights_dim (hsgp(X, m=10, c=1)_weights_dim) int64 0 ... * y_obs (y_obs) int64 0 1 2 3 4 ... 496 497 498 499 Data variables: hsgp(X, m=10, c=1)_weights_raw (chain, draw, hsgp(X, m=10, c=1)_weights_dim) float64 ... y_sigma (chain, draw) float64 20.86 19.86 ... 21.17 hsgp(X, m=10, c=1)_sigma (chain, draw) float64 26.85 24.0 ... 18.06 hsgp(X, m=10, c=1)_ell (chain, draw) float64 3.861 3.509 ... 3.76 hsgp(X, m=10, c=1)_weights (chain, draw, hsgp(X, m=10, c=1)_weights_dim) float64 ... y_mean (chain, draw, y_obs) float64 15.74 ... -4... hsgp(X, m=10, c=1) (chain, draw, y_obs) float64 15.74 ... -4... Attributes: created_at: 2024-05-27T06:01:00.623094 arviz_version: 0.17.0 modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000, y_obs: 500) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999 * y_obs (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499 Data variables: y (chain, draw, y_obs) float64 76.04 21.77 64.12 ... -41.12 -26.02 Attributes: modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000, y_obs: 133) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999 * y_obs (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132 Data variables: y (chain, draw, y_obs) float64 -3.957 -3.957 -3.957 ... -4.013 -4.099 Attributes: created_at: 2024-05-27T06:01:00.626582 arviz_version: 0.17.0 modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 1000) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999 Data variables: acceptance_rate (chain, draw) float64 0.9484 0.989 0.9772 ... 0.9905 0.9283 step_size (chain, draw) float64 0.1631 0.1631 ... 0.1827 0.1827 diverging (chain, draw) bool False False False ... False False False energy (chain, draw) float64 671.4 665.4 671.0 ... 670.8 674.0 n_steps (chain, draw) int64 31 31 31 15 15 31 ... 15 31 31 15 15 31 tree_depth (chain, draw) int64 5 5 5 4 4 5 5 4 4 ... 4 5 5 4 5 5 4 4 5 lp (chain, draw) float64 661.7 661.3 665.7 ... 665.6 668.8 Attributes: created_at: 2024-05-27T06:01:00.625568 arviz_version: 0.17.0 modeling_interface: bambi modeling_interface_version: 0.13.0
-
<xarray.Dataset> Dimensions: (y_obs: 133) Coordinates: * y_obs (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132 Data variables: y (y_obs) float64 0.0 -1.3 -2.7 0.0 -2.7 ... -2.7 10.7 -2.7 10.7 Attributes: created_at: 2024-05-27T06:01:00.626851 arviz_version: 0.17.0 inference_library: numpyro inference_library_version: 0.13.2 sampling_time: 1.870798 modeling_interface: bambi modeling_interface_version: 0.13.0
and plot the model fit to see if it can recover the observed data.
= az.plot_hdi(new_data['X'], idata_hsgp['posterior_predictive']['y'], fill_kwargs={'alpha': 0.2, 'color':'firebrick'}, figsize=(9, 8))
ax
'X'], idata_hsgp['posterior_predictive']['y'], fill_kwargs={'alpha': 0.8, 'color':'firebrick'}, hdi_prob=0.5)
az.plot_hdi(new_data[
= idata_hsgp['posterior_predictive']['y'].mean(dim=('chain', 'draw'))
y_mean
'X'], y_mean, label='Expected posterior predictive', color='k')
ax.plot(new_data[
'X'], df['y'], label='Observed Datapoints')
ax.scatter(df[
ax.legend()
"Posterior Predictive Distribution \n Based on HSGP approximation"); ax.set_title(
Here again we’ve allowed the model to extrapolate beyond the observed data but we achieve far more reasonable extrapolation than with the Spline model. And we can compare other performance metrics versus the spline models to see that by the aggregate performance measures our HSGP model seems to come out on top too.
= {"piecewise_constant": idata_spline1, "piecewise_linear": idata_spline2, "cubic_bspline": idata_spline3, "cubic_bspline_10": idata_spline4,
models_dict "cubic_bspline_15": idata_spline5, 'hsgp': idata_hsgp}
= az.compare(models_dict)
df_compare df_compare
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
hsgp | 0 | -608.819807 | 9.958864 | 0.000000 | 0.872617 | 12.009745 | 0.000000 | False | log |
cubic_bspline_10 | 1 | -612.578311 | 11.652463 | 3.758504 | 0.000000 | 9.646544 | 3.125929 | True | log |
cubic_bspline_15 | 2 | -620.709337 | 19.669201 | 11.889531 | 0.000000 | 9.600116 | 3.836813 | True | log |
cubic_bspline | 3 | -634.647180 | 8.703519 | 25.827373 | 0.000000 | 8.915728 | 7.666315 | True | log |
piecewise_constant | 4 | -643.781042 | 6.981098 | 34.961235 | 0.069309 | 9.770740 | 9.490854 | False | log |
piecewise_linear | 5 | -647.016885 | 5.987587 | 38.197078 | 0.058073 | 7.914787 | 9.204140 | False | log |
For a deeper dive on HSGP you might consult Juan Orduz’s work here
Bayesian Penalised Splines (Smoothing Priors)
Priors can be used in the same way to that regularisation techniques are deployed in machine learning. They serve as a kind of extra model parameter used to effect a variable selection routine. In the case of splines we are creating these synthetic features and this can be seen as a generating a feature selection problem. So far we have discussed priors as a means of imposing structure based on our knowledge of the problem but we can also use priors to automate feature selection over spline features.
Here we fit a two spline models to the speed-test data set but we use 60 knot points to generate our spline features. This is wildy excessive for the data set in question. We show how to use non-centered parameterisation with a “smoothing” trick to induce a good fit despite excessive features. In this way we’re removing the burden of manually specifying the spline structure. This is akin to penalisation methods used in PyGam
but with a Bayesian flavour.
= len(y)
max_dev=60
num_knots= np.linspace(0, np.max(X), num_knots)[2:-2]
knot_list = np.linspace(0, np.max(X), len(y))
dev_periods
= dmatrix(
Bx "bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=False)",
"dev_periods": dev_periods, "knots": knot_list},
{
)
Bx
def penalised_model(Bx, penalised=True):
with pm.Model() as model:
= pm.MutableData('basis', np.asfortranarray(Bx))
basis
if penalised:
= pm.HalfCauchy('sigma_a', 5.)
sigma_a = pm.Normal('a0', 0., 10.)
a0 = pm.Normal('delta_a', 0., 1.,shape=num_knots)
delta_a = pm.Deterministic('a', a0 + (sigma_a * delta_a).cumsum())
a else:
= pm.Normal('delta_a', 0., 25,
b =num_knots)
shape= pm.Normal('a0', 0, 10)
a0 = pm.Deterministic('a', a0 + b)
a
= pm.HalfCauchy('sigma', 5.)
sigma = pm.Normal('obs', pm.math.dot(basis, a), sigma, observed=y)
obs = pm.sample(target_accept=.95, idata_kwargs={"log_likelihood": True})
idata
return model, idata
= penalised_model(Bx, penalised=False)
model_raw, idata_raw = penalised_model(Bx, penalised=True) model_penalised, idata_penalised
The model structure is as follows:
pm.model_to_graphviz(model_penalised)
This parameterisation trick is owed to work by Austin Rochford and Adrian Seyboldt discussed here. We can see how the two parameterisations induce quite distinct values on the coefficients for splines.
= plt.subplots(figsize=(9, 20))
fig, ax =['a'], combined=True, ax=ax, model_names=['raw', 'penalised'])
az.plot_forest([idata_raw, idata_penalised], var_namesrange(160), -25, 25, alpha=0.2, color='red'); ax.fill_betweenx(
This results in much different posterior predictive fits. The un-smoothed spline fit overfits to the small partitions of the spline features, whereas the smoothing spline model achieves a reasonable fit recovering the smooth structure of the data generating process.
= pd.DataFrame(Bx)
B_df = az.summary(idata_penalised, var_names='a')['mean']
weights = plt.subplots(figsize=(8, 7))
fig, ax = np.array([np.dot(B_df, az.extract(idata_penalised['posterior']['a'])['a'].to_numpy()[:, i]) for i in range(4000)])
processed range(len(y)), processed, ax=ax, fill_kwargs={'alpha': 0.4, 'color':'firebrick'})
az.plot_hdi(range(len(y)), processed, ax=ax, fill_kwargs={'alpha': 0.8, 'color':'firebrick'}, hdi_prob=.50)
az.plot_hdi(= az.summary(idata_raw, var_names='a')['mean']
weights_raw range(len(y)), np.dot(B_df, weights), color='black', label='Cubic_Bsplines_60_Penalised')
ax.plot(range(len(y)), np.dot(B_df, weights_raw), color='blue', linestyle='--', label='Cubic_Bsplines_60_unpenalised')
ax.plot("Bayesian Automatic Regularisation \n Smoothing Priors over 60 Knots")
ax.set_title(; ax.legend()
This is reflected in the model performance measures as well.
= {"piecewise_constant": idata_spline1, "piecewise_linear": idata_spline2, "cubic_bspline": idata_spline3, "cubic_bspline_10": idata_spline4,
models_dict "cubic_bspline_20": idata_spline5, 'hsgp': idata_hsgp, 'spline_penalised_bayes': idata_penalised, 'cubic_bspline_60': idata_raw}
= az.compare(models_dict)
df_compare df_compare
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
hsgp | 0 | -608.819807 | 9.958864 | 0.000000 | 0.764385 | 12.009745 | 0.000000 | False | log |
spline_penalised_bayes | 1 | -610.258217 | 19.573938 | 1.438411 | 0.115567 | 11.444307 | 2.822797 | True | log |
cubic_bspline_10 | 2 | -612.578311 | 11.652463 | 3.758504 | 0.000000 | 9.646544 | 3.125929 | True | log |
cubic_bspline_20 | 3 | -620.709337 | 19.669201 | 11.889531 | 0.000000 | 9.600116 | 3.836813 | True | log |
cubic_bspline | 4 | -634.647180 | 8.703519 | 25.827373 | 0.000000 | 8.915728 | 7.666315 | True | log |
cubic_bspline_60 | 5 | -642.070674 | 24.951194 | 33.250867 | 0.000667 | 9.135971 | 8.313174 | False | log |
piecewise_constant | 6 | -643.781042 | 6.981098 | 34.961235 | 0.061876 | 9.770740 | 9.490854 | False | log |
piecewise_linear | 7 | -647.016885 | 5.987587 | 38.197078 | 0.057506 | 7.914787 | 9.204140 | False | log |
This functionality of Bayesian priors is not unique to spline models. We can mirror Lasso regression models with double-exponential prior specifications. Ridge regrssion is provably a species of regression models with Normal priors pulled toward 0. Other alternatives such horse-shoe and slab and spike specifications offer a rich range of bayesian approaches to variable selection methods with sparse data.
Recap
So far we’ve seen how we can use splines and gauassian processes to model highly eccentric functional relationships where the function could be approximated with univariate smoothing routine. These are two distinct abstractions which seem adequately fit to the world, but demand very different ways of thinking about the underlying reality. This is no fundamental contradiction in so far as the world admits many descriptions.
[K]nowledge develops in a multiplicity of theories, each with it’s limited utility … These theories overlap very considerable in their logical laws and in much else, but that they add up to an integrated and consistent whole is only a worthy ideal and not a pre-requistite of scientific progress … Let reconciliations proceed. - W.V.O Quine in Word and Object
Another observation in a similar vein is that penalised spline models are provably equivalent to hierarchical regression (random effects) models. This is striking because the character of these types of model seems diametrically opposed. With spline models you jerry-rig your features to an optimisation goal, with hierarchical model you tend to impose theoretical structure to induce shrinkage. It’s hard to see how this would work? With a penalised spline model are you inducing a hierarchy of latent features you can’t name? Should you even try to translate between the two!? How can we leverage our prior knowledge of the data generating process to add structure to spline model predictions?
The abstract components of our model are less graspable than their predictive performance yet the qualitative character of theory buiding differs markedly between these abstractions and aesthetics and explainability feature heavily in model preference. Next we’ll add another layer to our abstractions and show how to use hierarchical modelling over spline fits can be used to extract insight into the data generating process over a family of curves. In particular we’ll focus on the development of insurance loss curves and how to constrain our predictions within reasonable bounds. This is an apt problem for spline modelling in that the process is repeatable annually, but the domain of the support i.e. development in time is fixed year-on-year so the extrapolation problem is muted.
Insurance Loss Curves: Hierarchical Spline Models
Hierarchical models of spline fits allow us to characterise and sample from fits over a family of curves while improving out of sample predictive accuracy. We draw on car insurance losses data set discussed in Mick Cooney’s Stan case-study, but we simplify things for ourselves considerably by focusing on two types of loss and ensuring that each year under consideration has equal observations of the accruing losses.
= pd.read_csv('ppauto_pos.csv')
loss_df = loss_df[(loss_df['GRCODE'].isin([43, 353])) & (loss_df['DevelopmentYear'] < 1998)]
loss_df
= loss_df[['GRCODE', 'AccidentYear', 'DevelopmentYear', 'DevelopmentLag', 'EarnedPremDIR_B', 'CumPaidLoss_B']]
loss_df
= ['grcode', 'acc_year', 'dev_year', 'dev_lag', 'premium', 'cum_loss']
loss_df.columns 'lr'] = loss_df['cum_loss'] / loss_df['premium']
loss_df[= loss_df[(loss_df['acc_year'] <= 1992) & (loss_df['dev_lag'] <= 6)].reset_index(drop=True)
loss_df
'year_code'] = loss_df['acc_year'].astype(str) + '_' + loss_df['grcode'].astype(str)
loss_df[=['year_code', 'acc_year', 'dev_lag'], inplace=True)
loss_df.sort_values(by'standardised_premium'] = (loss_df.assign(mean_premium = np.mean(loss_df['premium']))
loss_df[= np.std(loss_df['premium']))
.assign(std_premium apply(lambda x: (x['mean_premium'] - x['premium']) /x['std_premium'], axis=1)
.
)
12) loss_df.head(
grcode | acc_year | dev_year | dev_lag | premium | cum_loss | lr | year_code | standardised_premium | |
---|---|---|---|---|---|---|---|---|---|
30 | 353 | 1988 | 1988 | 1 | 18793 | 4339 | 0.230884 | 1988_353 | -0.347453 |
31 | 353 | 1988 | 1989 | 2 | 18793 | 9617 | 0.511733 | 1988_353 | -0.347453 |
32 | 353 | 1988 | 1990 | 3 | 18793 | 11584 | 0.616400 | 1988_353 | -0.347453 |
33 | 353 | 1988 | 1991 | 4 | 18793 | 12001 | 0.638589 | 1988_353 | -0.347453 |
34 | 353 | 1988 | 1992 | 5 | 18793 | 12640 | 0.672591 | 1988_353 | -0.347453 |
35 | 353 | 1988 | 1993 | 6 | 18793 | 12966 | 0.689938 | 1988_353 | -0.347453 |
0 | 43 | 1988 | 1988 | 1 | 957 | 133 | 0.138976 | 1988_43 | 1.722344 |
1 | 43 | 1988 | 1989 | 2 | 957 | 333 | 0.347962 | 1988_43 | 1.722344 |
2 | 43 | 1988 | 1990 | 3 | 957 | 431 | 0.450366 | 1988_43 | 1.722344 |
3 | 43 | 1988 | 1991 | 4 | 957 | 570 | 0.595611 | 1988_43 | 1.722344 |
4 | 43 | 1988 | 1992 | 5 | 957 | 615 | 0.642633 | 1988_43 | 1.722344 |
5 | 43 | 1988 | 1993 | 6 | 957 | 615 | 0.642633 | 1988_43 | 1.722344 |
Plot the Loss Curves
Here we have plotted the developing loss curves from two different coded insurance products.
= loss_df.pivot(index=['dev_lag'], columns=['grcode', 'acc_year'], values='lr')
pivot = plt.subplots(1, 2, figsize=(9, 7))
fig, axs =(10, 6), ax=axs[0])
pivot.plot(figsize0].set_title("Loss Ratios by Year");
axs[for c in pivot.columns:
if 43 in c:
='red'
colorelse:
='grey'
color1].plot(pivot[c], color=color, label=c)
axs[1].legend()
axs[1].set_title("Loss Ratio by Group"); axs[
We want to model these curves collectively as instances of draws from a distribution of loss curves. To do so we will specify a PyMC
hierarchical (mixed) spline model. To do so we will have a spline basis for the global hyper parameters beta_g
and the individual parameters for each curve. Here we define a convenience function to generate the basis splines.
def make_basis_splines(num_knots=3, max_dev=7):
= np.linspace(0, max_dev, num_knots+2)[1:-1]
knot_list = np.arange(1, max_dev, 1)
dev_periods
= dmatrix(
Bi "bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True) - 1",
"dev_periods": dev_periods, "knots": knot_list},
{
)
= dmatrix(
Bg "bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True) - 1",
"dev_periods": dev_periods, "knots": knot_list})
{
return Bi, Bg
= make_basis_splines()
Bi, Bg Bg
DesignMatrix with shape (6, 7)
Columns:
['bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[0]',
'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[1]',
'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[2]',
'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[3]',
'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[4]',
'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[5]',
'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[6]']
Terms:
'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)' (columns 0:7)
(to view full data, use np.asarray(this_obj))
Next we specify a model maker function to create the various pooled, unpooled and hierarhical (mixed) models of the insurance curve data. Note that even though we’re specifying a hierarhical model we have not specified a hierarchy over the insurance codes, instead we have added this as a “fixed” effect feature into our regression model. The idea is that this fixed effect will capture the differences in expected baseline per group code of insurance product.
def make_model(loss_df, num_knots=3, max_dev=7, model_type='mixed'):
= make_basis_splines(num_knots, max_dev)
Bi, Bg = loss_df['lr'].values
observed = pd.factorize(loss_df['year_code'])
uniques, unique_codes = {'years': unique_codes,
coords'splines': list(range(Bi.shape[1])) ,
'measurement': list(range(6)),
'obs': uniques
}
with pm.Model(coords=coords) as sp_insur:
= pm.MutableData('Bg', np.asfortranarray(Bg))
basis_g
= pm.HalfCauchy('tau', 1)
tau ## Global Hierarchical Spline Terms
= pm.Normal("beta_g", mu=0, sigma=tau,
beta_g ='splines')
dims= pm.Deterministic("mu_g", pm.math.dot(basis_g, beta_g), dims='measurement')
mu_g
## Individual or Year Specific Spline Modifications
if model_type in ['mixed', 'unpooled']:
= pm.HalfCauchy('sigma_i', 1)
sigma = pm.MutableData('Bi', np.asfortranarray(Bi))
basis_i = pm.Normal("beta", mu=0, sigma=sigma, dims=('splines', 'years'))
beta = pm.Deterministic("mui", pm.math.dot(basis_i, beta), dims=('measurement', 'years'))
mui
## Features
= pm.MutableData('prem', loss_df['standardised_premium'].values)
prem = pm.MutableData('grcode', loss_df['grcode'] == 43)
grcode
= pm.Normal('beta_prem', 0, 1)
beta_prem = pm.Normal('beta_grcode', 0, 1)
beta_grcode = beta_prem*prem
mu_prem = beta_grcode*grcode
mu_grcode
## Likelihood
= pm.TruncatedNormal("sigma", 1, lower=0.005)
sigma if model_type == 'mixed':
= pm.Deterministic('mu', mu_grcode + mu_prem + (mu_g.T + mui.T).ravel(), dims='obs')
mu = pm.Normal("lr", mu, sigma, observed=observed, dims=('obs'))
lr_likelihood elif model_type == 'pooled':
= pm.Normal("lr", np.repeat(mu_g, len(unique_codes)), sigma, observed=observed, dims='obs')
lr_likelihood elif model_type == 'unpooled':
= pm.Normal("lr", mui.T.ravel(), sigma, observed=observed, dims=('obs'))
lr_likelihood
## Sampling
= pm.sample(2000, return_inferencedata=True, target_accept=.99,
idata_sp_insur ={"log_likelihood": True})
idata_kwargs= pm.sample_posterior_predictive(
idata_sp_insur =True)
idata_sp_insur,extend_inferencedata
return idata_sp_insur, sp_insur
= make_model(loss_df, model_type='unpooled')
idata_sp_insur_unpooled, sp_insur_unpooled = make_model(loss_df, model_type='pooled')
idata_sp_insur_pooled, sp_insur_pooled = make_model(loss_df, model_type='mixed') idata_sp_insur_mixed, sp_insur_mixed
The model structure can be seen more clearly in this graph. The important features to note are the spline components and the manner of their combination. We are postulating a kind of hierarchy of development curves. We specify the “global” curve in 7 development steps and allow that each distinct year is a realisation of a curve that can be thought of as modifying the profile of “global” curve as mui
is combined with mu_g
. These structures are then combined into a more traditional regression and fed into our likelihood term.
pm.model_to_graphviz(sp_insur_mixed)
We can extract the effect of the differences grcodes and examine the baseline and annual spline related coefficients.
= az.summary(idata_sp_insur_mixed, var_names=['sigma', 'beta_grcode', 'beta_prem', 'beta_g', 'beta'])
summary
summary
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
sigma | 0.014 | 0.007 | 0.005 | 0.027 | 0.001 | 0.000 | 153.0 | 348.0 | 1.04 |
beta_grcode | 0.231 | 0.057 | 0.123 | 0.337 | 0.001 | 0.001 | 4479.0 | 4712.0 | 1.00 |
beta_prem | -0.031 | 0.029 | -0.086 | 0.023 | 0.000 | 0.000 | 4884.0 | 5002.0 | 1.00 |
beta_g[0] | 0.132 | 0.069 | 0.006 | 0.265 | 0.001 | 0.001 | 3185.0 | 3812.0 | 1.00 |
beta_g[1] | 0.070 | 0.395 | -0.651 | 0.841 | 0.006 | 0.006 | 4464.0 | 3851.0 | 1.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
beta[6, 1990_43] | 0.547 | 0.073 | 0.412 | 0.686 | 0.001 | 0.001 | 2884.0 | 4276.0 | 1.00 |
beta[6, 1991_353] | -0.098 | 0.070 | -0.225 | 0.034 | 0.001 | 0.001 | 2873.0 | 4230.0 | 1.00 |
beta[6, 1991_43] | 0.115 | 0.073 | -0.016 | 0.257 | 0.001 | 0.001 | 3073.0 | 4496.0 | 1.00 |
beta[6, 1992_353] | -0.115 | 0.070 | -0.250 | 0.012 | 0.001 | 0.001 | 2861.0 | 4547.0 | 1.00 |
beta[6, 1992_43] | -0.033 | 0.091 | -0.212 | 0.135 | 0.002 | 0.001 | 3726.0 | 4461.0 | 1.00 |
80 rows × 9 columns
Again we can compare the performance metrics of the various models.
= az.compare({'unpooled': idata_sp_insur_unpooled,
compare_df 'pooled': idata_sp_insur_pooled,
'mixed': idata_sp_insur_mixed})
az.plot_compare(compare_df) compare_df
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
mixed | 0 | 134.697068 | 61.393405 | 0.000000 | 1.000000e+00 | 1.382933 | 0.000000 | True | log |
unpooled | 1 | 88.378689 | 73.735810 | 46.318379 | 0.000000e+00 | 1.267429 | 1.520694 | True | log |
pooled | 2 | -9.288780 | 5.995182 | 143.985848 | 6.050367e-10 | 5.414548 | 5.020442 | False | log |
The hierarchical splne model is by far the best fit to our data. What does it represent? Yes, splines are function approximation tools. Sure, extrapolation is dangerous, but in this domain where the range of the development curves is constrained they seem to offer an elegant way of articulating the structure of the risk profiles for various products.
Plot the Posterior Predictive Checks
We can check how well the model can recapture the observed data.
def plot_ppc_splines(idata):
= plt.subplots(2, 5, figsize=(20, 10), sharey=True)
fig, axs = axs.flatten()
axs = np.arange(1, 7, 1)
dev_periods = pd.factorize(loss_df['year_code'])
uniques, unique_codes for y, c in zip(unique_codes, range(10)):
'posterior_predictive']['lr'].sel(obs=c), color='firebrick', ax=axs[c], fill_kwargs={'alpha': 0.2}, hdi_prob=.89)
az.plot_hdi(dev_periods, idata['posterior_predictive']['lr'].sel(obs=c), color='firebrick', ax=axs[c], hdi_prob=0.5)
az.plot_hdi(dev_periods, idata['year_code'] == y)]['lr'], color='k', label='Actual Loss Ratio')
axs[c].scatter(dev_periods, loss_df[(loss_df['year_code'] == y)]['lr'], color='k')
axs[c].plot(dev_periods, loss_df[(loss_df[f"PPC 89% and 50% HDI: {y}")
axs[c].set_title("Loss Ratio")
axs[c].set_ylabel(;
axs[c].legend()
; plot_ppc_splines(idata_sp_insur_mixed)
Plot the Hierarchical Components
In the following plot we show similarly how to recapture the observed data, but additionally we can decompose the structure of the model in each case and extract baseline forecasts which would be our guide to future loss-ratio development curves in lieu of any other information.
= idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))["mu_g"]
mu_g
= idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))["mui"]
mu_i
= idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))['mu']
mu
= idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))['beta_grcode']
beta_grcode
= np.arange(1, 7, 1)
dev_periods = pd.factorize(loss_df['year_code'])
uniques, unique_codes
= """
mosaic ABCDE
FGHIJ
KKKKK
"""
= plt.subplot_mosaic(mosaic, sharey=True,
fig, axs =(20, 15))
figsize= [axs[k] for k in axs.keys()]
axs
= mu_g.mean(dim='draws')
mu_g_mean for y, c in zip(unique_codes, range(10)):
= 0
group_effect if '43' in y:
= beta_grcode.mean().item()
group_effect = mu_i.sel(years=y).mean(dim='draws')
mu_i_mean + mu_g_mean.values + mu_i_mean.values, label='Combined + E(grp_effect)', color='purple', linewidth=3.5)
axs[c].plot(dev_periods, group_effect + mu_g_mean.values, label='E(Hierarchical Baseline)', color='red', linestyle='--')
axs[c].plot(dev_periods, group_effect + mu_i_mean.values, label='E(Year Specific Adjustment)', color='blue', linestyle='--')
axs[c].plot(dev_periods, group_effect 'year_code'] == y)]['lr'], color='k', label='Actual Loss Ratio')
axs[c].scatter(dev_periods, loss_df[(loss_df[=c).T , ax=axs[c], color='firebrick', fill_kwargs={'alpha': 0.2})
az.plot_hdi(dev_periods,mu.sel(obs=c).T , ax=axs[c], color='firebrick', fill_kwargs={'alpha': 0.5}, hdi_prob=.50)
az.plot_hdi(dev_periods, mu.sel(obsf"Components for Year {y}")
axs[c].set_title("Loss Ratio")
axs[c].set_ylabel(if (c == 0):
axs[c].legend()
10].plot(dev_periods, mu_g_mean.values, label='E(Hierarchical Baseline)', color='black')
axs[10].plot(dev_periods, mu_g_mean.values + group_effect, label='E(Hierarchical Baseline) + E(grp_effect)', color='black', linestyle='--')
axs[='slateblue', ax=axs[10], fill_kwargs={'alpha': 0.2})
az.plot_hdi(dev_periods, mu_g.T.values, color+ group_effect, color='magenta', ax=axs[10], fill_kwargs={'alpha': 0.2})
az.plot_hdi(dev_periods, mu_g.T.values ='slateblue', ax=axs[10], hdi_prob=.5)
az.plot_hdi(dev_periods, mu_g.T.values, color+ group_effect, color='magenta', ax=axs[10], hdi_prob=.5)
az.plot_hdi(dev_periods, mu_g.T.values 10].set_title("Baseline Forecast Loss Ratio")
axs[10].legend(); axs[
This plot is a bit clunky, because we’re mixing expectations and posterior distributions over the parameters. The point is just to highlight the “compositional” structure of our model.
Predicting Next Year’s Losses
A better way to interrogate the implications of the model is to “push” forward different data through the posterior predictive distribution and derive a kind of ceteris paribus rule for accrual of losses.
with sp_insur_mixed:
'grcode': np.ones(len(loss_df)),
pm.set_data({
})= pm.sample_posterior_predictive(idata_sp_insur_mixed, var_names=['lr'], extend_inferencedata =True)
idata_43
with sp_insur_mixed:
'grcode': np.zeros(len(loss_df))})
pm.set_data({= pm.sample_posterior_predictive(idata_sp_insur_mixed, var_names=['lr'], extend_inferencedata=True)
idata_353
idata_353
Sampling: [lr]
Sampling: [lr]
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 2000, splines: 7, years: 10, measurement: 6, obs: 60) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999 * splines (splines) int64 0 1 2 3 4 5 6 * years (years) <U8 '1988_353' '1988_43' ... '1992_353' '1992_43' * measurement (measurement) int64 0 1 2 3 4 5 * obs (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 ... 8 8 8 8 8 8 9 9 9 9 9 9 Data variables: beta_g (chain, draw, splines) float64 0.08213 0.08078 ... 0.6575 beta (chain, draw, splines, years) float64 0.1395 ... -0.1751 beta_prem (chain, draw) float64 -0.0156 -0.03719 ... -0.0874 -0.06236 beta_grcode (chain, draw) float64 0.2377 0.2905 0.2288 ... 0.3751 0.3465 tau (chain, draw) float64 0.5886 0.5384 0.4783 ... 0.6427 0.4663 sigma_i (chain, draw) float64 0.2091 0.171 0.2062 ... 0.2064 0.1871 sigma (chain, draw) float64 0.01148 0.009831 ... 0.005943 0.005919 mu_g (chain, draw, measurement) float64 0.08213 0.3377 ... 0.6575 mui (chain, draw, measurement, years) float64 0.1395 ... -0.1751 mu (chain, draw, obs) float64 0.2271 0.509 ... 0.9015 0.9269 Attributes: created_at: 2024-05-27T06:09:05.927942 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3 sampling_time: 227.7068202495575 tuning_steps: 1000
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 2000, obs: 60) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999 * obs (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 2 ... 7 8 8 8 8 8 8 9 9 9 9 9 9 Data variables: lr (chain, draw, obs) float64 0.233 0.5237 0.6346 ... 0.8981 0.9297 Attributes: created_at: 2024-05-27T06:09:08.031307 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 2000, obs: 60) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999 * obs (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 2 ... 7 8 8 8 8 8 8 9 9 9 9 9 9 Data variables: lr (chain, draw, obs) float64 3.493 3.52 1.659 ... 3.711 4.137 4.128 Attributes: created_at: 2024-05-27T06:09:06.204725 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3
-
<xarray.Dataset> Dimensions: (chain: 4, draw: 2000) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999 Data variables: (12/17) perf_counter_diff (chain, draw) float64 0.08216 0.0801 ... 0.08181 lp (chain, draw) float64 184.4 196.5 ... 216.8 224.4 step_size_bar (chain, draw) float64 0.005908 0.005908 ... 0.005495 smallest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan acceptance_rate (chain, draw) float64 0.9998 0.9855 ... 0.944 0.9537 reached_max_treedepth (chain, draw) bool False False False ... True False ... ... largest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan energy (chain, draw) float64 -141.4 -154.8 ... -177.6 -175.2 tree_depth (chain, draw) int64 10 10 10 10 10 ... 10 10 10 10 10 diverging (chain, draw) bool False False False ... False False index_in_trajectory (chain, draw) int64 -352 387 -740 ... 790 521 -683 process_time_diff (chain, draw) float64 0.08216 0.08007 ... 0.0818 Attributes: created_at: 2024-05-27T06:09:05.940582 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3 sampling_time: 227.7068202495575 tuning_steps: 1000
-
<xarray.Dataset> Dimensions: (obs: 60) Coordinates: * obs (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 2 ... 7 8 8 8 8 8 8 9 9 9 9 9 9 Data variables: lr (obs) float64 0.2309 0.5117 0.6164 0.6386 ... 0.859 0.9038 0.9244 Attributes: created_at: 2024-05-27T06:09:05.943015 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3
-
<xarray.Dataset> Dimensions: (Bg_dim_0: 6, Bg_dim_1: 7, Bi_dim_0: 6, Bi_dim_1: 7, prem_dim_0: 60, grcode_dim_0: 60) Coordinates: * Bg_dim_0 (Bg_dim_0) int64 0 1 2 3 4 5 * Bg_dim_1 (Bg_dim_1) int64 0 1 2 3 4 5 6 * Bi_dim_0 (Bi_dim_0) int64 0 1 2 3 4 5 * Bi_dim_1 (Bi_dim_1) int64 0 1 2 3 4 5 6 * prem_dim_0 (prem_dim_0) int64 0 1 2 3 4 5 6 7 ... 52 53 54 55 56 57 58 59 * grcode_dim_0 (grcode_dim_0) int64 0 1 2 3 4 5 6 7 ... 53 54 55 56 57 58 59 Data variables: Bg (Bg_dim_0, Bg_dim_1) float64 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 Bi (Bi_dim_0, Bi_dim_1) float64 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 prem (prem_dim_0) float64 -0.3475 -0.3475 -0.3475 ... -1.572 -1.572 grcode (grcode_dim_0) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 Attributes: created_at: 2024-05-27T06:09:05.943907 arviz_version: 0.17.0 inference_library: pymc inference_library_version: 5.10.3
Even here though we want to average the curves over the specific years in the data and abstract a view of the model implications under different counterfactual settings. Or put another way, we want a view of the average development curve between years, not within a year. If we want to predict the novel insurance loss for next year we need some way to aggregate between the annual trajectories. Here we define a helper function to effect this step.
def get_posterior_predictive_curve(idata, prem=2, grcode=1):
= [np.dot(np.asfortranarray(Bi), az.extract(idata['posterior']['beta'])['beta'].values[:, :, i]) for i in range(2000)]
weighted_splines
= [np.dot(np.asfortranarray(Bg), az.extract(idata['posterior']['beta_g'])['beta_g'].values[:, i]) for i in range(2000)]
weighted_splines_1
= az.extract(idata['posterior']['beta_grcode'])['beta_grcode']
beta_grcode
= az.extract(idata['posterior']['beta_prem'])['beta_prem']
beta_prem = pd.DataFrame([beta_prem.values[i]*prem + beta_grcode.values[i]*grcode for i in range(2000)]).T
df1
## How do we averaging over the years to get
## a view of a new development period?
#df = pd.concat([pd.DataFrame(weighted_splines_1[i].T + weighted_splines[i].T).mean() for i in range(1000)], axis=1)
## Sample random group each draw from posterior
= pd.concat([pd.DataFrame((weighted_splines_1[i].T + weighted_splines[i].T)[np.random.choice(list(range(10))), :]) for i in range(2000)], axis=1)
df
## Average random subset of of groups
#df = pd.concat([pd.DataFrame((weighted_splines_1[i].T + weighted_splines[i].T)[np.random.choice(list(range(10)), 5), :]).mean() for i in range(2000)], axis=1)
= df1.iloc[0].values + df
df
return df
= get_posterior_predictive_curve(idata_43, prem=1, grcode=1)
pred_df_1 = get_posterior_predictive_curve(idata_353, prem=1, grcode=0)
pred_df_0
= plt.subplots(figsize=(9, 7), sharey=True)
fig, ax
range(6), pred_df_0.values.T, ax=ax, color='slateblue', smooth=False, fill_kwargs={'alpha': 0.2})
az.plot_hdi(range(6), pred_df_1.values.T, ax=ax, color='firebrick', smooth=False, fill_kwargs={'alpha': 0.2})
az.plot_hdi(range(6), pred_df_0.values.T, ax=ax, color='slateblue', smooth=False, fill_kwargs={'alpha': 0.5}, hdi_prob=.5)
az.plot_hdi(range(6), pred_df_1.values.T, ax=ax, color='firebrick', smooth=False, fill_kwargs={'alpha': 0.5}, hdi_prob=.5)
az.plot_hdi(=1), linestyle='-', color='k', linewidth=4, label='grcode 353 prem 1')
ax.plot(pred_df_0.mean(axis
=1), linestyle='--', color='grey', linewidth=4, label='grcode 43 prem 1')
ax.plot(pred_df_1.mean(axis
"Posterior Samples of the Trajectories \n Under different Counterfactual settings")
ax.set_title("Loss Ratio")
ax.set_ylabel("Development Period")
ax.set_xlabel(; ax.legend()
Note how we’ve varied the predictor inputs to showcase the expected realisation under different product codes. In this manner we can construct a hierarchical spline model with informed by the historic trajectories of the development curves but with predictions modulated by more recent information.
Multiple Smoother Regression Models
Non-linear spline relationships can be additively combined across multiple variables in simple and hierarchical regressions. To demonstrate how spline modelling can be further adapted to the multiple regression like cases we use the PISA data set discussed in this case study from Michael Clark. We’ll see how crucial it is to carefully assess the implications of these model fits.
The data set has been constructed using average Science scores by country from the Programme for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Human Development Index from UN data. We want to model the overall outcome score as a function of these broad demographic features.
= pd.read_csv("https://raw.githubusercontent.com/m-clark/generalized-additive-models/master/data/pisasci2006.csv")
pisa_df
pisa_df.head()
Country | Overall | Issues | Explain | Evidence | Interest | Support | Income | Health | Edu | HDI | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Albania | NaN | NaN | NaN | NaN | NaN | NaN | 0.599 | 0.886 | 0.716 | 0.724 |
1 | Argentina | 391.0 | 395.0 | 386.0 | 385.0 | 567.0 | 506.0 | 0.678 | 0.868 | 0.786 | 0.773 |
2 | Australia | 527.0 | 535.0 | 520.0 | 531.0 | 465.0 | 487.0 | 0.826 | 0.965 | 0.978 | 0.920 |
3 | Austria | 511.0 | 505.0 | 516.0 | 505.0 | 507.0 | 515.0 | 0.835 | 0.944 | 0.824 | 0.866 |
4 | Azerbaijan | 382.0 | 353.0 | 412.0 | 344.0 | 612.0 | 542.0 | 0.566 | 0.780 | NaN | NaN |
The relationships displayed between each of these measures is not obviously linear, and as such could plausibly benefit from being modelled with splines.
= sns.pairplot(pisa_df[['Overall', 'Income', 'Support', 'Health', 'Edu']], kind="reg", height=1.5)
g "Pair Plot of Complex Relations", y=1.05); g.fig.suptitle(
We define three models for contrasting the implications. Note here how we have to define a seperate spline basis for each of the covariates. Here we create the knots for defining our basis on each covariate.
= np.linspace(np.min(pisa_df['Income']), np.max(pisa_df['Income']), 5+2)[1:-1]
knots_income
= np.linspace(np.min(pisa_df['Edu']), np.max(pisa_df['Edu']), 5+2)[1:-1]
knots_edu
= np.linspace(np.min(pisa_df['Health']), np.max(pisa_df['Health']), 5+2)[1:-1]
knots_health
= np.linspace(np.min(pisa_df['Income']), np.max(pisa_df['Income']), 3+2)[1:-1]
knots_income1
= np.linspace(np.min(pisa_df['Edu']), np.max(pisa_df['Edu']), 3+2)[1:-1]
knots_edu1
= np.linspace(np.min(pisa_df['Health']), np.max(pisa_df['Health']), 3+2)[1:-1] knots_health1
Now we initialise these models
= "Overall ~ Income + Edu + Health"
formula = bmb.Model(formula, pisa_df, dropna=True)
base_model
= """Overall ~ bs(Income, degree=3, knots=knots_income) + bs(Edu, degree=3, knots=knots_edu) + bs(Health, degree=3, knots=knots_health) """
formula_spline
= """Overall ~ (1 | Country) + bs(Income, degree=3, knots=knots_income1) + bs(Edu, degree=3, knots=knots_edu1) + bs(Health, degree=3, knots=knots_health1) """
formula_spline1
= bmb.Model(formula_spline, pisa_df, dropna=True)
spline_model
= bmb.Model(formula_spline1, pisa_df, dropna=True)
spline_model1
spline_model1
Automatically removing 13/65 rows from the dataset.
Automatically removing 13/65 rows from the dataset.
Automatically removing 13/65 rows from the dataset.
Formula: Overall ~ (1 | Country) + bs(Income, degree=3, knots=knots_income1) + bs(Edu, degree=3, knots=knots_edu1) + bs(Health, degree=3, knots=knots_health1)
Family: gaussian
Link: mu = identity
Observations: 52
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 471.1538, sigma: 517.9979)
bs(Income, degree=3, knots=knots_income1) ~ Normal(mu: [0. 0. 0. 0. 0. 0.], sigma: [1630.0781
872.587 532.485 674.4864 570.5563 779.4582])
bs(Edu, degree=3, knots=knots_edu1) ~ Normal(mu: [0. 0. 0. 0. 0. 0.], sigma: [965.2606 734.261
559.679 590.0573 684.2515 649.6795])
bs(Health, degree=3, knots=knots_health1) ~ Normal(mu: [0. 0. 0. 0. 0. 0.], sigma: [1842.3891
815.7797 534.7546 671.1905 544.161 802.929 ])
Group-level effects
1|Country ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 517.9979))
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 53.4654)
spline_model.build() spline_model.graph()
= base_model.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})
base_idata = spline_model.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True}, target_accept=.95)
spline_idata = spline_model1.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True}, target_accept=.99) spline_idata1
We can compare the simple regression approach to the spline based regression in the now usual way.
= az.compare({'spline': spline_idata, 'raw': base_idata, 'spline_hierarchy':spline_idata1})
compare_df
compare_df
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
spline_hierarchy | 0 | -244.691895 | 38.487159 | 0.000000 | 1.000000e+00 | 3.901855 | 0.000000 | True | log |
spline | 1 | -254.101306 | 20.218465 | 9.409411 | 0.000000e+00 | 4.724572 | 2.840914 | True | log |
raw | 2 | -263.061455 | 9.617888 | 18.369559 | 1.054712e-13 | 10.083921 | 9.376156 | True | log |
The coefficients comparisons are harder
az.summary(base_idata)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 118.438 | 81.996 | -41.166 | 266.033 | 1.522 | 1.086 | 2915.0 | 2986.0 | 1.0 |
Income | 181.955 | 89.547 | 17.413 | 353.598 | 1.913 | 1.353 | 2196.0 | 2432.0 | 1.0 |
Edu | 234.125 | 59.153 | 120.739 | 340.343 | 1.017 | 0.740 | 3414.0 | 2518.0 | 1.0 |
Health | 30.395 | 140.966 | -223.699 | 298.258 | 2.852 | 2.153 | 2450.0 | 2598.0 | 1.0 |
Overall_sigma | 34.234 | 3.620 | 27.882 | 41.175 | 0.060 | 0.043 | 3757.0 | 2879.0 | 1.0 |
since it’s less clear what the spline coefficient terms mean.
az.summary(spline_idata)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 225.689 | 205.515 | -133.222 | 630.021 | 6.036 | 4.292 | 1163.0 | 1653.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[0] | -355.159 | 1613.343 | -3460.788 | 2642.535 | 56.357 | 39.865 | 819.0 | 1504.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[1] | 194.905 | 122.719 | -42.478 | 423.960 | 2.343 | 1.728 | 2748.0 | 2719.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[2] | 48.748 | 123.638 | -182.012 | 272.863 | 4.065 | 2.875 | 923.0 | 1431.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[3] | 68.533 | 98.242 | -118.858 | 250.496 | 3.369 | 2.383 | 853.0 | 1416.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[4] | 210.290 | 103.892 | 17.347 | 406.753 | 3.626 | 2.565 | 825.0 | 1404.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[5] | 222.546 | 105.553 | 31.163 | 427.878 | 3.589 | 2.539 | 866.0 | 1442.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[6] | 202.395 | 105.402 | 0.453 | 394.110 | 3.629 | 2.567 | 846.0 | 1507.0 | 1.0 |
bs(Income, degree=3, knots=knots_income)[7] | 81.592 | 103.313 | -106.154 | 279.937 | 3.641 | 2.575 | 805.0 | 1390.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[0] | 145.558 | 287.518 | -410.399 | 654.146 | 8.525 | 6.029 | 1142.0 | 1509.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[1] | 47.574 | 193.029 | -323.504 | 392.406 | 5.595 | 3.957 | 1197.0 | 1656.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[2] | 71.806 | 211.848 | -312.889 | 469.050 | 6.246 | 4.417 | 1156.0 | 1647.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[3] | 147.390 | 200.622 | -224.182 | 518.214 | 5.972 | 4.224 | 1134.0 | 1603.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[4] | 70.472 | 203.119 | -314.815 | 427.662 | 6.018 | 4.256 | 1144.0 | 1535.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[5] | 121.297 | 202.750 | -264.164 | 485.547 | 5.998 | 4.243 | 1147.0 | 1575.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[6] | 78.372 | 206.089 | -296.582 | 457.669 | 6.012 | 4.252 | 1178.0 | 1713.0 | 1.0 |
bs(Edu, degree=3, knots=knots_edu)[7] | 124.157 | 203.070 | -258.062 | 490.727 | 6.008 | 4.249 | 1148.0 | 1532.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[0] | 235.831 | 885.656 | -1490.479 | 1862.011 | 31.104 | 22.002 | 813.0 | 1482.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[1] | 162.805 | 389.636 | -552.198 | 913.839 | 7.756 | 6.545 | 2529.0 | 2422.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[2] | 14.967 | 120.574 | -202.130 | 248.327 | 3.854 | 2.726 | 979.0 | 1657.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[3] | 25.213 | 100.498 | -170.867 | 204.686 | 3.411 | 2.413 | 871.0 | 1555.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[4] | -21.315 | 102.473 | -223.005 | 158.665 | 3.577 | 2.530 | 822.0 | 1468.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[5] | -19.449 | 101.418 | -213.704 | 165.809 | 3.541 | 2.505 | 823.0 | 1444.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[6] | -55.820 | 100.895 | -252.216 | 126.193 | 3.547 | 2.509 | 811.0 | 1347.0 | 1.0 |
bs(Health, degree=3, knots=knots_health)[7] | 0.840 | 101.892 | -196.002 | 186.216 | 3.543 | 2.506 | 828.0 | 1469.0 | 1.0 |
Overall_sigma | 24.766 | 3.303 | 18.816 | 30.884 | 0.085 | 0.060 | 1491.0 | 2068.0 | 1.0 |
We can also check that the model seems to recover the observed data well.
Plotting Posterior Predictive Checks
='pps')
base_model.predict(base_idata, kind='pps')
spline_model.predict(spline_idata, kind='pps')
spline_model1.predict(spline_idata1, kind
= plt.subplots(1, 3, figsize=(20, 6))
fig, axs = axs.flatten()
axs =axs[0])
az.plot_ppc(base_idata, ax=axs[1]);
az.plot_ppc(spline_idata, ax=axs[2]);
az.plot_ppc(spline_idata1, ax0].set_xlabel('')
axs[1].set_title("PPC: Spline Model");
axs[0].set_title("PPC: Regression Model");
axs[2].set_title("PPC: Hierarchical Spline Model"); axs[
Next we’ll dig into the spline basis features and decompose the predicted outcome and show how the outcome varies for levels in the inputs variables. The Bambi interpret package offers some functionality to assess the conditional predictions for varying values of the input variables.
Model Interpretability Plots in Bambi
We want to highlight the differences between the hierarchical and non-hierarchical multivariable spline models here. The shrinkage effects of hierarchically modelling the country intercepts is clearly evident. They helpfully constrain the poor extrapolation effects of the simpler spline model.
= plt.subplots(3, 2, figsize=(9, 14))
fig, axs = axs.flatten()
axs
bmb.interpret.plot_predictions("Income", ax=axs[0]
spline_model, spline_idata,
)
0].set_title("Non-Hierarchical Income")
axs[2].set_title("Non-Hierarchical Edu")
axs[4].set_title("Non-Hierarchical Health")
axs[
1].set_title("Hierarchical Income")
axs[3].set_title("Hierarchical Edu")
axs[5].set_title("Hierarchical Health")
axs[
bmb.interpret.plot_predictions("Edu", ax=axs[2]
spline_model, spline_idata,
)
bmb.interpret.plot_predictions("Health", ax=axs[4]
spline_model, spline_idata, ;
)
bmb.interpret.plot_predictions("Income",
spline_model1, spline_idata1, =True, ax=axs[1]
sample_new_groups
)
bmb.interpret.plot_predictions("Edu", sample_new_groups=True, ax=axs[3]
spline_model1, spline_idata1,
)
bmb.interpret.plot_predictions("Health",sample_new_groups=True, ax=axs[5]
spline_model1, spline_idata1, ; )
We can pull these types of values out into a table. Note here how we ask for predictions based on varying values of the Edu
and Income
variables where we keep the Health
variable fixed at the mean value.
= bmb.interpret.predictions(
summary_df
spline_model,
spline_idata,={
conditional"Edu": list(np.linspace(0.5, .95, 10)),
"Income": list(np.linspace(0.4, .95, 10)),
},
)
summary_df
= bmb.interpret.predictions(
summary_df1
spline_model1,
spline_idata1,={
conditional"Edu": list(np.linspace(0.5, .95, 10)),
"Income": list(np.linspace(0.4, .95, 10)),
},=True
sample_new_groups
) summary_df1
Edu | Income | Country | Health | estimate | lower_3.0% | upper_97.0% | |
---|---|---|---|---|---|---|---|
0 | 0.50 | 0.400000 | Albania | 0.885672 | 102.686176 | -587.761967 | 750.373783 |
1 | 0.50 | 0.461111 | Albania | 0.885672 | 451.070276 | -37.688251 | 905.442110 |
2 | 0.50 | 0.522222 | Albania | 0.885672 | 376.852378 | -161.138067 | 945.914437 |
3 | 0.50 | 0.583333 | Albania | 0.885672 | 283.859330 | -357.146695 | 933.659941 |
4 | 0.50 | 0.644444 | Albania | 0.885672 | 288.608418 | -345.493652 | 920.643943 |
... | ... | ... | ... | ... | ... | ... | ... |
95 | 0.95 | 0.705556 | Albania | 0.885672 | 460.351293 | 413.197965 | 508.270287 |
96 | 0.95 | 0.766667 | Albania | 0.885672 | 516.885593 | 474.706704 | 561.845324 |
97 | 0.95 | 0.827778 | Albania | 0.885672 | 552.375825 | 506.285465 | 603.102537 |
98 | 0.95 | 0.888889 | Albania | 0.885672 | 499.921356 | 455.457677 | 550.896029 |
99 | 0.95 | 0.950000 | Albania | 0.885672 | 229.655207 | 49.386349 | 412.317948 |
100 rows × 7 columns
We can then plot these results based on the conditional realisations and see some interesting behaviour at the implausilble reaslisations. This should make us somewhat wary of using this model to extrapolate too far beyond the observable range of data
= sns.relplot(data=summary_df, x="Income", y="estimate", hue="Edu")
g
"Marginal Predictions of the Outcome Variable \n conditional on counterfactual values for Edu and Income", y=1.05);
g.fig.suptitle(
= sns.relplot(data=summary_df1, x="Income", y="estimate", hue="Edu")
g
"Hierarchical Marginal Predictions of the Outcome Variable \n conditional on counterfactual values for Edu and Income", y=1.05); g.fig.suptitle(
However, we can see here how the hierarchical component shrinks the predicted values back towards a reasonable range! Constraining the poor extrapolation of the more naive spline model. This demonstrates something of the interplay between in-sample approximation and out of sample generalisation we saw above in the case of the insurance curve development. The additional structure insists on commensurate realisations under different counterfactual settings reflective of the fact that countries do not vary so radically.
The Spline Component Contributions
Finally, we’ll pull out the component contributions of each variable and see how they combine additively. This will also serve as a kind of posterior predictive check for each country as we can show the degree which posterior draws from each component sum to achieve a plausible mirror of the observed data.
= spline_model.response_component.design.common['bs(Income, degree=3, knots=knots_income)']
Bincome
= az.extract(spline_idata['posterior']['bs(Income, degree=3, knots=knots_income)'])['bs(Income, degree=3, knots=knots_income)']
income_coefs
= spline_model.response_component.design.common['bs(Edu, degree=3, knots=knots_edu)']
Bedu
= az.extract(spline_idata['posterior']['bs(Edu, degree=3, knots=knots_edu)'])['bs(Edu, degree=3, knots=knots_edu)']
edu_coefs
= spline_model.response_component.design.common['bs(Health, degree=3, knots=knots_health)']
Bhealth
= az.extract(spline_idata['posterior']['bs(Health, degree=3, knots=knots_health)'])['bs(Health, degree=3, knots=knots_health)']
health_coefs
= np.dot(Bincome, income_coefs).T
income = np.dot(Bedu, edu_coefs).T
edu = np.dot(Bhealth, health_coefs).T
health
= az.extract(spline_idata['posterior']['Intercept'])['Intercept'].values
intercept
= plt.subplots(figsize=(10, 7))
fig, ax for i in range(100):
if i == 1:
='Income Component', color='red')
ax.plot(income[i], label='Edu Component', color='blue')
ax.plot(edu[i], label='Health Component', color='darkgreen')
ax.plot(health[i], label+ income[i] + edu[i] + health[i], label='Combined Components', color='purple')
ax.plot(intercept[i] else:
=0.1, color='red')
ax.plot(income[i], alpha=0.1, color='blue')
ax.plot(edu[i], alpha=0.1, color='darkgreen')
ax.plot(health[i], alpha+ income[i] + edu[i] + health[i], color='purple', alpha=0.3)
ax.plot(intercept[i]
range(len(spline_idata['observed_data']['Overall'])), spline_idata['observed_data']['Overall'], label='Observed', color='grey', s=56, ec='black')
ax.scatter("PISA Outcomes \n Additive Spline Components", fontsize=20)
ax.set_title(;
ax.legend()"Overall Score", fontsize=12)
ax.set_ylabel(=0).reset_index()['Country'], fontsize=12); ax.set_xticklabels(pisa_df.dropna(axis
We can see here how the combined components borrow the structure of the outcome variable primarily from the income variable component. The health measures have closer to zero additive contribution while the uncertainty in the educational component varies wildly. But the blue educational component here is used primarily as a scaling contribution which adds to the level of the outcome variable rather than distorting the shape. These initially opaque synthetic features of splines can offer deep insight into the structure of our data generating process when seen correctly.
Conclusion
We’ve seen the application of splines as univariate smoothers to approximate wiggly curves of arbitrary shape. We’ve also tried gaussian process approximations of the same univariate functions. The suggested flexibility of both methods is a strength, but we need to be careful where splines have a tendency to over-fit to individual curves. As such we have tried to show that we can incorporate spline basis modelling in a hierarchical bayesian model and recover more compelling posterior predictive checks and additionally derive predictiions from the mixed variant of the hierarhical model which helps us understand the implications of the data for generic forecasts of insurance loss curves. Finally we showed how splines can be used additively to model non-linear functions of multiple covariates. These are a powerful tool to interrogate complex non-linear relationships, but they offer interpolation of functions within a well understood domain. Applealing to these models for extrapolation needs to be done carefully.
We can, I think, draw a broad moral from this presentation. Observed phenomena admit description with a wide variety of mathematical abstraction. The “correctness” of the abstraction is never the right question. The manner in which one abstraction distils the structure of reality is not fundamentally any better than the manner in which an alternative scheme will do so. Predictive success achieved applying one abstraction can be traded off for interpretive tractability in another. What matters is the task at hand - solving the problem. Sometimes problems can be solved with new information, but mostly they’re resolved by re-arranging assumptions and phrasing the problem anew - seeing through a different lens, pulling the door handle where we previously pushed. Splines, hierarhical splines and gaussian process models all attempt to approximate an answer by rephrasing the question. The more of these tools we have, the better we are prepared to tackle novel problems and discern the pertinent structure in the world.
Citation
@online{forde2024,
author = {Nathaniel Forde},
title = {GAMs and {GPs:} {Flexibility} and {Calibration}},
date = {2024-04-07},
langid = {en}
}