GAMs and GPs: Flexibility and Calibration

probability
generalised additive models
gaussian processes
Published

April 7, 2024

import pandas as pd
import matplotlib.pyplot as plt
import bambi as bmb
import seaborn as sns
from pygam.datasets import mcycle
from pygam import LinearGAM, s, f, GAM, l, utils
import numpy as np
import arviz as az
import pymc as pm


random_seed = 100

import warnings

warnings.simplefilter('ignore')

GAMs and GPs: Flexibility and Calibration

Flexible Spline models risk overfit and need to be carefully calibrated against real data.

Theoretical Posits and Scientific Commitments

The scientific enterprise must abstract away details of reality to provide insight into the structure of reality. Statistical models of reality are cobbled together from a variety of abstractions; probability distributions and functional relations cojoined and calibrated against observed data. We are building on shifting foundations. Good abstractions aim at modelling the structures or real patterns that matter.

There is no way to establish fully secured, … starting point of the sciences. There is no tabula rasa. We are like sailors who have to rebuild their ship on the open sea, without ever being able to dismantle it in dry-dock and reconstruct it from its best components. - Otto Neurath

Spline models are like the panes of wood we use to construct the ship and ride the next wave. In this blog post we’re going to dive into modelling of non-linear functions and explore some of the tooling available in the python eco-system. We’ll start by looking into Generalised Additive Models with splines in pyGAM before proceeding to look at Bayesian versions of spline modelling comparing the splines to Gaussian processes in Bambi and PyMC. Before showing how hierarchical bayesian models avoid some of the issues of overfit in simpler spline models.

Our interest in these models stems from their flexibility to approximate functions of arbitrary complexity. We’ll see how the methods work in the case of relatively straightforward toy example and then we’ll apply each of the methods to deriving insights into the functional form of insurance loss curves. In this application we adapt a data set discussed in Mick Cooney’s Stan case study to demonstrate the power of hierarchical spline models. Throughout we’ll draw on the discussion of these methods in Osvaldo Martin’s “Bayesian Analysis with Python” for practical details implementing these models.

All of these methods need to be assessed with respect to their in-sample model fit and their out of sample performance. How can we best calibrate the model fits to perform reasonably well out of sample? How does the spline-constructs reflect an underlying reality, how can our posit’s aid generalisation.

Generalised Additive models

The canonical reference for GAMs is Simon Wood’s “Generalised Additive Models: An Introduction with R” which outlines in some detail the theoretical background of splines and univariate smoothers. The book stresses the trade-offs between the flexibility of splines and the need for cross-validation and penalised estimation methods for spline based modelling.

In R these penalised models fits can be achieved in mgcv which incorporates a wilkinson like formula syntax for model specification: y ~ s(x) + s(x1). The closest implementation in python is available in PyGam and we will adopt this package to illustrate the application of smoothing terms and the penalities.

PyGAM and Penalised Fits

In the next code block we load in an example data set on which to demonstrate univariate smoothing patterns using penalised splines. These models can be optimised by fitting differing strength penalities over a varying number of splines. Splines are used to construct non-linear functions of the input variable which are combined additively in our model equation. The penalities used in the optimisation routines constrain how “wiggly” these constructions will be.

X, y = mcycle(return_X_y=True)

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(X, y)
ax.set_ylabel("Acceleration")
ax.set_xlabel("Time Step")
ax.set_title("Crash Test Dummy Acceleration \n Simulated Motorcycle Crash", fontsize=20)
Text(0.5, 1.0, 'Crash Test Dummy Acceleration \n Simulated Motorcycle Crash')

Next we fit a number of different models to account for the herky-jerky nature of the data generating processs. We vary the parameterisations to see how the numbers of splines and strength of the penalty help account for the variation in \(y\) over the support of \(X\).

gam1 = LinearGAM(s(0, n_splines=5)).fit(X, y)
gam2 = LinearGAM(s(0, n_splines=7)).fit(X, y)
gam3 = LinearGAM(s(0, n_splines=10)).fit(X, y)
gam4 = LinearGAM(s(0, n_splines=15)).fit(X, y)
gam5 = LinearGAM(s(0, lam=.1)).fit(X, y)
gam6 = LinearGAM(s(0, lam=.5)).fit(X, y)
gam7 = LinearGAM(s(0, lam=5)).fit(X, y)
gam8 = LinearGAM(s(0, lam=15)).fit(X, y)


def plot_fit(gam, X, y, ax, t, c1='b', c2='r'):
    XX = gam.generate_X_grid(term=0, n=500)

    ax.plot(XX, gam.predict(XX), color=c2, linestyle='--')
    ax.plot(XX, gam.prediction_intervals(XX, width=.95), color=c1, ls='--')

    ax.scatter(X, y, facecolor='gray', edgecolors='none')
    ax.set_title(f"""95% prediction interval with {t} \n LL: {gam.statistics_['loglikelihood']}""");

fig, axs = plt.subplots(4,2, figsize=(10, 20))
axs = axs.flatten()
titles = ['5_splines', '7_splines', '10_splines', '15_splines',
'lam=.1', 'lam=.5', 'lam=5', 'lam=15']
gs = [gam1, gam2, gam3, gam4, gam5, gam6, gam7, gam8]
for ax, g, t in zip(axs, gs, titles):
    plot_fit(g, X, y, ax, t)

Here we’ve seen the PyGAM package applied to fitting our model to the data. In the formula specification we see y ~ s(i) where i denotes the index of the column variable in the X data.

Over the range of of the x-axis we can see how the conditional expectation is more or less well fit to the data depending on how the penalities and complexity of the model is specified.

Optimising The Parameter Setting

We can see from the model summary what is going on under the hood. For a given model specification the summary will report a number of model-fit statistics such as the log-likelihood and the AIC.

## Naive Model manually specified splines
gam_raw = LinearGAM(s(0, n_splines=10), fit_intercept=False).fit(X, y)
print("log_likelihood:", gam_raw.statistics_['loglikelihood'])
print("AIC:", gam_raw.statistics_['AIC'])
log_likelihood: -1042.2807399926558
AIC: 2097.7407642114244

The question then becomes what changes are induced in the model as we seek to optimise these model fit statistics.

## model optimised
gam = LinearGAM(s(0),  fit_intercept=False)
gam.gridsearch(X, y)
gam.summary()
  0% (0 of 11) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--
  9% (1 of 11) |##                       | Elapsed Time: 0:00:00 ETA:   0:00:00
100% (11 of 11) |########################| Elapsed Time: 0:00:00 Time:  0:00:00
LinearGAM                                                                                                 
=============================================== ==========================================================
Distribution:                        NormalDist Effective DoF:                                     11.8135
Link Function:                     IdentityLink Log Likelihood:                                  -952.2409
Number of Samples:                          133 AIC:                                             1930.1088
                                                AICc:                                            1933.0789
                                                GCV:                                              609.3811
                                                Scale:                                            512.7965
                                                Pseudo R-Squared:                                   0.7984
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [0.2512]             20           11.8         1.11e-16     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.

Fortunately, this routine can be performed directly and results in the following differences between the naive and optimised model.

Plot the GAM fits

fig, ax = plt.subplots(figsize=(10, 6))
plot_fit(gam_raw, X, y, ax, "Unoptimised Fit", c1='orange', c2='green')
plot_fit(gam, X, y, ax, "Optimised Fit")

This is all well and good. We’ve seen an approach to modelling that can capture eccentric patterns in raw data. But how does it work and why should we care?

If you’re familiar with the nomenclature of machine learning, you should think of spline modelling as a variety of feature creation. It generates “synthetic” features over the range of the observed variable. These synthetic features are the splines in question.

Digression on the usage of “Splines”

The history of the term “spline” is related to the history of draftmanship. Historically splines were thin strips of flexible wood or plastic that could be bent or shaped around a weight or “knot” points to express a traceable curve over the space of a “numberline”. The elastic nature of the spline material allowed it to be bent around the knot points of curvature expressing a smooth or continuous bend.

The mathematical technique apes these properties by defining a curve over in an analogous way. We specify “knots” to carve up the support of the random variable \(X\) into portions that require different weighting schemes to represent the outcome \(y\)

spline

Extracting the Splines

We can extract the spline features used in the PyGAM by invoking the following commands to first identify the knot points and create the b-spline basis appropriate for the variable \(X\).

knot_edges=utils.gen_edge_knots(X,dtype='numerical')
knots=np.linspace(knot_edges[0],knot_edges[-1],len(gam.coef_))

splines = utils.b_spline_basis(X, edge_knots=knot_edges, sparse=False)

splines_df = pd.DataFrame(splines, columns=[f'basis_{i}' for i in range(len(gam.coef_))])

splines_df.head(10)
basis_0 basis_1 basis_2 basis_3 basis_4 basis_5 basis_6 basis_7 basis_8 basis_9 basis_10 basis_11 basis_12 basis_13 basis_14 basis_15 basis_16 basis_17 basis_18 basis_19
0 0.166667 0.666667 0.166667 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.132997 0.661606 0.205334 0.000063 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.059688 0.594827 0.341426 0.004059 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.030095 0.518726 0.437481 0.013698 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.012374 0.428013 0.527144 0.032470 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.000000 0.040337 0.551431 0.399315 0.008917 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.000000 0.018232 0.465467 0.492630 0.023671 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.000000 0.011137 0.418489 0.535407 0.034967 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.000000 0.000014 0.189310 0.664817 0.145859 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.000000 0.000000 0.120914 0.656897 0.222015 0.000174 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

These spline features range the extent of the of covariate space \(X\) defining “partitions” of the space. The model “learns” to capture the shape of the outcome variable \(y\) by figuring out how to weight the different portion of this spline basis matrix i.e. the linear combination of this basis matrix with the derived coefficients is a model of our outcome variable.

Note how each row is 0 everywhere except within the columns that represent a partition of \(X\). Additionally each row sum to unity. These properties are important because they ensure that any weighted combination of this basis represents the outcome variable in a controlled and quite granular manner. The more splines we use the more control we have of the representation.

Plotting the Weighted Spline

ax = splines_df.dot(gam.coef_).plot(title='Weighted Splines', label='Weighted Combination of Spline Basis', figsize=(10, 6))
ax.set_ylabel("Acceleration")
ax.set_xlabel("Time Steps")
ax.legend();

In this manner we can see how the specification of a spline basis can help us model eccentric curves in an outcome space.They are like duct-tape tying down our rigging. Filling the gaps in our theory. If we hold this part down there, and jiggle another part here, our theory stays afloat. Spline constructions exemplify the contorted nature of empirically informed theory construction.

Next we’ll see how to more directly work with the specification of basis splines, passing these feature matrices into Bambi models.

Bayesian Splines with bambi

Under the hood Bambi makes use of the ‘formulae’ package to allow for formula-like syntax to specify spline basis terms and mixed effect model terms. We leverage this to streamline the specification of spline models.

knots_6 = np.linspace(0, np.max(X), 6+2)[1:-1]
knots_10 = np.linspace(0, np.max(X), 10+2)[1:-1]
knots_15 = np.linspace(0, np.max(X), 15+2)[1:-1]

df = pd.DataFrame({'X': X.flatten(), 'y': y})
formula1 = 'bs(X, degree=0, knots=knots_6)'
formula2 = 'bs(X, degree=1, knots=knots_6, intercept=False)'
formula3 = 'bs(X, degree=3, knots=knots_6, intercept=False)'
formula4 = 'bs(X, degree=3, knots=knots_10, intercept=False)'
formula5 = 'bs(X, degree=3, knots=knots_15, intercept=False)'
model_spline1 = bmb.Model(f"y ~ {formula1}", df)
model_spline2 = bmb.Model(f"y ~ {formula2}", df)
model_spline3 = bmb.Model(f"y ~ {formula3}", df)
model_spline4 = bmb.Model(f"y ~ {formula4}", df)
model_spline5 = bmb.Model(f"y ~ {formula5}", df)
model_spline5
       Formula: y ~ bs(X, degree=3, knots=knots_15, intercept=False)
        Family: gaussian
          Link: mu = identity
  Observations: 133
        Priors: 
    target = mu
        Common-level effects
            Intercept ~ Normal(mu: -25.5459, sigma: 226.0233)
            bs(X, degree=3, knots=knots_15, intercept=False) ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
                0. 0. 0. 0. 0. 0. 0.], sigma: [1309.9059 1516.6254  914.7421  803.2567  526.1027  542.4563
                739.4915
              590.0538  692.802   786.9096  707.6947  852.6529  792.6311 1046.3175
             1246.4855 1367.9865 1595.1739 1391.2518])
        
        Auxiliary parameters
            sigma ~ HalfStudentT(nu: 4.0, sigma: 48.14)

As you can see here we are specifying a range of different degrees of spline basis. The different degrees corresspond to the smoothness of the overlapping splines. The degree=0 splines mean we specify a piecewise constant basis i.e. 0 or 1 within each region of the partition. But we can add more degrees to see more flexible representations of the space.

model_spline5.build()
model_spline5.graph()

This will become clearer as we plot the various spline basis matrices below.

Plot the Spline Basis

The below functions extract the basis specification from each model and plots the basis design for an increasingly complex series of spline basis matrices.

def plot_spline_basis(basis, X, ax, title="Spline Basis"):
    df = (
        pd.DataFrame(basis)
        .assign(X=X)
        .melt("X", var_name="basis_idx", value_name="y")
    )


    for idx in df.basis_idx.unique():
        d = df[df.basis_idx == idx]
        ax.plot(d["X"], d["y"])
    
    ax.set_title(title)
    return ax

def plot_knots(knots, ax):
    for knot in knots:
        ax.axvline(knot, color="0.1", alpha=0.4)
    return ax



fig, axs = plt.subplots(5, 1, figsize=(9, 20))
axs = axs.flatten()
axs.flatten()
B1 = model_spline1.response_component.design.common[formula1]
plot_spline_basis(B1, df["X"].values, ax=axs[0], title="Piecewise Constant Basis")
plot_knots(knots_6, axs[0]);

B2 = model_spline2.response_component.design.common[formula2]
ax = plot_spline_basis(B2, df["X"].values, axs[1], 
title="Piecewise Linear Basis")
plot_knots(knots_6, axs[1]);

B3 = model_spline3.response_component.design.common[formula3]
ax = plot_spline_basis(B3, df["X"].values, axs[2], 
title="Cubic Spline Basis (6 Knots)")
plot_knots(knots_6, axs[2]);

B4 = model_spline4.response_component.design.common[formula4]
ax = plot_spline_basis(B4, df["X"].values, axs[3], 
title="Cubic Spline Basis (10 Knots)")
plot_knots(knots_10, axs[3]);


B5 = model_spline5.response_component.design.common[formula5]
ax = plot_spline_basis(B5, df["X"].values, axs[4], 
title="Cubic Spline Basis (15 Knots)")
plot_knots(knots_15, axs[4]);

Now that we’ve seen the nature of the modelling splines we’ll fit each model to the data and plot how the weighted splines matrices are able to represent the raw data.

Fit the Individual Spline Models

We now combine the spline components within a linear model fit, pulling and tying them together in such a way as to ape the shape of the observed sequence.

idata_spline1 = model_spline1.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})

idata_spline2 = model_spline2.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})

idata_spline3 = model_spline3.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})

idata_spline4 = model_spline4.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})

idata_spline5 = model_spline5.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})

Plotting the fits

Plot the Weighted Mean

def plot_weighted_splines(B, idata, formula, ax, knots):
    posterior_stacked = az.extract(idata)
    wp = posterior_stacked[formula].mean("sample").values

    plot_spline_basis(B * wp.T, df["X"].values, ax)
    ax.plot(df.X.values, np.dot(B, wp.T), color="black", lw=3, label='Weighted Splines')
    plot_knots(knots, ax);
    ax.legend()



fig, axs = plt.subplots(5, 1, figsize=(10, 20))
axs = axs.flatten()
axs.flatten()

plot_weighted_splines(B1, idata_spline1, formula1, axs[0], knots_6)
plot_weighted_splines(B2, idata_spline2, formula2, axs[1], knots_6)
plot_weighted_splines(B3, idata_spline3, formula3, axs[2], knots_6)
plot_weighted_splines(B4, idata_spline4, formula4, axs[3], knots_10)
plot_weighted_splines(B5, idata_spline5, formula5, axs[4], knots_15)

Here we can see how the models with increasingly complex splines are more exactly able to fit the herky jerky trajectory of the outcome variable in each interval. The fewer the intervals, the less flexibility available to the model.

Compare Model Fits

As before we can evaluate these model fits and compare them based on leave-one-out cross validation scores and information theoretic complexity measures.

fig, axs = plt.subplots(1, 2, figsize=(10, 7))
axs = axs.flatten()
models_dict = {"piecewise_constant": idata_spline1, "piecewise_linear": idata_spline2, "cubic_bspline": idata_spline3, "cubic_bspline_10": idata_spline4, 
"cubic_bspline_15": idata_spline5}
df_compare = az.compare(models_dict)
az.plot_compare(df_compare, ax=axs[0])
axs[0].get_legend().remove()
az.plot_compare(az.compare(models_dict, 'waic'), ax=axs[1])
axs[1].set_yticklabels([])
df_compare
rank elpd_loo p_loo elpd_diff weight se dse warning scale
cubic_bspline_10 0 -612.572745 11.646898 0.000000 9.011073e-01 9.646678 0.000000 True log
cubic_bspline_15 1 -616.333085 15.666243 3.760339 5.207000e-15 9.841471 1.475865 True log
cubic_bspline 2 -634.647180 8.703519 22.074435 2.722831e-15 8.915728 6.642850 True log
piecewise_constant 3 -643.781042 6.981098 31.208296 5.299660e-02 9.770740 8.728072 False log
piecewise_linear 4 -647.016885 5.987587 34.444140 4.589608e-02 7.914787 8.116163 False log

Here we see that the extra complexity of using 15 splines leads to slightly worse performance measures than the less complex but seemingly adequate 10 splines.

new_data = pd.DataFrame({"X": np.linspace(df.X.min(), df.X.max(), num=500)})
    
model_spline4.predict(idata_spline4, data=new_data, 
kind='pps', inplace=True)

idata_spline4
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:                                               (chain: 4,
                                                                 draw: 1000,
                                                                 bs(X, degree=3, knots=knots_10, intercept=False)_dim: 13,
                                                                 y_obs: 500)
      Coordinates:
        * chain                                                 (chain) int64 0 1 2 3
        * draw                                                  (draw) int64 0 ... 999
        * bs(X, degree=3, knots=knots_10, intercept=False)_dim  (bs(X, degree=3, knots=knots_10, intercept=False)_dim) int64 ...
        * y_obs                                                 (y_obs) int64 0 ......
      Data variables:
          Intercept                                             (chain, draw) float64 ...
          bs(X, degree=3, knots=knots_10, intercept=False)      (chain, draw, bs(X, degree=3, knots=knots_10, intercept=False)_dim) float64 ...
          y_sigma                                               (chain, draw) float64 ...
          y_mean                                                (chain, draw, y_obs) float64 ...
      Attributes:
          created_at:                  2024-05-10T07:33:28.311908
          arviz_version:               0.17.0
          inference_library:           pymc
          inference_library_version:   5.10.3
          sampling_time:               2.5449118614196777
          tuning_steps:                1000
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:  (chain: 4, draw: 1000, y_obs: 500)
      Coordinates:
        * chain    (chain) int64 0 1 2 3
        * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
        * y_obs    (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
      Data variables:
          y        (chain, draw, y_obs) float64 -39.47 -30.09 43.27 ... -29.69 -54.1
      Attributes:
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:  (chain: 4, draw: 1000, y_obs: 133)
      Coordinates:
        * chain    (chain) int64 0 1 2 3
        * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
        * y_obs    (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132
      Data variables:
          y        (chain, draw, y_obs) float64 -4.94 -4.349 -4.059 ... -4.232 -5.027
      Attributes:
          created_at:                  2024-05-10T07:33:28.417367
          arviz_version:               0.17.0
          inference_library:           pymc
          inference_library_version:   5.10.3
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:                (chain: 4, draw: 1000)
      Coordinates:
        * chain                  (chain) int64 0 1 2 3
        * draw                   (draw) int64 0 1 2 3 4 5 ... 994 995 996 997 998 999
      Data variables: (12/17)
          tree_depth             (chain, draw) int64 4 4 4 4 3 4 4 5 ... 4 5 4 3 4 4 4
          acceptance_rate        (chain, draw) float64 0.9519 0.8589 ... 0.7224 0.9861
          diverging              (chain, draw) bool False False False ... False False
          lp                     (chain, draw) float64 -721.3 -715.6 ... -710.3 -711.0
          reached_max_treedepth  (chain, draw) bool False False False ... False False
          max_energy_error       (chain, draw) float64 -1.01 0.7776 ... 0.8487 -0.3605
          ...                     ...
          energy                 (chain, draw) float64 728.6 729.2 ... 725.7 714.8
          perf_counter_start     (chain, draw) float64 2.053e+06 ... 2.053e+06
          index_in_trajectory    (chain, draw) int64 -7 -13 -14 -3 6 ... -3 4 -4 7 8
          perf_counter_diff      (chain, draw) float64 0.000742 ... 0.0007362
          energy_error           (chain, draw) float64 -0.2612 -0.1061 ... -0.1877
          step_size              (chain, draw) float64 0.2833 0.2833 ... 0.3204 0.3204
      Attributes:
          created_at:                  2024-05-10T07:33:28.319056
          arviz_version:               0.17.0
          inference_library:           pymc
          inference_library_version:   5.10.3
          sampling_time:               2.5449118614196777
          tuning_steps:                1000
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:  (y_obs: 133)
      Coordinates:
        * y_obs    (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132
      Data variables:
          y        (y_obs) float64 0.0 -1.3 -2.7 0.0 -2.7 ... -2.7 10.7 -2.7 10.7
      Attributes:
          created_at:                  2024-05-10T07:33:28.321732
          arviz_version:               0.17.0
          inference_library:           pymc
          inference_library_version:   5.10.3
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

Next we plot the posterior predictive distribution of our observed variable and compare against the observed data. Additionally we plot the 89th and 50% HDI.

ax = az.plot_hdi(new_data['X'], idata_spline4['posterior_predictive']['y'], fill_kwargs={'alpha': 0.2, 'color':'firebrick'}, hdi_prob=0.89, figsize=(10, 8))

az.plot_hdi(new_data['X'], idata_spline4['posterior_predictive']['y'], fill_kwargs={'alpha': 0.8, 'color':'firebrick'}, hdi_prob=0.5)

y_mean = idata_spline4['posterior_predictive']['y'].mean(dim=('chain', 'draw'))

ax.plot(new_data['X'], y_mean, label='Expected posterior predictive', color='k')
ax.set_xlabel("Time Point")
ax.set_ylabel("Acceleration")

ax.scatter(df['X'], df['y'], label='Observed Datapoints')

ax.legend()

ax.set_title("Posterior Predictive Distribution \n Based on 10 Knots");

This represents a good a clean model fit to the observed data using univariate spline smoothers. Next we’ll see another alternative approach to model this outcome variable using approximate gaussian processes.

Approximate Gaussian processes

The topic of gaussian processes is rich and detailed. Too rich to be fairly covered in this blog post, so we’ll just say that we’re using a method designed for function approximation that makes use of drawing samples from a multivariate normal distribution under a range of different covariance relationships. We are in fact using the HSGP approximation of gaussian processes.

Nevertheless the gaussian relationships can be somewhat intuitively interrogated by defining different combinations of covariance relationships with priors over the parameters governing the covariance of a sequence of points. For example consider the following parameterisations.

fig, axs = plt.subplots(2, 2, figsize=(9, 10))
axs = axs.flatten()

def plot_cov_draws(ax1, ax2, lengthscale=3, sigma=13):
    cov = sigma**2 * pm.gp.cov.ExpQuad(1, lengthscale)
    X = np.linspace(0, 60, 200)[:, None]
    K = cov(X).eval()

    sns.heatmap(pd.DataFrame(K), center=0, xticklabels=[], yticklabels=[], ax=ax1, cmap='crest');
    ax1.set_title(f"Covariance Length Scale {lengthscale}")
    ax2.plot(
        X,
        pm.draw(
            pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=K.shape[0]), draws=10, random_seed=random_seed
        ).T, color='blue', alpha=0.5
    )
    ax2.set_title(f"Samples from the GP prior \n lengthscale: {lengthscale}, sigma: {sigma}")
    plt.ylabel("y")
    plt.xlabel("X");

plot_cov_draws(axs[0], axs[1])
plot_cov_draws(axs[2], axs[3], lengthscale=10, sigma=13)

We’ve specified the range of X to reflect the support of the acceleration example and allowed the draws to be informed by a covariance function we have parameterised using the Exponentiated Quadratic kernel:

\[k(x, x') = \mathrm{exp}\left[ -\frac{(x - x')^2}{2 \ell^2} \right]\]

The plot on the left highlights the importance of the lengthscale parameter over the sequence. The wider the central shading is the more pronounced is the correlation among more points in the sequence. The patterns to the right show a good range of “wiggliness” that they should be flexible enough to capture the shape of the acceleration, if we can calibrate the model and derive posterior parameters against the observed data.

Priors on Gaussian Processes

Consider the following specification for the priors

fig, axs = plt.subplots(1, 2, figsize=(9, 6))
axs = axs.flatten()
axs[0].hist(pm.draw(pm.InverseGamma.dist(mu=1, sigma=1), 1000), ec='black', bins=30);
axs[0].set_title("Priors for Lengthscale \n in ExpQuad Kernel")
axs[1].hist(pm.draw(pm.Exponential.dist(lam=1), 1000), ec='black', bins=30);
axs[1].set_title("Priors for Amplitude \n in ExpQuad Kernel")
Text(0.5, 1.0, 'Priors for Amplitude \n in ExpQuad Kernel')

We use these to specify priors on the Hilbert space approximation of gaussian priors available in the Bambi package.

prior_hsgp = {
    "sigma": bmb.Prior("Exponential", lam=1), # amplitude
    "ell": bmb.Prior("InverseGamma", mu=1, sigma=1) # lengthscale
}

# This is the dictionary we pass to Bambi
priors = {
    "hsgp(X, m=10, c=1)": prior_hsgp,
    "sigma": bmb.Prior("HalfNormal", sigma=4)
}
model_hsgp = bmb.Model("y ~ 0 + hsgp(X, m=10, c=1)", df, priors=priors)
model_hsgp
       Formula: y ~ 0 + hsgp(X, m=10, c=1)
        Family: gaussian
          Link: mu = identity
  Observations: 133
        Priors: 
    target = mu
        HSGP contributions
            hsgp(X, m=10, c=1)
                cov: ExpQuad
                sigma ~ Exponential(lam: 1.0)
                ell ~ InverseGamma(mu: 1.0, sigma: 1.0)
        
        Auxiliary parameters
            sigma ~ HalfNormal(sigma: 4.0)

Here we’ve set the m=10 to determine the number of basis vectors used in the Hilbert space approximation. The idea differs in detail from the spline based approximations we’ve seen, but it’s perhaps useful to think of the process in the same vein. Here again we have a theory of the world expressed as a function of opaque components jerry-rigged for modelling some phenomena.

idata = model_hsgp.fit(inference_method="nuts_numpyro",target_accept=0.95, random_seed=121195, 
idata_kwargs={"log_likelihood": True})
print(idata.sample_stats["diverging"].sum().to_numpy())
Compiling...
Compilation time = 0:00:00.946200
Sampling...
Sampling time = 0:00:01.765245
Transforming variables...
Transformation time = 0:00:00.173278
Computing Log Likelihood...
Log Likelihood time = 0:00:00.143399
0

This model fits and the sampling seems to have worked well.

az.plot_trace(idata, backend_kwargs={"layout": "constrained"}, figsize=(9, 15));

The lengthscale and sigma parameters we have learned by calibrating our priors against the data. The degree to which these parameters are meaningful depend a little on how familar you are with covariance matrix kernels and their properties, so we won’t dwell on the point here.

az.summary(idata, var_names=['hsgp(X, m=10, c=1)_ell', 'hsgp(X, m=10, c=1)_sigma', 'y_sigma', 'hsgp(X, m=10, c=1)_weights'])
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
hsgp(X, m=10, c=1)_ell 3.301 0.666 2.099 4.567 0.016 0.011 1763.0 1815.0 1.0
hsgp(X, m=10, c=1)_sigma 23.675 2.679 18.925 28.847 0.069 0.049 1497.0 2167.0 1.0
y_sigma 20.589 1.121 18.590 22.751 0.019 0.014 3440.0 2485.0 1.0
hsgp(X, m=10, c=1)_weights[0] -131.588 13.784 -157.019 -105.189 0.216 0.154 4071.0 3020.0 1.0
hsgp(X, m=10, c=1)_weights[1] -94.391 18.010 -125.921 -58.020 0.363 0.256 2500.0 2876.0 1.0
hsgp(X, m=10, c=1)_weights[2] 106.121 20.717 69.627 147.247 0.458 0.324 2046.0 2563.0 1.0
hsgp(X, m=10, c=1)_weights[3] 149.289 22.154 106.508 188.107 0.521 0.368 1807.0 2467.0 1.0
hsgp(X, m=10, c=1)_weights[4] -83.399 22.728 -124.812 -40.406 0.539 0.383 1776.0 2517.0 1.0
hsgp(X, m=10, c=1)_weights[5] -125.322 23.273 -169.346 -82.187 0.531 0.377 1917.0 2940.0 1.0
hsgp(X, m=10, c=1)_weights[6] 37.835 21.153 -1.712 77.574 0.512 0.362 1713.0 2577.0 1.0
hsgp(X, m=10, c=1)_weights[7] 94.574 20.745 53.167 132.182 0.423 0.299 2393.0 3105.0 1.0
hsgp(X, m=10, c=1)_weights[8] -1.494 17.184 -31.835 33.063 0.361 0.255 2264.0 2853.0 1.0
hsgp(X, m=10, c=1)_weights[9] -45.442 15.955 -76.649 -16.636 0.289 0.208 3041.0 3183.0 1.0

But again we can sample from the posterior predictive distribution of the outcome variable

model_hsgp.predict(idata, data=new_data, 
kind='pps', inplace=True)

idata
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:                         (chain: 4, draw: 1000,
                                           hsgp(X, m=10, c=1)_weights_dim: 10,
                                           y_obs: 500)
      Coordinates:
        * chain                           (chain) int64 0 1 2 3
        * draw                            (draw) int64 0 1 2 3 4 ... 996 997 998 999
        * hsgp(X, m=10, c=1)_weights_dim  (hsgp(X, m=10, c=1)_weights_dim) int64 0 ...
        * y_obs                           (y_obs) int64 0 1 2 3 4 ... 496 497 498 499
      Data variables:
          hsgp(X, m=10, c=1)_weights_raw  (chain, draw, hsgp(X, m=10, c=1)_weights_dim) float64 ...
          y_sigma                         (chain, draw) float64 20.86 19.86 ... 21.17
          hsgp(X, m=10, c=1)_sigma        (chain, draw) float64 26.85 24.0 ... 18.06
          hsgp(X, m=10, c=1)_ell          (chain, draw) float64 3.861 3.509 ... 3.76
          hsgp(X, m=10, c=1)_weights      (chain, draw, hsgp(X, m=10, c=1)_weights_dim) float64 ...
          y_mean                          (chain, draw, y_obs) float64 -0.1748 ... ...
          hsgp(X, m=10, c=1)              (chain, draw, y_obs) float64 -0.1748 ... ...
      Attributes:
          created_at:                  2024-05-10T07:33:42.866443
          arviz_version:               0.17.0
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:  (chain: 4, draw: 1000, y_obs: 500)
      Coordinates:
        * chain    (chain) int64 0 1 2 3
        * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
        * y_obs    (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
      Data variables:
          y        (chain, draw, y_obs) float64 -6.307 22.12 30.33 ... 8.099 -5.211
      Attributes:
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:  (chain: 4, draw: 1000, y_obs: 133)
      Coordinates:
        * chain    (chain) int64 0 1 2 3
        * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
        * y_obs    (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132
      Data variables:
          y        (chain, draw, y_obs) float64 -3.957 -3.957 -3.957 ... -4.013 -4.099
      Attributes:
          created_at:                  2024-05-10T07:33:42.869825
          arviz_version:               0.17.0
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:          (chain: 4, draw: 1000)
      Coordinates:
        * chain            (chain) int64 0 1 2 3
        * draw             (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999
      Data variables:
          acceptance_rate  (chain, draw) float64 0.9484 0.989 0.9772 ... 0.9905 0.9283
          step_size        (chain, draw) float64 0.1631 0.1631 ... 0.1827 0.1827
          diverging        (chain, draw) bool False False False ... False False False
          energy           (chain, draw) float64 671.4 665.4 671.0 ... 670.8 674.0
          n_steps          (chain, draw) int64 31 31 31 15 15 31 ... 15 31 31 15 15 31
          tree_depth       (chain, draw) int64 5 5 5 4 4 5 5 4 4 ... 4 5 5 4 5 5 4 4 5
          lp               (chain, draw) float64 661.7 661.3 665.7 ... 665.6 668.8
      Attributes:
          created_at:                  2024-05-10T07:33:42.868840
          arviz_version:               0.17.0
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

    • <xarray.Dataset>
      Dimensions:  (y_obs: 133)
      Coordinates:
        * y_obs    (y_obs) int64 0 1 2 3 4 5 6 7 8 ... 125 126 127 128 129 130 131 132
      Data variables:
          y        (y_obs) float64 0.0 -1.3 -2.7 0.0 -2.7 ... -2.7 10.7 -2.7 10.7
      Attributes:
          created_at:                  2024-05-10T07:33:42.870094
          arviz_version:               0.17.0
          inference_library:           numpyro
          inference_library_version:   0.13.2
          sampling_time:               1.765245
          modeling_interface:          bambi
          modeling_interface_version:  0.13.0

and plot the model fit to see if it can recover the observed data.

ax = az.plot_hdi(new_data['X'], idata['posterior_predictive']['y'], fill_kwargs={'alpha': 0.2, 'color':'firebrick'}, figsize=(9, 8))

az.plot_hdi(new_data['X'], idata['posterior_predictive']['y'], fill_kwargs={'alpha': 0.8, 'color':'firebrick'}, hdi_prob=0.5)

y_mean = idata['posterior_predictive']['y'].mean(dim=('chain', 'draw'))

ax.plot(new_data['X'], y_mean, label='Expected posterior predictive', color='k')

ax.scatter(df['X'], df['y'], label='Observed Datapoints')

ax.legend()

ax.set_title("Posterior Predictive Distribution \n Based on HSGP approximation")
Text(0.5, 1.0, 'Posterior Predictive Distribution \n Based on HSGP approximation')

And we can compare versus the spline models to see that by the aggregate performance measures our HSGP model seems to come out on top.

models_dict = {"piecewise_constant": idata_spline1, "piecewise_linear": idata_spline2, "cubic_bspline": idata_spline3, "cubic_bspline_10": idata_spline4, 
"cubic_bspline_15": idata_spline5, 'hsgp': idata}
df_compare = az.compare(models_dict)
df_compare
rank elpd_loo p_loo elpd_diff weight se dse warning scale
hsgp 0 -608.819770 9.958827 0.000000 0.872617 12.009698 0.000000 False log
cubic_bspline_10 1 -612.586590 11.660742 3.766820 0.000000 9.647058 3.128421 True log
cubic_bspline_15 2 -616.333085 15.666243 7.513315 0.000000 9.841471 3.413170 True log
cubic_bspline 3 -634.647180 8.703519 25.827410 0.000000 8.915728 7.666272 True log
piecewise_constant 4 -643.781042 6.981098 34.961272 0.069309 9.770740 9.490817 False log
piecewise_linear 5 -647.016885 5.987587 38.197115 0.058073 7.914787 9.204094 False log

Recap

So far we’ve seen how we can use splines and gauassian processes to model highly eccentric functional relationships where the function could be approximated with univariate smoothing routine. These are two distinct abstractions which seem adequately fit to the world, but demand very different ways of thinking about the underlying reality. This is no fundamental contradiction in so far as the world admits many descriptions.

[K]nowledge develops in a multiplicity of theories, each with it’s limited utility … These theories overlap very considerable in their logical laws and in much else, but that they add up to an integrated and consistent whole is only a worthy ideal and not a pre-requistite of scientific progress … Let reconciliations proceed. - W.V.O Quine in Word and Object

Another observation in a similar vein is that penalised spline models are provably equivalent to hierarchical regression (random effects) models. This is striking because the character of these types of model seems diametrically opposed. With spline models you jerry-rig your features to an optimisation goal, with hierarchical model you tend to impose theoretical structure to induce shrinkage. It’s hard to see how this works - with a penalised spline model are you inducing a hierarchy of latent features you can’t name? Should you even try to translate between the two!? The abstract components of our model is less graspable than the predictive performance. The qualitative character of theory buiding differs markedly across these abstractions and aesthetics and explainability feature heavily in model preference.

Next we’ll step up our abstractions and show how to use hierarchical modelling over spline fits to extract insight into the data generating process over a family of curves. In particular we’ll focus on the development of insurance loss curves.

Insurance Loss Curves: Hierarchical Spline Models

We draw on car insurance losses data set discussed in Mick Cooney’s Stan case-study, but we simplify things for ourselves considerably by focusing on two types of loss and ensuring that each year under consideration has equal observations of the accruing losses.

loss_df = pd.read_csv('ppauto_pos.csv')
loss_df = loss_df[(loss_df['GRCODE'].isin([43, 353])) & (loss_df['DevelopmentYear'] < 1998)]

loss_df = loss_df[['GRCODE', 'AccidentYear', 'DevelopmentYear', 'DevelopmentLag', 'EarnedPremDIR_B', 'CumPaidLoss_B']]


loss_df.columns = ['grcode', 'acc_year', 'dev_year', 'dev_lag', 'premium', 'cum_loss']
loss_df['lr'] = loss_df['cum_loss'] / loss_df['premium']
loss_df = loss_df[(loss_df['acc_year'] <= 1992) & (loss_df['dev_lag'] <= 6)].reset_index(drop=True)

loss_df['year_code'] = loss_df['acc_year'].astype(str) + '_' + loss_df['grcode'].astype(str)
loss_df.sort_values(by=['year_code', 'acc_year', 'dev_lag'], inplace=True)
loss_df['standardised_premium'] = (loss_df.assign(mean_premium = np.mean(loss_df['premium']))
.assign(std_premium = np.std(loss_df['premium']))
.apply(lambda x: (x['mean_premium'] - x['premium']) /x['std_premium'], axis=1)
)

loss_df.head(12)
grcode acc_year dev_year dev_lag premium cum_loss lr year_code standardised_premium
30 353 1988 1988 1 18793 4339 0.230884 1988_353 -0.347453
31 353 1988 1989 2 18793 9617 0.511733 1988_353 -0.347453
32 353 1988 1990 3 18793 11584 0.616400 1988_353 -0.347453
33 353 1988 1991 4 18793 12001 0.638589 1988_353 -0.347453
34 353 1988 1992 5 18793 12640 0.672591 1988_353 -0.347453
35 353 1988 1993 6 18793 12966 0.689938 1988_353 -0.347453
0 43 1988 1988 1 957 133 0.138976 1988_43 1.722344
1 43 1988 1989 2 957 333 0.347962 1988_43 1.722344
2 43 1988 1990 3 957 431 0.450366 1988_43 1.722344
3 43 1988 1991 4 957 570 0.595611 1988_43 1.722344
4 43 1988 1992 5 957 615 0.642633 1988_43 1.722344
5 43 1988 1993 6 957 615 0.642633 1988_43 1.722344

Plot the Loss Curves

Here we have plotted the developing loss curves from two different coded insurance products.

pivot = loss_df.pivot(index=['dev_lag'], columns=['grcode', 'acc_year'], values='lr')
fig, axs = plt.subplots(1, 2, figsize=(9, 7))
pivot.plot(figsize=(10, 6), ax=axs[0])
axs[0].set_title("Loss Ratios by Year");
for c in pivot.columns:
    if 43 in c:
        color='red'
    else: 
        color='grey'
    axs[1].plot(pivot[c], color=color, label=c)
axs[1].legend()
axs[1].set_title("Loss Ratio by Group")
Text(0.5, 1.0, 'Loss Ratio by Group')

We want to model these curves collectively as instances of draws from a distribution of loss curves. To do so we will specify a PyMC hierarchical (mixed) spline model. To do so we will have a spline basis for the global hyper parameters beta_g and the individual parameters for each curve. Here we define a convenience function to generate the basis splines.

from patsy import bs as bs_patsy, dmatrix

def make_basis_splines(num_knots=3, max_dev=7):
    knot_list = np.linspace(0, max_dev, num_knots+2)[1:-1]
    dev_periods = np.arange(1, max_dev, 1)

    Bi = dmatrix(
        "bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True) - 1",
        {"dev_periods": dev_periods, "knots": knot_list},
    )

    Bg = dmatrix(
        "bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True) - 1",
        {"dev_periods": dev_periods, "knots": knot_list})


    return Bi, Bg

Bi, Bg = make_basis_splines()
Bg
DesignMatrix with shape (6, 7)
  Columns:
    ['bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[0]',
     'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[1]',
     'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[2]',
     'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[3]',
     'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[4]',
     'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[5]',
     'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)[6]']
  Terms:
    'bs_patsy(dev_periods, knots=knots, degree=3, include_intercept=True)' (columns 0:7)
  (to view full data, use np.asarray(this_obj))

Next we specify a model maker function to create the various pooled, unpooled and hierarhical (mixed) models of the insurance curve data. Note that even though we’re specifying a hierarhical model we have not specified a hierarchy over the insurance codes, instead we have added this as a “fixed” effect feature into our regression model. The idea is that this fixed effect will capture the differences in expected baseline per group code of insurance product.

def make_model(loss_df, num_knots=3, max_dev=7, model_type='mixed'):
    Bi, Bg = make_basis_splines(num_knots, max_dev)
    observed = loss_df['lr'].values
    uniques, unique_codes = pd.factorize(loss_df['year_code'])
    coords= {'years': unique_codes, 
            'splines': list(range(Bi.shape[1])) ,
            'measurement': list(range(6)), 
            'obs': uniques
            }

    with pm.Model(coords=coords) as sp_insur:
        basis_g = pm.MutableData('Bg', np.asfortranarray(Bg))

        tau = pm.HalfCauchy('tau', 1)
        ## Global Hierarchical Spline Terms
        beta_g = pm.Normal("beta_g", mu=0, sigma=tau, 
        dims='splines')
        mu_g = pm.Deterministic("mu_g", pm.math.dot(basis_g, beta_g), dims='measurement')

        ## Individual or Year Specific Spline Modifications
        if model_type in ['mixed', 'unpooled']:
            sigma = pm.HalfCauchy('sigma_i', 1)
            basis_i = pm.MutableData('Bi', np.asfortranarray(Bi))
            beta = pm.Normal("beta", mu=0, sigma=sigma, dims=('splines', 'years'))
            mui = pm.Deterministic("mui", pm.math.dot(basis_i, beta), dims=('measurement', 'years'))
        
        ## Features
        prem = pm.MutableData('prem', loss_df['standardised_premium'].values)
        grcode = pm.MutableData('grcode', loss_df['grcode'] == 43)

        beta_prem = pm.Normal('beta_prem', 0, 1)
        beta_grcode = pm.Normal('beta_grcode', 0, 1)
        mu_prem = beta_prem*prem
        mu_grcode = beta_grcode*grcode

        ## Likelihood
        sigma = pm.TruncatedNormal("sigma", 1, lower=0.005)
        if model_type == 'mixed':
            mu = pm.Deterministic('mu',  mu_grcode + mu_prem + (mu_g.T + mui.T).ravel(), dims='obs')
            lr_likelihood = pm.Normal("lr", mu, sigma, observed=observed, dims=('obs'))
        elif model_type == 'pooled': 
             lr_likelihood = pm.Normal("lr",  np.repeat(mu_g, len(unique_codes)), sigma, observed=observed, dims='obs')
        elif model_type == 'unpooled':
            lr_likelihood = pm.Normal("lr",  mui.T.ravel(), sigma, observed=observed, dims=('obs'))


        ## Sampling
        idata_sp_insur = pm.sample(2000, return_inferencedata=True, target_accept=.99,
        idata_kwargs={"log_likelihood": True})
        idata_sp_insur = pm.sample_posterior_predictive(
            idata_sp_insur,extend_inferencedata=True)

    return idata_sp_insur, sp_insur


idata_sp_insur_unpooled, sp_insur_unpooled = make_model(loss_df, model_type='unpooled')
idata_sp_insur_pooled, sp_insur_pooled = make_model(loss_df, model_type='pooled')
idata_sp_insur_mixed, sp_insur_mixed = make_model(loss_df, model_type='mixed')

The model structure can be seen more clearly in this graph

pm.model_to_graphviz(sp_insur_mixed)

We can extract the effect of the differences grcodes and examine the baseline and annual spline related coefficients.

summary = az.summary(idata_sp_insur_mixed, var_names=['sigma', 'beta_grcode', 'beta_prem', 'beta_g', 'beta'])

summary
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
sigma 0.014 0.008 0.005 0.029 0.001 0.000 164.0 322.0 1.02
beta_grcode 0.230 0.058 0.122 0.338 0.001 0.001 3007.0 3978.0 1.00
beta_prem -0.030 0.029 -0.084 0.025 0.000 0.000 3845.0 4440.0 1.00
beta_g[0] 0.137 0.068 0.009 0.262 0.002 0.001 1925.0 3352.0 1.00
beta_g[1] 0.055 0.398 -0.705 0.808 0.008 0.007 2883.0 2916.0 1.00
... ... ... ... ... ... ... ... ... ...
beta[6, 1990_43] 0.543 0.070 0.413 0.680 0.002 0.001 2142.0 3457.0 1.00
beta[6, 1991_353] -0.102 0.069 -0.232 0.028 0.002 0.001 1931.0 3092.0 1.00
beta[6, 1991_43] 0.113 0.071 -0.018 0.249 0.001 0.001 2263.0 3589.0 1.00
beta[6, 1992_353] -0.119 0.068 -0.242 0.017 0.002 0.001 1914.0 3022.0 1.00
beta[6, 1992_43] -0.033 0.090 -0.203 0.135 0.002 0.001 2781.0 3949.0 1.00

80 rows × 9 columns

Again we can compare the performance metrics of the various models.

compare_df = az.compare({'unpooled': idata_sp_insur_unpooled, 
            'pooled': idata_sp_insur_pooled, 
            'mixed': idata_sp_insur_mixed})

az.plot_compare(compare_df)
compare_df
rank elpd_loo p_loo elpd_diff weight se dse warning scale
mixed 0 132.942577 61.070097 0.000000 1.000000e+00 1.412555 0.000000 True log
unpooled 1 75.837950 75.540595 57.104626 1.081351e-10 1.624247 1.997727 True log
pooled 2 -9.212462 5.872277 142.155039 0.000000e+00 5.359659 5.071260 False log

Plot the Posterior Predictive Checks

We can check how well the model can recapture the observed data.

def plot_ppc_splines(idata):
    fig, axs = plt.subplots(5, 2, figsize=(9, 20), sharey=True)
    axs = axs.flatten()
    dev_periods = np.arange(1, 7, 1)
    uniques, unique_codes = pd.factorize(loss_df['year_code'])
    for y, c in zip(unique_codes, range(10)):
        az.plot_hdi(dev_periods, idata['posterior_predictive']['lr'].sel(obs=c), color='firebrick', ax=axs[c], fill_kwargs={'alpha': 0.2}, hdi_prob=.89)
        az.plot_hdi(dev_periods, idata['posterior_predictive']['lr'].sel(obs=c), color='firebrick', ax=axs[c], hdi_prob=0.5)
        axs[c].scatter(dev_periods, loss_df[(loss_df['year_code'] == y)]['lr'], color='k', label='Actual Loss Ratio')
        axs[c].plot(dev_periods, loss_df[(loss_df['year_code'] == y)]['lr'], color='k')
        axs[c].set_title(f"PPC 89% and 50% HDI: {y}")
        axs[c].set_ylabel("Loss Ratio")
        axs[c].legend();

plot_ppc_splines(idata_sp_insur_mixed)

Plot the Hierarchical Components

In the following plot we show similarly how to recapture the observed data, but additionally we decompose the structure of the model in each case and extract baseline forecasts which would be our guide to future loss-ratio development curves in lieu of any other information.

mu_g = idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))["mu_g"]

mu_i = idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))["mui"]

mu  = idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))['mu']

beta_grcode = idata_sp_insur_mixed.posterior.stack(draws=("chain", "draw"))['beta_grcode']

dev_periods = np.arange(1, 7, 1)
uniques, unique_codes = pd.factorize(loss_df['year_code'])

mosaic = """
         AB
         CD
         EF
         GH
         IJ
         KK
"""
fig, axs = plt.subplot_mosaic(mosaic, sharey=True, 
figsize=(9, 20))
axs = [axs[k] for k in axs.keys()] 

mu_g_mean = mu_g.mean(dim='draws')
for y, c in zip(unique_codes, range(10)):
    group_effect = 0
    if '43' in y: 
        group_effect = beta_grcode.mean().item()
    mu_i_mean = mu_i.sel(years=y).mean(dim='draws')
    axs[c].plot(dev_periods, group_effect + mu_g_mean.values + mu_i_mean.values, label='Combined + E(grp_effect)', color='purple', linewidth=3.5)
    axs[c].plot(dev_periods, group_effect + mu_g_mean.values, label='E(Hierarchical Baseline)', color='red', linestyle='--')
    axs[c].plot(dev_periods,  group_effect + mu_i_mean.values, label='E(Year Specific Adjustment term)', color='blue', linestyle='--')
    axs[c].scatter(dev_periods, loss_df[(loss_df['year_code'] == y)]['lr'], color='k', label='Actual Loss Ratio')
    az.plot_hdi(dev_periods,mu.sel(obs=c).T  , ax=axs[c], color='firebrick', fill_kwargs={'alpha': 0.2})
    az.plot_hdi(dev_periods, mu.sel(obs=c).T , ax=axs[c], color='firebrick', fill_kwargs={'alpha': 0.5}, hdi_prob=.50)
    axs[c].set_title(f"Components for Year {y}")
    axs[c].set_ylabel("Loss Ratio")
    if (c == 0) or (c == 1):
         axs[c].legend()

axs[10].plot(dev_periods, mu_g_mean.values, label='E(Hierarchical Baseline)', color='black')
axs[10].plot(dev_periods, mu_g_mean.values + group_effect, label='E(Hierarchical Baseline) + E(grp_effect)', color='black', linestyle='--')
az.plot_hdi(dev_periods, mu_g.T.values, color='slateblue', ax=axs[10], fill_kwargs={'alpha': 0.2})
az.plot_hdi(dev_periods, mu_g.T.values + group_effect, color='magenta', ax=axs[10], fill_kwargs={'alpha': 0.2})
az.plot_hdi(dev_periods, mu_g.T.values, color='slateblue', ax=axs[10], hdi_prob=.5)
az.plot_hdi(dev_periods, mu_g.T.values  + group_effect, color='magenta', ax=axs[10], hdi_prob=.5)
axs[10].set_title("Baseline Forecast Loss Ratio")
axs[10].legend();

This plot is a bit clunky, because we’re mixing expectations and posterior distributions over the parameters. The point is just to highlight the “compositional” structure of our model. A better way to interrogate the implications of the model is to “push” forward different data through the posterior predictive distribution and derive a kind of ceteris paribus rule for accrual of losses.

with sp_insur_mixed: 
    pm.set_data({'grcode': np.ones(len(loss_df)), 
    })
    idata_43 = pm.sample_posterior_predictive(idata_sp_insur_mixed, var_names=['lr'], extend_inferencedata =True)

with sp_insur_mixed: 
    pm.set_data({'grcode': np.zeros(len(loss_df))})
    idata_353 = pm.sample_posterior_predictive(idata_sp_insur_mixed, var_names=['lr'], extend_inferencedata=True)

idata_353
Sampling: [lr]
100.00% [8000/8000 00:00<00:00]
Sampling: [lr]
100.00% [8000/8000 00:00<00:00]
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:      (chain: 4, draw: 2000, splines: 7, years: 10, measurement: 6,
                        obs: 60)
      Coordinates:
        * chain        (chain) int64 0 1 2 3
        * draw         (draw) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999
        * splines      (splines) int64 0 1 2 3 4 5 6
        * years        (years) <U8 '1988_353' '1988_43' ... '1992_353' '1992_43'
        * measurement  (measurement) int64 0 1 2 3 4 5
        * obs          (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 ... 8 8 8 8 8 8 9 9 9 9 9 9
      Data variables:
          beta_g       (chain, draw, splines) float64 0.07933 -0.2632 ... 0.5738
          beta         (chain, draw, splines, years) float64 0.1373 ... 0.1109
          beta_prem    (chain, draw) float64 -0.01881 -0.04519 ... -0.03513 -0.006211
          beta_grcode  (chain, draw) float64 0.1776 0.2712 0.2247 ... 0.3298 0.2505
          tau          (chain, draw) float64 0.6441 0.5391 0.6406 ... 1.552 0.4835
          sigma_i      (chain, draw) float64 0.2157 0.1861 0.1993 ... 0.176 0.2111
          sigma        (chain, draw) float64 0.01109 0.006251 ... 0.01825 0.01801
          mu_g         (chain, draw, measurement) float64 0.07933 0.379 ... 0.5738
          mui          (chain, draw, measurement, years) float64 0.1373 ... 0.1109
          mu           (chain, draw, obs) float64 0.2232 0.5142 ... 0.8954 0.9449
      Attributes:
          created_at:                 2024-05-10T07:41:08.538254
          arviz_version:              0.17.0
          inference_library:          pymc
          inference_library_version:  5.10.3
          sampling_time:              227.8452410697937
          tuning_steps:               1000

    • <xarray.Dataset>
      Dimensions:  (chain: 4, draw: 2000, obs: 60)
      Coordinates:
        * chain    (chain) int64 0 1 2 3
        * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
        * obs      (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 2 ... 7 8 8 8 8 8 8 9 9 9 9 9 9
      Data variables:
          lr       (chain, draw, obs) float64 0.199 0.5282 0.61 ... 0.9096 0.9633
      Attributes:
          created_at:                 2024-05-10T07:41:10.642071
          arviz_version:              0.17.0
          inference_library:          pymc
          inference_library_version:  5.10.3

    • <xarray.Dataset>
      Dimensions:  (chain: 4, draw: 2000, obs: 60)
      Coordinates:
        * chain    (chain) int64 0 1 2 3
        * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
        * obs      (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 2 ... 7 8 8 8 8 8 8 9 9 9 9 9 9
      Data variables:
          lr       (chain, draw, obs) float64 3.34 3.557 3.314 ... 2.939 2.99 2.451
      Attributes:
          created_at:                 2024-05-10T07:41:08.819885
          arviz_version:              0.17.0
          inference_library:          pymc
          inference_library_version:  5.10.3

    • <xarray.Dataset>
      Dimensions:                (chain: 4, draw: 2000)
      Coordinates:
        * chain                  (chain) int64 0 1 2 3
        * draw                   (draw) int64 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999
      Data variables: (12/17)
          tree_depth             (chain, draw) int64 10 10 10 10 10 ... 10 10 10 10 10
          acceptance_rate        (chain, draw) float64 0.9995 0.9228 ... 0.9938 0.9978
          diverging              (chain, draw) bool False False False ... False False
          lp                     (chain, draw) float64 192.1 215.3 ... 157.4 157.9
          reached_max_treedepth  (chain, draw) bool False False False ... True True
          max_energy_error       (chain, draw) float64 -0.3075 -1.198 ... -0.03383
          ...                     ...
          energy                 (chain, draw) float64 -148.0 -166.6 ... -104.3 -124.1
          perf_counter_start     (chain, draw) float64 2.053e+06 ... 2.053e+06
          index_in_trajectory    (chain, draw) int64 588 906 -777 -475 ... 448 615 671
          perf_counter_diff      (chain, draw) float64 0.08013 0.08198 ... 0.08113
          energy_error           (chain, draw) float64 -0.04468 -0.5292 ... 0.0029
          step_size              (chain, draw) float64 0.004791 0.004791 ... 0.004154
      Attributes:
          created_at:                 2024-05-10T07:41:08.549801
          arviz_version:              0.17.0
          inference_library:          pymc
          inference_library_version:  5.10.3
          sampling_time:              227.8452410697937
          tuning_steps:               1000

    • <xarray.Dataset>
      Dimensions:  (obs: 60)
      Coordinates:
        * obs      (obs) int64 0 0 0 0 0 0 1 1 1 1 1 1 2 ... 7 8 8 8 8 8 8 9 9 9 9 9 9
      Data variables:
          lr       (obs) float64 0.2309 0.5117 0.6164 0.6386 ... 0.859 0.9038 0.9244
      Attributes:
          created_at:                 2024-05-10T07:41:08.552284
          arviz_version:              0.17.0
          inference_library:          pymc
          inference_library_version:  5.10.3

    • <xarray.Dataset>
      Dimensions:       (Bg_dim_0: 6, Bg_dim_1: 7, Bi_dim_0: 6, Bi_dim_1: 7,
                         prem_dim_0: 60, grcode_dim_0: 60)
      Coordinates:
        * Bg_dim_0      (Bg_dim_0) int64 0 1 2 3 4 5
        * Bg_dim_1      (Bg_dim_1) int64 0 1 2 3 4 5 6
        * Bi_dim_0      (Bi_dim_0) int64 0 1 2 3 4 5
        * Bi_dim_1      (Bi_dim_1) int64 0 1 2 3 4 5 6
        * prem_dim_0    (prem_dim_0) int64 0 1 2 3 4 5 6 7 ... 52 53 54 55 56 57 58 59
        * grcode_dim_0  (grcode_dim_0) int64 0 1 2 3 4 5 6 7 ... 53 54 55 56 57 58 59
      Data variables:
          Bg            (Bg_dim_0, Bg_dim_1) float64 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0
          Bi            (Bi_dim_0, Bi_dim_1) float64 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0
          prem          (prem_dim_0) float64 -0.3475 -0.3475 -0.3475 ... -1.572 -1.572
          grcode        (grcode_dim_0) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
      Attributes:
          created_at:                 2024-05-10T07:41:08.553184
          arviz_version:              0.17.0
          inference_library:          pymc
          inference_library_version:  5.10.3

Even here though we want to average the curves over the specific years in the data and abstract a view of the model implications under different counterfactual settings. Or put another way we a view of the average development curve between years not within a year. Here we define a helper function to effect this step.

def get_posterior_predictive_curve(idata, prem=2, grcode=1):
    weighted_splines = [np.dot(np.asfortranarray(Bi), az.extract(idata['posterior']['beta'])['beta'].values[:, :, i]) for i in range(2000)]

    weighted_splines_1 = [np.dot(np.asfortranarray(Bg), az.extract(idata['posterior']['beta_g'])['beta_g'].values[:, i]) for i in range(2000)]

    beta_grcode = az.extract(idata['posterior']['beta_grcode'])['beta_grcode']

    beta_prem = az.extract(idata['posterior']['beta_prem'])['beta_prem']
    df1 = pd.DataFrame([beta_prem.values[i]*prem + beta_grcode.values[i]*grcode for i in range(2000)]).T


    ## Crucial step averaging over the years to get
    ## a view of the development period
    #df = pd.concat([pd.DataFrame(weighted_splines_1[i].T + weighted_splines[i].T).mean() for i in range(1000)], axis=1)

    ## Sample random group each draw from posterior
    df = pd.concat([pd.DataFrame((weighted_splines_1[i].T + weighted_splines[i].T)[np.random.choice(list(range(10))), :]) for i in range(2000)], axis=1)

    ## Average random subset of of groups
    #df = pd.concat([pd.DataFrame((weighted_splines_1[i].T + weighted_splines[i].T)[np.random.choice(list(range(10)), 5), :]).mean() for i in range(2000)], axis=1)

    df = df1.iloc[0].values + df

    return df

pred_df_1 = get_posterior_predictive_curve(idata_43, prem=1, grcode=1)
pred_df_0 = get_posterior_predictive_curve(idata_353, prem=1, grcode=0) 

fig, ax = plt.subplots(figsize=(9, 7), sharey=True)

az.plot_hdi(range(6), pred_df_0.values.T, ax=ax, color='slateblue', fill_kwargs={'alpha': 0.2})
az.plot_hdi(range(6), pred_df_1.values.T, ax=ax, color='firebrick', fill_kwargs={'alpha': 0.2})
az.plot_hdi(range(6), pred_df_0.values.T, ax=ax, color='slateblue', fill_kwargs={'alpha': 0.5}, hdi_prob=.5)
az.plot_hdi(range(6), pred_df_1.values.T, ax=ax, color='firebrick', fill_kwargs={'alpha': 0.5}, hdi_prob=.5)
ax.plot(pred_df_0.mean(axis=1), linestyle='-', color='k', linewidth=4, label='grcode 353 prem 1')

ax.plot(pred_df_1.mean(axis=1), linestyle='--', color='grey', linewidth=4, label='grcode 43 prem 1')

ax.set_title("Posterior Samples of the Trajectories \n Under different Counterfactual settings")
ax.set_ylabel("Loss Ratio")
ax.set_xlabel("Development Period")
ax.legend();

Multiple Smoother Regression Models

To demonstrate how spline modelling can be further adapted to the multiple regression like cases we use the PISA data set discussed in this case study from Michael Clark.

The data set has been constructed using average Science scores by country from the Programme for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Human Development Index from UN data. We want to model the overall outcome score as a function of these broad demographic features.

pisa_df = pd.read_csv("https://raw.githubusercontent.com/m-clark/generalized-additive-models/master/data/pisasci2006.csv")

pisa_df.head()
Country Overall Issues Explain Evidence Interest Support Income Health Edu HDI
0 Albania NaN NaN NaN NaN NaN NaN 0.599 0.886 0.716 0.724
1 Argentina 391.0 395.0 386.0 385.0 567.0 506.0 0.678 0.868 0.786 0.773
2 Australia 527.0 535.0 520.0 531.0 465.0 487.0 0.826 0.965 0.978 0.920
3 Austria 511.0 505.0 516.0 505.0 507.0 515.0 0.835 0.944 0.824 0.866
4 Azerbaijan 382.0 353.0 412.0 344.0 612.0 542.0 0.566 0.780 NaN NaN

The relationships displayed between each of these measures is not obviously linear, and as such could plausibly benefit from being modelled with splines.

g = sns.pairplot(pisa_df[['Overall', 'Income', 'Support', 'Health', 'Edu']],  kind="reg")
g.fig.suptitle("Pair Plot of Complex Relations", y=1.05);

We define three models for contrasting the implications. Note here how we have to define a seperate spline basis for each of the covariates. Here we create the knots for defining our basis on each covariate.

knots_income = np.linspace(np.min(pisa_df['Income']), np.max(pisa_df['Income']), 5+2)[1:-1]

knots_edu = np.linspace(np.min(pisa_df['Edu']), np.max(pisa_df['Edu']), 5+2)[1:-1]

knots_health = np.linspace(np.min(pisa_df['Health']), np.max(pisa_df['Health']), 5+2)[1:-1]

knots_income1 = np.linspace(np.min(pisa_df['Income']), np.max(pisa_df['Income']), 3+2)[1:-1]

knots_edu1 = np.linspace(np.min(pisa_df['Edu']), np.max(pisa_df['Edu']), 3+2)[1:-1]

knots_health1 = np.linspace(np.min(pisa_df['Health']), np.max(pisa_df['Health']), 3+2)[1:-1]

Now we initialise these models

formula = "Overall ~ Income + Edu + Health"
base_model = bmb.Model(formula, pisa_df, dropna=True)

formula_spline = """Overall ~ bs(Income, degree=3, knots=knots_income) + bs(Edu, degree=3, knots=knots_edu) + bs(Health, degree=3, knots=knots_health) """

formula_spline1 = """Overall ~ bs(Income, degree=3, knots=knots_income1) + bs(Edu, degree=3, knots=knots_edu1) + bs(Health, degree=3, knots=knots_health1) + (1 | Country) """

spline_model = bmb.Model(formula_spline, pisa_df, dropna=True)

spline_model1 = bmb.Model(formula_spline1, pisa_df, dropna=True)

spline_model1
Automatically removing 13/65 rows from the dataset.
Automatically removing 13/65 rows from the dataset.
Automatically removing 13/65 rows from the dataset.
       Formula: Overall ~ bs(Income, degree=3, knots=knots_income1) + bs(Edu, degree=3, knots=knots_edu1) + bs(Health, degree=3, knots=knots_health1) + (1 | Country) 
        Family: gaussian
          Link: mu = identity
  Observations: 52
        Priors: 
    target = mu
        Common-level effects
            Intercept ~ Normal(mu: 471.1538, sigma: 517.9979)
            bs(Income, degree=3, knots=knots_income1) ~ Normal(mu: [0. 0. 0. 0. 0. 0.], sigma: [1630.0781
                872.587   532.485   674.4864  570.5563  779.4582])
            bs(Edu, degree=3, knots=knots_edu1) ~ Normal(mu: [0. 0. 0. 0. 0. 0.], sigma: [965.2606 734.261
                559.679  590.0573 684.2515 649.6795])
            bs(Health, degree=3, knots=knots_health1) ~ Normal(mu: [0. 0. 0. 0. 0. 0.], sigma: [1842.3891
                815.7797  534.7546  671.1905  544.161   802.929 ])
        
        
        Group-level effects
            1|Country ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 517.9979))
        
        Auxiliary parameters
            sigma ~ HalfStudentT(nu: 4.0, sigma: 53.4654)
spline_model.build()
spline_model.graph()

base_idata = base_model.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True})
spline_idata = spline_model.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True}, target_accept=.95)
spline_idata1 = spline_model1.fit(random_seed=random_seed, idata_kwargs={"log_likelihood": True}, target_accept=.99)

We can compare the simple regression approach to the spline based regression in the now usual way.

compare_df = az.compare({'spline': spline_idata, 'raw': base_idata, 'spline_hierarchy':spline_idata1})

compare_df
rank elpd_loo p_loo elpd_diff weight se dse warning scale
spline_hierarchy 0 -244.691895 38.487159 0.000000 1.000000e+00 3.901855 0.000000 True log
spline 1 -254.101306 20.218465 9.409411 0.000000e+00 4.724572 2.840914 True log
raw 2 -263.061455 9.617888 18.369559 1.054712e-13 10.083921 9.376156 True log

The coefficients comparisons are harder

az.summary(base_idata)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept 118.438 81.996 -41.166 266.033 1.522 1.086 2915.0 2986.0 1.0
Income 181.955 89.547 17.413 353.598 1.913 1.353 2196.0 2432.0 1.0
Edu 234.125 59.153 120.739 340.343 1.017 0.740 3414.0 2518.0 1.0
Health 30.395 140.966 -223.699 298.258 2.852 2.153 2450.0 2598.0 1.0
Overall_sigma 34.234 3.620 27.882 41.175 0.060 0.043 3757.0 2879.0 1.0

since it’s less clear what the spline coefficient terms mean.

az.summary(spline_idata)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept 225.689 205.515 -133.222 630.021 6.036 4.292 1163.0 1653.0 1.0
bs(Income, degree=3, knots=knots_income)[0] -355.159 1613.343 -3460.788 2642.535 56.357 39.865 819.0 1504.0 1.0
bs(Income, degree=3, knots=knots_income)[1] 194.905 122.719 -42.478 423.960 2.343 1.728 2748.0 2719.0 1.0
bs(Income, degree=3, knots=knots_income)[2] 48.748 123.638 -182.012 272.863 4.065 2.875 923.0 1431.0 1.0
bs(Income, degree=3, knots=knots_income)[3] 68.533 98.242 -118.858 250.496 3.369 2.383 853.0 1416.0 1.0
bs(Income, degree=3, knots=knots_income)[4] 210.290 103.892 17.347 406.753 3.626 2.565 825.0 1404.0 1.0
bs(Income, degree=3, knots=knots_income)[5] 222.546 105.553 31.163 427.878 3.589 2.539 866.0 1442.0 1.0
bs(Income, degree=3, knots=knots_income)[6] 202.395 105.402 0.453 394.110 3.629 2.567 846.0 1507.0 1.0
bs(Income, degree=3, knots=knots_income)[7] 81.592 103.313 -106.154 279.937 3.641 2.575 805.0 1390.0 1.0
bs(Edu, degree=3, knots=knots_edu)[0] 145.558 287.518 -410.399 654.146 8.525 6.029 1142.0 1509.0 1.0
bs(Edu, degree=3, knots=knots_edu)[1] 47.574 193.029 -323.504 392.406 5.595 3.957 1197.0 1656.0 1.0
bs(Edu, degree=3, knots=knots_edu)[2] 71.806 211.848 -312.889 469.050 6.246 4.417 1156.0 1647.0 1.0
bs(Edu, degree=3, knots=knots_edu)[3] 147.390 200.622 -224.182 518.214 5.972 4.224 1134.0 1603.0 1.0
bs(Edu, degree=3, knots=knots_edu)[4] 70.472 203.119 -314.815 427.662 6.018 4.256 1144.0 1535.0 1.0
bs(Edu, degree=3, knots=knots_edu)[5] 121.297 202.750 -264.164 485.547 5.998 4.243 1147.0 1575.0 1.0
bs(Edu, degree=3, knots=knots_edu)[6] 78.372 206.089 -296.582 457.669 6.012 4.252 1178.0 1713.0 1.0
bs(Edu, degree=3, knots=knots_edu)[7] 124.157 203.070 -258.062 490.727 6.008 4.249 1148.0 1532.0 1.0
bs(Health, degree=3, knots=knots_health)[0] 235.831 885.656 -1490.479 1862.011 31.104 22.002 813.0 1482.0 1.0
bs(Health, degree=3, knots=knots_health)[1] 162.805 389.636 -552.198 913.839 7.756 6.545 2529.0 2422.0 1.0
bs(Health, degree=3, knots=knots_health)[2] 14.967 120.574 -202.130 248.327 3.854 2.726 979.0 1657.0 1.0
bs(Health, degree=3, knots=knots_health)[3] 25.213 100.498 -170.867 204.686 3.411 2.413 871.0 1555.0 1.0
bs(Health, degree=3, knots=knots_health)[4] -21.315 102.473 -223.005 158.665 3.577 2.530 822.0 1468.0 1.0
bs(Health, degree=3, knots=knots_health)[5] -19.449 101.418 -213.704 165.809 3.541 2.505 823.0 1444.0 1.0
bs(Health, degree=3, knots=knots_health)[6] -55.820 100.895 -252.216 126.193 3.547 2.509 811.0 1347.0 1.0
bs(Health, degree=3, knots=knots_health)[7] 0.840 101.892 -196.002 186.216 3.543 2.506 828.0 1469.0 1.0
Overall_sigma 24.766 3.303 18.816 30.884 0.085 0.060 1491.0 2068.0 1.0

We can also check that the model seems to recover the observed data well.

Plotting Posterior Predictive Checks

base_model.predict(base_idata, kind='pps')
spline_model.predict(spline_idata, kind='pps')
spline_model1.predict(spline_idata1, kind='pps')


fig, axs = plt.subplots(3, 1, figsize=(8, 15))
axs = axs.flatten()
az.plot_ppc(base_idata, ax=axs[0])
az.plot_ppc(spline_idata, ax=axs[1]);
az.plot_ppc(spline_idata1, ax=axs[2]);
axs[0].set_xlabel('')
axs[1].set_title("PPC: Spline Model");
axs[0].set_title("PPC: Regression Model");
axs[2].set_title("PPC: Hierarchical Spline Model");

Next we’ll dig into the spline basis features and decompose the predicted outcome and show how the outcome varies for levels in the inputs variables. The Bambi interpret package offers some functionality to assess the conditional predictions for varying values of the input variables.

Model Interpretability Plots in Bambi

In particular we want to highlight the differences between the hierarchical and non-hierarchical multivariable spline models. The shrinkage effects of hierarchically modelling country intercepts are clearly evident and constrain the poor extrapolation effects of the simpler spline model.

fig, axs = plt.subplots(3, 2, figsize=(9, 14))
axs = axs.flatten()
bmb.interpret.plot_predictions(
    spline_model, spline_idata, "Income", ax=axs[0]
)

axs[0].set_title("Non-Hierarchical Income")
axs[2].set_title("Non-Hierarchical Edu")
axs[4].set_title("Non-Hierarchical Health")

axs[1].set_title("Hierarchical Income")
axs[3].set_title("Hierarchical Edu")
axs[5].set_title("Hierarchical Health")


bmb.interpret.plot_predictions(
    spline_model, spline_idata, "Edu", ax=axs[2]
)

bmb.interpret.plot_predictions(
    spline_model, spline_idata, "Health", ax=axs[4]
);

bmb.interpret.plot_predictions(
    spline_model1, spline_idata1, "Income", 
    sample_new_groups=True, ax=axs[1]
)

bmb.interpret.plot_predictions(
    spline_model1, spline_idata1, "Edu", sample_new_groups=True, ax=axs[3]
)

bmb.interpret.plot_predictions(
    spline_model1, spline_idata1, "Health",sample_new_groups=True, ax=axs[5]
);

We can pull these types of values out into a table. Note here how we ask for predictions based on varying values of the Edu and Income variables where we keep the Health variable fixed at the mean value.

summary_df = bmb.interpret.predictions(
    spline_model,
    spline_idata,
    conditional={
        "Edu": list(np.linspace(0.5, .98, 10)),
        "Income": list(np.linspace(0.4, .98, 10)),
        },
)
summary_df


summary_df1 = bmb.interpret.predictions(
    spline_model1,
    spline_idata1,
    conditional={
        "Edu": list(np.linspace(0.5, .98, 10)),
        "Income": list(np.linspace(0.4, .98, 10)),
        },
    sample_new_groups=True
)
summary_df1
Edu Income Country Health estimate lower_3.0% upper_97.0%
0 0.50 0.400000 Albania 0.885672 102.194259 -585.226936 756.059580
1 0.50 0.464444 Albania 0.885672 453.513671 -16.308350 927.170592
2 0.50 0.528889 Albania 0.885672 361.060621 -205.392062 932.651476
3 0.50 0.593333 Albania 0.885672 278.760394 -435.335249 858.714421
4 0.50 0.657778 Albania 0.885672 297.051995 -383.341065 866.221271
... ... ... ... ... ... ... ...
95 0.98 0.722222 Albania 0.885672 490.100671 443.246687 531.212584
96 0.98 0.786667 Albania 0.885672 545.999095 494.369197 593.351721
97 0.98 0.851111 Albania 0.885672 563.221217 503.657229 613.498251
98 0.98 0.915556 Albania 0.885672 430.521364 370.907881 499.685228
99 0.98 0.980000 Albania 0.885672 -10.062307 -359.500015 385.556273

100 rows × 7 columns

We can then plot these results based on the conditional realisations and see some interesting behaviour at the implausilble reaslisations. This should make us somewhat wary of using this model to extrapolate too far beyond the observable range of data

g = sns.relplot(data=summary_df, x="Income", y="estimate", hue="Edu")

g.fig.suptitle("Marginal Predictions of the Outcome Variable \n conditional on counterfactual values for Edu and Income", y=1.05);


g = sns.relplot(data=summary_df1, x="Income", y="estimate", hue="Edu")

g.fig.suptitle("Hierarchical Marginal Predictions of the Outcome Variable \n conditional on counterfactual values for Edu and Income", y=1.05);

We can see here how the hierarchical component shrinks the predicted values back towards a reasonable range constraining the poor extrapolation of the more naive spline model. This demonstrates something of the interplay between in-sample approximation and out of sample generalisation we saw above in the case of the insurance curve development.

The Spline Component Contributions

Finally, we’ll pull out the component contributions of each variable and see how they combine additively. This will also serve as a kind of posterior predictive check for each country as we can show the degree which posterior draws from each component sum to achieve a plausible mirror of the observed data.

Bincome = spline_model.response_component.design.common['bs(Income, degree=3, knots=knots_income)']

income_coefs = az.extract(spline_idata['posterior']['bs(Income, degree=3, knots=knots_income)'])['bs(Income, degree=3, knots=knots_income)']

Bedu = spline_model.response_component.design.common['bs(Edu, degree=3, knots=knots_edu)']

edu_coefs = az.extract(spline_idata['posterior']['bs(Edu, degree=3, knots=knots_edu)'])['bs(Edu, degree=3, knots=knots_edu)']


Bhealth = spline_model.response_component.design.common['bs(Health, degree=3, knots=knots_health)']

health_coefs = az.extract(spline_idata['posterior']['bs(Health, degree=3, knots=knots_health)'])['bs(Health, degree=3, knots=knots_health)']

income = np.dot(Bincome, income_coefs).T 
edu = np.dot(Bedu, edu_coefs).T
health = np.dot(Bhealth, health_coefs).T

intercept = az.extract(spline_idata['posterior']['Intercept'])['Intercept'].values

fig, ax = plt.subplots(figsize=(10, 7))
for i in range(100):
    if i == 1:
        ax.plot(income[i], label='Income Component', color='red')
        ax.plot(edu[i], label='Edu Component', color='blue')
        ax.plot(health[i], label='Health Component', color='darkgreen')
        ax.plot(intercept[i] + income[i] + edu[i] + health[i], label='Combined Components', color='purple')
    else: 
        ax.plot(income[i], alpha=0.1, color='red')
        ax.plot(edu[i], alpha=0.1, color='blue')
        ax.plot(health[i], alpha=0.1, color='darkgreen')
        ax.plot(intercept[i] + income[i] + edu[i] + health[i], color='purple', alpha=0.3)

ax.scatter(range(len(spline_idata['observed_data']['Overall'])), spline_idata['observed_data']['Overall'], label='Observed', color='grey', s=56, ec='black')
ax.set_title("Additive Spline Components", fontsize=20)
ax.legend();
ax.set_xticklabels(pisa_df.dropna(axis=0).reset_index()['Country']);

We can see here how the combined components borrow the structure of the outcome variable primarily from the income variable component. The health measures have closer to zero additive contribution while the uncertainty in the educational component varies wildly. But the blue educational component here is used primarily as a scaling contribution which adds to the level of the outcome variable rather than distorting the shape.

Conclusion

We’ve seen the application of splines as univariate smoothers to approximate wiggly curves of arbitrary shape. We’ve also tried gaussian process approximations of the same univariate functions. The suggested flexibility of both methods is a strength, but we need to be careful where splines have a tendency to over-fit to individual curves. As such we have tried to show that we can incorporate spline basis modelling in a hierarchical bayesian model and recover more compelling posterior predictive checks and additionally derive predictiions from the mixed variant of the hierarhical model which helps us understand the implications of the data for generic forecasts of insurance loss curves. Finally we showed how splines can be used additively to model non-linear functions of multiple covariates. These are a powerful tool to interrogate complex non-linear relationships, but they offer interpolation of functions within a well understood domain. Applealing to these models for extrapolation needs to be done carefully.

Citation

BibTeX citation:
@online{forde2024,
  author = {Nathaniel Forde},
  title = {GAMs and {GPs:} {Flexibility} and {Calibration}},
  date = {2024-04-07},
  langid = {en}
}
For attribution, please cite this work as:
Nathaniel Forde. 2024. “GAMs and GPs: Flexibility and Calibration.” April 7, 2024.