Bayesian Structural Timeseries - Forecasting

Wordsworth once spoke forlornly of the hope that “child [could be] the father of the man”. This desire for predictable continuity exists where disruption is rife. Continuous trajectories can be smooth everywhere or staggered and staccato suddenly. No matter how apparently inevitable; each and every timeseries is vulnerable to such structural break. Childhood, or the past more generally, was likely more tumultuous than you can recall in retrospect. The question is always whether the past predicts the present?

Characterising the Past to Predict the Future

We want to show how we can model bayesian structural time series with autoregressive processes can be modeled in pymc and used to predict future unobserved data. How these kinds of models can flexibly incorporate structural assumptions and project future outcomes is only sparsely covered in the PYMC documentation. Hopefully recording the full modeling and prediction loop here is useful for you. The broad idea of technique is captured in the Wordsworth quote above. Bayesian structural timeseries assumes that the past is a guide to the future, and that if we can characterise the processes which govern the data-generating process of past, they should serve as a guide to the future.

We’ll show how to fit and predict auto-regressive models in pymc and then how to adjust them to capture details of trend and seasonality for when they complicate the picture.

Generate Autoregresive Data

def simulate_ar(intercept, coef1, coef2, noise=0.3, *, warmup=10, steps=200):   
    # We sample some extra warmup steps, to let the AR process stabilize
    draws = np.zeros(warmup+steps)
    # Initialize first draws at intercept
    draws[:2] = intercept
    for step in range(2, warmup+steps):
        draws[step] = (
            intercept 
            + coef1 * draws[step-1]
            + coef2 * draws[step-2]
            + rng.normal(0, noise)
        )
    # Discard the warmup draws
    return draws[warmup:]


ar1_data = simulate_ar(10, -0.9, 0)

Autoregressive data

Specifying the simple Autoregressive Model

We define the function to be able to take a set of priors over the parameters in the model. Crucially we also set the data inputs t and y to be mutable data objects. This latter step is required to facilitate prediction in the next step. The AR class implementation in PYMC is a deterministic function on data, which requires the specification of the number of steps to project forward. Here we define the class as a latent process feeding a Gaussian likelihood. We have made this latent process dependent on the mutable data objects and the number of observations in the data. In this way, when we seek to predict on new data the AR process steps are projected forward appropriately.

priors = {'coefs': {'mu': 0, 'size': 2},
          'sigma': 1, 
          'init': {'mu': 8, 'sigma': 1, 'size': 1}
          }


def make_latent_AR_model(ar_data, priors, prediction_steps=250, full_sample=True, samples=2000):
    with pm.Model() as AR:
        pass

    t_data = list(range(len(ar_data)))
    AR.add_coord('obs_id', t_data, mutable=True)

    with AR:
        ## Data containers to enable prediction
        t = pm.MutableData("t", t_data, dims="obs_id")
        y = pm.MutableData("y", ar_data, dims="obs_id")
        # The first coefficient will be the constant term
        coefs = pm.Normal("coefs", priors['coefs']['mu'], size=priors['coefs']['size'])
        sigma = pm.HalfNormal('sigma', priors['sigma'])
        # We need one init variable for each lag, hence size is variable too
        init = pm.Normal.dist(priors['init']['mu'], priors['init']['sigma'], size=priors['init']['size'])
        # Steps of the AR model minus the lags required
        ar1 = pm.AR("ar", coefs, sigma=sigma, init_dist=init, 
        constant=True, steps=t.shape[0]-(priors['coefs']['size']-1))
        
        # The Likelihood
        outcome = pm.Normal('likelihood', mu=ar1, sigma=sigma, observed=y)
        ## Sampling
        idata_ar = pm.sample_prior_predictive()
        if full_sample:
            idata_ar.extend(pm.sampling_jax.sample_blackjax_nuts(samples, random_seed=100, target_accept=0.95))
            idata_ar.extend(pm.sample_posterior_predictive(idata_ar))
        else:
            return idata_ar

    n = prediction_steps - ar_data.shape[0]

    with AR:
        # update values of predictors pass in nans for number of predicted steps
        pm.set_data(new_data={"t": list(range(prediction_steps)), 
                            "y": np.concatenate([ar_data, np.array([np.nan for i in range(n)])])}, 
                            coords={'obs_id': list(range(prediction_steps))})
        # use the updated values and predict outcomes and probabilities:
        idata_preds = pm.sample_posterior_predictive(idata_ar,
        var_names=["likelihood"],predictions=True,random_seed=100
        )

    return idata_ar, idata_preds, AR

We can see the model structure more clearly in the plate notation. One thing to note about the autoregressive implementation in pymc is that it can any number of lagged terms into the deterministic function so long as the data contains enough data points to accomodate them. Here we’ve only set an AR1 model with an intercept term, but in principle we can be more flexible.

model_graph

Fitting the model recovers the parameters as defined:

params

Once this model has been fitted to the data we must inspect how the model generates a posterior predictive distribution and further, how it imputes the values for the future states of the process. Credit to Juan Camilo Orduz for the plotting trick i’m using below to illustrate the highest density intervals of the predicted distributions.

def plot_model_fits(idata, idata_preds):
    palette = "viridis_r"
    cmap = plt.get_cmap(palette)
    percs = np.linspace(51, 99, 100)
    colors = (percs - np.min(percs)) / (np.max(percs) - np.min(percs))

    fig, axs = plt.subplots(2, 1, sharex=False, figsize=(10, 15))
    axs = axs.flatten()
    for i, p in enumerate(percs[::-1]):
        upper = np.percentile(
            az.extract_dataset(idata, group="posterior_predictive", num_samples=100)[
                "likelihood"
            ], p, axis=1
        )
        lower = np.percentile(
            az.extract_dataset(idata, group="posterior_predictive", num_samples=100)[
                "likelihood"
            ], 100 - p, axis=1
        )
        color_val = colors[i]
        axs[0].fill_between(
            x=idata['constant_data']['t'],
            y1=upper.flatten(),
            y2=lower.flatten(),
            color=cmap(color_val),
            alpha=0.1,
        )

    axs[0].plot(az.extract_dataset(idata, group="posterior_predictive", num_samples=100)["likelihood"].median(axis=1), color='cyan', label='Median Prediction')

    axs[0].scatter(x=idata['constant_data']['t'], y=idata['constant_data']['y'], color="k")
    axs[0].set_title("AutoRegressive Posterior Predictive Fit \n Shaded by Probability Percentiles", fontsize=20)

    for i, p in enumerate(percs[::-1]):
        upper = np.percentile(
            az.extract_dataset(idata_preds, group="predictions", num_samples=100)[
                "likelihood"
            ], p, axis=1
        )
        lower = np.percentile(
            az.extract_dataset(idata_preds, group="predictions", num_samples=100)[
                "likelihood"
            ], 100 - p, axis=1
        )
        color_val = colors[i]
        axs[1].fill_between(
            x=idata_preds['predictions_constant_data']['t'],
            y1=upper.flatten(),
            y2=lower.flatten(),
            color=cmap(color_val),
            alpha=0.1,
        )

    axs[1].plot(
            az.extract_dataset(idata_preds, group="predictions", num_samples=100)[
                "likelihood"
            ].median(axis=1),
            color="cyan"
        )
    axs[1].scatter(x=idata['constant_data']['t'], y=idata['constant_data']['y'], color="k")
    axs[1].set_title("Predictions Plotted", fontsize=20)

and the posterior_predictive distribution seems to fit well to the observed data. We plot the posterior predictive distribution with the historic data overlayed. We further plot the same distribution with imputed future steps.

predictions

Adding a Trend to the Data

Next we want to see how these changes if we add a trend to the data.

y_t = -0.3 + np.arange(200)*-0.2 + np.random.normal(0, 10, 200)
y_t = y_t  + ar1_data

trend

If we naively try to fit the model above to our data now, we see that it struggles to capture the structure of the data. This can be accomodated if we add a structural trend component to our model. One word of caution is that the the combination of the autoregressive terms and the trend component needs to be managed carefully since without carefully chosen priors, their combination can result in explosive results beyond plausible limits. Fortunately this can be checked using prior and posterior predictive checks.

misspecified

Specifying the Trend model

We’ll allow here for the addition of a trend and saturated trend in addition to the autoregressive component. This necessitates more prior distributions of the $\alpha$ and $\beta$ parameters in the linear trend.

priors = {'coefs': {'mu': 0, 'sigma': .1, 'size': 2},
          'sigma': 1, 
          'init': {'mu': 0, 'sigma': .1, 'size': 1},
          'alpha': {'mu':0, 'sigma': .1}, 
          'beta': {'mu': -0.5, 'sigma': .1},
          'ar_sigma': {'mu': .1}
          }

def logistic_saturation(x, lam: float = 0.5):
    """Logistic saturation transformation."""
    return (1 - at.exp(-lam * x)) / (1 + at.exp(-lam * x))
        

def make_trend_AR_model(ar_data, priors, prediction_steps=250, full_sample=True, samples=2000):
    with pm.Model() as AR:
        pass

    t_data = list(range(len(ar_data)))
    AR.add_coord('obs_id', t_data, mutable=True)

    with AR:
        ## Data containers to enable prediction
        t = pm.MutableData("t", t_data, dims="obs_id")
        y = pm.MutableData("y", ar_data, dims="obs_id")
        # The first coefficient will be the constant term
        coefs = pm.Normal("coefs", priors['coefs']['mu'], priors['coefs']['sigma'], size=priors['coefs']['size'])
        ar_sigma = pm.HalfNormal('ar_sigma', priors['ar_sigma']['mu'])
        sigma = pm.HalfNormal('sigma', priors['sigma'])
        # We need one init variable for each lag, hence size is variable too
        init = pm.Normal.dist(priors['init']['mu'], priors['init']['sigma'], size=priors['init']['size'])
        # Steps of the AR model minus the lags required
        ar1 = pm.AR("ar", coefs, sigma=ar_sigma, init_dist=init, 
        constant=True, steps=t.shape[0]-(priors['coefs']['size']-1))

        ## Priors for the linear trend component
        alpha = pm.Normal('alpha', priors['alpha']['mu'], priors['alpha']['sigma'])
        beta = pm.Normal('beta', priors['beta']['mu'], priors['beta']['sigma'])
        trend = pm.Deterministic('trend', alpha + beta*t, dims='obs_id')

        # Saturation
        lam = pm.Normal('lam', mu=-0.4, sigma=.1)
        saturation = pm.Deterministic('saturation', logistic_saturation(trend, lam), dims='obs_id')
        mu = ar1 + saturation*trend

        outcome = pm.Normal('likelihood', mu=mu, sigma=sigma, observed=y)
        
        ## Sampling
        idata_ar = pm.sample_prior_predictive()
        if full_sample:
            idata_ar.extend(pm.sampling_jax.sample_blackjax_nuts(samples, random_seed=100, target_accept=0.95))
            idata_ar.extend(pm.sample_posterior_predictive(idata_ar))
        else:
            return idata_ar, None, None

    n = prediction_steps - ar_data.shape[0]

    with AR:
        # update values of predictors
        # pass in nans for the predicted steps
        pm.set_data(new_data={"t": list(range(prediction_steps)), 
                            "y": np.concatenate([ar_data, np.array([np.nan for i in range(n)])])}, 
                            coords={'obs_id': list(range(prediction_steps))})
        # use the updated values and predict outcomes and probabilities:
        idata_preds = pm.sample_posterior_predictive(idata_ar,
        var_names=["likelihood"],predictions=True,random_seed=100
        )

    return idata_ar, idata_preds, AR

The structure above can again be seeen more clearly with the plate notation:

ar_trend_model

The predictions of this model are far superior to the misspecifed model above. We correctly recover the directionality of the series.

ar_trend_predictions

and we can again validate the performance by testing it with a leave-one-out cross validation comparison methodology

compare_trend

Adding Seasonlity to the Data

If we allow that the data exhibits a seasonal component we can further augment our model by defininig a linear combination of sine and cosine functions in the manner of fourier terms. The linear combination of these should help recover the shape of any regular kind of oscillation we expect in our timeseries.

t_data = list(range(200))
n_order = 10
periods = (np.array(t_data) / 7)

fourier_features = pd.DataFrame(
    {
        f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods * order)
        for order in range(1, n_order + 1)
        for func in ("sin", "cos")
    }
)


y_t_s = y_t + 20*fourier_features['sin_order_1']

ar_t_s

Specifying the Trend and Seasonal Model

The major adjustment for this model is the manner in which we need to feed in additional data for the fourier terms in the prediction step. Similar adjustments need to be pushed forward for any additional features we might specify in the future. For example, if we added a one-hot encoded feature for days of the week we’d have to add appropriate flags for each day of the week represented any of the n-prediction steps.

priors = {'coefs': {'mu': 0, 'sigma': .1, 'size': 2},
          'sigma': 1, 
          'init': {'mu': 0, 'sigma': .1, 'size': 1},
          'alpha': {'mu':0, 'sigma': .1}, 
          'beta': {'mu': -0.5, 'sigma': .1},
          'ar_sigma': {'mu': .1}, 
          'beta_fourier': {'mu': 0, 'sigma':0.05},
          'lam': {'mu': 0, 'sigma': 0.1}
          }
        

def make_trend_seasonal_AR_model(ar_data, ff, priors, prediction_steps=250, full_sample=True, samples=2000, jax=True):
    with pm.Model() as AR:
        pass
    
    ff = ff.to_numpy().T
    t_data = list(range(len(ar_data)))
    AR.add_coord('obs_id', t_data, mutable=True)
    AR.add_coord('fourier_features', np.arange(len(ff)), mutable=True)

    with AR:
        ## Data containers to enable prediction
        t = pm.MutableData("t", t_data, dims="obs_id")
        y = pm.MutableData("y", ar_data, dims="obs_id")
        # The first coefficient will be the constant term
        sigma = pm.HalfNormal('sigma', priors['sigma'])
        coefs = pm.Normal("coefs", priors['coefs']['mu'], priors['coefs']['sigma'], size=priors['coefs']['size'])
        ar_sigma = pm.HalfNormal('ar_sigma', priors['ar_sigma']['mu'])
        # We need one init variable for each lag, hence size is variable too
        init = pm.Normal.dist(priors['init']['mu'], priors['init']['sigma'], size=priors['init']['size'])
        # Steps of the AR model minus the lags required
        ar1 = pm.AR("ar", coefs, sigma=ar_sigma, init_dist=init, 
        constant=True, steps=t.shape[0]-(priors['coefs']['size']-1))

        ## Priors for the linear trend component
        alpha = pm.Normal('alpha', priors['alpha']['mu'], priors['alpha']['sigma'])
        beta = pm.Normal('beta', priors['beta']['mu'], priors['beta']['sigma'])
        trend = pm.Deterministic('trend', alpha + beta*t, dims='obs_id')

        # Saturation
        lam = pm.Normal('lam', mu=priors['lam']['mu'], sigma=priors['lam']['sigma'])
        saturation = pm.Deterministic('saturation', logistic_saturation(trend, lam), dims='obs_id')

        # Seasonality
        β_fourier = pm.Normal("β_fourier", mu=priors['beta_fourier']['mu'], sigma=priors['beta_fourier']['sigma'], dims="fourier_features")
        fourier_terms = pm.MutableData('fourier_terms', ff)
        seasonality = pm.Deterministic(
            "seasonality", pm.math.dot(β_fourier, fourier_terms), dims='obs_id'
        )

        mu =  saturation*trend + seasonality  + ar1

        outcome = pm.TruncatedNormal('likelihood', mu=mu, sigma=sigma, observed=y, lower=0)
        
        ## Sampling
        idata_ar = pm.sample_prior_predictive()
        if full_sample:
            if jax: 
                idata_ar.extend(pm.sampling_jax.sample_blackjax_nuts(samples, random_seed=100, target_accept=0.99))
            else:
                idata_ar.extend(pm.sample(samples, random_seed=100, target_accept=0.99, cores=1))
            idata_ar.extend(pm.sample_posterior_predictive(idata_ar))
        else:
            return idata_ar, None, None

    n = prediction_steps - ar_data.shape[0]
    n_order = 10
    periods = (np.arange(prediction_steps) / 7)

    fourier_features_new = pd.DataFrame(
        {
            f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods * order)
            for order in range(1, n_order + 1)
            for func in ("sin", "cos")
        }
    )

    with AR:
        # update values of predictors
        # pass in nans for the predicted steps
        pm.set_data(new_data={"t": list(range(prediction_steps)), 
                             "fourier_terms": fourier_features_new.to_numpy().T,
                             "y": np.concatenate([ar_data, np.array([np.nan for i in range(n)])])}, 
                    coords={'obs_id': list(range(prediction_steps))})
        # use the updated values and predict outcomes and probabilities:
        idata_preds = pm.sample_posterior_predictive(idata_ar,
        var_names=["likelihood"],predictions=True,random_seed=100
        )

    return idata_ar, idata_preds, AR

Again, the model plate representation is a cleaner view of the structure.

ar_trend_seasonal

Comparing the predictions against the model with just the ar and trend component we see an improvement based on the leave-one-out cross validation criteria.

compare

although the broad pattern of the projection is not too disimilar to the forecast of the trend model due to the already highly volatile nature of our autoregressive process:

ar_trend_seasonal_preds

Conclusion

We’ve shown here how to use PYMC to specify bayesian structural timeseries models and how that incorporating different structural components can result in improved model fit. We’ve seen how to fit and predict new observations for complicated combinations of structural components available in the PYMC distributions.

However, we’ve not delved into the notion of how structural breaks can disrupt any of these components of the model or how modelling a changepoint or changepoints explicitly can help identify these step changes in the series. We might have also discussed including a gaussian random walk term if we wanted to allow for changing variances in the distribution over time. We can be more or less opinionated about which components we include in these models since we’re working with the basic distributional building blocks. There are benefits to using smart default packages like prophet but it’s harder to adapt those defaults to your specific problem. I hope i’ve encouraged you to look into using PYMC, it’s a great tool which ties model building to narrative decisions about the data generating process. This transparency and flexibility makes for more compelling data science.

 
Back