R Tutorial

Material heavily influenced by Nicholas J Clark’s excellent course on Ecological forecasting with mvgam and brms

Installation

Check if your installation worked

cmdstan_version()

If this returns a version number like "2.32.2" then things are working properly.

Introduction

So far we’ve developed relatively simple time-series models
Linear time-series dependance
Linear responses to environmental factors
Normally distributed errors
No model of observation error
No interactions between species
Can’t handle missing data
Most of these are violated in ecological systems
So start fitting more complex models

mvgam

Models like this are not trivial to fit
Use STAN
Uses MCMC to explore parameter space to fit the model using Bayesian methods
Typically requires learning a separate language - STAN is it’s own language
This lets you write arbitrarily complex models, but really needs a course in Bayesian methods
So, we’re going to use an R package called mvgam to implement our models
We’re going to use it because it’s the simplest way to make complex time-series model in R

library(mvgam)
library(dplyr)

Data

Data on the population dynamics of the Desert Pocket Mouse

pp_data <- read.csv("content/data/pp_abundance_timeseries.csv")

mvgam requires it’s own specific data format and doesn’t currently work with tsibbles
Requires a time variable be present in the data to index temporal observations starting at 1.
Our newmoonnumber indexes the monthly samples so convert it to a start at 1
Also requires a a series variable, which needs to be a factor
Helps when analyzing multiple time series at once (e.g., multiple species)

pp_data <- read.csv("content/data/pp_abundance_timeseries.csv") |>
  mutate(time = newmoonnumber - min(newmoonnumber) + 1) |>
  mutate(series = as.factor('PP')) |>
  select(time, series, abundance, mintemp, cool_precip)

124 months of data
Reserve 24 for testing

data_train <- filter(pp_data, time <= 100) 
data_test <- filter(pp_data, time > 100)

Simply time-series models in mvgam

mvgam comes with a time-series visualization function like we had in fable

plot_mvgam_series(data = pp_data, y = 'abundance')

Start by building something similar to what we’ve done before with the number of desert pocket mice
Model has an exogenous driver - minimum temperature
An autoregressive component
Gaussian error

$$y_t = c + \beta_1 x_{1,t} + \beta_2 y_{t-1} + \mathcal{N}(0,\sigma^{2})$$

Which can also be written as

$$y_t = \mathcal{N}(\mu,\sigma^{2})$$ $$u_t = c + \beta_1 x_{1,t} + \beta_2 y_{t-1}$$

Fit using the mvgam() function
Follows a Base R model structure so start with the model formula
Instead of including the AR component in the model we add a separate trend_model argument
We’ll use "AR1"
Specify the error family = gaussian()
Then we can specify the data for fitting the model and the data for making/evaluating forecasts

baseline_model = mvgam(abundance ~ mintemp,
                       trend_model = "AR1",
                       family = gaussian(),
                       data = data_train)

Bayesian model fitting

That’s a lot of output for fitting a model
What’s going on?
It is difficult to fit more complex time-series models
One way to fit them is using Bayesian methods
These methods iteratively search parameter space for the best parameter values
Using something called Markov Chain Monte Carlo (MCMC)
Draw 2D parameter search on board
The "Iteration" lines are telling us that the model is working it’s way through this process
The "Warmup" lines are iterations that are used to get in the right parameter space but then thrown away
The "Sampling" lines are iterations that serve as samples for each of the parameters we are fitting
Having multiple samples gives us the uncertainty in the parameters by looking at how variable those values are
The different "chains" are because we typically go through this process multiple times to check than we are converging to the right values
We can look at the result of this fitting process using mcmc_plot

mcmc_plot(baseline_model, type = "trace", variable = c("mintemp", "ar1[1]", "sigma[1]"))

The red lines at the bottom are providing the same information as the warnings when we fit the model
Something isn’t quite right
The model isn’t converged yet
We could try running it longer
But part of what’s going on here is that we’re using a poorly specified model given the data because we’ve assumed Gaussian errors
So let’s look at the model & forecast and then keep going
We can look at the model structure using summary()

summary(baseline_model)

If we plot the model we’ll get some diagnostic plots

plot(baseline_model)

The residuals are normally distributed
And the residual autocorrelation isn’t significant

Bayesian model forecasting

So let’s look at the forecasts
To make forecasts for Bayesian models we include data for y that is NA
This tells the model that we don’t know the values and therefore the model estimates them as part of the fitting process
Handy because it means these models can handle missing data
To make a true forecast one NA is added to the end of y for each time step we want to forecast
To hindcast the values for y that are part of the test set are replaced with NA
We can do this in mvgam using the forecast() function

baseline_forecast = forecast(baseline_model, newdata = data_test)

This is slow because it is refitting the model
We can then plot the forecast

plot(baseline_forecast)

These forecasts look similar to those we generated using fable
Prediction intervals are regularly negative because we’ve assumed normally distributed error
Actual counts can only be non-negative integers: 0, 1, 2…

Modeling count data

To model count data explicitly we need observations that are integers
We can use use Poisson error to accomplish this
Instead of $N(\mu, \sigma^2)$

$$y_t = \mathrm{Pois}(\lambda_t)$$

The Poisson distribution has one parameter $\lambda$
Which is both the mean and the variance
It generates only integer draws based on a mean

rpois(n = 10, lambda = 5)
hist(rpois(n = 1000, lambda = 5))

If the mean is 1.5 sometimes you’ll draw a 1, sometimes a 2, sometimes a 0, etc.
$\lambda_t$ can be a decimal

hist(rpois(n = 1000, lambda = 4.5))

We could expect an average of 4.5 rodents based on the environment even though we can only observe an integer number
But $\lambda_t$ does have to be positive because we can’t reasonably expect to see negative rodents

hist(rpois(n = 1000, lambda = -4.5))

To handle this we use a log link function to give us only positive values of $\mu_t$
The log link means that instead of modeling $\mu_t$ directly we model $log(\mu_t)$

$$y_t = \mathrm{Pois}(\lambda_t)$$ $$\mathrm{log(\lambda_t)} = c + \beta_1 x_{1,t} + \beta_2 y_{t-1}$$

poisson_model = mvgam(abundance ~ mintemp,
                      trend_model = "AR1",
                      family = poisson(link = "log"),
                      data = data_train)

No more warnings at the end
Look at the model

summary(poisson_model)

plot(poisson_model)

Model structure looks OK, be we still have some residual seasonal lag we haven’t captured
Now let’s look at the forecast

poisson_forecast = forecast(poisson_model, newdata = data_test)
plot(poisson_forecast)

Now all of our predictions are positive!
But while the point estimates seem reasonable the uncertainties see really large

Visualizing environmental drivers

As our models get more complicate it’s important to make sure we understand what they are doing
We can visualize the environmental component of the model using type = pterms
pterms is short for “parametric terms”
Which is what we have since we modeled a linear relationship with a fixed slope

plot(poisson_model, type = 'pterms')

This shows a linear relationship
But that is the relationship between log(abundance) and mintemp
So let’s look at the actual relationship

plot_predictions(poisson_model, condition = "mintemp")

Since log(abundance) is linearly related to temperature
This makes the response to untransformed abundance exponential
That doesn’t feel right
Could be part of the reason for our really large upper prediction intervals

Non-linear responses

Linear relationships between abundance and the environment are unlikely to hold over reasonable environmental gradients
Typically we think of species as have some optimal environmental value
With decreasing performance as you move away from that environmental value in both directions
Draw standard performance response curve
We can model using in mvgam by using a Generalized Additive Model that fits a smoother to the relationship

poisson_gam_model = mvgam(abundance ~ s(mintemp),
                          trend_model = "AR1",
                          family = poisson(link = "log"),
                          data = data_train)

Visualizing the environmental component of the model shows that it is now non-linear

plot(poisson_gam_model, type = 'smooths')

It doesn’t have an optimal value, but it does saturate
This means that once it’s warm enough then temperature stops influencing abundance
This is the smooth on log(abundance)
Look at the relationship with raw abundance

plot_predictions(poisson_gam_model, condition = "mintemp")

The GAM now gives us a more realistic response to temperature
When it’s cold abundance is always near zero
Even if gets a little less cold
Abundance then increases with temperature
But starts to asymptote at higher temperatures
How does this influence the forecasts

poisson_gam_forecast = forecast(poisson_gam_model, newdata = data_test)
plot(poisson_gam_forecast)

The extremely wide prediction intervals have been reduced substantially
If we look at the diagnostics

plot(poisson_gam_model)

We can also see that the residual season autocorrelation is improved

State space models

To add an explicit observation model we use the

state_space_model = mvgam(abundance ~ 1,
                          trend_formula = ~ s(mintemp, bs = 'bs', k = 5),
                          trend_model = "AR1",
                          family = poisson(link = "log"),
                          data = data_train)
state_space_forecast = forecast(state_space_model, newdata = data_test)
plot(state_space_forecast)

Some notes on Bayesian forecasting

Since forecasting and model fitting are all part of the same process in Bayesian models
We can also just incorporate the test data directly in the model fitting step

poisson_gam_model = mvgam(abundance ~ s(mintemp, bs = 'bs', k = 5),
                      trend_model = "AR1",
                      family = poisson(link = "log"),
                      data = data_train,
                      newdat = data_test)

And then plot forecasts directly from the model object

plot(poisson_gam_model, type = "forecast")

If you want to know why we’re using mvgam
Here’s what the code to build this model directly in STAN looks like

code(poisson_gam_model)

Evaluation and comparison

To evaluate the forecasts from mvgam we can use the score function

scores <- score(poisson_gam_forecast)

The output shows us two things we’ve seen before
Whether or not each observed value falls within the prediction interval
With a default prediction interval of 90%
And the continuous rank probability score
Other scores are available by changing the optional score argument
We can also change the prediction interval using the interval_width argument

scores <- score(poisson_gam_forecast, interval_width = 0.5)

If we want to calculate the coverage we can sum

sum(scores$PP$in_interval) / nrow(scores$PP)

Last updated on Nov 16, 2023

Edit this page