R Tutorial

Material heavily influenced by Nicholas J Clark’s excellent course on Ecological forecasting with mvgam and brms

Installation

  • Check if your installation worked
cmdstan_version()

If this returns a version number like "2.32.2" then things are working properly.

Introduction

  • So far we’ve developed relatively simple time-series models
  • Linear time-series dependance
  • Linear responses to environmental factors
  • Normally distributed errors
  • No model of observation error
  • No interactions between species
  • Can’t handle missing data
  • Most of these are violated in ecological systems
  • So start fitting more complex models

mvgam

  • Models like this are not trivial to fit
  • Use STAN
  • Uses MCMC to explore parameter space to fit the model using Bayesian methods
  • Typically requires learning a separate language - STAN is it’s own language
  • This lets you write arbitrarily complex models, but really needs a course in Bayesian methods
  • So, we’re going to use an R package called mvgam to implement our models
  • We’re going to use it because it’s the simplest way to make complex time-series model in R
library(mvgam)
library(dplyr)

Data

  • Data on the population dynamics of the Desert Pocket Mouse
pp_data <- read.csv("content/data/pp_abundance_timeseries.csv") 
  • mvgam requires it’s own specific data format and doesn’t currently work with tsibbles
  • Requires a time variable be present in the data to index temporal observations starting at 1.
  • Our newmoonnumber indexes the monthly samples so convert it to a start at 1
  • Also requires a a series variable, which needs to be a factor
  • Helps when analyzing multiple time series at once (e.g., multiple species)
pp_data <- read.csv("content/data/pp_abundance_timeseries.csv") |>
  mutate(time = newmoonnumber - min(newmoonnumber) + 1) |>
  mutate(series = as.factor('PP')) |>
  select(time, series, abundance, mintemp, cool_precip)
  • 124 months of data
  • Reserve 24 for testing
data_train <- filter(pp_data, time <= 100) 
data_test <- filter(pp_data, time > 100)

Simply time-series models in mvgam

  • mvgam comes with a time-series visualization function like we had in fable
plot_mvgam_series(data = pp_data, y = 'abundance')
  • Start by building something similar to what we’ve done before with the number of desert pocket mice
  • Model has an exogenous driver - minimum temperature
  • An autoregressive component
  • Gaussian error
$$y_t = c + \beta_1 x_{1,t} + \beta_2 y_{t-1} + \mathcal{N}(0,\sigma^{2})$$
  • Which can also be written as
$$y_t = \mathcal{N}(\mu,\sigma^{2})$$ $$u_t = c + \beta_1 x_{1,t} + \beta_2 y_{t-1}$$
  • Fit using the mvgam() function
  • Follows a Base R model structure so start with the model formula
  • Instead of including the AR component in the model we add a separate trend_model argument
  • We’ll use "AR1"
  • Specify the error family = gaussian()
  • Then we can specify the data for fitting the model and the data for making/evaluating forecasts
baseline_model = mvgam(abundance ~ mintemp,
                       trend_model = "AR1",
                       family = gaussian(),
                       data = data_train)

Bayesian model fitting

  • That’s a lot of output for fitting a model
  • What’s going on?
  • It is difficult to fit more complex time-series models
  • One way to fit them is using Bayesian methods
  • These methods iteratively search parameter space for the best parameter values
  • Using something called Markov Chain Monte Carlo (MCMC)
  • Draw 2D parameter search on board
  • The "Iteration" lines are telling us that the model is working it’s way through this process
  • The "Warmup" lines are iterations that are used to get in the right parameter space but then thrown away
  • The "Sampling" lines are iterations that serve as samples for each of the parameters we are fitting
  • Having multiple samples gives us the uncertainty in the parameters by looking at how variable those values are
  • The different "chains" are because we typically go through this process multiple times to check than we are converging to the right values
  • We can look at the result of this fitting process using mcmc_plot
mcmc_plot(baseline_model, type = "trace", variable = c("mintemp", "ar1[1]", "sigma[1]"))
  • The red lines at the bottom are providing the same information as the warnings when we fit the model

  • Something isn’t quite right

  • The model isn’t converged yet

  • We could try running it longer

  • But part of what’s going on here is that we’re using a poorly specified model given the data because we’ve assumed Gaussian errors

  • So let’s look at the model & forecast and then keep going

  • We can look at the model structure using summary()

summary(baseline_model)
  • If we plot the model we’ll get some diagnostic plots
plot(baseline_model)
  • The residuals are normally distributed
  • And the residual autocorrelation isn’t significant

Bayesian model forecasting

  • So let’s look at the forecasts

  • To make forecasts for Bayesian models we include data for y that is NA

  • This tells the model that we don’t know the values and therefore the model estimates them as part of the fitting process

  • Handy because it means these models can handle missing data

  • To make a true forecast one NA is added to the end of y for each time step we want to forecast

  • To hindcast the values for y that are part of the test set are replaced with NA

  • We can do this in mvgam using the forecast() function

baseline_forecast = forecast(baseline_model, newdata = data_test)
  • This is slow because it is refitting the model
  • We can then plot the forecast
plot(baseline_forecast)
  • These forecasts look similar to those we generated using fable
  • Prediction intervals are regularly negative because we’ve assumed normally distributed error
  • Actual counts can only be non-negative integers: 0, 1, 2…

Modeling count data

  • To model count data explicitly we need observations that are integers
  • We can use use Poisson error to accomplish this
  • Instead of $N(\mu, \sigma^2)$
$$y_t = \mathrm{Pois}(\lambda_t)$$
  • The Poisson distribution has one parameter $\lambda$
  • Which is both the mean and the variance
  • It generates only integer draws based on a mean
rpois(n = 10, lambda = 5)
hist(rpois(n = 1000, lambda = 5))
  • If the mean is 1.5 sometimes you’ll draw a 1, sometimes a 2, sometimes a 0, etc.

  • $\lambda_t$ can be a decimal

hist(rpois(n = 1000, lambda = 4.5))
  • We could expect an average of 4.5 rodents based on the environment even though we can only observe an integer number
  • But $\lambda_t$ does have to be positive because we can’t reasonably expect to see negative rodents
hist(rpois(n = 1000, lambda = -4.5))
  • To handle this we use a log link function to give us only positive values of $\mu_t$
  • The log link means that instead of modeling $\mu_t$ directly we model $log(\mu_t)$
$$y_t = \mathrm{Pois}(\lambda_t)$$ $$\mathrm{log(\lambda_t)} = c + \beta_1 x_{1,t} + \beta_2 y_{t-1}$$
poisson_model = mvgam(abundance ~ mintemp,
                      trend_model = "AR1",
                      family = poisson(link = "log"),
                      data = data_train)
  • No more warnings at the end
  • Look at the model
summary(poisson_model)
plot(poisson_model)
  • Model structure looks OK, be we still have some residual seasonal lag we haven’t captured

  • Now let’s look at the forecast

poisson_forecast = forecast(poisson_model, newdata = data_test)
plot(poisson_forecast)
  • Now all of our predictions are positive!
  • But while the point estimates seem reasonable the uncertainties see really large

Visualizing environmental drivers

  • As our models get more complicate it’s important to make sure we understand what they are doing
  • We can visualize the environmental component of the model using type = pterms
  • pterms is short for “parametric terms”
  • Which is what we have since we modeled a linear relationship with a fixed slope
plot(poisson_model, type = 'pterms')
  • This shows a linear relationship
  • But that is the relationship between log(abundance) and mintemp
  • So let’s look at the actual relationship
plot_predictions(poisson_model, condition = "mintemp")
  • Since log(abundance) is linearly related to temperature
  • This makes the response to untransformed abundance exponential
  • That doesn’t feel right
  • Could be part of the reason for our really large upper prediction intervals

Non-linear responses

  • Linear relationships between abundance and the environment are unlikely to hold over reasonable environmental gradients
  • Typically we think of species as have some optimal environmental value
  • With decreasing performance as you move away from that environmental value in both directions
  • Draw standard performance response curve
  • We can model using in mvgam by using a Generalized Additive Model that fits a smoother to the relationship
poisson_gam_model = mvgam(abundance ~ s(mintemp),
                          trend_model = "AR1",
                          family = poisson(link = "log"),
                          data = data_train)
  • Visualizing the environmental component of the model shows that it is now non-linear
plot(poisson_gam_model, type = 'smooths')
  • It doesn’t have an optimal value, but it does saturate
  • This means that once it’s warm enough then temperature stops influencing abundance
  • This is the smooth on log(abundance)
  • Look at the relationship with raw abundance
plot_predictions(poisson_gam_model, condition = "mintemp")
  • The GAM now gives us a more realistic response to temperature

  • When it’s cold abundance is always near zero

  • Even if gets a little less cold

  • Abundance then increases with temperature

  • But starts to asymptote at higher temperatures

  • How does this influence the forecasts

poisson_gam_forecast = forecast(poisson_gam_model, newdata = data_test)
plot(poisson_gam_forecast)
  • The extremely wide prediction intervals have been reduced substantially
  • If we look at the diagnostics
plot(poisson_gam_model)
  • We can also see that the residual season autocorrelation is improved

State space models

  • To add an explicit observation model we use the
state_space_model = mvgam(abundance ~ 1,
                          trend_formula = ~ s(mintemp, bs = 'bs', k = 5),
                          trend_model = "AR1",
                          family = poisson(link = "log"),
                          data = data_train)
state_space_forecast = forecast(state_space_model, newdata = data_test)
plot(state_space_forecast)

Some notes on Bayesian forecasting

  • Since forecasting and model fitting are all part of the same process in Bayesian models
  • We can also just incorporate the test data directly in the model fitting step
poisson_gam_model = mvgam(abundance ~ s(mintemp, bs = 'bs', k = 5),
                      trend_model = "AR1",
                      family = poisson(link = "log"),
                      data = data_train,
                      newdat = data_test)
  • And then plot forecasts directly from the model object
plot(poisson_gam_model, type = "forecast")
  • If you want to know why we’re using mvgam
  • Here’s what the code to build this model directly in STAN looks like
code(poisson_gam_model)

Evaluation and comparison

  • To evaluate the forecasts from mvgam we can use the score function
scores <- score(poisson_gam_forecast)
  • The output shows us two things we’ve seen before
  • Whether or not each observed value falls within the prediction interval
  • With a default prediction interval of 90%
  • And the continuous rank probability score
  • Other scores are available by changing the optional score argument
  • We can also change the prediction interval using the interval_width argument
scores <- score(poisson_gam_forecast, interval_width = 0.5)
  • If we want to calculate the coverage we can sum
sum(scores$PP$in_interval) / nrow(scores$PP)
Previous