R Tutorial Part 1

Adapted from the state space modeling activity from Michael Dietz’s excellent Ecological Forecasting book

JAGS needs to be installed: https://sourceforge.net/projects/mcmc-jags/files/ rjags R package needs to be installed

Video Tutorial

Text Tutorial

State space models

  • Time-series model
  • Only first order autoregressive component
  • Separately model
    • the process model - how the system evolves in time or space
    • the observation model - observation error or indirect observations
  • Estimates the true value of the underlying latent state variables


gflu = read.csv("https://raw.githubusercontent.com/EcoForecast/EF_Activities/master/data/gflu_data.txt", skip=11)
time = as.Date(gflu$Date)
y = gflu$Florida
plot(time,y,type='l',ylab="Flu Index",lwd=2,log='y')


Draw on board while walking through models

y_t-1    y_t    y_t+1
  |       |       |
x_t-1 -> x_t -> x_t+1   Process model

Process model

  • What is actually happening in the system
  • First order autoregressive component

x_t+1 = f(x_t) + e_t

  • Simple linear model is AR1:

x_t+1 = b0 + b1 * x_t + e_t

Observation model

  • Google searches aren’t perfect measures of the number of flu cases (which are what should be changing in the process model and what we care about)
  • So model this imperfect observation

y_t = x_t + e_t

  • Can be much more complicated


  • Models like this are not trivial to fit
  • Use JAGS (Just Another Gibbs Sampler)
  • Gibbs samplers are a way of exploring parameter space to fit the model using Bayesian methods.
  • The rjags library use R to call JAGS.

JAGS Model

  • JAGS code to describe the model
  • Store as string in R
  • Three components
    • data/error model
      • relates observed data (y) to latent variable (x)
      • Gaussian obs error
    • process model
      • relates state of the system at t to the state at t-1
      • random walk (x_t = x_t-1 + e_t)
    • priors
  • Bayesian methods need priors, or starting points for model fitting
RandomWalk = "
  #### Data Model
  for (t in 1:n){
    y[t] ~ dnorm(x[t], tau_obs)
  #### Process Model
  for (t in 2:n){
    x[t]~dnorm(x[t-1], tau_proc)
  #### Priors
  x[1] ~ dnorm(x_ic, tau_ic)
  tau_obs ~ dgamma(a_obs, r_obs)
  tau_proc ~ dgamma(a_proc, r_proc)
  • Data and priors as a list
data <- list(y=y,
  • Starting point of parameters
init <- list(list(tau_proc=1/var(diff(y)),tau_obs=5/var(y)))
  • Normally would want several chains with different starting positions to avoid local minima

  • Send to JAGS

j.model   <- jags.model (file = textConnection(RandomWalk),
                         data = data,
                         inits = init,
                         n.chains = 1)
  • Burn in
jags.out   <- coda.samples (model = j.model,
                            variable.names = c("tau_proc","tau_obs"),
                            n.iter = 10000)
  • Sample from MCMC with full vector of X’s
  • This starts sampling from the point were the previous run of coda.samples ends so it gets rid of the burn-in samples
jags.out   <- coda.samples (model = j.model,
                            variable.names = c("x","tau_proc","tau_obs"),
                            n.iter = 10000)
  • Visualize
  • Convert the output into a matrix & drop parameters
out <- as.matrix(jags.out)
xs <- out[,3:ncol(out)]
  • Point predictions are averages across MCMC samples
predictions <- colMeans(xs)
plot(time, predictions, type = "l")
  • And this looks very similar to the observed dynamics of y

  • Add prediction intervals as range containing 95% of MCMC samples

ci <- apply(xs, 2, quantile, c(0.025, 0.975))
lines(time, ci[1,], lty = "dashed", col = "blue")
lines(time, ci[2,], lty = "dashed", col = "blue")
  • These are very narrow prediction intervals, so the model appears to be very confident
  • But it’s important to keep in mind that when fitting the value of x at time t, the model has access to the value of y at time t
  • And the y is present it isn’t being estimated, it’s just the observed value
  • So, will this model forecast well?


  • To make forecasts using a JAGS model we include data for y that is NA
  • This tells the model that we don’t know the values and therefore the model estimates them as part of the fitting process
  • To make a true forecast we would add one NA to the end of y for each time step we wanted to forecast
  • To hindcast or backcast like we replace the values for y that are part of the test set with NA
  • We’ll hindcast, so to do this we’ll replace the last year of y values with NA and then compare the final year of data to our predictions

Make these changes at top of script and rerun

data$y[(length(y)-51):length(y)] = NA
jags.out   <- coda.samples (model = j.model,
                            variable.names = c("y","tau_proc","tau_obs"),
                            n.iter = 10000)
  • We can see from plotting the predictions that the forecast doesn’t look promising
  • Without the observed data to influence the estimates of x[t] the model predicts little change over the forecast year
  • We can directly compare this to the empirical data by adding it to the plot
lines(time, y)
  • So the point estimates don’t perform well
  • This raises the question of whether the model accurately predicts that it is uncertain when making forecasts
  • Plotting the prediction intervals suggests that it does
  • They very quickly expand towards zero and the upper limits of the data