# R Tutorial

Material influenced by the state space modeling activity from Michael Dietz’s excellent Ecological Forecasting book and Nicholas J Clark’s course on Ecological forecasting with mvgam and brms*

## Installation

### Pre-installation Windows

Before starting installation on Windows you will need to install RTools.

If you don’t already have it installed:

- Follow the link
- Select your appropriate version of R
- Download the installer
- Run it with the defaults

### Installation

```
install.packages(c("brms", "dplyr", "gratia", "ggplot2",
"marginaleffects", "tidybayes", "zoo",
"viridis", "remotes"))
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
remotes::install_github('nicholasjclark/mvgam', force = TRUE)
```

#### macOS and Linux

```
library(cmdstanr)
check_cmdstan_toolchain()
install_cmdstan()
cmdstan_version()
```

#### Windows

```
library(cmdstanr)
check_cmdstan_toolchain(fix = TRUE)
install_cmdstan()
cmdstan_version()
```

If these returns a version number like `"2.32.2"`

then things are working properly.

## Text Tutorial

- So far we’ve developed relatively simple time-series models
- Linear time-series dependance
- Linear responses to environmental factors
- Normally distributed errors
- No model of observation error
- Most of these are violated in ecological systems
- So start fitting more complex models

### Data

- Data on the population dynamics of the Desert Pocket Mouse

### Model

Draw on board while walking through models

```
y_t-1 y_t y_t+1
| | |
x_t-1 -> x_t -> x_t+1 Process model
```

#### Process model

- What is actually happening in the system
- First order autoregressive component

x_t+1 = f(x_t) + e_t

- Simple linear model is AR1:

x_t+1 = b0 + b1 * x_t + e_t

#### Observation model

- Counts of rodents in traps aren’t perfect measures of the number of rodents (which are what should be changing in the process model and what we care about)
- So model this imperfect observation

y_t = Pois(x_t)

- Can be much more complicated

### mvgam

- Models like this are not trivial to fit
- Use [STAN](http://mcmc-jags.sourceforge.net)
- Uses MCMC to explore parameter space to fit the model using Bayesian methods
- Typically requires learning a separate language - STAN is it’s own language
- This lets you write arbitrarily complex models, but really needs a course in Bayesian methods
- So, we’re going to use an R package called
`mvgam`

to implement our models - We’re going to use it because it’s the simplest way to make complex time-series model in R

```
library(mvgam)
library(dplyr)
```

- mvgam requires that we modify our data a bit
- Requires a
`time`

variable be present in the data to index temporal observations starting at 1. - Our
`newmoonnumber`

indexes the monthly samples so convert it to a start at 1 - Also requires a a `series`` variable, which needs to be a factor
- Needed for analyzing multiple time series at once (e.g., multiple species)

```
pp_data <- read.csv("content/data/pp_abundance_timeseries.csv") |>
mutate(time = newmoonnumber - (min(newmoonnumber)) + 1) |>
mutate(series = as.factor('PP')) |>
select(time, series, abundance, mintemp, cool_precip)
```

- We have 124 months of data so reserve 24 for testing

```
data_train <- filter(pp_data, time <= 100)
data_test <- filter(pp_data, time > 100)
```

### Simply time-series models in mvgam

- Start by building something similar to what we’ve done before with the number of desert pocket mice
- Model has an exogenous driver - minimum temperature
- An autoregressive component
- Gaussian error
- Fit using the
`mvgam()`

function - Follows a Base R model structure so start with the model
- Instead of including the AR component in the model we add a separate
`trend_model`

argument - We’ll use
`"AR1"`

- Specify the error
`family = gaussian()`

- Then we can specify the data for fitting the model and the data for making/evaluating forecasts

```
baseline_model = mvgam(abundance ~ mintemp,
trend_model = "AR1",
family = gaussian(),
data = data_train,
newdata = data_test)
```

#### Bayesian model fitting

- That’s a lot of output for fitting a model
- What’s going on?
- It is difficult to fit complex models like the state space models we’re building towards
- One way to fit these more complex models is using Bayesian methods
- These methods iteratively search parameter space for the best parameter values
- Using something called Markov Chain Monte Carlo (MCMC)
*Draw 2D parameter search on board*- The
`"Iteration"`

lines are telling us that the model is working it’s way through this process - The different “chains” are because we typically go through this process multiple times to check than we are converging to the right values
- We can look at the result of this fitting process using
`mcmc_plot`

```
mcmc_plot(baseline_model, type = "trace", variable = c("mintemp", "ar1[1]", "sigma[1]"))
```

- The red lines at the bottom are providing the same information as the warnings when we fit the model
- Something isn’t quite right
- The model isn’t converged yet
- We could try running it longer
- But part of what’s going on here is that the errors aren’t really Gaussian
- So let’s look at the forecast and then move on

```
plot(baseline_model, type = "forecast")
```

- Remember our model looks like this

- Which can also be written as

- Prediction intervals are regularly negative because we’ve assumed normally distributed error
- Actual counts can only be non-negative integers: 0, 1, 2…

### Better distributions

- Let’s use Poisson error structure and a log link function to give us only integer predictions

- The Poisson distribution generates only integer draws based on a mean
- If the mean is 1.5 sometimes you’ll draw a 1, sometimes a 2, etc.
- The log transformation of $\mu$ ensures that $\mu$ is positive

```
poisson_model = mvgam(abundance ~ mintemp,
trend_model = "AR1",
family = poisson(link = "log"),
data = data_train,
newdata = data_test)
```

- No more warnings at the end
- Look at the model

```
summary(poisson_model)
```

- And the forecast

```
plot(poisson_model, type = "forecast", newdata = data_test)
```

- Now all of our predictions are positive!

### State space models

Time-series model

Only first order autoregressive component

Separately model

- the process model - how the system evolves in time or space
- the observation model - observation error or indirect observations

Estimates the true value of the underlying

**latent**state variablesState space model of AR1 + rain w/Poisson error

```
state_space_model = mvgam(abundance ~ 1,
trend_formula = ~ mintemp,
trend_model = "AR1",
family = poisson(link = "log"),
data = data_train,
newdata = data_test)
plot(state_space_model, type = "forecast")
```

```
output = state_space_model$model_output
```

Normally would want several chains with different starting positions to avoid local minima

Send to JAGS

```
j.model <- jags.model (file = textConnection(RandomWalk),
data = data,
inits = init,
n.chains = 1)
```

- Burn in

```
jags.out <- coda.samples (model = j.model,
variable.names = c("tau_proc","tau_obs"),
n.iter = 10000)
plot(jags.out)
```

- Sample from MCMC with full vector of X’s
- This starts sampling from the point were the previous run of
`coda.samples`

ends so it gets rid of the burn-in samples

```
jags.out <- coda.samples (model = j.model,
variable.names = c("x","tau_proc","tau_obs"),
n.iter = 10000)
```

- Visualize
- Convert the output into a matrix & drop parameters

```
out <- as.matrix(jags.out)
xs <- out[,3:ncol(out)]
```

- Point predictions are averages across MCMC samples

```
predictions <- colMeans(xs)
plot(time, predictions, type = "l")
```

And this looks very similar to the observed dynamics of y

Add prediction intervals as range containing 95% of MCMC samples

```
ci <- apply(xs, 2, quantile, c(0.025, 0.975))
lines(time, ci[1,], lty = "dashed", col = "blue")
lines(time, ci[2,], lty = "dashed", col = "blue")
```

- These are very narrow prediction intervals, so the model appears to be very confident
- But it’s important to keep in mind that when fitting the value of
`x`

at time`t`

, the model has access to the value of`y`

at time`t`

- And the
`y`

is present it isn’t being estimated, it’s just the observed value - So, will this model forecast well?

### Forecasting

- To make forecasts using a JAGS model we include data for
`y`

that is`NA`

- This tells the model that we don’t know the values and therefore the model estimates them as part of the fitting process
- To make a true forecast we would add one
`NA`

to the end of`y`

for each time step we wanted to forecast - To hindcast or backcast like we replace the values for
`y`

that are part of the test set with`NA`

- We’ll hindcast, so to do this we’ll replace the last year of
`y`

values with`NA`

and then compare the final year of data to our predictions

Make these changes at top of script and rerun

```
data$y[(length(y)-51):length(y)] = NA
jags.out <- coda.samples (model = j.model,
variable.names = c("y","tau_proc","tau_obs"),
n.iter = 10000)
```

- We can see from plotting the predictions that the forecast doesn’t look promising
- Without the observed data to influence the estimates of
`x[t]`

the model predicts little change over the forecast year - We can directly compare this to the empirical data by adding it to the plot

```
lines(time, y)
```

- So the point estimates don’t perform well
- This raises the question of whether the model accurately predicts that it is uncertain when making forecasts
- Plotting the prediction intervals suggests that it does
- They very quickly expand towards zero and the upper limits of the data