# R Tutorial

*Material heavily influenced by Nicholas J Clark’s excellent course on
Ecological forecasting with mvgam and brms*

## Installation

- Check if your installation worked

```
cmdstan_version()
```

If this returns a version number like `"2.32.2"`

then things are working properly.

## Introduction

- So far we’ve developed relatively simple time-series models
- Linear time-series dependance
- Linear responses to environmental factors
- Normally distributed errors
- No model of observation error
- No interactions between species
- Can’t handle missing data
- Most of these are violated in ecological systems
- So start fitting more complex models

## mvgam

- Models like this are not trivial to fit
- Use STAN
- Uses MCMC to explore parameter space to fit the model using Bayesian methods
- Typically requires learning a separate language - STAN is it’s own language
- This lets you write arbitrarily complex models, but really needs a course in Bayesian methods
- So, we’re going to use an R package called
`mvgam`

to implement our models - We’re going to use it because it’s the simplest way to make complex time-series model in R

```
library(mvgam)
library(dplyr)
```

## Data

- Data on the population dynamics of the Desert Pocket Mouse

```
pp_data <- read.csv("content/data/pp_abundance_timeseries.csv")
```

- mvgam requires it’s own specific data format and doesn’t currently work with tsibbles
- Requires a
`time`

variable be present in the data to index temporal observations starting at 1. - Our
`newmoonnumber`

indexes the monthly samples so convert it to a start at 1 - Also requires a a
`series`

variable, which needs to be a factor - Helps when analyzing multiple time series at once (e.g., multiple species)

```
pp_data <- read.csv("content/data/pp_abundance_timeseries.csv") |>
mutate(time = newmoonnumber - min(newmoonnumber) + 1) |>
mutate(series = as.factor('PP')) |>
select(time, series, abundance, mintemp, cool_precip)
```

- 124 months of data
- Reserve 24 for testing

```
data_train <- filter(pp_data, time <= 100)
data_test <- filter(pp_data, time > 100)
```

## Simply time-series models in mvgam

- mvgam comes with a time-series visualization function like we had in
`fable`

```
plot_mvgam_series(data = pp_data, y = 'abundance')
```

- Start by building something similar to what we’ve done before with the number of desert pocket mice
- Model has an exogenous driver - minimum temperature
- An autoregressive component
- Gaussian error

- Which can also be written as

- Fit using the
`mvgam()`

function - Follows a Base R model structure so start with the model formula
- Instead of including the AR component in the model we add a separate
`trend_model`

argument - We’ll use
`"AR1"`

- Specify the error
`family = gaussian()`

- Then we can specify the data for fitting the model and the data for making/evaluating forecasts

```
baseline_model = mvgam(abundance ~ mintemp,
trend_model = "AR1",
family = gaussian(),
data = data_train)
```

### Bayesian model fitting

- That’s a lot of output for fitting a model
- What’s going on?
- It is difficult to fit more complex time-series models
- One way to fit them is using Bayesian methods
- These methods iteratively search parameter space for the best parameter values
- Using something called Markov Chain Monte Carlo (MCMC)
*Draw 2D parameter search on board*- The
`"Iteration"`

lines are telling us that the model is working it’s way through this process - The
`"Warmup"`

lines are iterations that are used to get in the right parameter space but then thrown away - The
`"Sampling"`

lines are iterations that serve as samples for each of the parameters we are fitting - Having multiple samples gives us the uncertainty in the parameters by looking at how variable those values are
- The different
`"chains"`

are because we typically go through this process multiple times to check than we are converging to the right values - We can look at the result of this fitting process using
`mcmc_plot`

```
mcmc_plot(baseline_model, type = "trace", variable = c("mintemp", "ar1[1]", "sigma[1]"))
```

The red lines at the bottom are providing the same information as the warnings when we fit the model

Something isn’t quite right

The model isn’t converged yet

We could try running it longer

But part of what’s going on here is that we’re using a poorly specified model given the data because we’ve assumed Gaussian errors

So let’s look at the model & forecast and then keep going

We can look at the model structure using

`summary()`

```
summary(baseline_model)
```

- If we plot the model we’ll get some diagnostic plots

```
plot(baseline_model)
```

- The residuals are normally distributed
- And the residual autocorrelation isn’t significant

### Bayesian model forecasting

So let’s look at the forecasts

To make forecasts for Bayesian models we include data for

`y`

that is`NA`

This tells the model that we don’t know the values and therefore the model estimates them as part of the fitting process

Handy because it means these models can handle missing data

To make a true forecast one

`NA`

is added to the end of`y`

for each time step we want to forecastTo hindcast the values for

`y`

that are part of the test set are replaced with`NA`

We can do this in mvgam using the

`forecast()`

function

```
baseline_forecast = forecast(baseline_model, newdata = data_test)
```

- This is slow because it is refitting the model
- We can then plot the forecast

```
plot(baseline_forecast)
```

- These forecasts look similar to those we generated using
`fable`

- Prediction intervals are regularly negative because we’ve assumed normally distributed error
- Actual counts can only be non-negative integers: 0, 1, 2…

## Modeling count data

- To model count data explicitly we need observations that are integers
- We can use use Poisson error to accomplish this
- Instead of $N(\mu, \sigma^2)$

- The Poisson distribution has one parameter $\lambda$
- Which is both the mean and the variance
- It generates only integer draws based on a mean

```
rpois(n = 10, lambda = 5)
hist(rpois(n = 1000, lambda = 5))
```

If the mean is 1.5 sometimes you’ll draw a 1, sometimes a 2, sometimes a 0, etc.

$\lambda_t$ can be a decimal

```
hist(rpois(n = 1000, lambda = 4.5))
```

- We could expect an average of 4.5 rodents based on the environment even though we can only observe an integer number
- But $\lambda_t$ does have to be positive because we can’t reasonably expect to see negative rodents

```
hist(rpois(n = 1000, lambda = -4.5))
```

- To handle this we use a log link function to give us only positive values of $\mu_t$
- The log link means that instead of modeling $\mu_t$ directly we model $log(\mu_t)$

```
poisson_model = mvgam(abundance ~ mintemp,
trend_model = "AR1",
family = poisson(link = "log"),
data = data_train)
```

- No more warnings at the end
- Look at the model

```
summary(poisson_model)
```

```
plot(poisson_model)
```

Model structure looks OK, be we still have some residual seasonal lag we haven’t captured

Now let’s look at the forecast

```
poisson_forecast = forecast(poisson_model, newdata = data_test)
plot(poisson_forecast)
```

- Now all of our predictions are positive!
- But while the point estimates seem reasonable the uncertainties see really large

### Visualizing environmental drivers

- As our models get more complicate it’s important to make sure we understand what they are doing
- We can visualize the environmental component of the model using
`type = pterms`

`pterms`

is short for “parametric terms”- Which is what we have since we modeled a linear relationship with a fixed slope

```
plot(poisson_model, type = 'pterms')
```

- This shows a linear relationship
- But that is the relationship between
`log(abundance)`

and`mintemp`

- So let’s look at the actual relationship

```
plot_predictions(poisson_model, condition = "mintemp")
```

- Since
`log(abundance)`

is linearly related to temperature - This makes the response to untransformed abundance exponential
- That doesn’t feel right
- Could be part of the reason for our really large upper prediction intervals

## Non-linear responses

- Linear relationships between abundance and the environment are unlikely to hold over reasonable environmental gradients
- Typically we think of species as have some optimal environmental value
- With decreasing performance as you move away from that environmental value in both directions
*Draw standard performance response curve*- We can model using in
`mvgam`

by using a Generalized Additive Model that fits a smoother to the relationship

```
poisson_gam_model = mvgam(abundance ~ s(mintemp),
trend_model = "AR1",
family = poisson(link = "log"),
data = data_train)
```

- Visualizing the environmental component of the model shows that it is now non-linear

```
plot(poisson_gam_model, type = 'smooths')
```

- It doesn’t have an optimal value, but it does saturate
- This means that once it’s warm enough then temperature stops influencing abundance
- This is the smooth on
`log(abundance)`

- Look at the relationship with raw abundance

```
plot_predictions(poisson_gam_model, condition = "mintemp")
```

The GAM now gives us a more realistic response to temperature

When it’s cold abundance is always near zero

Even if gets a little less cold

Abundance then increases with temperature

But starts to asymptote at higher temperatures

How does this influence the forecasts

```
poisson_gam_forecast = forecast(poisson_gam_model, newdata = data_test)
plot(poisson_gam_forecast)
```

- The extremely wide prediction intervals have been reduced substantially
- If we look at the diagnostics

```
plot(poisson_gam_model)
```

- We can also see that the residual season autocorrelation is improved

## State space models

- To add an explicit observation model we use the

```
state_space_model = mvgam(abundance ~ 1,
trend_formula = ~ s(mintemp, bs = 'bs', k = 5),
trend_model = "AR1",
family = poisson(link = "log"),
data = data_train)
state_space_forecast = forecast(state_space_model, newdata = data_test)
plot(state_space_forecast)
```

## Some notes on Bayesian forecasting

- Since forecasting and model fitting are all part of the same process in Bayesian models
- We can also just incorporate the test data directly in the model fitting step

```
poisson_gam_model = mvgam(abundance ~ s(mintemp, bs = 'bs', k = 5),
trend_model = "AR1",
family = poisson(link = "log"),
data = data_train,
newdat = data_test)
```

- And then plot forecasts directly from the model object

```
plot(poisson_gam_model, type = "forecast")
```

- If you want to know why we’re using
`mvgam`

- Here’s what the code to build this model directly in STAN looks like

```
code(poisson_gam_model)
```

## Evaluation and comparison

- To evaluate the forecasts from
`mvgam`

we can use the`score`

function

```
scores <- score(poisson_gam_forecast)
```

- The output shows us two things we’ve seen before
- Whether or not each observed value falls within the prediction interval
- With a default prediction interval of 90%
- And the continuous rank probability score
- Other scores are available by changing the optional
`score`

argument - We can also change the prediction interval using the
`interval_width`

argument

```
scores <- score(poisson_gam_forecast, interval_width = 0.5)
```

- If we want to calculate the coverage we can sum

```
sum(scores$PP$in_interval) / nrow(scores$PP)
```