R Tutorial
Objectives
We’ll model and forecast abundance data for Dipodomys ordii.
# package dependencies
library(ggplot2)
library(portalr)
library(rEDM)
Data
- Download the data
rodent_data = abundance()
do_data = data.frame(period = rodent_data$period, abundance = rodent_data$DO)
ggplot(do_data, aes(x = period, y = abundance)) +
geom_line()
Theory
- Multi-species system
- Population dynamics model would have ~10 state variables (abundance of each species)
- Each at t time steps
- If deterministic those 10 values at time t+1 are fully determined by the values at t
- Can write this as:
- Could assume some parametric shape (e.g., logistic growth):
with parameters $r_1, r_2, \alpha_{1,2}, \alpha_{2,1}$.
- But we might get the form wrong
- And there are typically way more than 2 species even if we’re only studying 2
Empirical Dynamic Modeling
- Model arbitrarily complex F
- Can capture the complexity of a multiple species/state system in a single species/state time-series
- Does this based on Taken’s Theorem
- Reconstruct a shadow of the real system from single time-series
In other words, instead of relying on: $$x_i(t+1) = F_i\left(x_1(t), x_2(t), \dots, x_d(t)\right)$$
the system dynamics can be represented as a function of a single variable and its lags: $$x_i(t+1) = G_i\left(x_i(t), x_i(t-1), \dots, x_i(t-(E-1))\right)$$
- $E$ is the embedding dimension which defines how far back in time we go
Usage
- Need to estimate $G_i$ from the data
- Typically think of this as fitting an equation, but $G$ is arbitrarily complex
- And since we’re focusing on prediction we don’t need the equation, we just need to know what is going to happen next
- And we can get this prediction by looking through the existing time-series and finding periods that look like the current period (draw example)
- These historical periods should reflect the same dynamics of $G$ and therefore the same state variables in $F$
- Which should mean that what happens next in these periods should match what happens next now.
- Use the simplest, simplex projection
- Weighted nearest-neighbors approximation:
- Have value of $x$ and its lags at time $t$. Then we want a prediction of $x(t+1) = G\left(x(t), x(t-1), \dots, x(t - (E-1))\right)$.
- We look for $j = 1..k$ nearest neighbors in the observed time series such that $$\begin{multline} \langle x(t), x(t-1), \dots, x(t - (E-1))\rangle \\ \approx \langle x(n_j), x(n_j-1), \dots, x(n_j - (E-1))\rangle \end{multline}$$
- We then suppose that $x(t+1) \approx x(n_j+1)$.
- Use a distance function to judge how similar $\langle x(t), x(t-1), \dots, x(t - (E-1))\rangle$ is to $\langle x(n_j), x(n_j-1), \dots, x(n_j - (E-1))\rangle$
- Estimating $x(t+1)$ as a weighted average of the $x(n_j+1)$ values with weighting determined by the distances.
Determining Embedding Dimension
- Need to know $E$, how many lags to use for determining if time-series is similar
- Split the data to reserve some for forecasting
n <- nrow(do_data)
lib <- c(1, floor(2/3 * n)) # indices for the first 2/3 of the time series
pred <- c(floor(2/3 * n) + 1, n) # indices for the final 1/3 of the time series
- Fit 10 different embedding dimensions (1:10) see how well they work
simplex(do_data, # input data (for data.frames, uses 2nd column)
lib = lib, pred = lib, # which portions of the data to train and predict
E = 1:10) # embedding dimensions to try
- Output shows how well the model “predicted” on the given data
- Focus on measures of fit/error:
- rho (correlation between observed and predicted values, higher is better)
- mae (mean absolute error, lower is better)
- rmse (root mean squared error, lower is better)
- $E$ of 4 or 5 is optimal
- Use 4 for simpler model
- Use to forecast the remaining 1/3 of the data.
Forecasts
- Similar code for forecasting for the last 1/3 of the time series
- To store predicted values add
stats_only
argument
output <- simplex(do_data,
lib = lib, pred = pred, # predict on last 1/3
E = 4,
stats_only = FALSE) # return predictions, too
- Output is a data.frame with a list column for the predictions
predictions <- output$model_output[[1]]
str(predictions)
- Plot the predictions against the original data
ggplot(do_data, aes(x = period, y = abundance)) +
geom_line() +
geom_line(data = predictions, mapping = aes(x = period, y = Predictions), color = 'blue')
- Also have estimate of the prediction uncertainty in
Pred_Variance
- Variance of the prediction
- Plot a 95% prediction interval use $\pm 2 * SD$
ggplot(do_data, aes(x = period, y = abundance)) +
geom_line() +
geom_line(data = predictions, mapping = aes(x = period, y = Predictions), color = 'blue') +
geom_ribbon(data = predictions,
mapping = aes(x = period,
y = Predictions,
ymax = Predictions + 2 * sqrt(Pred_Variance),
ymin = Predictions - 2 * sqrt(Pred_Variance)),
fill = 'blue',
alpha = 0.2)
- This is a single time step forecast
- For multi-step forecasts we can take these one step ahead forecasts along with their uncertainties and increment forward for multi-step forecasts as you’ll see later in the week in a different context.
- Or we can train the model to make forecasts the number of steps ahead that we want
- We do this using
tp
output <- simplex(do_data,
lib = lib, pred = pred, # predict on last 1/3
E = 10, # predicting 6 steps ahead is different model;
# selecting a different E is recommended, and
# a higher E for a more complex model is sensible
tp = 6,
stats_only = FALSE) # return predictions, too
predictions <- output$model_output[[1]]
ggplot(do_data, aes(x = period, y = abundance)) +
geom_line() +
geom_line(data = predictions, mapping = aes(x = period, y = Predictions), color = 'blue')
- The model performs less well the longer the “forecast horizon” or distance into the future we ask it to predict, which is a common feature of real world forecasts.