R Tutorial

Objectives

We’ll model and forecast abundance data for Dipodomys ordii.

# package dependencies
library(ggplot2)
library(portalr)
library(rEDM)

Data

  • Download the data
rodent_data = abundance()
do_data = data.frame(period = rodent_data$period, abundance = rodent_data$DO)

ggplot(do_data, aes(x = period, y = abundance)) +
  geom_line()

Theory

  • Multi-species system
  • Population dynamics model would have ~10 state variables (abundance of each species)
  • Each at t time steps
  • If deterministic those 10 values at time t+1 are fully determined by the values at t
  • Can write this as:
$$\bar{x}(t+1) = F \left(\bar{x}(t)\right)$$
  • Could assume some parametric shape (e.g., logistic growth):
$$\begin{align*} x_1(t+1) &= r_1 x_1(t) \left[1 - x_1(t) - \alpha_{1,2} x_2(t)\right]\\ x_2(t+1) &= r_2 x_2(t) \left[1 - x_2(t) - \alpha_{2,1} x_1(t)\right] \end{align*}$$

with parameters $r_1, r_2, \alpha_{1,2}, \alpha_{2,1}$.

  • But we might get the form wrong
  • And there are typically way more than 2 species even if we’re only studying 2

Empirical Dynamic Modeling

  • Model arbitrarily complex F
  • Can capture the complexity of a multiple species/state system in a single species/state time-series
  • Does this based on Taken’s Theorem
    • Reconstruct a shadow of the real system from single time-series

In other words, instead of relying on: $$x_i(t+1) = F_i\left(x_1(t), x_2(t), \dots, x_d(t)\right)$$

the system dynamics can be represented as a function of a single variable and its lags: $$x_i(t+1) = G_i\left(x_i(t), x_i(t-1), \dots, x_i(t-(E-1))\right)$$

  • $E$ is the embedding dimension which defines how far back in time we go

Usage

  • Need to estimate $G_i$ from the data
  • Typically think of this as fitting an equation, but $G$ is arbitrarily complex
  • And since we’re focusing on prediction we don’t need the equation, we just need to know what is going to happen next
  • And we can get this prediction by looking through the existing time-series and finding periods that look like the current period (draw example)
  • These historical periods should reflect the same dynamics of $G$ and therefore the same state variables in $F$
  • Which should mean that what happens next in these periods should match what happens next now.
  • Use the simplest, simplex projection
  • Weighted nearest-neighbors approximation:
  1. Have value of $x$ and its lags at time $t$. Then we want a prediction of $x(t+1) = G\left(x(t), x(t-1), \dots, x(t - (E-1))\right)$.
  2. We look for $j = 1..k$ nearest neighbors in the observed time series such that $$\begin{multline} \langle x(t), x(t-1), \dots, x(t - (E-1))\rangle \\ \approx \langle x(n_j), x(n_j-1), \dots, x(n_j - (E-1))\rangle \end{multline}$$
  3. We then suppose that $x(t+1) \approx x(n_j+1)$.
  • Use a distance function to judge how similar $\langle x(t), x(t-1), \dots, x(t - (E-1))\rangle$ is to $\langle x(n_j), x(n_j-1), \dots, x(n_j - (E-1))\rangle$
  • Estimating $x(t+1)$ as a weighted average of the $x(n_j+1)$ values with weighting determined by the distances.

Determining Embedding Dimension

  • Need to know $E$, how many lags to use for determining if time-series is similar
  • Split the data to reserve some for forecasting
n <- nrow(do_data)
lib <- c(1, floor(2/3 * n))      # indices for the first 2/3 of the time series
pred <- c(floor(2/3 * n) + 1, n) # indices for the final 1/3 of the time series
  • Fit 10 different embedding dimensions (1:10) see how well they work
simplex(do_data,                     # input data (for data.frames, uses 2nd column)
        lib = lib, pred = lib,   # which portions of the data to train and predict
        E = 1:10)                # embedding dimensions to try
  • Output shows how well the model “predicted” on the given data
  • Focus on measures of fit/error:
    • rho (correlation between observed and predicted values, higher is better)
    • mae (mean absolute error, lower is better)
    • rmse (root mean squared error, lower is better)
  • $E$ of 4 or 5 is optimal
  • Use 4 for simpler model
  • Use to forecast the remaining 1/3 of the data.

Forecasts

  • Similar code for forecasting for the last 1/3 of the time series
  • To store predicted values add stats_only argument
output <- simplex(do_data,
                  lib = lib, pred = pred,  # predict on last 1/3
                  E = 4, 
                  stats_only = FALSE)      # return predictions, too
  • Output is a data.frame with a list column for the predictions
predictions <- output$model_output[[1]]
str(predictions)
  • Plot the predictions against the original data
ggplot(do_data, aes(x = period, y = abundance)) +
  geom_line() +
  geom_line(data = predictions, mapping = aes(x = period, y = Predictions), color = 'blue')
  • Also have estimate of the prediction uncertainty in Pred_Variance
  • Variance of the prediction
  • Plot a 95% prediction interval use $\pm 2 * SD$
ggplot(do_data, aes(x = period, y = abundance)) +
  geom_line() +
  geom_line(data = predictions, mapping = aes(x = period, y = Predictions), color = 'blue') +
  geom_ribbon(data = predictions,
              mapping = aes(x = period,
                            y = Predictions,
                            ymax = Predictions + 2 * sqrt(Pred_Variance),
                            ymin = Predictions - 2 * sqrt(Pred_Variance)),
              fill = 'blue',
              alpha = 0.2)
  • This is a single time step forecast
  • For multi-step forecasts we can take these one step ahead forecasts along with their uncertainties and increment forward for multi-step forecasts as you’ll see later in the week in a different context.
  • Or we can train the model to make forecasts the number of steps ahead that we want
  • We do this using tp
output <- simplex(do_data,
          lib = lib, pred = pred,  # predict on last 1/3
          E = 10,                  # predicting 6 steps ahead is different model;
                                   # selecting a different E is recommended, and
                                   # a higher E for a more complex model is sensible
          tp = 6,
          stats_only = FALSE)      # return predictions, too

predictions <- output$model_output[[1]]

ggplot(do_data, aes(x = period, y = abundance)) +
  geom_line() +
  geom_line(data = predictions, mapping = aes(x = period, y = Predictions), color = 'blue')
  • The model performs less well the longer the “forecast horizon” or distance into the future we ask it to predict, which is a common feature of real world forecasts.
Previous