# 1 Abstract

This document is designed to be a quick intro for “empiricial dynamic modeling” (EDM) and the rEDM package. For more detailed information, please see the rEDM-tutorial vignette include in the package [link].

# 2 Theory

Suppose that we have a system with $$d$$ state variables: $$x_1, x_2, \dots, x_d$$. We can represent the state at a specific time, $$t$$, as a vector with these $$d$$ components: $$\bar{x}(t) = \langle x_1(t), x_2(t), \dots, x_d(t) \rangle$$.

We generally assume that the system is deterministic, such that the current state fully determines future states: $\bar{x}(t+1) = F \left(\bar{x}(t)\right)$ though we may not know what this function $$F$$ actually is.

## 2.1 Example

In model systems, we usually know what $$F$$ is, since we have differential or difference equations that describe the system behavior. For example, the coupled logistic map equations are: \begin{align*} x_1(t+1) &= r_1 x_1(t) \left[1 - x_1(t) - \alpha_{1,2} x_2(t)\right]\\ x_2(t+1) &= r_2 x_2(t) \left[1 - x_2(t) - \alpha_{2,1} x_1(t)\right] \end{align*} with parameters $$r_1, r_2, \alpha_{1,2}, \alpha_{2,1}$$.

## 2.2 Time Delay Embedding (Delay Coordinate Embedding)

We would naturally expect that to predict the future of any component of the system, e.g. $$x_i$$, we would need to know all the other $$x_j$$ values. Takens’ Theorem actually posits that this is not strictly necessary, and that under certain conditions, a single time series is sufficient.

In other words, instead of relying on: $x_i(t+1) = F_i\left(x_1(t), x_2(t), \dots, x_d(t)\right)$

the system dynamics can be represented as a function of a single variable and its lags: $x_i(t+1) = G_i\left(x_i(t), x_i(t-1), \dots, x_i(t-(E-1))\right)$

Note that we use $$E$$ coordinates instead of $$d$$ and that we do not make any claims about how the functions $$F_i$$ and $$G_i$$ are related.

## 2.3 Usage

Since we generally don’t know the form of $$G_i$$, we stick to inferring it from the data. The rEDM package provides several methods to do so, but here we just stick to simplex projection.

Simplex projection uses a weighted nearest-neighbors approximation to estimate $$G_i$$. That’s kind of a fancy way of saying the following:

1. Suppose we have the value of $$x$$ and its lags at time $$s$$. Then we want a prediction of $$x(s+1) = G\left(x(s), x(s-1), \dots, x(s - (E-1))\right)$$.

2. We look for $$j = 1..k$$ nearest neighbors in the observed time series of $$x$$ such that $$\langle x(s), x(s-1), \dots, x(s - (E-1))\rangle \approx \langle x(n_j), x(n_j-1), \dots, x(n_j - (E-1))\rangle$$.

3. We then suppose that $$x(s+1) \approx x(n_j+1)$$.

Mathematically, this occurs by using some distance function to judge how similar $$\langle x(s), x(s-1), \dots, x(s - (E-1))\rangle$$ is to $$\langle x(n_j), x(n_j-1), \dots, x(n_j - (E-1))\rangle$$ and estimating $$x(s+1)$$ as a weighted average of the $$x(n_j+1)$$ values with weighting determined by the distances.

# 3 Applications

These principles of EDM can be used for several different applications. These include:

• univariate forecasting (predicting future behavior using lags of a single time series)
• multivariate forecasting (predicting future behavior using lags of multiple time series)
• characterizing various properties of time series (e.g. predictability and “nonlinearity”)
• identifying causal relationships between time series

Here, we’ll just look at univariate forecasting, though there are examples for the others among the package documentation. (see the vignettes on CRAN)

# 4 Univariate Forecasting

## 4.1 Data

We’ll start by looking at sunspot data. Note that there is also monthly data, but it seems quite a bit noisier:

dat <- data.frame(yr = as.numeric(time(sunspot.year)),
sunspot_count = as.numeric(sunspot.year))

plot(dat$yr, dat$sunspot_count, type = "l")

## 4.2 Determining E

From the above sections, you may recall that one of the parameters is the embedding dimension, E. Since we ultimately want to test forecast skill, we use only the first 2/3 of the data to fit E. There are several other parameters that we can set, but the defaults work fine for our case:

library(rEDM)                    # load the package

n <- NROW(dat)
lib <- c(1, floor(2/3 * n))      # indices for the first 2/3 of the time series
pred <- c(floor(2/3 * n) + 1, n) # indices for the final 1/3 of the time series

simplex(dat,                     # input data (for data.frames, uses 2nd column)
lib = lib, pred = lib,   # which portions of the data to train and predict
E = 1:10)                # embedding dimensions to try
## Warning in model$run(): Found overlap between lib and pred. Enabling cross- ## validation with exclusion radius = 0. ## E tau tp nn num_pred rho mae rmse perc p_val ## 1 1 1 1 2 191 0.6991334 19.02584 26.30722 1 8.621117e-33 ## 2 2 1 1 3 190 0.9074985 10.73279 14.73790 1 2.037040e-95 ## 3 3 1 1 4 189 0.9195079 10.16759 13.93505 1 4.910681e-104 ## 4 4 1 1 5 188 0.9195163 10.46335 14.00166 1 1.703957e-103 ## 5 5 1 1 6 187 0.9176281 10.92784 14.17733 1 2.032115e-101 ## 6 6 1 1 7 186 0.9160703 11.21311 14.28820 1 1.170578e-99 ## 7 7 1 1 8 185 0.9082384 11.42164 14.98126 1 1.974748e-93 ## 8 8 1 1 9 184 0.8911547 11.98804 16.38659 1 1.675719e-82 ## 9 9 1 1 10 183 0.8708673 12.95866 17.75854 1 3.249362e-72 ## 10 10 1 1 11 182 0.8561815 13.38768 18.57338 1 6.248548e-66 ## const_pred_num_pred const_pred_rho const_pred_mae const_pred_rmse ## 1 191 0.8181284 16.30681 21.10918 ## 2 190 0.8171769 16.36105 21.16018 ## 3 189 0.8164772 16.42116 21.21297 ## 4 188 0.8160786 16.47128 21.26318 ## 5 187 0.8162301 16.48984 21.29875 ## 6 186 0.8171832 16.46022 21.29492 ## 7 185 0.8188354 16.39243 21.24568 ## 8 184 0.8186807 16.43261 21.29300 ## 9 183 0.8181869 16.46776 21.33830 ## 10 182 0.8171689 16.54725 21.39633 ## const_pred_perc const_p_val ## 1 1 2.020297e-56 ## 2 1 7.301909e-56 ## 3 1 2.220877e-55 ## 4 1 5.537661e-55 ## 5 1 9.715865e-55 ## 6 1 1.027021e-54 ## 7 1 6.966389e-55 ## 8 1 1.498706e-54 ## 9 1 3.987699e-54 ## 10 1 1.463741e-53 The output is a data.frame describing how well the model “predicted” on the given data. The columns are various statistical outputs. We’ll focus just on the measures of model performance: • rho (correlation between observed and predicted values) • mae (mean absolute error) • rmse (root mean squared error) It looks like E = 3 or 4 is optimal. Since we generally want a simpler model, if possible, let’s go with E = 3 to forecast the remaining 1/3 of the data. ## 4.3 Forecasts We can use similar code to produce forecasts for the last 1/3 of the time series. Since we want the outputs in addition to the performance statistics, we add an additional argument to the function call: output <- simplex(dat, lib = lib, pred = pred, # predict on last 1/3 E = 3, stats_only = FALSE) # return predictions, too Here the output is again a data.frame with a list column for the predictions: predictions <- output$model_output[[1]]
str(predictions)
## 'data.frame':    97 obs. of  4 variables:
##  $time : num 1893 1894 1895 1896 1897 ... ##$ obs     : num  85.1 78 64 41.8 26.2 26.7 12.1 9.5 2.7 5 ...
##  $pred : num NaN NaN 49.3 45 36.2 ... ##$ pred_var: num  NaN NaN 24 40.6 88.2 ...

Let’s plot the predictions against the original data:

plot(dat$yr, dat$sunspot_count, type = "l")
lines(predictions$time, predictions$pred, col = "blue", lty = 2)

Since we also have an estimate of the prediction uncertainty, let’s include that as well:

plot(dat$yr, dat$sunspot_count, type = "l")
lines(predictions$time, predictions$pred, col = "blue", lty = 2)
polygon(c(predictions$time, rev(predictions$time)),
c(predictions$pred - sqrt(predictions$pred_var),
rev(predictions$pred + sqrt(predictions$pred_var))),
col = rgb(0, 0, 1, 0.5), border = NA)