R Tutorial

Video Tutorial

Text Tutorial

  • Species Distribution Models attempt to model where species are expect to occur
  • Can be based on lots of things, but often just environmental variables
  • This demo will build a simple version version using a linear model and two predictors

Load some packages

  • dismo is the major package for SDM in R
  • We’ll also load ggplot2 for plotting and dplyr for data manipulation


  • We need two kinds of data for SDMs

  • Locations for where a species occurs

    • And ideally where it is absent
    • When no absences make fake absences called “background” or “pseudo absences”
  • Spatial data on predictor variables

    • Most commonly temp & precip based
    • But that is mostly convenience based, not biology based
    • To use the SDM for forecasting we need both current and future values
    • So we need forecasts for our predictor variables to make forecasts for species distributions
  • For location data we’ll use the Breeding Bird Survey of North America data on the Hooded Warbler

  • For environmental data we’ll use minimum temperature and annual precipitation

  • Fore forecasts of environmental data we’ll use CMIP5 50 year forecasts (CMIP = coupled model intercomparison project)

  • Download data from https://course.naturecast.org/data/sdm_data.zip and unzip it into your working directory

  • The hooded warbler data is in a csv file

  • Load it using read.csv()

hooded_warb_data = read.csv("hooded_warb_locations.csv")
  • We can see that this file has information on locations as latitude & longitude plus whether or not the species is present at that location

  • The environmental data is stored in raster stacks in .grd files

  • Load those using the rast(), which is part of the terra package

env_data_current = rast("env_current.grd")
env_data_forecast = rast("env_forecast.grd")

Determine environment where species is (and isn’t)

  • Now we need to combine the data to get information environment where a species occur and where it doesn’t
  • Combine the two data types
  • To start we’ll extract the environmental data at the locations that have been sampled
  • To do this we get just the position data for the locations by selecting the lon and lat columns
  • And then use the extract() function from the raster package
  • It takes a raster that we want to extract data from as the first argument and the point locations to extract as the second argument
  • Then we need to attach these extracted environmental values back to our presence absence data using bind_rows()
hooded_warb_data = env_data_current |>
  extract(select(hooded_warb_data, lon, lat)) |>
  • Now we can look at where the species occurs in 2D environmental space
ggplot(hooded_warb_data, aes(x = tmin, y = precip, color = present)) +
  • We want our SDM to find regions of climate space where the species is likely to occur
  • Visually we can see that this species is associated with high temps and precips

Modeling species distribution

  • Many different ways to model the probability of species presence
  • Look at one of the simplest - Generalized linear modeling
  • Specifically we’ll build a multivariate logistic regression
  • Do this using the glm() function
  • First argument is the model, where we want the present column to depend on tmin and precip
  • Second argument is the link function, which gives us our logistic regression
  • Third argument is the data
logistic_regr_model <- glm(present ~ tmin + precip,
                           family = binomial(link = "logit"),
                           data = hooded_warb_data)

Plot the model predictions

  • To make predictions and forecasts from this model we use the overloaded predict function from the terra package
  • Arguments
    • Raster of environmental conditions
    • Model
    • type = "response" to get probabilities
predictions <- predict(env_data_current, logistic_regr_model, type = "response")
present_loc <- select(filter(hooded_warb_data, present == 1), lon, lat)
plot(predictions, ext = extent(-140, -50, 25, 60))
points(present_loc, pch='+', cex = 0.5)
  • This shows us the probability that each species will be present across spatial locations
  • Sometimes we we want to identify the locations where we expect the species to occur instead
  • To do this we can only show predictions that are greater than some threshold
  • So, we could say that if our model says there is a 50% chance of a species being present that it is present
plot(predictions > 0.5, ext = extent(-140, -50, 25, 60))
points(present, pch='+', cex = 0.5)
  • Different thresholds often have better characteristics depending on what we value
  • Might want to make sure that all locations that a species could possibly occur are included
  • E.g., > 25% probability of occurring
plot(predictions > 0.25, ext = extent(-140, -50, 25, 60))
points(present, pch='+', cex = 0.5)

Evaluate the model performance

  • When we evaluate model performance we often do so across many thresholds

  • To evaluate model performance for binary classification problems, like our presence vs absence model, we often use Receiver Operating Characteristic or ROC curves

  • These are plots of false positives (cases where we predict a species is present but it isn’t) on the x axis

  • Against true positives (cases where we predict a species is present and it is present) on the y axis

  • We use different thresholds on the predicted probability to say whether a species is predicted to be present or absent

  • So we might say that if the probability is at least 10% we’ll call that present

  • That will give us a pair of values for true and false positives

  • Then do the same thing for 20%, 30%, 40%, and so on which gives us a curve

  • The 1:1 line on this graph is what we would get if we randomly assigned each point to be present or absent

  • So we definitely want a model that is above that line

  • Let’s make our own ROC curve

  • Split the data in into presences and absences

presence_data = filter(hooded_warb_data, present == 1)
absence_data = filter(hooded_warb_data, present == 0)
  • The run the evaluate() function from dismo
  • It takes 3 arguments: presence data, absence data, and the model
evaluation <- evaluate(presence_data, absence_data, logistic_regr_model)
  • We can then plot the ROC curve
plot(evaluation, 'ROC')
  • Looks good, but it is important to know that this is biased when using absences from large scales

Automatically choosing thresholds

  • We can then use this evaluation to automatically choose thresholds
  • One common approach is to make sure our model predicts approximately the right number of presences
  • Can choose this automatically using threshold()
  • First argument is our evaluation object
  • Second argument allows us to specific what is important to us
  • We’ll use 'prevalence' which predicts approximately the right number of presencences
thresh <- threshold(evaluation, stat = 'prevalence')
plot(predictions > thresh, ext = extent(-140, -50, 25, 60))
points(present, pch='+', cex = 0.5)

Make forecasts

  • To make predictions for the future we use the exact same approach, but using forecasts for future environmental conditions
  • We’ll use the CMIP5 predictions for 50 years from now that we loaded at the beginning of the lesson
forecasts <- predict(env_data_forecast, logistic_regr_model, type = "response")
plot(forecasts, ext = extent(-140, -50, 25, 60))
plot(forecasts > tr, ext = extent(-140, -50, 25, 60))
  • If we want to see the changes that are expected to occur we can plot the difference between the current predictions and the future forecasts
plot(forecasts - predictions, ext = extent(-140, -50, 25, 60))