R Tutorial

Video Tutorial

Text Tutorial

Species Distribution Models attempt to model where species are expect to occur
Can be based on lots of things, but often just environmental variables
This demo will build a simple version version using a linear model and two predictors

Load some packages

dismo is the major package for SDM in R
We’ll also load ggplot2 for plotting and dplyr for data manipulation

library(dismo)
library(ggplot2)
library(dplyr)
library(terra)

Data

We need two kinds of data for SDMs
Locations for where a species occurs
- And ideally where it is absent
- When no absences make fake absences called “background” or “pseudo absences”
Spatial data on predictor variables
- Most commonly temp & precip based
- But that is mostly convenience based, not biology based
- To use the SDM for forecasting we need both current and future values
- So we need forecasts for our predictor variables to make forecasts for species distributions
For location data we’ll use the Breeding Bird Survey of North America data on the Hooded Warbler
For environmental data we’ll use minimum temperature and annual precipitation
Fore forecasts of environmental data we’ll use CMIP5 50 year forecasts (CMIP = coupled model intercomparison project)
Download data from https://course.naturecast.org/data/sdm_data.zip and unzip it into your working directory
The hooded warbler data is in a csv file
Load it using read.csv()

hooded_warb_data = read.csv("hooded_warb_locations.csv")
head(hooded_warb_data)

We can see that this file has information on locations as latitude & longitude plus whether or not the species is present at that location
The environmental data is stored in raster stacks in .grd files
Load those using the rast(), which is part of the terra package

env_data_current = rast("env_current.grd")
env_data_forecast = rast("env_forecast.grd")
plot(env_data_current$tmin)
plot(env_data_current$precip)

Determine environment where species is (and isn’t)

Now we need to combine the data to get information environment where a species occur and where it doesn’t
Combine the two data types
To start we’ll extract the environmental data at the locations that have been sampled
To do this we get just the position data for the locations by selecting the lon and lat columns
And then use the extract() function from the raster package
It takes a raster that we want to extract data from as the first argument and the point locations to extract as the second argument
Then we need to attach these extracted environmental values back to our presence absence data using bind_rows()

hooded_warb_data = env_data_current |>
  terra::extract(dplyr::select(hooded_warb_data, lon, lat)) |>
  bind_cols(hooded_warb_data)

Now we can look at where the species occurs in 2D environmental space

ggplot(hooded_warb_data, aes(x = tmin, y = precip, color = present)) +
  geom_point()

We want our SDM to find regions of climate space where the species is likely to occur
Visually we can see that this species is associated with high temps and precips

Modeling species distribution

Many different ways to model the probability of species presence
Look at one of the simplest - Generalized linear modeling
Specifically we’ll build a multivariate logistic regression
Do this using the glm() function
First argument is the model, where we want the present column to depend on tmin and precip
Second argument is the link function, which gives us our logistic regression
Third argument is the data

logistic_regr_model <- glm(present ~ tmin + precip,
                           family = binomial(link = "logit"),
                           data = hooded_warb_data)
summary(logistic_regr_model)

Plot the model predictions

To make predictions and forecasts from this model we use the overloaded predict function from the terra package
Arguments
- Raster of environmental conditions
- Model
- type = "response" to get probabilities

predictions <- predict(env_data_current, logistic_regr_model, type = "response")
present_loc <- dplyr::select(filter(hooded_warb_data, present == 1), lon, lat)
plot(predictions, ext = extent(-140, -50, 25, 60))
points(present_loc, pch='+', cex = 0.5)

This shows us the probability that each species will be present across spatial locations
Sometimes we we want to identify the locations where we expect the species to occur instead
To do this we can only show predictions that are greater than some threshold
So, we could say that if our model says there is a 50% chance of a species being present that it is present

plot(predictions > 0.5, ext = extent(-140, -50, 25, 60))
points(present_loc, pch='+', cex = 0.5)

Different thresholds often have better characteristics depending on what we value
Might want to make sure that all locations that a species could possibly occur are included
E.g., > 25% probability of occurring

plot(predictions > 0.25, ext = extent(-140, -50, 25, 60))
points(present_loc, pch='+', cex = 0.5)

Evaluate the model performance

When we evaluate model performance we often do so across many thresholds
To evaluate model performance for binary classification problems, like our presence vs absence model, we often use Receiver Operating Characteristic or ROC curves
These are plots of false positives (cases where we predict a species is present but it isn’t) on the x axis
Against true positives (cases where we predict a species is present and it is present) on the y axis
We use different thresholds on the predicted probability to say whether a species is predicted to be present or absent
So we might say that if the probability is at least 10% we’ll call that present
That will give us a pair of values for true and false positives
Then do the same thing for 20%, 30%, 40%, and so on which gives us a curve
The 1:1 line on this graph is what we would get if we randomly assigned each point to be present or absent
So we definitely want a model that is above that line
Let’s make our own ROC curve
Split the data in into presences and absences

presence_data = filter(hooded_warb_data, present == 1)
absence_data = filter(hooded_warb_data, present == 0)

The run the evaluate() function from dismo
It takes 3 arguments: presence data, absence data, and the model

evaluation <- evaluate(presence_data, absence_data, logistic_regr_model)

We can then plot the ROC curve

plot(evaluation, 'ROC')

Looks good, but it is important to know that this is biased when using absences from large scales

Automatically choosing thresholds

We can then use this evaluation to automatically choose thresholds
One common approach is to make sure our model predicts approximately the right number of presences
Can choose this automatically using threshold()
First argument is our evaluation object
Second argument allows us to specific what is important to us
We’ll use 'prevalence' which predicts approximately the right number of presencences

thresh <- threshold(evaluation, stat = 'prevalence')
plot(predictions > thresh, ext = extent(-140, -50, 25, 60))
points(present_loc, pch='+', cex = 0.5)

Make forecasts

To make predictions for the future we use the exact same approach, but using forecasts for future environmental conditions
We’ll use the CMIP5 predictions for 50 years from now that we loaded at the beginning of the lesson

forecasts <- predict(env_data_forecast, logistic_regr_model, type = "response")
plot(forecasts, ext = extent(-140, -50, 25, 60))
plot(forecasts > tr, ext = extent(-140, -50, 25, 60))

If we want to see the changes that are expected to occur we can plot the difference between the current predictions and the future forecasts

plot(forecasts - predictions, ext = extent(-140, -50, 25, 60))

Last updated on Feb 6, 2026

Edit this page