R Tutorial
Video Tutorial
Text Tutorial
- Species Distribution Models attempt to model where species are expect to occur
- Can be based on lots of things, but often just environmental variables
- This demo will build a simple version version using a linear model and two predictors
Load some packages
dismo
is the major package for SDM in R- We’ll also load
ggplot2
for plotting anddplyr
for data manipulation
library(dismo)
library(ggplot2)
library(dplyr)
library(terra)
Data
We need two kinds of data for SDMs
Locations for where a species occurs
- And ideally where it is absent
- When no absences make fake absences called “background” or “pseudo absences”
Spatial data on predictor variables
- Most commonly temp & precip based
- But that is mostly convenience based, not biology based
- To use the SDM for forecasting we need both current and future values
- So we need forecasts for our predictor variables to make forecasts for species distributions
For location data we’ll use the Breeding Bird Survey of North America data on the Hooded Warbler
For environmental data we’ll use minimum temperature and annual precipitation
Fore forecasts of environmental data we’ll use CMIP5 50 year forecasts (CMIP = coupled model intercomparison project)
Download data from https://course.naturecast.org/data/sdm_data.zip and unzip it into your working directory
The hooded warbler data is in a csv file
Load it using
read.csv()
hooded_warb_data = read.csv("hooded_warb_locations.csv")
head(hooded_warb_data)
We can see that this file has information on locations as latitude & longitude plus whether or not the species is present at that location
The environmental data is stored in raster stacks in
.grd
filesLoad those using the
rast()
, which is part of theterra
package
env_data_current = rast("env_current.grd")
env_data_forecast = rast("env_forecast.grd")
plot(env_data_current$tmin)
plot(env_data_current$precip)
Determine environment where species is (and isn’t)
- Now we need to combine the data to get information environment where a species occur and where it doesn’t
- Combine the two data types
- To start we’ll extract the environmental data at the locations that have been sampled
- To do this we get just the position data for the locations by selecting the
lon
andlat
columns - And then use the
extract()
function from the raster package - It takes a raster that we want to extract data from as the first argument and the point locations to extract as the second argument
- Then we need to attach these extracted environmental values back to our presence absence data using
bind_rows()
hooded_warb_data = env_data_current |>
extract(select(hooded_warb_data, lon, lat)) |>
bind_cols(hooded_warb_data)
- Now we can look at where the species occurs in 2D environmental space
ggplot(hooded_warb_data, aes(x = tmin, y = precip, color = present)) +
geom_point()
- We want our SDM to find regions of climate space where the species is likely to occur
- Visually we can see that this species is associated with high temps and precips
Modeling species distribution
- Many different ways to model the probability of species presence
- Look at one of the simplest - Generalized linear modeling
- Specifically we’ll build a multivariate logistic regression
- Do this using the
glm()
function - First argument is the model, where we want the
present
column to depend ontmin
andprecip
- Second argument is the link function, which gives us our logistic regression
- Third argument is the data
logistic_regr_model <- glm(present ~ tmin + precip,
family = binomial(link = "logit"),
data = hooded_warb_data)
summary(logistic_regr_model)
Plot the model predictions
- To make predictions and forecasts from this model we use the overloaded
predict
function from theterra
package - Arguments
- Raster of environmental conditions
- Model
type = "response"
to get probabilities
predictions <- predict(env_data_current, logistic_regr_model, type = "response")
present_loc <- select(filter(hooded_warb_data, present == 1), lon, lat)
plot(predictions, ext = extent(-140, -50, 25, 60))
points(present_loc, pch='+', cex = 0.5)
- This shows us the probability that each species will be present across spatial locations
- Sometimes we we want to identify the locations where we expect the species to occur instead
- To do this we can only show predictions that are greater than some threshold
- So, we could say that if our model says there is a 50% chance of a species being present that it is present
plot(predictions > 0.5, ext = extent(-140, -50, 25, 60))
points(present, pch='+', cex = 0.5)
- Different thresholds often have better characteristics depending on what we value
- Might want to make sure that all locations that a species could possibly occur are included
- E.g., > 25% probability of occurring
plot(predictions > 0.25, ext = extent(-140, -50, 25, 60))
points(present, pch='+', cex = 0.5)
Evaluate the model performance
When we evaluate model performance we often do so across many thresholds
To evaluate model performance for binary classification problems, like our presence vs absence model, we often use Receiver Operating Characteristic or ROC curves
These are plots of false positives (cases where we predict a species is present but it isn’t) on the x axis
Against true positives (cases where we predict a species is present and it is present) on the y axis
We use different thresholds on the predicted probability to say whether a species is predicted to be present or absent
So we might say that if the probability is at least 10% we’ll call that present
That will give us a pair of values for true and false positives
Then do the same thing for 20%, 30%, 40%, and so on which gives us a curve
The 1:1 line on this graph is what we would get if we randomly assigned each point to be present or absent
So we definitely want a model that is above that line
Let’s make our own ROC curve
Split the data in into presences and absences
presence_data = filter(hooded_warb_data, present == 1)
absence_data = filter(hooded_warb_data, present == 0)
- The run the
evaluate()
function fromdismo
- It takes 3 arguments: presence data, absence data, and the model
evaluation <- evaluate(presence_data, absence_data, logistic_regr_model)
- We can then plot the ROC curve
plot(evaluation, 'ROC')
- Looks good, but it is important to know that this is biased when using absences from large scales
Automatically choosing thresholds
- We can then use this
evaluation
to automatically choose thresholds - One common approach is to make sure our model predicts approximately the right number of presences
- Can choose this automatically using
threshold()
- First argument is our
evaluation
object - Second argument allows us to specific what is important to us
- We’ll use
'prevalence'
which predicts approximately the right number of presencences
thresh <- threshold(evaluation, stat = 'prevalence')
plot(predictions > thresh, ext = extent(-140, -50, 25, 60))
points(present, pch='+', cex = 0.5)
Make forecasts
- To make predictions for the future we use the exact same approach, but using forecasts for future environmental conditions
- We’ll use the CMIP5 predictions for 50 years from now that we loaded at the beginning of the lesson
forecasts <- predict(env_data_forecast, logistic_regr_model, type = "response")
plot(forecasts, ext = extent(-140, -50, 25, 60))
plot(forecasts > tr, ext = extent(-140, -50, 25, 60))
- If we want to see the changes that are expected to occur we can plot the difference between the current predictions and the future forecasts
plot(forecasts - predictions, ext = extent(-140, -50, 25, 60))