Inverse Distance weighting with R

Prepare the data for analysis

First, we download the spatial data:

  1. The GeoDatabase containing all shapefiles for the study area
  2. The CABIN data for Hamilton Harbour

Import the data in R.

library(sf)
# The RDS file is an R object and can be read with the following command:
hamilton_habitat_desc <- readRDS("../data/post_process/hamilton_habitat.rds")
lake <- st_read("../data/raw/hamilton_harbor.gdb", layer="waterbody_2")
## Reading layer `waterbody_2' from data source `/home/steve/Documents/Git/PRStats/PR-GSFE01/data/raw/hamilton_harbor.gdb' using driver `OpenFileGDB'
## Simple feature collection with 45 features and 32 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -8899166 ymin: 5349406 xmax: -8878953 ymax: 5360658
## epsg (SRID):    3857
## proj4string:    +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs
# Making sure both are on the same CRS
st_crs(lake) == st_crs(hamilton_habitat_desc)
## [1] FALSE
hamilton_habitat_desc <- st_transform(hamilton_habitat_desc, st_crs(lake))

Filter the variable total phosphorus (ppm). Because, the data contains replicates (several years and samples), we want to average all measurements of total phosphorus by site.

library(dplyr)

hamilton_avg_tp <- hamilton_habitat_desc %>% 
  filter(Variable == "TP") %>% 
  group_by(Site.Site) %>% 
  summarise(avg_tp=mean(Value.Valeur)) 

We perform a visual check of the mean variable.

library(mapview)
mapview(hamilton_avg_tp, zcol='avg_tp', cex = 'avg_tp' )   

We need to transform the sf spatial object to sp, because the gstat package doesn’t support sf class.

hml_avg_tp <- as(hamilton_avg_tp, "Spatial")
lake <- as(lake, "Spatial")

The data is now ready for analysis.

Fit the Inverse Distance Weight (IDW)

We need to load the needed libraries for the analysis.

library(raster)
library(gstat)

We fit the IDW model on the data. We can easily play with the smoothing paramater c with set = list(idp = 5) in the gstat function.

gs <- gstat(formula=avg_tp~1, locations=hml_avg_tp)
gs_5 <- gstat(formula=avg_tp~1, locations=hml_avg_tp, set = list(idp = 5))

Build the predictions map

Then, when the model is fitted, we want to make a predictions map. We create the reference grid for model projection (using the extent of the lake).

r_ref <- raster(lake,res=100)

And we project the model on the reference grid

idw <- interpolate(r_ref, gs)
## [inverse distance weighted interpolation]
plot(idw)

As you can see, the model estimates the values for the entire extent of the reference grid. We can mask those values using the following command:

idw_mask <- mask(idw, lake)
plot(idw_mask)

We can then compare with the map predictions for c = 5

idw5 <- interpolate(r_ref, gs_5)
## [inverse distance weighted interpolation]
idw5_mask <- mask(idw5, lake)

par(mfrow=c(1,2))
plot(idw_mask, main = "c = 2")
plot(idw5_mask, main = "c = 5", legend = FALSE)

Exercice

  1. Construct an inverse distance weighting map using another variable
  2. Construct an inverse distance weighting map using your own data

Evaluate model quality

The choice of the power function parameter (through idp) might be subjective. Let’s perform quality evaluation through with Leave-one-out validation routine (LOOCV) with \(c = 2\).

IDW_out <- vector(length = length(hml_avg_tp))
for (i in 1:length(hml_avg_tp)) {
  IDW_out[i] <- idw(avg_tp ~ 1, hml_avg_tp[-i,], hml_avg_tp[i,], idp = 2, debug.level = 0)$var1.pred
}

We then can assess the model quality by comparing the observed versus the predicted data

plot(IDW_out ~ hml_avg_tp$avg_tp, asp=1, xlab="Observed", ylab="Predicted", pch=16,
      col=rgb(0,0,0,0.5))
abline(lm(IDW_out ~ hml_avg_tp$avg_tp), col="red", lw=2,lty=2)
abline(0,1)

Model quality statistics

Quick reminder of the statistics to use

Using the vector IDW_out, we can compute the mean of squared prediction error (MSPE) for the IDW model (with \(c = 2\))

mean((hml_avg_tp$avg_tp - IDW_out)^2)
## [1] 542716.3

Finding the optimum \(c\) value

Define the cost function

optim() is general-purpose optimization function. optim() will cover the parameter space (here \(c\)) in order to minimize the mean squared prediction error (MSPE). Once the minimum MSPE found, optim returns the best corresponding \(c\) value. For this purpose, you need to provide a cost function to optim. This function cost_fun takes the \(c\) parameter as an argument and compute the MSPE with the Leave-one-out validation routine (LOOCV).

cost_fun <- function(const){
  IDW_out <- vector(length = length(hml_avg_tp))
  for (i in 1:length(hml_avg_tp)) {
    IDW_out[i] <- idw(avg_tp ~ 1, hml_avg_tp[-i,], hml_avg_tp[i,], idp = const, debug.level = 0)$var1.pred
  }
  # Return the MSPE for the c value 
  return(mean((hml_avg_tp$avg_tp - IDW_out)^2))
}

# Set the initial c parameter
init_c <- 2
# Lauch optim function
optim(init_c, cost_fun, method = "Brent", lower = 1, upper = 3)
## $par
## [1] 1.179947
## 
## $value
## [1] 527253.7
## 
## $counts
## function gradient 
##       NA       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

\(K\)-fold cross-validation

  • Step 1: Randomly split the data into \(K\) roughly equal-size group (or components or folds)

Declaring the number of groups (nfold), and attach each points to a specific group (k).

nfolds <- 5
k <- sample(1:nfolds,length(hml_avg_tp), replace = TRUE)
k
##  [1] 1 4 1 4 4 5 1 5 2 1 3 4 3 1 3 1 1 1 2 1 5 4 3 5 2 1 3 3 3 1 2 2 5 1 2
## [36] 1 3 3 3 5 2 5 2 1 3 5 5 1 1 3 1 5 4 3 2 1 2 1 3 2 2 3 1 2 3 2 2 5 5 4
## [71] 3 1 5 1 4 4 3 5 5 2 3 2 1
sf_hml <- st_as_sf(hml_avg_tp)
sf_hml$group <- k
mapview(sf_hml, zcol='group')
  • Step 2: Hold out one group (the \(k^\text{th}\) group) and train the model on the remaining data. Usually, using 5 or 10 groups works well. The resulting model can be defined as \(\widehat{Z}^{(-k)}_i\) for \(i = 1, \dots, m_k\) where \(m_k\) is the number of samples in the \(k^\text{th}\) group.
  • Step 3: Use a metric to evaluate the quality of the model (MSPE here)
  • Step 4: Repeat steps 2 and 3 by holding out another group until all groups have been considered.
  • Step 5: Evaluate the quality of the cross valiation using a \(K\)-fold cross-validation score

\[\text{CV}_{(K)} = \frac{1}{K} \sum_{k=1}^K \text{MSPE}_k\]

The cross-validation score is particularly useful if there is an interest in comparing different model formulation.

Step 2 to 5

# Declare c parameter
const <- 2

# vector storing the MSPE value pour each fold
vec_MSPE <- rep(NA, nfolds)

# Looping over groups
for (i in 1:nfolds) {
  # Create training and validation set
  valid <- hml_avg_tp[k==i,]
  train <- hml_avg_tp[k!=i,]

  # Perform the IDW model on the training set
  m <- gstat(formula=avg_tp~1, locations=train, set=list(idp = const))
  preds <- predict(m, newdata=valid, debug = 0)$var1.pred

  # Assign the value to the MSPE vector
  vec_MSPE[i] <- mean((preds - valid$avg_tp)^2) 
}

# Compute mean MSPE
cv_k <- mean(vec_MSPE)

Exercice

  1. Find the best value of \(c\) using K-fold cross validation by changing the value of \(c\) by increments of 0.1 Between \(c\) values of 1, and 3.

  2. How different is the “best” cross-validation value compared to the one obtained using optim. Why do you think this is different.

Weighted Gaussian kernel

# install.packages("spatialEco")
library("spatialEco")

# Include some data into the raster (if you do not it will not work !)
r_ref[1:length(r_ref)]<- 0 

# Weighted gaussian kernel
wgk <- sp.kde(x= hml_avg_tp,
              y = hml_avg_tp$avg_tp,
              bw = 1000,
              newdata = r_ref,
              standardize = TRUE,
              scale.factor = 10000)
## 
##  calculating weighted kde
# Clean results for projection
wgk_mask <- mask(wgk, lake)

# Draw map
mapview(wgk_mask)

Exercice

Use a K-fold cross-validation to choose the best bandwitch value. Decide on - The increment to use for the bandwitch - The number of folds (\(K\)) to use in the cross-validation

By making these choices, you make a tradeoff between computation time and parameter estimation quality.