First, we download the spatial data:
Import the data in R.
library(sf)
# The RDS file is an R object and can be read with the following command:
hamilton_habitat_desc <- readRDS("../data/post_process/hamilton_habitat.rds")
lake <- st_read("../data/raw/hamilton_harbor.gdb", layer="waterbody_2")
## Reading layer `waterbody_2' from data source `/home/steve/Documents/Git/PRStats/PR-GSFE01/data/raw/hamilton_harbor.gdb' using driver `OpenFileGDB'
## Simple feature collection with 45 features and 32 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -8899166 ymin: 5349406 xmax: -8878953 ymax: 5360658
## epsg (SRID): 3857
## proj4string: +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs
## [1] FALSE
Filter the variable total phosphorus (ppm). Because, the data contains replicates (several years and samples), we want to average all measurements of total phosphorus by site.
library(dplyr)
hamilton_avg_tp <- hamilton_habitat_desc %>%
filter(Variable == "TP") %>%
group_by(Site.Site) %>%
summarise(avg_tp=mean(Value.Valeur))
We perform a visual check of the mean variable.
We need to transform the sf
spatial object to sp
, because the gstat
package doesn’t support sf
class.
The data is now ready for analysis.
We need to load the needed libraries for the analysis.
We fit the IDW model on the data. We can easily play with the smoothing paramater c
with set = list(idp = 5)
in the gstat
function.
gs <- gstat(formula=avg_tp~1, locations=hml_avg_tp)
gs_5 <- gstat(formula=avg_tp~1, locations=hml_avg_tp, set = list(idp = 5))
Then, when the model is fitted, we want to make a predictions map. We create the reference grid for model projection (using the extent of the lake
).
And we project the model on the reference grid
## [inverse distance weighted interpolation]
As you can see, the model estimates the values for the entire extent of the reference grid. We can mask those values using the following command:
We can then compare with the map predictions for c = 5
## [inverse distance weighted interpolation]
idw5_mask <- mask(idw5, lake)
par(mfrow=c(1,2))
plot(idw_mask, main = "c = 2")
plot(idw5_mask, main = "c = 5", legend = FALSE)
The choice of the power function parameter (through idp
) might be subjective. Let’s perform quality evaluation through with Leave-one-out validation routine (LOOCV) with \(c = 2\).
IDW_out <- vector(length = length(hml_avg_tp))
for (i in 1:length(hml_avg_tp)) {
IDW_out[i] <- idw(avg_tp ~ 1, hml_avg_tp[-i,], hml_avg_tp[i,], idp = 2, debug.level = 0)$var1.pred
}
We then can assess the model quality by comparing the observed versus the predicted data
plot(IDW_out ~ hml_avg_tp$avg_tp, asp=1, xlab="Observed", ylab="Predicted", pch=16,
col=rgb(0,0,0,0.5))
abline(lm(IDW_out ~ hml_avg_tp$avg_tp), col="red", lw=2,lty=2)
abline(0,1)
optim()
is general-purpose optimization function. optim()
will cover the parameter space (here \(c\)) in order to minimize the mean squared prediction error (MSPE). Once the minimum MSPE found, optim
returns the best corresponding \(c\) value. For this purpose, you need to provide a cost function to optim
. This function cost_fun
takes the \(c\) parameter as an argument and compute the MSPE with the Leave-one-out validation routine (LOOCV).
cost_fun <- function(const){
IDW_out <- vector(length = length(hml_avg_tp))
for (i in 1:length(hml_avg_tp)) {
IDW_out[i] <- idw(avg_tp ~ 1, hml_avg_tp[-i,], hml_avg_tp[i,], idp = const, debug.level = 0)$var1.pred
}
# Return the MSPE for the c value
return(mean((hml_avg_tp$avg_tp - IDW_out)^2))
}
# Set the initial c parameter
init_c <- 2
# Lauch optim function
optim(init_c, cost_fun, method = "Brent", lower = 1, upper = 3)
## $par
## [1] 1.179947
##
## $value
## [1] 527253.7
##
## $counts
## function gradient
## NA NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Declaring the number of groups (nfold
), and attach each points to a specific group (k
).
## [1] 1 4 1 4 4 5 1 5 2 1 3 4 3 1 3 1 1 1 2 1 5 4 3 5 2 1 3 3 3 1 2 2 5 1 2
## [36] 1 3 3 3 5 2 5 2 1 3 5 5 1 1 3 1 5 4 3 2 1 2 1 3 2 2 3 1 2 3 2 2 5 5 4
## [71] 3 1 5 1 4 4 3 5 5 2 3 2 1
\[\text{CV}_{(K)} = \frac{1}{K} \sum_{k=1}^K \text{MSPE}_k\]
The cross-validation score is particularly useful if there is an interest in comparing different model formulation.
# Declare c parameter
const <- 2
# vector storing the MSPE value pour each fold
vec_MSPE <- rep(NA, nfolds)
# Looping over groups
for (i in 1:nfolds) {
# Create training and validation set
valid <- hml_avg_tp[k==i,]
train <- hml_avg_tp[k!=i,]
# Perform the IDW model on the training set
m <- gstat(formula=avg_tp~1, locations=train, set=list(idp = const))
preds <- predict(m, newdata=valid, debug = 0)$var1.pred
# Assign the value to the MSPE vector
vec_MSPE[i] <- mean((preds - valid$avg_tp)^2)
}
# Compute mean MSPE
cv_k <- mean(vec_MSPE)
Find the best value of \(c\) using K-fold cross validation by changing the value of \(c\) by increments of 0.1 Between \(c\) values of 1, and 3.
How different is the “best” cross-validation value compared to the one obtained using optim
. Why do you think this is different.
# install.packages("spatialEco")
library("spatialEco")
# Include some data into the raster (if you do not it will not work !)
r_ref[1:length(r_ref)]<- 0
# Weighted gaussian kernel
wgk <- sp.kde(x= hml_avg_tp,
y = hml_avg_tp$avg_tp,
bw = 1000,
newdata = r_ref,
standardize = TRUE,
scale.factor = 10000)
##
## calculating weighted kde
Use a K-fold cross-validation to choose the best bandwitch value. Decide on - The increment to use for the bandwitch - The number of folds (\(K\)) to use in the cross-validation
By making these choices, you make a tradeoff between computation time and parameter estimation quality.