I have several large GAM models to fit (lots of data, various distributions for residuals and random effect smooths). Currently, I use foreach with a cluster of processors based on the parallel package to run each of the models using the gam function on an individual processor. This is an improvement over evaluating the models serially, but I am wondering if it is possible to use parallelization within the bam function itself while at the same time running each model using foreach. Ideally, I would like the results of the model runs to be as similar as possible to those of the original gam function. Should I use the cluster argument to bam or the nthreads argument?
An example code using the cluster argument would be (my actual code is different than this, but the idea is the same):
library(parallel)
library(doParallel)
library(foreach)
library(mgcv)
nprocs = 16
cl <- makeCluster(nprocs)
registerDoParallel(cl)
results <- foreach(i = 1:6, .packages = 'mgcv') %dopar% {
gam_model <- bam(n ~ te(lon,lat) + te(year,month) + s(vessel,bs="re"),
data = D, family="tw",cluster=cl)
summary(gam_model)
}
parallel::stopCluster(cl)
Alternatively, I could try something like:
library(parallel)
library(doParallel)
library(foreach)
library(mgcv)
nprocs = 16
cl <- makeCluster(nprocs)
registerDoParallel(cl)
nthreads = floor(nprocs/6)
results <- foreach(i = 1:6, .packages = 'mgcv') %dopar% {
gam_model <- bam(n ~ te(lon,lat) + te(year,month) + s(vessel,bs="re"),
data = D, family="tw",nthreads=nthreads,discrete=TRUE)
summary(gam_model)
}
parallel::stopCluster(cl)
Which, if either, of these is the best/correct approach to parallelizing the bam execution while inside a foreach() %dopar%? The fundamental blockage is that I do not understand how foreach and bam are going to interact (e.g., how calculations will be dispatched to the available processors) and what the differences will be if I use cluster versus nthreads.