1

I have several large GAM models to fit (lots of data, various distributions for residuals and random effect smooths). Currently, I use foreach with a cluster of processors based on the parallel package to run each of the models using the gam function on an individual processor. This is an improvement over evaluating the models serially, but I am wondering if it is possible to use parallelization within the bam function itself while at the same time running each model using foreach. Ideally, I would like the results of the model runs to be as similar as possible to those of the original gam function. Should I use the cluster argument to bam or the nthreads argument?

An example code using the cluster argument would be (my actual code is different than this, but the idea is the same):

library(parallel)
library(doParallel)
library(foreach)
library(mgcv)

nprocs = 16
cl <- makeCluster(nprocs)  
registerDoParallel(cl)  

results <- foreach(i = 1:6, .packages = 'mgcv') %dopar% {
  gam_model <- bam(n ~ te(lon,lat) + te(year,month) + s(vessel,bs="re"),
                   data = D, family="tw",cluster=cl)
  summary(gam_model)
}

parallel::stopCluster(cl)  

Alternatively, I could try something like:

library(parallel)
library(doParallel)
library(foreach)
library(mgcv)

nprocs = 16
cl <- makeCluster(nprocs)  
registerDoParallel(cl)  

nthreads = floor(nprocs/6)

results <- foreach(i = 1:6, .packages = 'mgcv') %dopar% {
  gam_model <- bam(n ~ te(lon,lat) + te(year,month) + s(vessel,bs="re"),
                   data = D, family="tw",nthreads=nthreads,discrete=TRUE)
  summary(gam_model)
}

parallel::stopCluster(cl)  

Which, if either, of these is the best/correct approach to parallelizing the bam execution while inside a foreach() %dopar%? The fundamental blockage is that I do not understand how foreach and bam are going to interact (e.g., how calculations will be dispatched to the available processors) and what the differences will be if I use cluster versus nthreads.

2 Answers 2

2

I see you have found something that works for you, but some observations to consider:

You're not really comparing like for like there; the nthreads version requires discrete = TRUE which uses a different, faster, but slightly more approximate algorithm than the one that is used with discrete = FALSE (i.e. where cluster is used).

All else equal, bam() will run much faster with discrete = TRUE. But, it is also the method that will be least like the gam() fit because of the discretization. Often the differences are small, but if a criterion is to be similar to the gam() fits, discrete = TRUE might not be the best option - I have seen important difference between models fitted with discrete = TRUE and discrete = FALSE with a Tweedie family, for example (although not often).

Rather than fit 6 models at a time, such that you are only going to get about 1.5x speed up per model (with two threads each), have you timed/tried to run fewer models at once but give more threads to each model? Which will be quicker will depend on how much time is actually spent in the multithreaded code versus single-threaded code for you model. If the models are all accessing different data, getting that data into and out of the caches could become more of a bottleneck if you run multiple models at once.

Note Simon's points about memory bandwidth often being the constraint on parallel performance with GAMs. Have you tried fitting the models in sequence, but allowing them to use all the processors for threads? For some very big model fits, I achieved good performance by fitting models sequentially, but using a lot of threads per model fit.

Finally, with nthreads make sure you aren't also running a multithreaded BLAS on the system where this is all being done. If the BLAS is multithreaded, using nthreads(8, 1) will use 8 OMP threads where it really matters (inner products) and just 1 thread for the other BLAS calls (where those BLAS calls will be using multiple threads because of the multithreaded BLAS).

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for these helpful thoughts. The runs with discrete=TRUE have finished, but I haven't analyzed them yet. I take your point about maybe running sequentially, but giving each model lots of threads (or the entire cluster). I thought about that, but as I wasn't sure how many threads a typical bam run is capable of using, I decided at least using foreach I knew that part was parallelizable. Note that in my real case, I actually had 5 models and 20 cores, so each model got 4 threads, which seemed like a usable amount for bam and they ran quite quickly.
Note that on the cluster where I am running these models, I got the following warning messages many many many times: OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option. Does this have to do with your comments on BLAS?
You should check what flavour of OpenBLAS is installed on the cluster; there are several flavours, single-threaded, pthreads, openmp threads, different precisions. Check out the warning in the Changelog of mgcv about versions of OpenBLAS that are not thread safe. Also try running mgcv::blas.thread.test on the cluster to see if it detects any issues before you start fitting models. And also read ?blas.thread.test to see if things there help you decide what to use.
You may not be in control of this, but you might be able to install a different OpenBLAS - the defaults on many linux flavours doesn't include OpenMP support, defaulting instead to pthreads.
I just went over the results of the discrete=TRUE runs. They look generally quite similar to the gam models, but they seem worse fits for the data. I also noticed that one of the models has a covariance of coefficients that has a degenerate cholevsky factorisation. I am trying the sequential approach with the cluster argument.
|
1

I tried both of the solutions above and the approach using discrete=TRUE and nthreads appears to be the correct one. The approach using the cluster argument fails complaining about impossibility to make a connection, presumably because the cluster is already occupied by the foreach call. The solution with nthreads worked considerably faster than running the model with the gam function, presumably in part due to the parallel processing.

Therefore, the solution that worked for me is:

library(parallel)
library(doParallel)
library(foreach)
library(mgcv)

nprocs = 16
cl <- makeCluster(nprocs)  
registerDoParallel(cl)  

nthreads = floor(nprocs/6)

results <- foreach(i = 1:6, .packages = 'mgcv') %dopar% {
  gam_model <- bam(n ~ te(lon,lat) + te(year,month) + s(vessel,bs="re"),
                   data = D, family="tw",nthreads=nthreads,discrete=TRUE)
  summary(gam_model)
}

parallel::stopCluster(cl) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.