Feb., 2021
Programs could be of different types:
Compute bound if they use mainly CPU power (more cores can help)
Memory bound if the bottlenecks are allocating memory, copying/duplicating objects (large memory nodes on Kebnekaise)
Implicit parallelism is included in some packages, one only needs to assign the number of workers (threads)
Explicit parallelism requires the intervention of the user to write the proper parallelization instruction (Rmpi for instance)
This loop can be easily parallelized:
for(i in 1:100){ b[i] = 4 a[i] = 2*b[i] + 1 }
but this one cannot because it has dependency on the values of the b vector
for(i in 1:100){ b[i] = 4 a[i] = 2*b[i-1] + 1 }
There are several versions of R installed on Kebnekaise
ml spider R # Versions: # R/3.3.1 # R/3.4.4-X11-20180131 # R/3.5.1-Python-2.7.15 # R/3.5.1 # R/3.6.0 # R/3.6.2 # R/4.0.0 ml spider R/3.6.0 #search for the modules needed by R # You will need to load all module(s) on any one of the lines below before the "R/3.6.0" # GCC/8.2.0-2.31.1 OpenMPI/3.1.3
R --help #Usage: R [options] [< infile] [> outfile] # or: R CMD command [arguments] #Start R, a system for statistical computation and graphics, with the #specified options, or invoke an R tool via the 'R CMD' interface. #Options: # -h, --help Print short help message and exit # --version Print version info and exit # --encoding=ENC Specify encoding to be used for stdin # --encoding ENC # RHOME Print path to R home directory and exit # --save Do save workspace at the end of the session # --no-save Don't save it # --no-environ Don't read the site and user environment files
dos2unix job.sh
before submitting your job.
One can use job arrays option in SLURM to run independent instances of a program:
#!/bin/bash #SBATCH -A Project_ID #Asking for 10 min. #SBATCH -t 00:12:00 #SBATCH --array=1-28 ##Writing the output and error files #SBATCH --output=Array_test.%A_%a.out #SBATCH --error=Array_test.%A_%a.error ml GCC/8.2.0-2.31.1 OpenMPI/3.1.3 ml R/3.6.0 R --no-save --no-restore -f script.R
squeue -a -u username list your jobs on the queue
projinfo displays the project’s usage
Some packages like BLAS/LAPACK have an implicit parallelization layer that can be activated by setting a number of threads.
On Kebnekaise the OpenBLAS libraries are available and can use implicit parallelism:
sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.6 LTS Matrix products: default BLAS/LAPACK: /cvmfs/.../8.2.0-2.31.1/OpenBLAS/0.3.5/lib/libopenblas_haswellp-r0.3.5.so
The number of threads can be controlled with the RhpcBLASctl package and setting the number of threads:
library(RhpcBLASctl) n <- 5000; nsim <- 3 #matrix size nxn; nr. independent simulations set.seed(123); summa <- 0; x <- 0; blas_set_num_threads(1) #set the number of threads for (i in 1:nsim) { m <- matrix(rnorm(n^2), n); a <- crossprod(m) #random matrix and symmetrize it timing <- system.time({ x <- eigen(a, symmetric=TRUE, only.values=TRUE)$values[1] #compute eigenvalues })[3] ; summa <- summa + timing } ; times <- summa/nsim cat(c("Computation of eig. random matrix 5000x5000 (sec): ", times, "\n"))
Other packages (doParallel, parallel, doMC, doMPI, doFuture) use a common set of instructions to use parallel capabilities as follows:
library("package-name") cl <- makeCluster(NumberofCores) register_cluster(cl) ... #code to be run in parallel mode stopCluster(cl)
the foreach package is used for executing loops:
library(foreach) r <- foreach(icount(trials), .combine=cbind) %do% { ind <- sample(100,100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) coefficients(result1) }
doParallel is a backend parallel package for executing code in parallel mode:
library(doParallel) #using "doParallel" package cl <- makeCluster(2) registerDoParallel(cl) getDoParWorkers() #this line tells the nr. of workers
## [1] 2
getDoParName() #this line tell the type of cluster
## [1] "doParallelSNOW"
stopCluster(cl)
doParallel can be used to execute foreach loops in parallel:
library(doParallel) #using "doParallel" package cl <- makeCluster(2) registerDoParallel(cl) r <- foreach(icount(trials), .combine=cbind) %dopar% { ind <- sample(100,100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) coefficients(result1) } stopCluster(cl)
Send slices of data to the workers with the parallel package
library(parallel) #using "parallel" package detectCores() P <- detectCores(logical = FALSE) #only physical cores myfunc <- function(id) { #function to sum by rows arguments <- mydata[id, ] arguments$one + arguments$two + arguments$three } cl <- makeCluster(P) #distribute the work across cores clusterExport(cl, "mydata") res <- clusterApply(cl, 1:N, fun = myfunc) stopCluster(cl)
The doParallel package is used in the following example to compute the eigenvalues of matrices growing in size:
library(doParallel) my_eigen <- function(x) { n <- x*800 m <- matrix(runif(n^2),n,n) m[lower.tri(m)] = t(m)[lower.tri(m)] d <- diag(eigen(m)$values) } cl <- makeCluster(4) registerDoParallel(cl) system.time( res1 <- foreach(n = 1:6) %dopar% my_eigen(n) )[3] stopCluster(cl) #Elapsed #211.25
We can also use the future package in R which runs on several OS and supports asynchronous calculations:
library(future) plan(multisession, gc = TRUE, workers = 4) plan(multisession, workers = 4) par_future <- function(x) { #creating futures ft <- lapply( x, function(x) future(my_eigen(x)) ) #get futures get_ft <- lapply(ft, value) } x <- 1:6 system.time( res2 <- par_future(x) )[3] #Elapsed #210.17
The following simulations res1 and res2 do not give reproducible results:
library(doParallel) cl <- makeCluster(2) registerDoParallel(cl) set.seed(1) res1 <- foreach(n = rep(2, 3), .combine=rbind) %dopar% rnorm(n) set.seed(1) res2 <- foreach(n = rep(2, 3), .combine=rbind) %dopar% rnorm(n) stopCluster(cl) identical(res1,res2)
## [1] FALSE
For reproducible parallel simulation a RNG package such as doRNG is recommended:
library(doRNG) cl <- makeCluster(2) registerDoParallel(cl) registerDoRNG(1) res3 <- foreach(n = rep(2, 3), .combine=rbind) %dopar% rnorm(n) set.seed(1) res4 <- foreach(n = rep(2, 3), .combine=rbind) %dopar% rnorm(n) stopCluster(cl) identical(res3,res4)
## [1] TRUE
Memory profiling is crucial upon using parallel packages. Suppose we have a data frame mydata which will be processed with the clusterApply function
gcinfo(TRUE) #activate gc N <- 5000000 mydata <- data.frame(one=1.0*seq(N),two=2.0*seq(N),three = 3.0*seq(N)) #... #Garbage collection 23 = 14+2+7 (level 0) ... #43.5 Mbytes of cons cells used (66%) #130.4 Mbytes of vectors used (65%) gc() # used (Mb) gc trigger (Mb) max used (Mb) #Ncells 572516 30.6 1233268 65.9 1233268 65.9 #Vcells 16492769 125.9 26338917 201.0 19085502 145.7
Then, we use a function to partition the data frame by cores
library(parallel) #using parallel package detectCores() P <- detectCores(logical = FALSE) #only physical cores myfunc <- function(id) { #function to sum by rows arguments <- mydata[id, ] arguments$one + arguments$two + arguments$three }
cl <- makeCluster(P) #distribute the work across cores clusterExport(cl, "mydata") res <- clusterApply(cl, 1:N, fun = myfunc) stopCluster(cl) #... #Garbage collection 1196 = 1128+50+18 (level 0) ... #312.5 Mbytes of cons cells used (60%) #206.5 Mbytes of vectors used (59%) gc() # used (Mb) gc trigger (Mb) max used (Mb) #Ncells 5850436 312.5 9776540 522.2 9776540 522.2 #Vcells 27062930 206.5 45804848 349.5 42982557 328.0
the time to execute myfunc in parallel mode increases drastically.
If you run your script on multiple cores, you can monitor the CPU and memory usage in real time, use the following command on the terminal:
job-usage “job_ID”
Then copy and paste the URL on your local web browser.
Login nodes and Rstudio should be used for lightweight tasks for other tasks use the SLURM batch system sbatch script
Compute bound or memory bound programs
Some packages for instance the linear algebra ones include already implicit parallelism
It is a good practice to get a profiling analysis (time vs. nr. cores) to request the optimal nr. of cores in your batch script
Monitor the behavior of your batch job with job-usage tool