Introduction to Kebnekaise
Modules and toolchains
You need to load the correct toolchain before compiling your code on Kebnekaise.
The available modules are listed using the ml avail
command:
$ ml avail
------------------------- /hpc2n/eb/modules/all/Core --------------------------
Bison/3.0.5 fosscuda/2020a
Bison/3.3.2 fosscuda/2020b (D)
Bison/3.5.3 gaussian/16.C.01-AVX2
Bison/3.7.1 (D) gcccuda/2019b
CUDA/8.0.61 gcccuda/2020a
CUDA/10.1.243 (D) gcccuda/2020b (D)
...
The list shows the modules you can load directly, and so may change if you have loaded modules.
In order to see all the modules, including those that have prerequisites to load, use the command ml spider
. Many types of application software fall in this category.
You can find more information regarding a particular module using the ml spider <module>
command:
$ ml spider MATLAB
---------------------------------------------------------------------------
MATLAB: MATLAB/2019b.Update2
---------------------------------------------------------------------------
Description:
MATLAB is a high-level language and interactive environment that
enables you to perform computationally intensive tasks faster than
with traditional programming languages such as C, C++, and Fortran.
This module can be loaded directly: module load MATLAB/2019b.Update2
Help:
Description
===========
MATLAB is a high-level language and interactive environment
that enables you to perform computationally intensive tasks faster than with
traditional programming languages such as C, C++, and Fortran.
More information
================
- Homepage: http://www.mathworks.com/products/matlab
You can load the module using the ml <module>
command:
$ ml MATLAB/2019b.Update2
You can list loaded modules using the ml
command:
$ ml
Currently Loaded Modules:
1) snicenvironment (S) 7) libevent/2.1.11 13) PMIx/3.0.2
2) systemdefault (S) 8) numactl/2.0.12 14) impi/2018.4.274
3) GCCcore/8.2.0 9) XZ/5.2.4 15) imkl/2019.1.144
4) zlib/1.2.11 10) libxml2/2.9.8 16) intel/2019a
5) binutils/2.31.1 11) libpciaccess/0.14 17) MATLAB/2019b.Update2
6) iccifort/2019.1.144 12) hwloc/1.11.11
Where:
S: Module is Sticky, requires --force to unload or purge
You can unload all modules using the ml purge
command:
$ ml purge
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) systemdefault 2) snicenvironment
Note that the ml purge
command will warn that two modules were not unloaded.
This is normal and you should NOT force unload them.
Compile C code
Once the correct toolchain (foss/2020b
) has been loaded, we can compile C source files (*.c
) with the GNU compiler:
$ gcc -o <binary name> <sources> -Wall
The -Wall
causes the compiler to print additional warnings.
Compile CUDA code
Once the correct toolchain (fosscuda/2020b
) has been loaded, we can compile CU source files (*.cu
) with the nvcc
compiler:
$ nvcc -o <binary name> <sources> -Xcompiler="-Wall"
This passes the -Wall
flag to g++
. The flag causes the compiler to print extra warnings.
Course project and reservation
During the course, you can use the course reservations (snic2021-22-272-cpu-day[1|2|3] and snic2021-22-272-gpu-day[1|2|3]) to get faster access to the compute nodes. The reservations are valid during the time 9:00-13:00 on each of the three days (10-12 May 2021). Note that capitalization matters for reservations!
Day |
CPU only |
CPU + GPU |
---|---|---|
Monday |
snic2021-22-272-cpu-day1 |
snic2021-22-272-gpu-day1 |
Tuesday |
snic2021-22-272-cpu-day2 |
snic2021-22-272-gpu-day1 |
Wednesday |
snic2021-22-272-cpu-day3 |
snic2021-22-272-gpu-day1 |
Note that jobs that are submitted using a reservation are not scheduled outside the reservation time window.
You can, however, submit jobs without the reservation as long as you are a member of an active project.
The course project SNIC2021-22-272
is valid until 2021-06-01.
Submitting jobs
The jobs are submitted using the srun
command:
$ srun --account=<account> --ntasks=<task count> --time=<time> <command>
This places the command into the batch queue. The three arguments are the project number, the number of tasks, and the requested time allocation. For example, the following command prints the uptime of the allocated compute node:
$ srun --account=SNIC2021-22-272 --ntasks=1 --time=00:00:15 uptime
srun: job 12727702 queued and waiting for resources
srun: job 12727702 has been allocated resources
11:53:43 up 5 days, 1:23, 0 users, load average: 23,11, 23,20, 23,27
Note that we are using the course project, the number of tasks is set to one, and we are requesting 15 seconds.
When the reservation is valid, you can specify it using the --reservation=<reservation>
argument:
$ srun --account=SNIC2021-22-272 --reservation=snic2021-22-272-cpu-day1 --ntasks=1 --time=00:00:15 uptime
11:58:43 up 6 days, 1:23, 0 users, load average: 23,11, 22,20, 21,27
were N in dayN is either 1, 2, 3 and cpu can be replaced with gpu if you are running a GPU job.
We could submit multiple tasks using the --ntasks=<task count>
argument:
$ srun --account=SNIC2021-22-272 --reservation=snic2021-22-272-cpu-day1 --ntasks=4 --time=00:00:15 uname -n
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
Note that all task are running on the same node.
We could request multiple CPU cores for each task using the --cpus-per-task=<cpu count>
argument:
$ srun --account=SNIC2021-22-272 --reservation=snic2021-22-272-cpu-day1 --ntasks=4 --cpus-per-task=14 --time=00:00:15 uname -n
b-cn0935.hpc2n.umu.se
b-cn0935.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
If you want to measure the performance, it is advisable to request an exclusive access to the compute nodes (--exclusive
):
$ srun --account=SNIC2021-22-272 --reservation=snic2021-22-272-cpu-day1 --ntasks=4 --cpus-per-task=14 --exclusive --time=00:00:15 uname -n
b-cn0935.hpc2n.umu.se
b-cn0935.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
Finally, we could request a single Nvidia Tesla V100 GPU and 14 CPU cores using the --gres=gpu:v100:1,gpuexcl
argument:
$ srun --account=SNIC2021-22-272 --reservation=snic2021-22-272-gpu-day1 --ntasks=1 --gres=gpu:v100:1,gpuexcl --time=00:00:15 nvidia-smi
Wed Apr 21 12:59:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67 Driver Version: 460.67 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:58:00.0 Off | 0 |
| N/A 33C P0 26W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Aliases
In order to save time, you can create an alias for a command:
$ alias <alist>="<command>"
For example:
$ alias run_full="srun --account=SNIC2021-22-272 --reservation=snic2021-22-272-cpu-day1 --ntasks=1 --cpus-per-task=28 --time=00:05:00"
$ run_full uname -n
b-cn0932.hpc2n.umu.se
Batch files
It is often more convenient to write the commands into a batch file.
For example, we could write the following to a file called batch.sh
:
1#!/bin/bash
2#SBATCH --account=SNIC2021-22-272
3#SBATCH --reservation=snic2021-22-272-cpu-day1
4#SBATCH --ntasks=1
5#SBATCH --time=00:00:15
6
7ml purge
8ml foss/2020b
9
10uname -n
Note that the same arguments that were earlier passed to the srun
command are now given as comments.
It is highly advisable to purge all loaded modules and re-load the required modules as the job inherits the environment.
The batch file is submitted using the sbatch <batch file>
command:
sbatch batch.sh
Submitted batch job 12728675
By default, the output is directed to the file slurm-<job_id>.out
, where <job_id>
is the job id returned by the sbatch
command:
$ cat slurm-12728675.out
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) systemdefault 2) snicenvironment
b-cn0102.hpc2n.umu.se
Job queue
You can investigate the job queue with the squeue
command:
$ squeue -u $USER
If you want an estimate for when the job will start running, you can give the squeue
command the argument --start
.
You can cancel a job with the scancel
command:
$ scancel <job_id>