Introduction to Kebnekaise

Objectives

Learn how to load the necessary modules on Kebnekaise.
Learn how to compile C code on Kebnekaise.
Learn how to place jobs to the batch queue.
Learn how to use the course project and reservations.

Modules and toolchains

You need to load the correct toolchain before compiling your code on Kebnekaise.

The available modules are listed using the ml avail command:

$ ml avail
------------------------- /hpc2n/eb/modules/all/Core --------------------------
Bison/3.0.5                        fosscuda/2020a
Bison/3.3.2                        fosscuda/2020b        (D)
Bison/3.5.3                        gaussian/16.C.01-AVX2
Bison/3.7.1                (D)     gcccuda/2019b
CUDA/8.0.61                        gcccuda/2020a
CUDA/10.1.243              (D)     gcccuda/2020b         (D)
...

The list shows the modules you can load directly, and so may change if you have loaded modules.

In order to see all the modules, including those that have prerequisites to load, use the command ml spider. Many types of application software fall in this category.

You can find more information regarding a particular module using the ml spider <module> command:

$ ml spider MATLAB

---------------------------------------------------------------------------
MATLAB: MATLAB/2019b.Update2
---------------------------------------------------------------------------
    Description:
    MATLAB is a high-level language and interactive environment that
    enables you to perform computationally intensive tasks faster than
    with traditional programming languages such as C, C++, and Fortran.


    This module can be loaded directly: module load MATLAB/2019b.Update2

    Help:
    Description
    ===========
    MATLAB is a high-level language and interactive environment
    that enables you to perform computationally intensive tasks faster than with
    traditional programming languages such as C, C++, and Fortran.


    More information
    ================
    - Homepage: http://www.mathworks.com/products/matlab

You can load the module using the ml <module> command:

$ ml MATLAB/2019b.Update2

You can list loaded modules using the ml command:

$ ml

Currently Loaded Modules:
 1) snicenvironment     (S)   7) libevent/2.1.11    13) PMIx/3.0.2
 2) systemdefault       (S)   8) numactl/2.0.12     14) impi/2018.4.274
 3) GCCcore/8.2.0             9) XZ/5.2.4           15) imkl/2019.1.144
 4) zlib/1.2.11              10) libxml2/2.9.8      16) intel/2019a
 5) binutils/2.31.1          11) libpciaccess/0.14  17) MATLAB/2019b.Update2
 6) iccifort/2019.1.144      12) hwloc/1.11.11

Where:
 S:  Module is Sticky, requires --force to unload or purge

You can unload all modules using the ml purge command:

$ ml purge
The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) systemdefault   2) snicenvironment

Note that the ml purge command will warn that two modules were not unloaded. This is normal and you should NOT force unload them.

Exercise

Load the FOSS toolchain for source code compilation:
```
$ ml purge
```
The foss module loads the GNU compiler
Investigate which modules were loaded.
Purge all modules.
Find the latest FOSS toolchain (foss). Investigate the loaded modules. Purge all modules.

Compile C code

Once the correct toolchain (foss) has been loaded, we can compile C source files (*.c) with the GNU compiler:

$ gcc -o <binary name> <sources> -Wall

The -Wall causes the compiler to print additional warnings.

Exercise

Compile the following “Hello world” program:

#include <stdio.h>

int main() {
    printf("Hello world!\n");
    return 0;
}

Course project

You can request to be a member of the course project hpc2n202w-xyz, where the letters need to be substituted by the actual numerical values for the project.

Submitting jobs

The jobs are submitted using the srun command:

$ srun --account=<account> --ntasks=<task count> --time=<time> <command>

This places the command into the batch queue. The three arguments are the project number, the number of tasks, and the requested time allocation. For example, the following command prints the uptime of the allocated compute node:

$ srun --account=hpc2n202w-xyz --ntasks=1 --time=00:00:15 uptime
srun: job 12727702 queued and waiting for resources
srun: job 12727702 has been allocated resources
 11:53:43 up 5 days,  1:23,  0 users,  load average: 23,11, 23,20, 23,27

Note that we are using the course project, the number of tasks is set to one, and we are requesting 15 seconds.

We could submit multiple tasks using the --ntasks=<task count> argument:

$ srun --account=hpc2n202w-xyz --ntasks=4 --time=00:00:15 uname -n
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se

Note that all task are running on the same node. We could request multiple CPU cores for each task using the --cpus-per-task=<cpu count> argument:

$ srun --account=hpc2n202w-xyz --ntasks=4 --cpus-per-task=14 --time=00:00:15 uname -n
b-cn0935.hpc2n.umu.se
b-cn0935.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se

If you want to measure the performance, it is advisable to request an exclusive access to the compute nodes (--exclusive):

$ srun --account=hpc2n202w-xyz --ntasks=4 --cpus-per-task=14 --exclusive --time=00:00:15 uname -n
b-cn0935.hpc2n.umu.se
b-cn0935.hpc2n.umu.se
b-cn0932.hpc2n.umu.se
b-cn0932.hpc2n.umu.se

Exercise

Run both “Hello world” programs on the the compute nodes.

Aliases

In order to save time, you can create an alias for a command:

$ alias <alist>="<command>"

For example:

$ alias run_full="srun --account=hpc2n202w-xyz --ntasks=1 --cpus-per-task=28 --time=00:05:00"
$ run_full uname -n
b-cn0932.hpc2n.umu.se

Batch files

It is often more convenient to write the commands into a batch file. For example, we could write the following to a file called batch.sh:

#!/bin/bash
#SBATCH --account=hpc2n202w-xyz
#SBATCH --ntasks=1
#SBATCH --time=00:00:15

ml purge
ml foss/2020b

uname -n

Note that the same arguments that were earlier passed to the srun command are now given as comments. It is highly advisable to purge all loaded modules and re-load the required modules as the job inherits the environment. The batch file is submitted using the sbatch <batch file> command:

sbatch batch.sh
Submitted batch job 12728675

By default, the output is directed to the file slurm-<job_id>.out, where <job_id> is the job id returned by the sbatch command:

$ cat slurm-12728675.out
The following modules were not unloaded:
 (Use "module --force purge" to unload all):

 1) systemdefault   2) snicenvironment
b-cn0102.hpc2n.umu.se

Exercise

Write two batch files that run both “Hello world” programs on the the compute nodes.

Job queue

You can investigate the job queue with the squeue command:

$ squeue -u $USER

If you want an estimate for when the job will start running, you can give the squeue command the argument --start.

You can cancel a job with the scancel command:

$ scancel <job_id>

What is High Performance Computing?

High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. (insideHPC.com)

What does this mean?

Aggregating computing power
- Kebnekaise: 602 nodes in 15 racks totalling 19288 cores
- Your laptop: 4 cores
Higher performance
- Kebnekaise: 728,000 billion arithmetical operations per second
- Your laptop: 200 billion arithmetical operations per second
Solve large problems
- Time: The time required to form a solution to the problem is very long.
- Memory: The solution of the problem requires a lot of memory and/or storage.

Memory models

When it comes to the memory layout, (super)computers can be divided into two primary categories:

Shared memory:

A single memory space for all data:

Everyone can access the same data.
Straightforward to use.

Distributed memory:

Multiple distinct memory spaces for the data:

Everyone has direct access only to the local data.
Requires communication and data transfers.

Computing clusters and supercomputers are generally distributed memory machines:

Programming models

The programming model changes when we aim for extra performance and/or memory:

Single-core:

Matlab, Python, C, Fortran, …

Single stream of operations (thread).
Single pool of data.

Multi-core:

Vectorized Matlab, pthreads, OpenMP

Multiple streams of operations (multiple threads).
Single pool of data.
Extra challenges:
- Work distribution.
- Coordination (synchronization, etc).

Distributed memory:

MPI, …

Multiple streams of operations (multiple threads).
Multiple pools of data.
Extra challenges:
- Work distribution.
- Coordination (synchronization, etc).
- Data distribution.
- Communication and data transfers.

Accelerators / GPUs:

CUDA, OpenCL, OpenACC, OpenMP, …

Single/multiple streams of operations on the host device.
Many lightweight streams of operations on the accelerator.
Multiple pools of data on multiple layers.
Extra challenges:
- Work distribution.
- Coordination (synchronization, etc).
- Data distribution across multiple memory spaces.
- Communication and data transfers.

Hybrid:

MPI + OpenMP, OpenMP + CUDA, MPI + CUDA, …

Combines the benefits and the downsides of several programming models.

Task-based:

OpenMP tasks

Does task-based programming count as a separate programming model?

Functions and data dependencies

Imagine the following computer program:

#include <stdio.h>

void function1(int a, int b) {
    printf("The sum is %d.\n", a + b);
}

void function2(int b) {
    printf("The sum is %d.\n", 10 + b);
}

int main() {
    int a = 10, b = 7;
    function1(a, b);
    function2(b);
    return 0;
}

The program consists of two functions, function1 and function2, that are called one after another from the main function. The first function reads the variables a and b, and the second function reads the variable b:

The program prints the line The sum is 17. twice. The key observation is that the two functions calls are independent of each other. More importantly, the two functions can be executed in parallel:

Let us modify the the program slightly:

#include <stdio.h>

void function1(int a, int *b) {
    printf("The sum is %d.\n", a + *b);
    *b += 3;
}

void function2(int b) {
    printf("The sum is %d.\n", 10 + b);
}

int main() {
    int a = 10, b = 7;
    function1(a, &b);
    function2(b);
    return 0;
}

This time the function function1 modifies the variable b:

Therefore, the two function calls are not independent of each other and changing the order would change the printed lines. Furthermore, executing the two functions in parallel would lead to an undefined result as the execution order would be arbitrary.

We could say that in this particular context, the function function2 is dependent on the function function1. That is, the function function1 must be executed completely before the function function2 can be executed:

However, this data dependency exists only when these two functions are called in this particular sequence using these particular arguments. In a different context, this particular data dependency does not exists. We can therefore conclude that the data dependencies are separate from the functions definitions.