OpenMP task basics (part 1)

Objectives

  • Learn the basic terminology.

  • Learn about the task construct.

  • Learn more about data sharing rules.

  • Learn how to construct a task graph in a three-based manner.

  • Learn how to construct a task graph in a centralised manner.

  • Learn how to wait tasks to complete their execution.

Important

The Kebnekaise login and compute nodes have the OMP_NUM_THREADS environmental variable set to 1 by default.

  1. If you are using the Kebnekaise login nodes to experiment with OpenMP, then it is important to set the OMP_NUM_THREADS environmental variable to some reasonable value:

    $ export OMP_NUM_THREADS=8
    

    Please note that you are not allowed to run long computations on the login nodes!

  2. If you are using the Kebnekaise compute nodes to experiment with OpenMP, then either unset the OMP_NUM_THREADS environmental variable:

    $ unset OMP_NUM_THREADS
    

    or, if you specified the --cpus-per-task=<cpu count> SLURM argument, set the OMP_NUM_THREADS environmental variable to the number of CPU cores available for the task:

    $ export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
    

Terminology

Let us go through some basic terminology:

Task

A specific instance of executable code and its data environment that the OpenMP implementation can schedule for execution by threads.

Task region

A region consisting of all code encountered during the execution of a task.

Implicit task

A task generated by an implicit parallel region or generated when a parallel construct is encountered during execution.

Initial task

An implicit task associated with an implicit parallel region.

Explicit task

A task that is not an implicit task.

We can see that each implicit parallel region or parallel construct creates a set of implicit tasks. Each implicit task is assigned to a different thread in the created team. That is, you have already done task-based programming!

Binding thread set

The set of threads that are affected by, or provide the context for, the execution of a region.

Binding implicit task

The implicit task of the current thread team assigned to the encountering thread.

Most OpenMP constructs have a binding thread set that defines the threads involved in the construct. We can also see that the current team is bound to a binding implicit task. These terms will help us to understand where the tasks are executed.

We also need to understand how tasks are related to each others:

Current task

For a given thread, the task corresponding to the task region in which it is executing.

Child task

A task is a child task of its generating task region.

Sibling tasks

Tasks that are child tasks of the same task region.

Descendent task

A task that is the child task of a task region or of one of its descendent task regions.

We can see that tasks can generate child tasks and these child tasks can generate more child tasks.

Finally, we can see that OpenMP tasks can have dependences between them:

Task dependence

An ordering relation between two sibling tasks: the dependent task and a previously generated predecessor task. The task dependence is fulfilled when the predecessor task has completed.

Dependent task

A task that because of a task dependence cannot be executed until its predecessor tasks have completed.

Task construct

The simplest way to create an explicit task in OpenMP is the task construct:

#pragma omp task [clause[ [,] clause] ... ] new-line
    structured-block

The thread that encounters the task construct creates an explicit task from the structured block. The encountering thread

  • may execute the task immediately or

  • defer its execution to one of the other threads in the team.

A task is always bound to the innermost parallel region. If a task construct is encountered outside a parallel region, then the structured block is executed immediately by the encountering thread.

The task construct accepts a set of clauses:

if([ task :] scalar-expression)
final(scalar-expression)
untied
default(shared | none)
mergeable
private(list)
firstprivate(list)
shared(list)
in_reduction(reduction-identifier : list)
depend([depend-modifier,] dependence-type : locator-list)
priority(priority-value)
allocate([allocator :] list)
affinity([aff-modifier :] locator-list)
detach(event-handle)

We can already recognise some of the clauses. For example, the if clause can be used to enable/disable the creation of the corresponding task, and the default, private, firstprivate, and shared clauses are used to control the data sharing rules. It should be noted that some of these clauses behave slightly differently when compared to the traditional OpenMP constructs.

Let us return to the earlier “Hello world” program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#include <stdio.h>

int main() {
    #pragma omp parallel
    {
        #pragma omp task
        printf("Hello world!\n");
    }
    return 0;
}

Note that the task pragma is inside a parallel construct. Each thread in the team

  • encounters the task construct,

  • creates the corresponding task and

  • either executes the task immediately or defer its execution to one of the other threads in the team:

../_images/task.png

Therefore, the number of tasks, and lines printed, are the same as the number of threads in the team:

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello world!
Hello world!
...
Hello world!

Data sharing rules

We must again begin by discussing the data sharing rules. Let us reconsider an earlier example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <stdio.h>

int main() {
    int number = 1;
    #pragma omp parallel
    {
        #pragma omp task
        {
            printf("I think the number is %d.\n", number);
            number++;
        }
    }
    return 0;
}

The output of the program is not that surprising:

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1.
I think the number is 2.
I think the number is 2.
I think the number is 3.
...

That is, variables declared outside the parallel construct are still shared by default. This is consistent with the three basic data sharing rules.

If we move the variable number inside the parallel construct, then the variable becomes firstprivate by default:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <stdio.h>

int main() {
    #pragma omp parallel
    {
        int number = 1;
        #pragma omp task
        {
            printf("I think the number is %d.\n", number);
            number++;
        }
    }
    return 0;
}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1.
I think the number is 1.
I think the number is 1.
I think the number is 1.
...

The value of the variable is copied when the task is created.

Challenge

Modify the following program such that it uses explicit data sharing rules and the incrementation (number++) is protected with a critical construct:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <stdio.h>

int main() {
    int number = 1;
    #pragma omp parallel
    {
        #pragma omp task
        {
            printf("I think the number is %d.\n", number);
            number++;
        }
    }
    return 0;
}

Note that the atomic construct is usually a better approach when protecting a scalar update operation.

Hint: You may want to write the value of the number variable to a private variable.

Single construct

In the earlier examples, each thread in the team created a task. This is sometimes very convenient as the need for new tasks might arise gradually. However, it is more likely that we want to generate the task graph in a centralized manner, i.e. only one thread should generate the task. This can be accomplished by combining the parallel and single constructs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#include <stdio.h>

int main() {
    #pragma omp parallel
    #pragma omp single nowait
    {
        #pragma omp task
        printf("Hello world!\n");
    }
    return 0;
}

The nowait clause removes the redundant barrier from the end of the single construct.

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello world!

Note that the binding thread set for a single region is the current team. That is, any tasks in the current team can execute the task.

Challenge

Write a program that creates 10 tasks. Each task should print the thread number of the calling thread.

Hint: From the omp.h header file:

int omp_get_thread_num (void);

Barrier construct

It is sometimes necessary to wait until all earlier tasks have been executed. This can be accomplished with the barrier construct:

#pragma omp barrier new-line

All threads in the team must reach the barrier and complete all explicit tasks bound to the parallel region before they are allowed to continue execution beyond the barrier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#include <stdio.h>

int main() {
    #pragma omp parallel
    {
        #pragma omp task
        printf("Hello.\n");

        #pragma omp barrier

        #pragma omp task
        printf("Goodbye.\n");
    }

    return 0;
}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello.
Hello.
...
Hello.
Goodbye.
Goodbye.
Goodbye.
...

Child tasks and taskwait construct

A task can create new child tasks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#include <stdio.h>

int main() {
    #pragma omp parallel
    #pragma omp single
    {
        #pragma omp task
        {
            #pragma omp task
            printf("Hello.\n");

            printf("Hi.\n");
        }

        printf("Hej.\n");
    }

    printf("Goodbye.\n");

    return 0;
}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hej
Hi.
Hello.
Goodbye.
$ ./my_program
Hej.
Hello.
Hi.
Goodbye.

Note that child tasks are executed separately from the generating tasks. In particular, it is possible that a child task gets executed after the generating task has finished. We can use the taskwait construct to wait on the completion of child tasks of the generating task:

#pragma omp taskwait [clause[ [,] clause] ... ] new-line

This allows us to enforce an execution order:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>

int main() {
    #pragma omp parallel
    #pragma omp single
    {
        #pragma omp task
        {
            #pragma omp task
            printf("Hello.\n");

            #pragma omp taskwait

            printf("Hi.\n");
        }

        printf("Hej.\n");
    }

    printf("Goodbye.\n");

    return 0;
}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hej.
Hello.
Hi.
Goodbye.

Challenge

Parallelize the following code using child tasks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#include <stdio.h>

int fib(int n)
{
    if (n < 2)
        return n;

    int i, j;
    i = fib(n-1);
    j = fib(n-2);

    return i + j;
}

int main() {
    #pragma omp parallel
    #pragma omp single
    printf("fib(10) = %d\n", fib(10));

    return 0;
}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
fib(10) = 55

Hint: Remember, variables defined inside a parallel construct are firstprivate for a task region by default. Also, you must wait for certain tasks.