OpenMP task basics (part 2)

Objectives

  • Learn about taskloop and taskgroup constructs.

  • Learn about depend, priority, untied, mergeable, and final clauses.

Task loop

If is very commont that a set of tasks is created inside a loop:

1
2
3
4
5
6
7
8
#pragma omp parallel
#pragma omp single
{
    for (int i = 0; i < n; i++) {
        #pragma omp task
        task_implementation(data[i]);
    }
}

This is not always very convenient. For example, the resulting task granularity can be too fine and we may want to combine several loop iterations into a single tasks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#pragma omp parallel
#pragma omp single
{
    int per_task = 10;
    for (int i = 0; i < n; i += per_task) {
        #pragma omp task
        {
            for (int j = 0; j < per_task && i+j < n; j++)
                task_implementation(data[i+j]);
        }
    }
}

As you can clearly see, the resulting code is quite complicated and easy to implement incorrectly. Fortunately the taskloop construct can be used to solve this issue:

#pragma omp taskloop [clause[[,] clause] ...] new-line
    for-loops

The construct specifies that the iterations of the loops will be executed in parallel using tasks. We can thus write the earlier example in a much shorter form:

1
2
3
4
5
6
7
#pragma omp parallel
#pragma omp single
{
    #pragma omp taskloop
    for (int i = 0; i < n; i++)
        task_implementation(data[i]);
}

Unless otherwise specified, the number of iterations assigned to each task is decided by the OpenMP implementation. We can change this behaviour with clauses:

if([ taskloop :] scalar-expression)
shared(list)
private(list)
firstprivate(list)
lastprivate(list)
reduction([default ,]reduction-identifier : list)
in_reduction(reduction-identifier : list)
default(shared | none)
grainsize(grain-size)
num_tasks(num-tasks)
collapse(n)
final(scalar-expr)
priority(priority-value)
untied
mergeable
nogroup
allocate([allocator :] list)

In particular, the grainsize sets the number of iterations assigned to each tasks and the num_tasks sets the number of tasks generated.

Task group

The taskwait construct specifies that the current task region is suspended until the completion of child tasks of the current task. However, the construct does not specify that the current task region is suspended until the completion of descendants of the child tasks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#include <stdio.h>

int main() {
    #pragma omp parallel
    #pragma omp single
    {
        #pragma omp task
        {
            #pragma omp task
            printf("Hello.\n");

            printf("Hi.\n");
        }

        #pragma omp taskwait

        printf("Goodbye.\n");
    }

    return 0;
}

In the above example, the tasks that prints the Hello line is a descendant of the child task that prints the Hi line. As we can see, it is possible that the descendant gets executed after the taskwait region:

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hi.
Goodbye.
Hello.

Tasks and their descendent tasks can be synchronized by containing them in a taskgroup region. The taskgroup construct specifies a wait on completion of** child tasks** of the current task and their descendent tasks:

#pragma omp taskgroup [clause[[,] clause] ...] new-line
    structured-block

Note that the taskgroup construct is not a standalone construct. Instead, we must enclose the task generating region with it. All tasks generated inside a taskgroup region are waited for at the end of the region.

Challenge

Modify the following code such that the two tasks are enclosed inside a taskgroup region:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#include <stdio.h>

int main() {
    #pragma omp parallel
    #pragma omp single
    {
        #pragma omp task
        {
            #pragma omp task
            printf("Hello.\n");

            printf("Hi.\n");
        }

        printf("Goodbye.\n");
    }

    return 0;
}

Depend clause

Up to this point, we have only discussed tasks that are either mutually independent or are related to each other due to the fact that they are generated in a nested manner. In the earlier lecture, we talked about task dependencies. Since OpenMP 4.5, most task-related OpenMP constructs have accepted the depend clause:

depend([depend-modifier,]dependence-type : locator-list)

where dependence-type is one of the following:

in
out
inout
mutexinoutset
depobj

The most relevant ones of these are the following:

in

Input variable(s).

out

Output variable(s).

inout

Input and output variable(s).

The locator-list argument lists all involved variables: var1, var2, ..., varN. A construct can have multiple depend clauses, one for each dependence-type. The depend clause is much more powerful than this but during this course we are going to use only the basic functionality.

As an example, consider the following ill-defined program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>

int main() {
    int number;

    #pragma omp parallel
    #pragma omp single nowait
    {
        #pragma omp task
        number = 1;

        #pragma omp task
        {
            printf("I think the number is %d\n", number);
            number++;
        }

        #pragma omp task
        printf("I think the final number is %d\n", number);
    }

    return 0;
}

As expected, the result is not well-defined:

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1
I think the final number is 1
$ ./my_program
I think the final number is 1
I think the number is 1

We can fix the issue by defining input and output variables for each task:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>

int main() {
    int number;

    #pragma omp parallel
    #pragma omp single nowait
    {
        #pragma omp task depend(out: number)
        number = 1;

        #pragma omp task depend(inout: number)
        {
            printf("I think the number is %d\n", number);
            number++;
        }

        #pragma omp task depend(in: number)
        printf("I think the final number is %d\n", number);
    }

    return 0;
}

That is,

  • the first task is going to write into the variable number,

  • the second task is going to read and write from/into the variable number, and

  • the third task is going to read from the variable number.

These clauses force the OpenMP implementation to execute the tasks in an order that respects the induced task dependencies:

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1
I think the final number is 2

Challenge

Parallelize the following program using tasks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#include <stdio.h>

#define N 15

int main() {
    int fib_numbers[N];

    fib_numbers[0] = 1;
    fib_numbers[1] = 1;

    for (int i = 2; i < N; i++) {
        fib_numbers[i] = fib_numbers[i-1] + fib_numbers[i-2];
    }

    printf("The Fibonacci numbers are:");
    for (int i = 0; i < N; i++)
        printf(" %d", fib_numbers[i]);
    printf("\n");

    return 0;
}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
The Fibonacci numbers are: 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610

Hint: locator-list can contain array elements.

Priority clause

As discussed during an earlier lecture, we can give each task a priority in an attempt to help the runtime system to schedule the tasks in a more optimal order.

In OpenMP, the priority is given using the priority(priority-value) clause. The priority-value is a non-negative integer expression and higher value implies higher priority. You can enquiry the maxinum priority with the omp_get_max_task_priority() function.

Untied clause and taskyield construct

Threads are allowed to suspend the current task region at a task scheduling point in order to execute a different task. A task scheduling point can occur

  • during the generation of an explicit task,

  • the point immediately following the generation of an explicit task,

  • after the point of completion of the structured block associated with a task,

  • in a taskyield region,

  • in a taskwait region,

  • at the end of a taskgroup region,

  • in an implicit barrier region, and

  • in an explicit barrier region.

In particular, we can use the taskyield construct to force a task scheduling point.

#pragma omp taskyield new-line

Note that the above list is not complete.

By default, task regions are tied to the the initially assigned thread and only the initially assigned thread can later resume the execution of the suspended task region. This behaviour can be changed with the untied in which case any thread is allowed to resume the execution.

Mergeable and final clauses

The mergeable and final clauses can be used to reduce the number of generated tasks in deep nested task generation trees.

First, we need more terminology:

Undeferred task

A task for which execution is not deferred with respect to its generating task region. That is, its generating task region is suspended until execution of the structured block associated with the undeferred task is completed.

Included task

A task for which execution is sequentially included in the generating task region. That is, an included task is undeferred and executed by the encountering thread.

Merged task

A task for which the data environment, inclusive of ICVs, is the same as that of its generating task region.

Mergeable task

A task that may be a merged task if it is an undeferred task or an included task.

Final task

A task that forces all of its child tasks to become final and included tasks.

The mergeable clause indicates that the task is mergeable.

If scalar-expression is evaluated as true, then the final(scalar-expression) clause indicates that the task is final.