The Task Directive

Objectives

  • Learn the basic terminology.

  • Learn about the task construct.

  • Learn more about data sharing rules.

  • Learn how to construct a task graph in a three-based manner.

  • Learn how to construct a task graph in a centralised manner.

  • Learn how to wait tasks to complete their execution.

Beyond Regular Loops

Regular vs. Irregular Structures

So far we’ve discussed regular structures:

  • Loops with known start and end

  • Fortran array constructs

Many problems have irregular structures:

  • Recursive algorithms

  • Linked lists

  • Loops with unknown end (e.g., while loops)

  • Divide and conquer algorithms

  • And more…

Note

Depending on the details, irregular structures might still be parallelizable using tasks.

The Task Construct

The task construct provides a flexible way to express parallelism for irregular problems.

Key Concept

Tasks allow you to create units of work that can be executed by any thread in the team, now or later.

The Task Directive in Fortran

Syntax

!$omp task [clauses]
    code body
!$omp end task

What It Does

Creates an “explicit task” from:

  • Code body

  • Data environment at that point

  • Place inside a parallel region

Execution

The task may execute:

  • When: Now or later

  • By whom: Encountering thread or other thread

The thread that encounters the task construct creates an explicit task from the structured block. The encountering thread

  • may execute the task immediately or

  • defer its execution to one of the other threads in the team.

A task is always bound to the innermost parallel region. If a task construct is encountered outside a parallel region, then the structured block is executed immediately by the encountering thread.

The task construct accepts a set of clauses:

if([ task :] scalar-expression)
final(scalar-expression)
untied
default(shared | none)
mergeable
private(list)
firstprivate(list)
shared(list)
in_reduction(reduction-identifier : list)
depend([depend-modifier,] dependence-type : locator-list)
priority(priority-value)
allocate([allocator :] list)
affinity([aff-modifier :] locator-list)
detach(event-handle)

We can already recognise some of the clauses. For example, the if clause can be used to enable/disable the creation of the corresponding task, and the default, private, firstprivate, and shared clauses are used to control the data sharing rules. It should be noted that some of these clauses behave slightly differently when compared to the traditional OpenMP constructs.

Let us return to the earlier “Hello world” program:

 1#include <stdio.h>
 2
 3int main() {
 4    #pragma omp parallel
 5    {
 6        #pragma omp task
 7        printf("Hello world!\n");
 8    }
 9    return 0;
10}

Note that the task pragma is inside a parallel construct. Each thread in the team

  • encounters the task construct,

  • creates the corresponding task and

  • either executes the task immediately or defer its execution to one of the other threads in the team:

../_images/task.png

Therefore, the number of tasks, and lines printed, are the same as the number of threads in the team:

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello world!
Hello world!
...
Hello world!

The Task Directive in C

Syntax

#pragma omp task [clauses]
{
    code body
}

What It Does

Creates an “explicit task” from:

  • Code body

  • Data environment at that point

  • Place inside a parallel region

Execution

The task may execute:

  • When: Now or later

  • By whom: Encountering thread or other thread

Allowed Data Sharing Attributes for Tasks

Available Attributes

private:

  • Data is private to the task

firstprivate:

  • Data is private to the task

  • Data initialized when task directive is encountered

shared:

  • Data is shared

  • Only way to return a result from a task!

default:

  • Fortran: shared | private | firstprivate | none

  • C: shared | none

Data Sharing Without a default Clause

When no default is declared on a task directive:

Default Rules

If variable is shared by all implicit tasks in the current team:

Variable is: shared

Otherwise:

Variable is: firstprivate

Recommendation

Important

Use default(none) to be explicit about data sharing!

Example: Task Execution Flow

Consider this code:

code block 1
!$omp task
    code block 2
!$omp end task
code block 3

Thread Encountering This Code

  1. Executes “code block 1”

  2. Creates a task for “code block 2”

  3. May:

    • Execute the task for “code block 2”

    • Pick up another task

    • Continue with “code block 3”

  4. At some point: Has to execute code block 3

No Control Over

Warning

  • Who executes code block 2

  • When code block 2 is finished

Exercise

Use two tasks where each prints out Hello and World strings. Use 4 threads.

Let us consider the data sharing rules with the following example:

 1#include <stdio.h>
 2
 3int main() {
 4    int number = 1;
 5    #pragma omp parallel
 6    {
 7        #pragma omp task
 8        {
 9            printf("I think the number is %d.\n", number);
10            number++;
11        }
12    }
13    return 0;
14}

The output of the program is not that surprising:

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1.
I think the number is 2.
I think the number is 2.
I think the number is 3.
...

That is, variables declared outside the parallel construct are still shared by default. This is consistent with the three basic data sharing rules.

If we move the variable number inside the parallel construct, then the variable becomes firstprivate by default:

 1#include <stdio.h>
 2
 3int main() {
 4    #pragma omp parallel
 5    {
 6        int number = 1;
 7        #pragma omp task
 8        {
 9            printf("I think the number is %d.\n", number);
10            number++;
11        }
12    }
13    return 0;
14}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1.
I think the number is 1.
I think the number is 1.
I think the number is 1.
...

The value of the variable is copied when the task is created.

Exercise

Replace the fixme strings with the types of the variables in the code when the defaults are used. (Exercise inspired in an example of Christian Terboven).

 1#include <stdio.h>
 2#include <omp.h>
 3
 4static int first = 1;  // File scope (static global)
 5
 6int main() {
 7    // Variables in main scope
 8    int second = 2;
 9    int third = 3;
10
11    #pragma omp parallel private(second)
12    {
13        // Variable in parallel region scope
14        int fourth = 4;
15
16        #pragma omp task
17        {
18            // Variable in task scope
19            int fifth = 5;
20
21            // Data scoping in this task:
22            // "first":  *fixme*    (static global)
23            // "second": *fixme*    (private in parallel, becomes firstprivate in task)
24            // "third":  *fixme*    (declared outside parallel)
25            // "fourth": *fixme*    (declared in parallel region)
26            // "fifth":  *fixme*    (declared in task)
27
28            printf("1=%d, 2=%d, 3=%d, 4=%d, 5=%d\n",
29                first, second, third, fourth, fifth);
30        }
31    }
32    return 0;
33}

Now, use default(none) and write the types of the variables in the fixme strings to have the same behavior as in the previous code:

 1#pragma omp parallel private(*fixme*) default(none) shared(*fixme*, *fixme*)
 2{
 3    int fourth = 4;
 4
 5    #pragma omp task default(none) firstprivate(*fixme*,*fixme*) shared(*fixme*, *fixme*)
 6    {
 7        int fifth = 5;
 8        printf("1=%d, 2=%d, 3=%d, 4=%d, 5=%d\n",
 9            first, second, third, fourth, fifth);
10    }
11}

Single construct

In the earlier examples, each thread in the team created a task. This is sometimes very convenient as the need for new tasks might arise gradually. However, it is more likely that we want to generate the task graph in a centralized manner, i.e. only one thread should generate the task. This can be accomplished by combining the parallel and single constructs:

 1#include <stdio.h>
 2
 3int main() {
 4    #pragma omp parallel
 5    #pragma omp single nowait
 6    {
 7        #pragma omp task
 8        printf("Hello world!\n");
 9    }
10    return 0;
11}

The nowait clause removes the redundant barrier from the end of the single construct.

$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello world!

Note that the binding thread set for a single region is the current team. That is, any tasks in the current team can execute the task.

Exercise

Write a program that creates 10 tasks. Each task should print the thread number of the calling thread.

Hint: From the omp.h header file:

int omp_get_thread_num (void);

Exercise

Inspect and run the following code to see the behavior of shared, private, and first private variables in the context of tasks. Notice that one can use atomic updates together with tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 13a-sharedprivate-task-openmp.c -lm
 5#include <stdio.h>
 6#include <omp.h>
 7
 8void process_task(int id, int *shared_data) {
 9    printf("Task %d: shared_data = %d\n", id, *shared_data);
10}
11
12int main() {
13    int shared_data = 100; // variable that is shared across tasks
14    int private_data = 200; // variable is private to each task
15
16    #pragma omp parallel
17    {
18        #pragma omp single
19        {
20            for (int i = 0; i < 4; i++) {
21                int firstprivate_data = i * 10; // firstprivate variable: unique for each task but initialized with i * 10
22
23                #pragma omp task shared(shared_data) private(private_data) firstprivate(firstprivate_data)
24                {
25                    // here we modify the private_data variable, which is unique to each task
26                    private_data = i * 100;
27                    printf("Task %d: private_data = %d, firstprivate_data = %d\n", i, private_data, firstprivate_data);
28
29                    // Calling a function with shared data
30                    process_task(i, &shared_data);
31
32                    // Use an atomic op. for shared_data  or critical?
33                    #pragma omp atomic
34                    shared_data += 1;
35                }
36            }
37        }
38    }
39
40    printf("Final value of shared_data: %d\n", shared_data);
41
42    return 0;
43}

Exercise

Inspect and run the following code to see the behavior of first private variables in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 13b-firstprivate-task-openmp.c -lm
 5#include <stdio.h>
 6#include <omp.h>
 7
 8void task_example() {
 9    int n = 5; // a shared variable with an initial value
10
11    #pragma omp parallel
12    {
13        #pragma omp single
14        {
15            for (int i = 0; i < 5; i++) {
16                #pragma omp task private(n) firstprivate(n)
17                {
18                    // n is private to each task, initialized with its value at task creation
19                    n += i; // Modify n for this task
20                    printf("Task %d: n = %d\n", i, n);
21                }
22            }
23        }
24    }
25}
26
27int main() {
28    task_example();
29    return 0;
30}

Exercise

The following function takes an integer as input, and writes the thread number that handles the execution of the function. Use tasks to parallelize a for loop where each task executes this function:

1void process_element(int i) {
2    // do some work for each element by printing out the thread ID excuting this work
3    printf("Processing element %d in thread %d\n", i, omp_get_thread_num());
4}

Allowing Suspension of Current Task

taskyield Construct

At a taskyield construct, the current task can be suspended to execute a different task.

Syntax

Fortran:

!$omp taskyield

C:

#pragma omp taskyield

Use Case

Allows the runtime to schedule other tasks while waiting for resources or dependencies.

Exercise

Inspect and run the following code to see the behavior of taskyield in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 17-taskyield-openmp.c -lm
 5// ml GCCcore/12.3.0 Clang/16.0.6
 6// clang -O3 -fopenmp -o test.x 17-taskyield-openmp.c
 7#include <omp.h>
 8#include <stdio.h>
 9#include <unistd.h>
10#include <stdlib.h>
11
12void something_useful(int id,int random_time) {
13    sleep(random_time); // Simulate work
14    printf("Task %d: did something useful on thread %d\n", id, omp_get_thread_num());
15}
16
17void something_critical(int id,int random_time) {
18    sleep(random_time); // Simulate longer work
19    printf("Task %d: did something critical on thread %d\n", id, omp_get_thread_num());
20}
21
22void foo(omp_lock_t *lock, int n) {
23    srand(1999); // seed the random number generator
24    int min = 1, max = 10;
25    for (int i = 0; i < n; i++) {
26        int random_time1 = rand() % (max - min + 1) + min;
27        int random_time2 = rand() % (max - min + 1) + min;
28        #pragma omp task
29        {
30            something_useful(i,random_time1);
31
32            // Use taskyield to allow other tasks to execute while waiting
33            while (!omp_test_lock(lock)) {
34                #pragma omp taskyield
35                sleep(5);
36            }
37
38            something_critical(i,random_time2);
39            omp_unset_lock(lock);
40        }
41    }
42}
43
44int main() {
45    int n = 10; // Number of tasks
46    omp_lock_t lock;
47    omp_init_lock(&lock);
48
49    #pragma omp parallel
50    {
51        #pragma omp single
52        foo(&lock, n);
53    }
54
55    omp_destroy_lock(&lock);
56    return 0;
57}

Child tasks and taskwait construct

A task can create new child tasks:

 1#include <stdio.h>
 2
 3int main() {
 4    #pragma omp parallel
 5    #pragma omp single
 6    {
 7        #pragma omp task
 8        {
 9            #pragma omp task
10            printf("Hello.\n");
11
12            printf("Hi.\n");
13        }
14
15        printf("Hej.\n");
16    }
17
18    printf("Goodbye.\n");
19
20    return 0;
21}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hej
Hi.
Hello.
Goodbye.
$ ./my_program
Hej.
Hello.
Hi.
Goodbye.

Note that child tasks are executed separately from the generating tasks. In particular, it is possible that a child task gets executed after the generating task has finished. We can use the taskwait construct to wait on the completion of child tasks of the generating task:

#pragma omp taskwait [clause[ [,] clause] ... ] new-line

This allows us to enforce an execution order:

 1#include <stdio.h>
 2
 3int main() {
 4    #pragma omp parallel
 5    #pragma omp single
 6    {
 7        #pragma omp task
 8        {
 9            #pragma omp task
10            printf("Hello.\n");
11
12            #pragma omp taskwait
13
14            printf("Hi.\n");
15        }
16
17        printf("Hej.\n");
18    }
19
20    printf("Goodbye.\n");
21
22    return 0;
23}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hej.
Hello.
Hi.
Goodbye.

Controlling When Tasks Finish

taskwait Directive

!$omp taskwait

Purpose:

  • Ensures child tasks have completed

  • Does not consider grandchildren, etc.

barrier Directive

!$omp barrier

Purpose:

  • Ensures all tasks in the innermost parallel region have finished

Note

Instead of waiting, a thread can execute tasks generated elsewhere.

All threads in the team must reach the barrier and complete all explicit tasks bound to the parallel region before they are allowed to continue execution beyond the barrier:

 1#include <stdio.h>
 2
 3int main() {
 4    #pragma omp parallel
 5    {
 6        #pragma omp task
 7        printf("Hello.\n");
 8
 9        #pragma omp barrier
10
11        #pragma omp task
12        printf("Goodbye.\n");
13    }
14
15    return 0;
16}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello.
Hello.
...
Hello.
Goodbye.
Goodbye.
Goodbye.
...

Exercise

Inspect and run the following code to see the behavior of taskwait and barrier in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 19-taskwait-barrier-openmp.c -lm
 5// ml GCCcore/12.3.0 Clang/16.0.6
 6// clang -O3 -fopenmp -o test.x 19-taskwait-barrier-openmp.c
 7#include <stdio.h>
 8#include <omp.h>
 9#include <unistd.h>
10
11void process_task(int task_id) {
12    printf("task %d started by thread %d\n", task_id, omp_get_thread_num());
13    #pragma omp taskyield
14    // do some work
15    sleep(3);
16    printf("task %d completed by thread %d\n", task_id, omp_get_thread_num());
17}
18
19int main() {
20    #pragma omp parallel
21    {
22        #pragma omp single
23        {
24            // create some initial tasks set of tasks
25            for (int i = 0; i < 3; i++) {
26                #pragma omp task
27                process_task(i);
28            }
29
30            // wait for all tasks created up to now
31            #pragma omp taskwait
32            printf("All tasks at this level completed, but children,grandchildren tasks may still be running.\n");
33
34            // Second set of tasks
35            for (int i = 3; i < 6; i++) {
36                #pragma omp task
37                process_task(i);
38            }
39        }
40        // A Barrier synchronizes all threads in the parallel region
41        #pragma omp barrier
42        printf("All threads have reached the barrier.\n");
43    }
44
45    return 0;
46}

taskgroup: Controlling Descendant Tasks (OpenMP 4.0)

A taskgroup construct defines a region with an implied task scheduling point at the end.

Current task is suspended until all descendant tasks (including grandchildren, etc.) have completed.

Fortran Syntax

!$omp taskgroup
do i = 1, n
    !$omp task ...
        call processing(...)
    !$omp end task
end do
!$omp end taskgroup  ! Waits for all tasks, including
                      ! tasks generated in processing

C Syntax

#pragma omp taskgroup
{
    for (int i = 0; i < n; i++)
    {
        #pragma omp task ...
        {
            processing(...);
        }
    }
}  // Waits for all tasks, including
   // tasks generated in processing

Controlling Task Creation

The Overhead Problem

Creating a task encounters significant overhead:

  • Requires significant work inside the task to pay off

  • Too many small tasks can hurt performance

Solution: if Clause

Use the if clause to control task creation:

!$OMP task if(level .lt. 10) ...
    ...
!$OMP end task

If the expression evaluates to .false.:

  • Encountering thread executes code body directly (included task)

  • No task creation overhead

Exercise

Inspect and run the following code to see the behavior of if clause in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 15-ifclause-task-openmp.c -lm
 5#include <stdio.h>
 6#include <omp.h>
 7
 8void process_task(int task_id) {
 9    printf("Processing iteration task %d in thread %d\n", task_id, omp_get_thread_num());
10}
11
12int main() {
13    int n = 5; // Threshold for deciding if the task is executed synchronously or asynchrously
14
15    #pragma omp parallel
16    {
17        #pragma omp single
18        {
19            for (int i = 0; i < 10; i++) {
20                // the task creation will depend on the value of "i" and the clause with "if" condition
21                #pragma omp task if(i > n) // Tasks with i > n are deferred; others run in the current thread
22                {
23                process_task(i);
24
25                // childrens behave as regular tasks (can be deferred)
26                #pragma omp task
27                {
28                    printf("Nested task for %d in thread %d\n", i, omp_get_thread_num());
29                }
30                }
31            }
32        }
33    }
34
35    return 0;
36}

Final Tasks

A task can carry a final clause to control task generation in descendants.

Syntax

!$OMP task final(level .gt. 30) ...
    ...
!$OMP end task

If the expression evaluates to .true.:

  • All encountered tasks within this task will be:

    • Included (executed immediately by encountering thread)

    • Final (they also cannot generate new tasks)

Use Case

Prevents excessive task creation in deep recursion by serializing once a certain depth is reached.

Exercise

Inspect and run the following code to see the behavior of final clause in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 16-finalclause-task-openmp.c -lm
 5#include <stdio.h>
 6#include <omp.h>
 7
 8void process_task(int task_id) {
 9    printf("Processing iteration task %d in thread %d\n", task_id, omp_get_thread_num());
10}
11
12int main() {
13    int n = 5; // Threshold for the final clause
14
15    #pragma omp parallel
16    {
17        #pragma omp single
18        {
19            for (int i = 0; i < 10; i++) {
20                // when i <= n, the task and its child tasks are final and included
21                // when i > n, task is final and child tasks are regular tasks
22                #pragma omp task final(i <= n)
23                {
24                    process_task(i);
25
26                    #pragma omp task
27                    {
28                        printf("Nested task for %d in thread %d\n", i, omp_get_thread_num());
29                        #pragma omp task
30                        {
31                        printf("Subnested task for %d in thread %d\n", i, omp_get_thread_num());
32                        }
33                    }
34                }
35            }
36        }
37    }
38    return 0;
39}

Mergeable Tasks

A task can be declared as mergeable for optimization.

Syntax

!$omp task mergeable ...
#pragma omp task mergeable ...

For an undeferred or included task, the implementation may:

  • Use the data environment of the generating task (including internal control variables)

  • Optimize by merging task execution

Use Case

Often used with final clause for optimization at deep recursion levels.

Exercise

Inspect and run the following code to see the behavior of mergeable clause in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 18b-mergeable-task-openmp.c -lm
 5// ml GCCcore/12.3.0 Clang/16.0.6
 6// clang -O3 -fopenmp -o test.x 18b-mergeable-task-openmp.c
 7#include <omp.h>
 8#include <stdio.h>
 9
10void process_task(int task_id) {
11    printf("Processing task %d on thread %d\n", task_id, omp_get_thread_num());
12}
13
14int main() {
15    int n = 10; // nr. of iterations
16
17    #pragma omp parallel
18    {
19        #pragma omp single
20        {
21            for (int i = 0; i < n; i++) {
22                // tasks with the mergeable clause
23                #pragma omp task mergeable
24                {
25                    process_task(i);
26                }
27            }
28        }
29    }
30
31    return 0;
32}

Task Scheduling Points

Threads may switch to a different task at a task scheduling point.

Task Scheduling Points Are

  1. Immediately after generation of an explicit task

  2. After point of completion of a task

  3. At taskwait or taskyield

  4. At barrier (explicit or implicit)

  5. At the end of taskgroup

Untied Tasks (Advanced)

Warning

Untied tasks (not covered in this course) may switch at any point.

  • Be careful with critical regions and locks

  • Example: task may switch out of critical region → deadlock!

Case Study 1: Recursive Fibonacci

Fibonacci Numbers Mathematical series:

\[\begin{split}F_0 &= 0 \\ F_1 &= 1 \\ F_n &= F_{n-1} + F_{n-2}\end{split}\]

First numbers in series: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …

Note

Recursive implementation: not efficient for computation, but good for demonstrating task parallelism!

Recursion Tree

             n
          /     \
       n-1       n-2
      /   \     /   \
   n-2   n-3  n-3  n-4
  / \    / \  / \  / \
...  ... ... ... ... ...

Serial Fibonacci Implementation (C)

unsigned long long recursive_fib(int in) {
    unsigned long long fibnum, sub1, sub2;

    if (in > 1) {
        sub1 = recursive_fib(in - 1);
        sub2 = recursive_fib(in - 2);
        fibnum = sub1 + sub2;
    } else {
        fibnum = in;
    }
    return fibnum;
}

Parallel Version: Attempt 1 (C)

Adding One Task

unsigned long long recursive_fib(int in) {
    unsigned long long fibnum, sub1, sub2;

    if (in > 1) {
        #pragma omp task shared(sub1) firstprivate(in)
        {
            sub1 = recursive_fib(in - 1);
        }
        sub2 = recursive_fib(in - 2);
        fibnum = sub1 + sub2;
    } else {
        fibnum = in;
    }
    return fibnum;
}

Data Sharing

  • sub1 is shared - declared inside function, must share to return result

  • in is firstprivate - initialized at task creation

Parallel Version: Attempt 2 (C)

Adding Second Task

unsigned long long recursive_fib(int in) {
    unsigned long long fibnum, sub1, sub2;

    if (in > 1) {
        #pragma omp task shared(sub1) firstprivate(in)
        {
            sub1 = recursive_fib(in - 1);
        }
        #pragma omp task shared(sub2) firstprivate(in)
        {
            sub2 = recursive_fib(in - 2);
        }
        fibnum = sub1 + sub2;
    } else {
        fibnum = in;
    }
    return fibnum;
}

Problem: Need to have sub1 and sub2 ready before computing fibnum!

Parallel Version: Final Solution (C)

Adding taskwait

unsigned long long recursive_fib(int in) {
    unsigned long long fibnum, sub1, sub2;

    if (in > 1) {
        #pragma omp task shared(sub1) firstprivate(in)
        {
            sub1 = recursive_fib(in - 1);
        }
        #pragma omp task shared(sub2) firstprivate(in)
        {
            sub2 = recursive_fib(in - 2);
        }
        #pragma omp taskwait
        fibnum = sub1 + sub2;
    } else {
        fibnum = in;
    }
    return fibnum;
}

Solution

  • taskwait waits for the 2 tasks above

  • Recursion takes care of grandchildren automatically

Calling the Parallel Fibonacci

Original Serial Code

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    unsigned long long fibres;
    int input;

    printf("Enter input: ");
    scanf("%d", &input);
    fibres = recursive_fib(input);
    printf("Fibonacci number %d is: %llu\n", input, fibres);

    return 0;
}

Attempt: Starting Parallel Region

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {
    unsigned long long fibres;
    int input;

    printf("Enter input: ");
    scanf("%d", &input);

    #pragma omp parallel shared(input, fibres) default(none)
    {
        fibres = recursive_fib(input);
    }

    printf("Fibonacci number %d is: %llu\n", input, fibres);

    return 0;
}

Problem: Each thread starts Fibonacci calculation! Multiple redundant computations.

Solution: Using single Construct

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {
    unsigned long long fibres;
    int input;

    printf("Enter input: ");
    scanf("%d", &input);

    #pragma omp parallel shared(input, fibres) default(none)
    {
        #pragma omp single
        {
            fibres = recursive_fib(input);
        }
    }

    printf("Fibonacci number %d is: %llu\n", input, fibres);

    return 0;
}

single ensures only one thread starts the recursion, but all threads can help execute tasks.

Performance: Fibonacci Number 40

How is the performance of the OpenMP tasks implementation?

Discussion: Fibonacci Performance

Key Findings

Task Overhead:

  • Creating tasks has significant overhead

  • Need sufficient work per task to justify overhead

Optimization Strategies:

  1. if clause: Prevents task creation for small problem sizes. Try adding this clause to the code above:

#pragma omp task shared(sub1) firstprivate(in) if(in > 30)
{
    sub1 = recursive_fib(in - 1);
}
#pragma omp task shared(sub2) firstprivate(in) if(in >30)
{
    sub2 = recursive_fib(in - 2);
}
  1. Limit task creation: Only create tasks at higher recursion levels

  2. Balance: Between parallelism and overhead

Benchmark Setup

Hardware: Intel E5-2650 v3 Test: Computing Fibonacci(40) Compilers: gfortran 6.3, ifort 17.1

Results Summary

../_images/perf-fibo.png

Key Observations

Both compilers show similar patterns:

  • Naive implementation (2 tasks per iteration): Poor performance

  • Using if clause (no tasks for low values): Helps significantly

  • 1 task per iteration: Helps even more

  • Problem: Too little work per task

  • Solution: Limit the number of tasks created

Hardware Details

Test System:

  • 2 sockets per server

  • Intel E5-2650 v3

  • 10 cores per processor

Compilers:

  • gfortran: Version 6.3 with thread binding

  • Intel ifort: Version 17.1 with thread binding

Case Study 2: Self-Refining Recursive Integrator

Mesh Refinement Concept

Codes employing irregular grids benefit from dynamic grid refinement/coarsening:

Example: Fluid dynamics

  • Refine grid where eddy develops

  • Coarsen when eddy vanishes

Case Study Application

Self-refining integrator for 1D function.

Basic Algorithm

Integration with Adaptive Refinement

  1. Evaluate function at 5 regularly spaced points in interval

  2. Estimate integral using two methods:

    • Polygon using all 5 points (accurate)

    • Polygon using only 3 points (first, center, last) (coarse)

  3. Check difference between the two integrals:

    • Compare to threshold * interval length

  4. Decision:

    • If accurate: Add contribution to accumulator

    • If not accurate:

      • Split interval into two pieces

      • Run integrator on both pieces (recursion)

Implementation: Parallel Region

accumulator = 0.0D0

!$OMP parallel default(none) &
!$OMP shared(accumulator) &
!$OMP shared(startv, stopv, unit_err, gen_num)
    !$OMP single
        call rec_eval_shared_update( &
            startv, stopv, unit_err, gen_num)
    !$OMP end single
!$OMP end parallel

Key Design Decisions

Shared variable accumulator:

  • Declared as module variable

  • Used to accumulate results

single construct:

  • Starts the recursive integrator once

  • Implied barrier ensures all tasks are finished

Recursive subroutine:

  • rec_eval_shared_update

Implementation: Task Startup

!$OMP task shared(accumulator) firstprivate(my_start, my_stop) &
!$OMP default(none) firstprivate(my_gen, u_err) &
!$OMP if(task_start)
    call rec_eval_shared_update( &
        my_start, 0.5_dpr * (my_start + my_stop), u_err, my_gen)
!$OMP end task

!$OMP task shared(accumulator) firstprivate(my_start, my_stop) &
!$OMP default(none) firstprivate(my_gen, u_err) &
!$OMP if(task_start)
    call rec_eval_shared_update( &
        0.5_dpr * (my_start + my_stop), my_stop, u_err, my_gen)
!$OMP end task

Recursion Strategy

  • Split interval in half

  • Create two tasks for sub-intervals

  • Each task may recursively subdivide further

Implementation: Result Accumulation

Three Approaches

1. Shared variable with atomic update:

!$omp atomic update
accumulator = accumulator + contribution

2. Threadprivate variables:

  • Thread executing task adds to its threadprivate copy

  • After barrier (implied in end single): atomic update of threadprivate data into shared variable

Remarks:

Warning

Be careful with threadprivate and task scheduling points!

  • Value can be changed after scheduling point

  • threadprivate isn’t private to the task

OpenMP 5.0:

OpenMP 5.0 has reduction constructs for tasks.

Test Function

Mathematical Function

\[f(x) = \sin^2(10000x) \cdot \sin^4(x)\]

Properties

  • Highly oscillatory (due to sin(10000x) term)

  • Requires adaptive refinement

  • Samples more densely where function varies rapidly

Sampling Pattern

The integrator samples most densely where the function oscillates most rapidly, demonstrating the effectiveness of adaptive refinement.

Performance Results: Integrator

Configuration

  • Task started every 5th generation

  • Two accumulation strategies tested

../_images/perf-res.png

Key Findings

Atomic updates:

  • Poor results (millions of atomic operations)

Threadprivate accumulation:

  • Satisfactory results

  • Efficient utilization of up to 128 cores

Compiler Comparison: Integrator

Note

GCC shows inferior scalability beyond 20 cores compared to Intel compiler.

Task Dependencies (OpenMP 4.0)

Declare explicit dependencies between tasks to control execution order.

Syntax

Fortran:

!$omp task depend(type : list)

C:

#pragma omp task depend(type : list)

Dependency Types

in:

  • Task depends on all previous siblings with out or inout dependency on one or more list items

out, inout:

  • Task depends on all previous siblings with in, out, or inout dependency on one or more list items

List Format

The list contains variables, which may include array sections.

Example: Task Dependencies

#pragma omp task depend(out: a)
task_function_1(&a);

#pragma omp task depend(in: a)
task_function_2(a);

#pragma omp task depend(in: a)
task_function_3(a);

Execution Order

  1. Wait for task_function_1 to finish (it writes a)

  2. Then execute task_function_2 and task_function_3 in any order on any thread (both read a)

Benefits

  • Explicit control over task ordering

  • Runtime can optimize scheduling within dependency constraints

  • More flexible than barriers

Exercise

Inspect and run the following code to see the behavior of dependencies in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 14-lifecycle-task-openmp.c -lm
 5#include <stdio.h>
 6#include <omp.h>
 7
 8int taskA(int *a) {
 9    *a += 1;
10    printf("Executing taskA, a=%d\n",*a);
11}
12
13int taskB(int *a, int *b) {
14    *a += 1;
15    *b += 1;
16    printf("Executing taskB, a=%d  b=%d\n", *a,*b);
17}
18
19int taskC(int *a, int *b, int *c) {
20    *a += 1;
21    *b += 1;
22    *c += 1;
23    printf("Executing taskC, a=%d  b=%d  c=%d\n",*a,*b,*c);
24}
25
26int main() {
27
28    int a = 0;
29    int b = 0;
30    int c = 0;
31
32    #pragma omp parallel
33    {
34        #pragma omp single
35        {
36            #pragma omp task depend(out : a)
37            taskA(&a);  // taskA runs independently
38
39            #pragma omp task depend(in : a) depend(out : b)
40            taskB(&a,&b);  // taskB is suspended until taskA completes
41                        // when dependencies are satisfied the task is resumed
42                        // and is ready to continue
43
44            #pragma omp task depend(in : b)
45            taskC(&a,&b,&c);  // taskC is suspended until taskB completes
46
47        }
48    }
49    return 0;
50}

Exercise

In the previous implementation of the Fibonacci generator code, we noticed that the values of Fibonacci numbers are computed several times during recursion. This situation can be avoided by using task dependencies, as in the following graph which shows the dependencies that we can create explicitl for the Fibonacci number 6:

fib[6]
depends on fib[5] and fib[4]
        |
        v
      fib[5]
        depends on fib[4] and fib[3]
            |
            v
          fib[4]
            depends on fib[3] and fib[2]
            /                          \
           v                            v
        fib[3]                        fib[2]
        depends on fib[2] and fib[1]   depends on fib[1] and fib[0]
        |  \                               |          \
        v   v                              v           v
    fib[2] fib[1]                     fib[1]       fib[0]
    depends on fib[1] and fib[0]
    |  \
    v   v
    fib[1] fib[0]


Base values:
fib[1] = 1
fib[0] = 0

Write a code for the Fibonacci numbers generator but now using task dependencies.

taskloop Construct (OpenMP 4.5)

The taskloop construct distributes loop iterations onto tasks.

Similarity

Similar to the loop construct (omp for/omp do), but creates tasks instead of directly distributing iterations.

Default Behavior

By default, taskloop implies a taskgroup.

taskloop: Basic Syntax

Fortran

!$OMP taskloop default(none) shared(…) private(…)
do i = 1, N
    ...
enddo

C

#pragma omp taskloop default(none) shared(…) private(…)
for (i = 0; i < N; i++)
{
    ...
}

Clauses for taskloop

Standard Clauses

Clauses introduced previously that work with taskloop:

  • if(scalar-expr)

  • shared

  • private

  • firstprivate

  • lastprivate

  • default

  • collapse

  • final(scalar-expr)

Important Differences

Warning

  • No reduction clause for taskloop

  • Use nogroup clause to remove the implied taskgroup

Additional Construct

There is also a taskloop simd construct for vectorization.

Exercise

Inspect and run the following code to see the behavior of taskloop in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 20a-taskloop-openmp.c -lm
 5#include <stdio.h>
 6#include <omp.h>
 7
 8void process_iteration(int i) {
 9    printf("Processing iteration %d in thread %d\n", i, omp_get_thread_num());
10}
11
12int main() {
13    int n = 10; // Number of iterations
14
15    #pragma omp parallel
16    {
17        // Use taskloop to distribute iterations as tasks
18        #pragma omp single
19        {
20            #pragma omp taskloop
21            for (int i = 0; i < n; i++) {
22                process_iteration(i);
23            }
24        }
25    }
26
27    return 0;
28}

Exercise

Inspect and run the following code to see the behavior of taskloop in the context of tasks.

 1// On cluster Kebnekaise
 2// ml foss
 3// export OMP_NUM_THREADS=1
 4// gcc -O3 -march=native -fopenmp -o test.x 20b-taskloop-openmp.c -lm
 5#include <stdio.h>
 6#include <omp.h>
 7
 8int main() {
 9    int n = 2000; // Number of iterations
10    int sumR = 0; // Summation
11    #pragma omp parallel shared(sumR)
12    {
13        // Use taskloop to distribute iterations as tasks
14        #pragma omp single
15        {
16            #pragma omp taskloop
17            for (int i = 0; i < n; i++) {
18                #pragma omp atomic
19                sumR +=1;
20            }
21        }
22    }
23
24    printf("Total summation = %d \n", sumR);
25
26    return 0;
27}

Controlling Number of Tasks Created

Control the granularity of task creation to balance parallelism and overhead.

Clauses

Use grainsize or num_tasks (only one allowed):

grainsize Clause

!$omp taskloop grainsize(100)

Controls number of loop iterations per task:

  • Each task gets between grainsize and 2*grainsize iterations

num_tasks Clause

!$omp taskloop num_tasks(10)

Specifies the number of tasks to create.

Additional Restrictions

Final number of tasks may be affected by iteration count and other factors.

Summary

Task Construct:

  • Flexible parallelism for irregular problems

  • Tasks can be created dynamically

  • Execution by any thread, now or later

Task Scheduling:

  • Task scheduling points

  • taskwait: wait for child tasks

  • taskgroup: wait for all descendants

  • taskyield: allow suspension

Task Completion:

  • Multiple mechanisms to control when tasks finish

  • Dependencies between tasks (OpenMP 4.0)

Performance Aspects:

  • Task creation has overhead

  • Need decent amount of work per task

  • Use if and final clauses to control task generation

  • Balance between parallelism and overhead

Case Studies:

  • Recursive Fibonacci: demonstrated task basics

  • Self-refining integrator: demonstrated adaptive algorithms

Advanced Features:

  • taskloop (OpenMP 4.5): distribute loops onto tasks

  • Task dependencies: explicit ordering

  • Various accumulation strategies for results