The Task Directive
Objectives
Learn the basic terminology.
Learn about the
taskconstruct.Learn more about data sharing rules.
Learn how to construct a task graph in a three-based manner.
Learn how to construct a task graph in a centralised manner.
Learn how to wait tasks to complete their execution.
Beyond Regular Loops
Regular vs. Irregular Structures
So far we’ve discussed regular structures:
Loops with known start and end
Fortran array constructs
Many problems have irregular structures:
Recursive algorithms
Linked lists
Loops with unknown end (e.g.,
whileloops)Divide and conquer algorithms
And more…
Note
Depending on the details, irregular structures might still be parallelizable using tasks.
The Task Construct
The task construct provides a flexible way to express parallelism for irregular problems.
Key Concept
Tasks allow you to create units of work that can be executed by any thread in the team, now or later.
The Task Directive in Fortran
Syntax
!$omp task [clauses]
code body
!$omp end task
What It Does
Creates an “explicit task” from:
Code body
Data environment at that point
Place inside a parallel region
Execution
The task may execute:
When: Now or later
By whom: Encountering thread or other thread
The thread that encounters the task construct creates an explicit task from the structured block.
The encountering thread
may execute the task immediately or
defer its execution to one of the other threads in the team.
A task is always bound to the innermost parallel region. If a task construct is encountered outside a parallel region, then the structured block is executed immediately by the encountering thread.
The task construct accepts a set of clauses:
if([ task :] scalar-expression)
final(scalar-expression)
untied
default(shared | none)
mergeable
private(list)
firstprivate(list)
shared(list)
in_reduction(reduction-identifier : list)
depend([depend-modifier,] dependence-type : locator-list)
priority(priority-value)
allocate([allocator :] list)
affinity([aff-modifier :] locator-list)
detach(event-handle)
We can already recognise some of the clauses.
For example, the if clause can be used to enable/disable the creation of the corresponding task, and the default, private, firstprivate, and shared clauses are used to control the data sharing rules.
It should be noted that some of these clauses behave slightly differently when compared to the traditional OpenMP constructs.
Let us return to the earlier “Hello world” program:
1#include <stdio.h>
2
3int main() {
4 #pragma omp parallel
5 {
6 #pragma omp task
7 printf("Hello world!\n");
8 }
9 return 0;
10}
Note that the task pragma is inside a parallel construct.
Each thread in the team
encounters the task construct,
creates the corresponding task and
either executes the task immediately or defer its execution to one of the other threads in the team:
Therefore, the number of tasks, and lines printed, are the same as the number of threads in the team:
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello world!
Hello world!
...
Hello world!
The Task Directive in C
Syntax
#pragma omp task [clauses]
{
code body
}
What It Does
Creates an “explicit task” from:
Code body
Data environment at that point
Place inside a parallel region
Execution
The task may execute:
When: Now or later
By whom: Encountering thread or other thread
Allowed Data Sharing Attributes for Tasks
Available Attributes
private:
Data is private to the task
firstprivate:
Data is private to the task
Data initialized when task directive is encountered
shared:
Data is shared
Only way to return a result from a task!
default:
Fortran:
shared | private | firstprivate | noneC:
shared | none
Data Sharing Without a default Clause
When no default is declared on a task directive:
Default Rules
If variable is shared by all implicit tasks in the current team:
Variable is: shared
Otherwise:
Variable is: firstprivate
Recommendation
Important
Use default(none) to be explicit about data sharing!
Example: Task Execution Flow
Consider this code:
code block 1
!$omp task
code block 2
!$omp end task
code block 3
Thread Encountering This Code
Executes “code block 1”
Creates a task for “code block 2”
May:
Execute the task for “code block 2”
Pick up another task
Continue with “code block 3”
At some point: Has to execute code block 3
No Control Over
Warning
Who executes code block 2
When code block 2 is finished
Exercise
Use two tasks where each prints out Hello and World strings. Use 4 threads.
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 12a-task-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8int main() {
9
10 #pragma omp parallel num_threads(4)
11 {
12
13 #pragma omp task
14 {printf("Hello\n");}
15
16 #pragma omp task
17 {printf("World\n");}
18
19 }
20
21 return 0;
22}
Let us consider the data sharing rules with the following example:
1#include <stdio.h>
2
3int main() {
4 int number = 1;
5 #pragma omp parallel
6 {
7 #pragma omp task
8 {
9 printf("I think the number is %d.\n", number);
10 number++;
11 }
12 }
13 return 0;
14}
The output of the program is not that surprising:
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1.
I think the number is 2.
I think the number is 2.
I think the number is 3.
...
That is, variables declared outside the parallel construct are still shared by default.
This is consistent with the three basic data sharing rules.
If we move the variable number inside the parallel construct, then the variable becomes firstprivate by default:
1#include <stdio.h>
2
3int main() {
4 #pragma omp parallel
5 {
6 int number = 1;
7 #pragma omp task
8 {
9 printf("I think the number is %d.\n", number);
10 number++;
11 }
12 }
13 return 0;
14}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I think the number is 1.
I think the number is 1.
I think the number is 1.
I think the number is 1.
...
The value of the variable is copied when the task is created.
Exercise
Replace the fixme strings with the types of the variables in the code when the defaults are used. (Exercise inspired in an example of Christian Terboven).
1#include <stdio.h>
2#include <omp.h>
3
4static int first = 1; // File scope (static global)
5
6int main() {
7 // Variables in main scope
8 int second = 2;
9 int third = 3;
10
11 #pragma omp parallel private(second)
12 {
13 // Variable in parallel region scope
14 int fourth = 4;
15
16 #pragma omp task
17 {
18 // Variable in task scope
19 int fifth = 5;
20
21 // Data scoping in this task:
22 // "first": *fixme* (static global)
23 // "second": *fixme* (private in parallel, becomes firstprivate in task)
24 // "third": *fixme* (declared outside parallel)
25 // "fourth": *fixme* (declared in parallel region)
26 // "fifth": *fixme* (declared in task)
27
28 printf("1=%d, 2=%d, 3=%d, 4=%d, 5=%d\n",
29 first, second, third, fourth, fifth);
30 }
31 }
32 return 0;
33}
Now, use default(none) and write the types of the variables in the fixme strings to have the same behavior as in the previous code:
1#pragma omp parallel private(*fixme*) default(none) shared(*fixme*, *fixme*)
2{
3 int fourth = 4;
4
5 #pragma omp task default(none) firstprivate(*fixme*,*fixme*) shared(*fixme*, *fixme*)
6 {
7 int fifth = 5;
8 printf("1=%d, 2=%d, 3=%d, 4=%d, 5=%d\n",
9 first, second, third, fourth, fifth);
10 }
11}
Solution
The variables using default data scope look like:
1#include <stdio.h>
2#include <omp.h>
3
4static int first = 1; // File scope (static global)
5
6int main() {
7 // Variables in main scope
8 int second = 2;
9 int third = 3;
10
11 #pragma omp parallel private(second)
12 {
13 // Variable in parallel region scope
14 int fourth = 4;
15
16 #pragma omp task
17 {
18 // Variable in task scope
19 int fifth = 5;
20
21 // Data scoping in this task:
22 // "first": shared (static global)
23 // "second": firstprivate (private in parallel, becomes firstprivate in task)
24 // "third": shared (declared outside parallel)
25 // "fourth": firstprivate (declared in parallel region)
26 // "fifth": private (declared in task)
27
28 printf("1=%d, 2=%d, 3=%d, 4=%d, 5=%d\n",
29 first, second, third, fourth, fifth);
30 }
31 }
32 return 0;
33}
And not using defaults default(none), the variables will need to be declared as follows to have the same behavior as in the previous code:
1#pragma omp parallel private(second) default(none) shared(first, third)
2{
3 int fourth = 4;
4
5 #pragma omp task default(none) firstprivate(second, fourth) shared(first, third)
6 {
7 int fifth = 5;
8 printf("1=%d, 2=%d, 3=%d, 4=%d, 5=%d\n",
9 first, second, third, fourth, fifth);
10 }
11}
Single construct
In the earlier examples, each thread in the team created a task.
This is sometimes very convenient as the need for new tasks might arise gradually.
However, it is more likely that we want to generate the task graph in a centralized manner, i.e. only one thread should generate the task.
This can be accomplished by combining the parallel and single constructs:
1#include <stdio.h>
2
3int main() {
4 #pragma omp parallel
5 #pragma omp single nowait
6 {
7 #pragma omp task
8 printf("Hello world!\n");
9 }
10 return 0;
11}
The nowait clause removes the redundant barrier from the end of the single construct.
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello world!
Note that the binding thread set for a single region is the current team.
That is, any tasks in the current team can execute the task.
Exercise
Write a program that creates 10 tasks. Each task should print the thread number of the calling thread.
Hint: From the omp.h header file:
int omp_get_thread_num (void);
Solution
1#include <stdio.h>
2#include <omp.h>
3
4int main() {
5 #pragma omp parallel
6 #pragma omp single nowait
7 {
8 for (int i = 0; i < 10; i++) {
9 #pragma omp task
10 printf("I am thread no. %d.\n", omp_get_thread_num());
11 }
12 }
13 return 0;
14}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
I am thread no. 5.
I am thread no. 13.
I am thread no. 6.
I am thread no. 8.
I am thread no. 4.
I am thread no. 7.
I am thread no. 15.
I am thread no. 12.
I am thread no. 0.
I am thread no. 9.
Exercise
Inspect and run the following code to see the behavior of shared, private, and first private variables in the context of tasks. Notice that one can use atomic updates together with tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 13a-sharedprivate-task-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8void process_task(int id, int *shared_data) {
9 printf("Task %d: shared_data = %d\n", id, *shared_data);
10}
11
12int main() {
13 int shared_data = 100; // variable that is shared across tasks
14 int private_data = 200; // variable is private to each task
15
16 #pragma omp parallel
17 {
18 #pragma omp single
19 {
20 for (int i = 0; i < 4; i++) {
21 int firstprivate_data = i * 10; // firstprivate variable: unique for each task but initialized with i * 10
22
23 #pragma omp task shared(shared_data) private(private_data) firstprivate(firstprivate_data)
24 {
25 // here we modify the private_data variable, which is unique to each task
26 private_data = i * 100;
27 printf("Task %d: private_data = %d, firstprivate_data = %d\n", i, private_data, firstprivate_data);
28
29 // Calling a function with shared data
30 process_task(i, &shared_data);
31
32 // Use an atomic op. for shared_data or critical?
33 #pragma omp atomic
34 shared_data += 1;
35 }
36 }
37 }
38 }
39
40 printf("Final value of shared_data: %d\n", shared_data);
41
42 return 0;
43}
Exercise
Inspect and run the following code to see the behavior of first private variables in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 13b-firstprivate-task-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8void task_example() {
9 int n = 5; // a shared variable with an initial value
10
11 #pragma omp parallel
12 {
13 #pragma omp single
14 {
15 for (int i = 0; i < 5; i++) {
16 #pragma omp task private(n) firstprivate(n)
17 {
18 // n is private to each task, initialized with its value at task creation
19 n += i; // Modify n for this task
20 printf("Task %d: n = %d\n", i, n);
21 }
22 }
23 }
24 }
25}
26
27int main() {
28 task_example();
29 return 0;
30}
Exercise
The following function takes an integer as input, and writes the thread number that handles
the execution of the function. Use tasks to parallelize a for loop where each task
executes this function:
1void process_element(int i) {
2 // do some work for each element by printing out the thread ID excuting this work
3 printf("Processing element %d in thread %d\n", i, omp_get_thread_num());
4}
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 12c-task-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8void process_element(int i) {
9 // do some work for each element by printing out the thread ID excuting this work
10 printf("Processing element %d in thread %d\n", i, omp_get_thread_num());
11}
12
13int main() {
14 int n = 10;
15
16 #pragma omp parallel
17 {
18 // single directive ensures that only one thread generates tasks
19 #pragma omp single
20 {
21 for (int i = 0; i < n; i++) {
22 // Create a new task for each array element
23 #pragma omp task
24 {
25 process_element(i);
26 }
27 }
28 printf("Generating tasks \n");
29 }
30 }
31
32 return 0;
33}
Allowing Suspension of Current Task
taskyield Construct
At a taskyield construct, the current task can be suspended to execute a different task.
Syntax
Fortran:
!$omp taskyield
C:
#pragma omp taskyield
Use Case
Allows the runtime to schedule other tasks while waiting for resources or dependencies.
Exercise
Inspect and run the following code to see the behavior of taskyield in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 17-taskyield-openmp.c -lm
5// ml GCCcore/12.3.0 Clang/16.0.6
6// clang -O3 -fopenmp -o test.x 17-taskyield-openmp.c
7#include <omp.h>
8#include <stdio.h>
9#include <unistd.h>
10#include <stdlib.h>
11
12void something_useful(int id,int random_time) {
13 sleep(random_time); // Simulate work
14 printf("Task %d: did something useful on thread %d\n", id, omp_get_thread_num());
15}
16
17void something_critical(int id,int random_time) {
18 sleep(random_time); // Simulate longer work
19 printf("Task %d: did something critical on thread %d\n", id, omp_get_thread_num());
20}
21
22void foo(omp_lock_t *lock, int n) {
23 srand(1999); // seed the random number generator
24 int min = 1, max = 10;
25 for (int i = 0; i < n; i++) {
26 int random_time1 = rand() % (max - min + 1) + min;
27 int random_time2 = rand() % (max - min + 1) + min;
28 #pragma omp task
29 {
30 something_useful(i,random_time1);
31
32 // Use taskyield to allow other tasks to execute while waiting
33 while (!omp_test_lock(lock)) {
34 #pragma omp taskyield
35 sleep(5);
36 }
37
38 something_critical(i,random_time2);
39 omp_unset_lock(lock);
40 }
41 }
42}
43
44int main() {
45 int n = 10; // Number of tasks
46 omp_lock_t lock;
47 omp_init_lock(&lock);
48
49 #pragma omp parallel
50 {
51 #pragma omp single
52 foo(&lock, n);
53 }
54
55 omp_destroy_lock(&lock);
56 return 0;
57}
Child tasks and taskwait construct
A task can create new child tasks:
1#include <stdio.h>
2
3int main() {
4 #pragma omp parallel
5 #pragma omp single
6 {
7 #pragma omp task
8 {
9 #pragma omp task
10 printf("Hello.\n");
11
12 printf("Hi.\n");
13 }
14
15 printf("Hej.\n");
16 }
17
18 printf("Goodbye.\n");
19
20 return 0;
21}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hej
Hi.
Hello.
Goodbye.
$ ./my_program
Hej.
Hello.
Hi.
Goodbye.
Note that child tasks are executed separately from the generating tasks.
In particular, it is possible that a child task gets executed after the generating task has finished.
We can use the taskwait construct to wait on the completion of child tasks of the generating task:
#pragma omp taskwait [clause[ [,] clause] ... ] new-line
This allows us to enforce an execution order:
1#include <stdio.h>
2
3int main() {
4 #pragma omp parallel
5 #pragma omp single
6 {
7 #pragma omp task
8 {
9 #pragma omp task
10 printf("Hello.\n");
11
12 #pragma omp taskwait
13
14 printf("Hi.\n");
15 }
16
17 printf("Hej.\n");
18 }
19
20 printf("Goodbye.\n");
21
22 return 0;
23}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hej.
Hello.
Hi.
Goodbye.
Controlling When Tasks Finish
taskwait Directive
!$omp taskwait
Purpose:
Ensures child tasks have completed
Does not consider grandchildren, etc.
barrier Directive
!$omp barrier
Purpose:
Ensures all tasks in the innermost parallel region have finished
Note
Instead of waiting, a thread can execute tasks generated elsewhere.
All threads in the team must reach the barrier and complete all explicit tasks bound to the parallel region before they are allowed to continue execution beyond the barrier:
1#include <stdio.h>
2
3int main() {
4 #pragma omp parallel
5 {
6 #pragma omp task
7 printf("Hello.\n");
8
9 #pragma omp barrier
10
11 #pragma omp task
12 printf("Goodbye.\n");
13 }
14
15 return 0;
16}
$ gcc -o my_program my_program.c -Wall -fopenmp
$ ./my_program
Hello.
Hello.
...
Hello.
Goodbye.
Goodbye.
Goodbye.
...
Exercise
Inspect and run the following code to see the behavior of taskwait and barrier in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 19-taskwait-barrier-openmp.c -lm
5// ml GCCcore/12.3.0 Clang/16.0.6
6// clang -O3 -fopenmp -o test.x 19-taskwait-barrier-openmp.c
7#include <stdio.h>
8#include <omp.h>
9#include <unistd.h>
10
11void process_task(int task_id) {
12 printf("task %d started by thread %d\n", task_id, omp_get_thread_num());
13 #pragma omp taskyield
14 // do some work
15 sleep(3);
16 printf("task %d completed by thread %d\n", task_id, omp_get_thread_num());
17}
18
19int main() {
20 #pragma omp parallel
21 {
22 #pragma omp single
23 {
24 // create some initial tasks set of tasks
25 for (int i = 0; i < 3; i++) {
26 #pragma omp task
27 process_task(i);
28 }
29
30 // wait for all tasks created up to now
31 #pragma omp taskwait
32 printf("All tasks at this level completed, but children,grandchildren tasks may still be running.\n");
33
34 // Second set of tasks
35 for (int i = 3; i < 6; i++) {
36 #pragma omp task
37 process_task(i);
38 }
39 }
40 // A Barrier synchronizes all threads in the parallel region
41 #pragma omp barrier
42 printf("All threads have reached the barrier.\n");
43 }
44
45 return 0;
46}
taskgroup: Controlling Descendant Tasks (OpenMP 4.0)
A taskgroup construct defines a region with an implied task scheduling point at the end.
Current task is suspended until all descendant tasks (including grandchildren, etc.) have completed.
Fortran Syntax
!$omp taskgroup
do i = 1, n
!$omp task ...
call processing(...)
!$omp end task
end do
!$omp end taskgroup ! Waits for all tasks, including
! tasks generated in processing
C Syntax
#pragma omp taskgroup
{
for (int i = 0; i < n; i++)
{
#pragma omp task ...
{
processing(...);
}
}
} // Waits for all tasks, including
// tasks generated in processing
Controlling Task Creation
The Overhead Problem
Creating a task encounters significant overhead:
Requires significant work inside the task to pay off
Too many small tasks can hurt performance
Solution: if Clause
Use the if clause to control task creation:
!$OMP task if(level .lt. 10) ...
...
!$OMP end task
If the expression evaluates to .false.:
Encountering thread executes code body directly (included task)
No task creation overhead
Exercise
Inspect and run the following code to see the behavior of if clause in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 15-ifclause-task-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8void process_task(int task_id) {
9 printf("Processing iteration task %d in thread %d\n", task_id, omp_get_thread_num());
10}
11
12int main() {
13 int n = 5; // Threshold for deciding if the task is executed synchronously or asynchrously
14
15 #pragma omp parallel
16 {
17 #pragma omp single
18 {
19 for (int i = 0; i < 10; i++) {
20 // the task creation will depend on the value of "i" and the clause with "if" condition
21 #pragma omp task if(i > n) // Tasks with i > n are deferred; others run in the current thread
22 {
23 process_task(i);
24
25 // childrens behave as regular tasks (can be deferred)
26 #pragma omp task
27 {
28 printf("Nested task for %d in thread %d\n", i, omp_get_thread_num());
29 }
30 }
31 }
32 }
33 }
34
35 return 0;
36}
Final Tasks
A task can carry a final clause to control task generation in descendants.
Syntax
!$OMP task final(level .gt. 30) ...
...
!$OMP end task
If the expression evaluates to .true.:
All encountered tasks within this task will be:
Included (executed immediately by encountering thread)
Final (they also cannot generate new tasks)
Use Case
Prevents excessive task creation in deep recursion by serializing once a certain depth is reached.
Exercise
Inspect and run the following code to see the behavior of final clause in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 16-finalclause-task-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8void process_task(int task_id) {
9 printf("Processing iteration task %d in thread %d\n", task_id, omp_get_thread_num());
10}
11
12int main() {
13 int n = 5; // Threshold for the final clause
14
15 #pragma omp parallel
16 {
17 #pragma omp single
18 {
19 for (int i = 0; i < 10; i++) {
20 // when i <= n, the task and its child tasks are final and included
21 // when i > n, task is final and child tasks are regular tasks
22 #pragma omp task final(i <= n)
23 {
24 process_task(i);
25
26 #pragma omp task
27 {
28 printf("Nested task for %d in thread %d\n", i, omp_get_thread_num());
29 #pragma omp task
30 {
31 printf("Subnested task for %d in thread %d\n", i, omp_get_thread_num());
32 }
33 }
34 }
35 }
36 }
37 }
38 return 0;
39}
Mergeable Tasks
A task can be declared as mergeable for optimization.
Syntax
!$omp task mergeable ...
#pragma omp task mergeable ...
For an undeferred or included task, the implementation may:
Use the data environment of the generating task (including internal control variables)
Optimize by merging task execution
Use Case
Often used with final clause for optimization at deep recursion levels.
Exercise
Inspect and run the following code to see the behavior of mergeable clause in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 18b-mergeable-task-openmp.c -lm
5// ml GCCcore/12.3.0 Clang/16.0.6
6// clang -O3 -fopenmp -o test.x 18b-mergeable-task-openmp.c
7#include <omp.h>
8#include <stdio.h>
9
10void process_task(int task_id) {
11 printf("Processing task %d on thread %d\n", task_id, omp_get_thread_num());
12}
13
14int main() {
15 int n = 10; // nr. of iterations
16
17 #pragma omp parallel
18 {
19 #pragma omp single
20 {
21 for (int i = 0; i < n; i++) {
22 // tasks with the mergeable clause
23 #pragma omp task mergeable
24 {
25 process_task(i);
26 }
27 }
28 }
29 }
30
31 return 0;
32}
Task Scheduling Points
Threads may switch to a different task at a task scheduling point.
Task Scheduling Points Are
Immediately after generation of an explicit task
After point of completion of a task
At
taskwaitortaskyieldAt
barrier(explicit or implicit)At the end of
taskgroup
Untied Tasks (Advanced)
Warning
Untied tasks (not covered in this course) may switch at any point.
Be careful with
criticalregions and locksExample: task may switch out of critical region → deadlock!
Case Study 1: Recursive Fibonacci
Fibonacci Numbers Mathematical series:
First numbers in series: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …
Note
Recursive implementation: not efficient for computation, but good for demonstrating task parallelism!
Recursion Tree
n
/ \
n-1 n-2
/ \ / \
n-2 n-3 n-3 n-4
/ \ / \ / \ / \
... ... ... ... ... ...
Serial Fibonacci Implementation (C)
unsigned long long recursive_fib(int in) {
unsigned long long fibnum, sub1, sub2;
if (in > 1) {
sub1 = recursive_fib(in - 1);
sub2 = recursive_fib(in - 2);
fibnum = sub1 + sub2;
} else {
fibnum = in;
}
return fibnum;
}
Parallel Version: Attempt 1 (C)
Adding One Task
unsigned long long recursive_fib(int in) {
unsigned long long fibnum, sub1, sub2;
if (in > 1) {
#pragma omp task shared(sub1) firstprivate(in)
{
sub1 = recursive_fib(in - 1);
}
sub2 = recursive_fib(in - 2);
fibnum = sub1 + sub2;
} else {
fibnum = in;
}
return fibnum;
}
Data Sharing
sub1isshared- declared inside function, must share to return resultinisfirstprivate- initialized at task creation
Parallel Version: Attempt 2 (C)
Adding Second Task
unsigned long long recursive_fib(int in) {
unsigned long long fibnum, sub1, sub2;
if (in > 1) {
#pragma omp task shared(sub1) firstprivate(in)
{
sub1 = recursive_fib(in - 1);
}
#pragma omp task shared(sub2) firstprivate(in)
{
sub2 = recursive_fib(in - 2);
}
fibnum = sub1 + sub2;
} else {
fibnum = in;
}
return fibnum;
}
Problem: Need to have sub1 and sub2 ready before computing fibnum!
Parallel Version: Final Solution (C)
Adding taskwait
unsigned long long recursive_fib(int in) {
unsigned long long fibnum, sub1, sub2;
if (in > 1) {
#pragma omp task shared(sub1) firstprivate(in)
{
sub1 = recursive_fib(in - 1);
}
#pragma omp task shared(sub2) firstprivate(in)
{
sub2 = recursive_fib(in - 2);
}
#pragma omp taskwait
fibnum = sub1 + sub2;
} else {
fibnum = in;
}
return fibnum;
}
Solution
taskwaitwaits for the 2 tasks aboveRecursion takes care of grandchildren automatically
Calling the Parallel Fibonacci
Original Serial Code
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
unsigned long long fibres;
int input;
printf("Enter input: ");
scanf("%d", &input);
fibres = recursive_fib(input);
printf("Fibonacci number %d is: %llu\n", input, fibres);
return 0;
}
Attempt: Starting Parallel Region
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char *argv[]) {
unsigned long long fibres;
int input;
printf("Enter input: ");
scanf("%d", &input);
#pragma omp parallel shared(input, fibres) default(none)
{
fibres = recursive_fib(input);
}
printf("Fibonacci number %d is: %llu\n", input, fibres);
return 0;
}
Problem: Each thread starts Fibonacci calculation! Multiple redundant computations.
Solution: Using single Construct
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char *argv[]) {
unsigned long long fibres;
int input;
printf("Enter input: ");
scanf("%d", &input);
#pragma omp parallel shared(input, fibres) default(none)
{
#pragma omp single
{
fibres = recursive_fib(input);
}
}
printf("Fibonacci number %d is: %llu\n", input, fibres);
return 0;
}
single ensures only one thread starts the recursion, but all threads can help execute tasks.
Serial Fibonacci Implementation (Fortran)
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
sub1 = recursive_fib(in - 1)
sub2 = recursive_fib(in - 2)
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Parallel Version: Attempt 1 (Fortran)
Adding One Task
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
!$OMP task shared(sub1) firstprivate(in)
sub1 = recursive_fib(in - 1)
!$OMP end task
sub2 = recursive_fib(in - 2)
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Data Sharing
sub1isshared- declared inside function, must share to return resultinisfirstprivate- initialized at task creation
Parallel Version: Attempt 2 (Fortran)
Adding Second Task
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
!$OMP task shared(sub1) firstprivate(in)
sub1 = recursive_fib(in - 1)
!$OMP end task
!$OMP task shared(sub2) firstprivate(in)
sub2 = recursive_fib(in - 2)
!$OMP end task
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Problem: Need to have sub1 and sub2 ready before computing fibnum!
Parallel Version: Final Solution (Fortran)
Adding taskwait
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
!$OMP task shared(sub1) firstprivate(in)
sub1 = recursive_fib(in - 1)
!$OMP end task
!$OMP task shared(sub2) firstprivate(in)
sub2 = recursive_fib(in - 2)
!$OMP end task
!$OMP taskwait
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Solution
taskwaitwaits for the 2 tasks aboveRecursion takes care of grandchildren automatically
Calling the Parallel Fibonacci
Original Serial Code
program fibonacci
!$ use omp_lib
integer, parameter :: lint = selected_int_kind(10)
integer(lint) :: fibres
integer :: input
read (*,*) input
fibres = recursive_fib(input)
print *, "Fibonacci number", input, " is:", fibres
end program fibonacci
Attempt: Starting Parallel Region
program fibonacci
!$ use omp_lib
integer, parameter :: lint = selected_int_kind(10)
integer(lint) :: fibres
integer :: input
read (*,*) input
!$OMP parallel shared(input, fibres) default(none)
fibres = recursive_fib(input)
!$OMP end parallel
print *, "Fibonacci number", input, " is:", fibres
end program fibonacci
Problem: Each thread starts Fibonacci calculation! Multiple redundant computations.
Solution: Using single Construct
program fibonacci
!$ use omp_lib
integer, parameter :: lint = selected_int_kind(10)
integer(lint) :: fibres
integer :: input
read (*,*) input
!$OMP parallel shared(input, fibres) default(none)
!$OMP single
fibres = recursive_fib(input)
!$OMP end single
!$OMP end parallel
print *, "Fibonacci number", input, " is:", fibres
end program fibonacci
single ensures only one thread starts the recursion, but all threads can help execute tasks.
Performance: Fibonacci Number 40
How is the performance of the OpenMP tasks implementation?
Discussion: Fibonacci Performance
Key Findings
Task Overhead:
Creating tasks has significant overhead
Need sufficient work per task to justify overhead
Optimization Strategies:
if clause: Prevents task creation for small problem sizes. Try adding this clause to the code above:
#pragma omp task shared(sub1) firstprivate(in) if(in > 30)
{
sub1 = recursive_fib(in - 1);
}
#pragma omp task shared(sub2) firstprivate(in) if(in >30)
{
sub2 = recursive_fib(in - 2);
}
Limit task creation: Only create tasks at higher recursion levels
Balance: Between parallelism and overhead
Benchmark Setup
Hardware: Intel E5-2650 v3 Test: Computing Fibonacci(40) Compilers: gfortran 6.3, ifort 17.1
Results Summary
Key Observations
Both compilers show similar patterns:
Naive implementation (2 tasks per iteration): Poor performance
Using if clause (no tasks for low values): Helps significantly
1 task per iteration: Helps even more
Problem: Too little work per task
Solution: Limit the number of tasks created
Hardware Details
Test System:
2 sockets per server
Intel E5-2650 v3
10 cores per processor
Compilers:
gfortran: Version 6.3 with thread binding
Intel ifort: Version 17.1 with thread binding
Case Study 2: Self-Refining Recursive Integrator
Mesh Refinement Concept
Codes employing irregular grids benefit from dynamic grid refinement/coarsening:
Example: Fluid dynamics
Refine grid where eddy develops
Coarsen when eddy vanishes
Case Study Application
Self-refining integrator for 1D function.
Basic Algorithm
Integration with Adaptive Refinement
Evaluate function at 5 regularly spaced points in interval
Estimate integral using two methods:
Polygon using all 5 points (accurate)
Polygon using only 3 points (first, center, last) (coarse)
Check difference between the two integrals:
Compare to threshold * interval length
Decision:
If accurate: Add contribution to accumulator
If not accurate:
Split interval into two pieces
Run integrator on both pieces (recursion)
Implementation: Parallel Region
accumulator = 0.0D0
!$OMP parallel default(none) &
!$OMP shared(accumulator) &
!$OMP shared(startv, stopv, unit_err, gen_num)
!$OMP single
call rec_eval_shared_update( &
startv, stopv, unit_err, gen_num)
!$OMP end single
!$OMP end parallel
Key Design Decisions
Shared variable accumulator:
Declared as module variable
Used to accumulate results
single construct:
Starts the recursive integrator once
Implied barrier ensures all tasks are finished
Recursive subroutine:
rec_eval_shared_update
Implementation: Task Startup
!$OMP task shared(accumulator) firstprivate(my_start, my_stop) &
!$OMP default(none) firstprivate(my_gen, u_err) &
!$OMP if(task_start)
call rec_eval_shared_update( &
my_start, 0.5_dpr * (my_start + my_stop), u_err, my_gen)
!$OMP end task
!$OMP task shared(accumulator) firstprivate(my_start, my_stop) &
!$OMP default(none) firstprivate(my_gen, u_err) &
!$OMP if(task_start)
call rec_eval_shared_update( &
0.5_dpr * (my_start + my_stop), my_stop, u_err, my_gen)
!$OMP end task
Recursion Strategy
Split interval in half
Create two tasks for sub-intervals
Each task may recursively subdivide further
Implementation: Result Accumulation
Three Approaches
1. Shared variable with atomic update:
!$omp atomic update
accumulator = accumulator + contribution
2. Threadprivate variables:
Thread executing task adds to its threadprivate copy
After barrier (implied in
end single): atomic update of threadprivate data into shared variable
Remarks:
Warning
Be careful with threadprivate and task scheduling points!
Value can be changed after scheduling point
threadprivate isn’t private to the task
OpenMP 5.0:
OpenMP 5.0 has reduction constructs for tasks.
Test Function
Mathematical Function
Properties
Highly oscillatory (due to sin(10000x) term)
Requires adaptive refinement
Samples more densely where function varies rapidly
Sampling Pattern
The integrator samples most densely where the function oscillates most rapidly, demonstrating the effectiveness of adaptive refinement.
Performance Results: Integrator
Configuration
Task started every 5th generation
Two accumulation strategies tested
Key Findings
Atomic updates:
Poor results (millions of atomic operations)
Threadprivate accumulation:
Satisfactory results
Efficient utilization of up to 128 cores
Compiler Comparison: Integrator
Note
GCC shows inferior scalability beyond 20 cores compared to Intel compiler.
Task Dependencies (OpenMP 4.0)
Declare explicit dependencies between tasks to control execution order.
Syntax
Fortran:
!$omp task depend(type : list)
C:
#pragma omp task depend(type : list)
Dependency Types
in:
Task depends on all previous siblings with
outorinoutdependency on one or more list items
out, inout:
Task depends on all previous siblings with
in,out, orinoutdependency on one or more list items
List Format
The list contains variables, which may include array sections.
Example: Task Dependencies
#pragma omp task depend(out: a)
task_function_1(&a);
#pragma omp task depend(in: a)
task_function_2(a);
#pragma omp task depend(in: a)
task_function_3(a);
Execution Order
Wait for
task_function_1to finish (it writesa)Then execute
task_function_2andtask_function_3in any order on any thread (both reada)
Benefits
Explicit control over task ordering
Runtime can optimize scheduling within dependency constraints
More flexible than barriers
Exercise
Inspect and run the following code to see the behavior of dependencies in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 14-lifecycle-task-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8int taskA(int *a) {
9 *a += 1;
10 printf("Executing taskA, a=%d\n",*a);
11}
12
13int taskB(int *a, int *b) {
14 *a += 1;
15 *b += 1;
16 printf("Executing taskB, a=%d b=%d\n", *a,*b);
17}
18
19int taskC(int *a, int *b, int *c) {
20 *a += 1;
21 *b += 1;
22 *c += 1;
23 printf("Executing taskC, a=%d b=%d c=%d\n",*a,*b,*c);
24}
25
26int main() {
27
28 int a = 0;
29 int b = 0;
30 int c = 0;
31
32 #pragma omp parallel
33 {
34 #pragma omp single
35 {
36 #pragma omp task depend(out : a)
37 taskA(&a); // taskA runs independently
38
39 #pragma omp task depend(in : a) depend(out : b)
40 taskB(&a,&b); // taskB is suspended until taskA completes
41 // when dependencies are satisfied the task is resumed
42 // and is ready to continue
43
44 #pragma omp task depend(in : b)
45 taskC(&a,&b,&c); // taskC is suspended until taskB completes
46
47 }
48 }
49 return 0;
50}
Exercise
In the previous implementation of the Fibonacci generator code, we noticed that the values of Fibonacci numbers are computed several times during recursion. This situation can be avoided by using task dependencies, as in the following graph which shows the dependencies that we can create explicitl for the Fibonacci number 6:
fib[6]
depends on fib[5] and fib[4]
|
v
fib[5]
depends on fib[4] and fib[3]
|
v
fib[4]
depends on fib[3] and fib[2]
/ \
v v
fib[3] fib[2]
depends on fib[2] and fib[1] depends on fib[1] and fib[0]
| \ | \
v v v v
fib[2] fib[1] fib[1] fib[0]
depends on fib[1] and fib[0]
| \
v v
fib[1] fib[0]
Base values:
fib[1] = 1
fib[0] = 0
Write a code for the Fibonacci numbers generator but now using task dependencies.
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=4
4// gcc -O3 -march=native -fopenmp -o test.x fibonacci_task_dependencies.c -lm
5#include <stdio.h>
6#include <stdlib.h>
7#include <omp.h>
8
9int main(int argc, char* argv[]) {
10 int n = (argc > 1) ? atoi(argv[1]) : 40;
11 unsigned long long *fib = malloc((n+1) * sizeof(unsigned long long));
12
13 #pragma omp parallel
14 {
15 #pragma omp single
16 {
17 // Base cases
18 #pragma omp task depend(out: fib[0])
19 { fib[0] = 0; }
20
21 #pragma omp task depend(out: fib[1])
22 { fib[1] = 1; }
23
24 // Create tasks for F(2) .. F(n)
25 for (int i = 2; i <= n; i++) {
26 #pragma omp task depend(in: fib[i-1], fib[i-2]) depend(out: fib[i])
27 {
28 fib[i] = fib[i-1] + fib[i-2];
29 }
30 }
31
32 // Wait until fib[n] is available
33 #pragma omp taskwait
34 printf("fib(%d) = %llu\n", n, fib[n]);
35 }
36 }
37
38 free(fib);
39 return 0;
40}
taskloop Construct (OpenMP 4.5)
The taskloop construct distributes loop iterations onto tasks.
Similarity
Similar to the loop construct (omp for/omp do), but creates tasks instead of directly distributing iterations.
Default Behavior
By default, taskloop implies a taskgroup.
taskloop: Basic Syntax
Fortran
!$OMP taskloop default(none) shared(…) private(…)
do i = 1, N
...
enddo
C
#pragma omp taskloop default(none) shared(…) private(…)
for (i = 0; i < N; i++)
{
...
}
Clauses for taskloop
Standard Clauses
Clauses introduced previously that work with taskloop:
if(scalar-expr)sharedprivatefirstprivatelastprivatedefaultcollapsefinal(scalar-expr)
Important Differences
Warning
No
reductionclause for taskloopUse
nogroupclause to remove the impliedtaskgroup
Additional Construct
There is also a taskloop simd construct for vectorization.
Exercise
Inspect and run the following code to see the behavior of taskloop in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 20a-taskloop-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8void process_iteration(int i) {
9 printf("Processing iteration %d in thread %d\n", i, omp_get_thread_num());
10}
11
12int main() {
13 int n = 10; // Number of iterations
14
15 #pragma omp parallel
16 {
17 // Use taskloop to distribute iterations as tasks
18 #pragma omp single
19 {
20 #pragma omp taskloop
21 for (int i = 0; i < n; i++) {
22 process_iteration(i);
23 }
24 }
25 }
26
27 return 0;
28}
Exercise
Inspect and run the following code to see the behavior of taskloop in the context of tasks.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 20b-taskloop-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8int main() {
9 int n = 2000; // Number of iterations
10 int sumR = 0; // Summation
11 #pragma omp parallel shared(sumR)
12 {
13 // Use taskloop to distribute iterations as tasks
14 #pragma omp single
15 {
16 #pragma omp taskloop
17 for (int i = 0; i < n; i++) {
18 #pragma omp atomic
19 sumR +=1;
20 }
21 }
22 }
23
24 printf("Total summation = %d \n", sumR);
25
26 return 0;
27}
Controlling Number of Tasks Created
Control the granularity of task creation to balance parallelism and overhead.
Clauses
Use grainsize or num_tasks (only one allowed):
grainsize Clause
!$omp taskloop grainsize(100)
Controls number of loop iterations per task:
Each task gets between
grainsizeand2*grainsizeiterations
num_tasks Clause
!$omp taskloop num_tasks(10)
Specifies the number of tasks to create.
Additional Restrictions
Final number of tasks may be affected by iteration count and other factors.
Summary
Task Construct:
Flexible parallelism for irregular problems
Tasks can be created dynamically
Execution by any thread, now or later
Task Scheduling:
Task scheduling points
taskwait: wait for child taskstaskgroup: wait for all descendantstaskyield: allow suspension
Task Completion:
Multiple mechanisms to control when tasks finish
Dependencies between tasks (OpenMP 4.0)
Performance Aspects:
Task creation has overhead
Need decent amount of work per task
Use
ifandfinalclauses to control task generationBalance between parallelism and overhead
Case Studies:
Recursive Fibonacci: demonstrated task basics
Self-refining integrator: demonstrated adaptive algorithms
Advanced Features:
taskloop(OpenMP 4.5): distribute loops onto tasksTask dependencies: explicit ordering
Various accumulation strategies for results