The Task Directive
Beyond Regular Loops
Regular vs. Irregular Structures
So far we’ve discussed regular structures:
Loops with known start and end
Fortran array constructs
Many problems have irregular structures:
Recursive algorithms
Linked lists
Loops with unknown end (e.g.,
while
loops)Divide and conquer algorithms
And more…
Note
Depending on the details, irregular structures might still be parallelizable using tasks.
The Task Construct
The task construct provides a flexible way to express parallelism for irregular problems.
Key Concept
Tasks allow you to create units of work that can be executed by any thread in the team, now or later.
The Task Directive in Fortran
Syntax
!$omp task [clauses]
code body
!$omp end task
What It Does
Creates an “explicit task” from:
Code body
Data environment at that point
Place inside a parallel region
Execution
The task may execute:
When: Now or later
By whom: Encountering thread or other thread
The Task Directive in C
Syntax
#pragma omp task [clauses]
{
code body
}
What It Does
Creates an “explicit task” from:
Code body
Data environment at that point
Place inside a parallel region
Execution
The task may execute:
When: Now or later
By whom: Encountering thread or other thread
Allowed Data Sharing Attributes for Tasks
Available Attributes
private:
Data is private to the task
firstprivate:
Data is private to the task
Data initialized when task directive is encountered
shared:
Data is shared
Only way to return a result from a task!
default:
Fortran:
shared | private | firstprivate | none
C:
shared | none
Data Sharing Without a default Clause
When no default
is declared on a task directive:
Default Rules
If variable is shared by all implicit tasks in the current team:
Variable is: shared
Otherwise:
Variable is: firstprivate
Recommendation
Important
Use default(none)
to be explicit about data sharing!
Example: Task Execution Flow
Consider this code:
code block 1
!$omp task
code block 2
!$omp end task
code block 3
Thread Encountering This Code
Executes “code block 1”
Creates a task for “code block 2”
May:
Execute the task for “code block 2”
Pick up another task
Continue with “code block 3”
At some point: Has to execute code block 3
No Control Over
Warning
Who executes code block 2
When code block 2 is finished
Controlling When Tasks Finish
taskwait Directive
!$omp taskwait
Purpose:
Ensures child tasks have completed
Does not consider grandchildren, etc.
barrier Directive
!$omp barrier
Purpose:
Ensures all tasks in the innermost parallel region have finished
Note
Instead of waiting, a thread can execute tasks generated elsewhere.
Allowing Suspension of Current Task
taskyield Construct
At a taskyield
construct, the current task can be suspended to execute a different task.
Syntax
Fortran:
!$omp taskyield
C:
#pragma omp taskyield
Use Case
Allows the runtime to schedule other tasks while waiting for resources or dependencies.
taskgroup: Controlling Descendant Tasks (OpenMP 4.0)
A taskgroup
construct defines a region with an implied task scheduling point at the end.
Current task is suspended until all descendant tasks (including grandchildren, etc.) have completed.
Fortran Syntax
!$omp taskgroup
do i = 1, n
!$omp task ...
call processing(...)
!$omp end task
end do
!$omp end taskgroup ! Waits for all tasks, including
! tasks generated in processing
C Syntax
#pragma omp taskgroup
{
for (int i = 0; i < n; i++)
{
#pragma omp task ...
{
processing(...);
}
}
} // Waits for all tasks, including
// tasks generated in processing
Controlling Task Creation
The Overhead Problem
Creating a task encounters significant overhead:
Requires significant work inside the task to pay off
Too many small tasks can hurt performance
Solution: if Clause
Use the if
clause to control task creation:
!$OMP task if(level .lt. 10) ...
...
!$OMP end task
If the expression evaluates to .false.
:
Encountering thread executes code body directly (included task)
No task creation overhead
Final Tasks
A task can carry a final
clause to control task generation in descendants.
Syntax
!$OMP task final(level .gt. 30) ...
...
!$OMP end task
If the expression evaluates to .true.
:
All encountered tasks within this task will be:
Included (executed immediately by encountering thread)
Final (they also cannot generate new tasks)
Use Case
Prevents excessive task creation in deep recursion by serializing once a certain depth is reached.
Mergeable Tasks
A task can be declared as mergeable
for optimization.
Syntax
!$omp task mergeable ...
#pragma omp task mergeable ...
For an undeferred or included task, the implementation may:
Use the data environment of the generating task (including internal control variables)
Optimize by merging task execution
Use Case
Often used with final
clause for optimization at deep recursion levels.
Task Scheduling Points
Threads may switch to a different task at a task scheduling point.
Task Scheduling Points Are
Immediately after generation of an explicit task
After point of completion of a task
At
taskwait
ortaskyield
At
barrier
(explicit or implicit)At the end of
taskgroup
Untied Tasks (Advanced)
Warning
Untied tasks (not covered in this course) may switch at any point.
Be careful with
critical
regions and locksExample: task may switch out of critical region → deadlock!
Case Study 1: Recursive Fibonacci
Fibonacci Numbers Mathematical series:
First numbers in series: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …
Note
Recursive implementation: not efficient for computation, but good for demonstrating task parallelism!
Serial Fibonacci Implementation (Fortran)
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
sub1 = recursive_fib(in - 1)
sub2 = recursive_fib(in - 2)
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Recursion Tree
n
/ \
n-1 n-2
/ \ / \
n-2 n-3 n-3 n-4
/ \ / \ / \ / \
... ... ... ... ... ...
Parallel Version: Attempt 1 (Fortran)
Adding One Task
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
!$OMP task shared(sub1) firstprivate(in)
sub1 = recursive_fib(in - 1)
!$OMP end task
sub2 = recursive_fib(in - 2)
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Data Sharing
sub1
isshared
- declared inside function, must share to return resultin
isfirstprivate
- initialized at task creation
Parallel Version: Attempt 2 (Fortran)
Adding Second Task
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
!$OMP task shared(sub1) firstprivate(in)
sub1 = recursive_fib(in - 1)
!$OMP end task
!$OMP task shared(sub2) firstprivate(in)
sub2 = recursive_fib(in - 2)
!$OMP end task
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Danger
Problem: Need to have sub1
and sub2
ready before computing fibnum
!
Parallel Version: Final Solution (Fortran)
Adding taskwait
recursive function recursive_fib(in) result(fibnum)
integer, intent(in) :: in
integer(lint) :: fibnum, sub1, sub2
if (in .gt. 1) then
!$OMP task shared(sub1) firstprivate(in)
sub1 = recursive_fib(in - 1)
!$OMP end task
!$OMP task shared(sub2) firstprivate(in)
sub2 = recursive_fib(in - 2)
!$OMP end task
!$OMP taskwait
fibnum = sub1 + sub2
else
fibnum = in
endif
end function recursive_fib
Solution
taskwait
waits for the 2 tasks aboveRecursion takes care of grandchildren automatically
Calling the Parallel Fibonacci
Original Serial Code
program fibonacci
!$ use omp_lib
integer, parameter :: lint = selected_int_kind(10)
integer(lint) :: fibres
integer :: input
read (*,*) input
fibres = recursive_fib(input)
print *, "Fibonacci number", input, " is:", fibres
end program fibonacci
Attempt: Starting Parallel Region
program fibonacci
!$ use omp_lib
integer, parameter :: lint = selected_int_kind(10)
integer(lint) :: fibres
integer :: input
read (*,*) input
!$OMP parallel shared(input, fibres) default(none)
fibres = recursive_fib(input)
!$OMP end parallel
print *, "Fibonacci number", input, " is:", fibres
end program fibonacci
Danger
Problem: Each thread starts Fibonacci calculation! Multiple redundant computations.
Solution: Using single Construct
program fibonacci
!$ use omp_lib
integer, parameter :: lint = selected_int_kind(10)
integer(lint) :: fibres
integer :: input
read (*,*) input
!$OMP parallel shared(input, fibres) default(none)
!$OMP single
fibres = recursive_fib(input)
!$OMP end single
!$OMP end parallel
print *, "Fibonacci number", input, " is:", fibres
end program fibonacci
Note
single
ensures only one thread starts the recursion, but all threads can help execute tasks.
Performance: Fibonacci Number 40
Benchmark Setup
Hardware: Intel E5-2650 v3 Test: Computing Fibonacci(40) Compilers: gfortran 6.3, ifort 17.1
Results Summary
Time (seconds) - logarithmic scale
1000 ┤ ■ Naive (2 tasks/iteration)
│
100 ┤ □ serial 10 (if clause, cutoff=10)
│ ○ serial 30 (if clause, cutoff=30)
10 ┤ ● serial 10, 1 task/iteration
│ △ serial 30, 1 task/iteration
1 ┤
└─────┴─────┴─────┴─────┴─────
1 2 5 10 20 Cores
Key Observations
Both compilers show similar patterns:
Naive implementation (2 tasks per iteration): Poor performance
Using if clause (no tasks for low values): Helps significantly
1 task per iteration: Helps even more
Problem: Too little work per task
Solution: Limit the number of tasks created
Discussion: Fibonacci Performance
Key Findings
Task Overhead:
Creating tasks has significant overhead
Need sufficient work per task to justify overhead
Optimization Strategies:
if clause: Prevents task creation for small problem sizes
Limit task creation: Only create tasks at higher recursion levels
Balance: Between parallelism and overhead
Hardware Details
Test System:
2 sockets per server
Intel E5-2650 v3
10 cores per processor
Compilers:
gfortran: Version 6.3 with thread binding
Intel ifort: Version 17.1 with thread binding
Case Study 2: Self-Refining Recursive Integrator
Mesh Refinement Concept
Codes employing irregular grids benefit from dynamic grid refinement/coarsening:
Example: Fluid dynamics
Refine grid where eddy develops
Coarsen when eddy vanishes
Case Study Application
Self-refining integrator for 1D function.
Basic Algorithm
Integration with Adaptive Refinement
Evaluate function at 5 regularly spaced points in interval
Estimate integral using two methods:
Polygon using all 5 points (accurate)
Polygon using only 3 points (first, center, last) (coarse)
Check difference between the two integrals:
Compare to threshold × interval length
Decision:
If accurate: Add contribution to accumulator
If not accurate:
Split interval into two pieces
Run integrator on both pieces (recursion)
Implementation: Parallel Region
accumulator = 0.0D0
!$OMP parallel default(none) &
!$OMP shared(accumulator) &
!$OMP shared(startv, stopv, unit_err, gen_num)
!$OMP single
call rec_eval_shared_update( &
startv, stopv, unit_err, gen_num)
!$OMP end single
!$OMP end parallel
Key Design Decisions
Shared variable accumulator:
Declared as module variable
Used to accumulate results
single construct:
Starts the recursive integrator once
Implied barrier ensures all tasks are finished
Recursive subroutine:
rec_eval_shared_update
Implementation: Task Startup
!$OMP task shared(accumulator) firstprivate(my_start, my_stop) &
!$OMP default(none) firstprivate(my_gen, u_err) &
!$OMP if(task_start)
call rec_eval_shared_update( &
my_start, 0.5_dpr * (my_start + my_stop), u_err, my_gen)
!$OMP end task
!$OMP task shared(accumulator) firstprivate(my_start, my_stop) &
!$OMP default(none) firstprivate(my_gen, u_err) &
!$OMP if(task_start)
call rec_eval_shared_update( &
0.5_dpr * (my_start + my_stop), my_stop, u_err, my_gen)
!$OMP end task
Recursion Strategy
Split interval in half
Create two tasks for sub-intervals
Each task may recursively subdivide further
Implementation: Result Accumulation
Three Approaches
1. Shared variable with atomic update:
!$omp atomic update
accumulator = accumulator + contribution
2. Threadprivate variables:
Thread executing task adds to its threadprivate copy
After barrier (implied in
end single
): atomic update of threadprivate data into shared variable
Remarks:
Warning
Be careful with threadprivate and task scheduling points!
Value can be changed after scheduling point
threadprivate isn’t private to the task
OpenMP 5.0:
OpenMP 5.0 has reduction constructs for tasks.
Test Function
Mathematical Function
Properties
Highly oscillatory (due to sin(10000x) term)
Requires adaptive refinement
Samples more densely where function varies rapidly
Sampling Pattern
The integrator samples most densely where the function oscillates most rapidly, demonstrating the effectiveness of adaptive refinement.
Performance Results: Integrator
Configuration
Task started every 5th generation
Two accumulation strategies tested
Results
Speedup
30 ┤
│ ○ threadprivate accumulation
25 ┤ ○
│ ○
20 ┤ ○
│ ○
15 ┤ ○
│ ○
10 ┤ ○ ■ atomic updates
│ ○ ■
5 ┤○ ■
│■
0 └─────┴─────┴─────┴─────┴─────┴─────
1 10 20 40 80 128 Cores
Key Findings
Atomic updates:
Poor results (millions of atomic operations)
Threadprivate accumulation:
Satisfactory results
Efficient utilization of up to 128 cores
Compiler Comparison: Integrator
Speedup
25 ┤
│ ● Intel ifort
20 ┤ ●
│ ●
15 ┤ ●
│ ●
10 ┤ ● ○ GCC gfortran
│ ● ○
5 ┤ ● ○
│● ○
0 └─────┴─────┴─────┴─────
1 10 20 40 Cores
Observation
Note
GCC shows inferior scalability beyond 20 cores compared to Intel compiler.
Task Dependencies (OpenMP 4.0)
Declare explicit dependencies between tasks to control execution order.
Syntax
Fortran:
!$omp task depend(type : list)
C:
#pragma omp task depend(type : list)
Dependency Types
in:
Task depends on all previous siblings with
out
orinout
dependency on one or more list items
out, inout:
Task depends on all previous siblings with
in
,out
, orinout
dependency on one or more list items
List Format
The list contains variables, which may include array sections.
Example: Task Dependencies
#pragma omp task depend(out: a)
task_function_1(&a);
#pragma omp task depend(in: a)
task_function_2(a);
#pragma omp task depend(in: a)
task_function_3(a);
Execution Order
Wait for
task_function_1
to finish (it writesa
)Then execute
task_function_2
andtask_function_3
in any order on any thread (both reada
)
Benefits
Explicit control over task ordering
Runtime can optimize scheduling within dependency constraints
More flexible than barriers
taskloop Construct (OpenMP 4.5)
The taskloop
construct distributes loop iterations onto tasks.
Similarity
Similar to the loop construct (omp for
/omp do
), but creates tasks instead of directly distributing iterations.
Default Behavior
By default, taskloop
implies a taskgroup
.
taskloop: Basic Syntax
Fortran
!$OMP taskloop default(none) shared(…) private(…)
do i = 1, N
...
enddo
C
#pragma omp taskloop default(none) shared(…) private(…)
for (i = 0; i < N; i++)
{
...
}
Clauses for taskloop
Standard Clauses
Clauses introduced previously that work with taskloop
:
if(scalar-expr)
shared
private
firstprivate
lastprivate
default
collapse
final(scalar-expr)
Important Differences
Warning
No
reduction
clause for taskloopUse
nogroup
clause to remove the impliedtaskgroup
Additional Construct
There is also a taskloop simd
construct for vectorization.
Controlling Number of Tasks Created
Control the granularity of task creation to balance parallelism and overhead.
Clauses
Use grainsize
or num_tasks
(only one allowed):
grainsize Clause
!$omp taskloop grainsize(100)
Controls number of loop iterations per task:
Each task gets between
grainsize
and2*grainsize
iterations
num_tasks Clause
!$omp taskloop num_tasks(10)
Specifies the number of tasks to create.
Additional Restrictions
Final number of tasks may be affected by iteration count and other factors.
Summary
Task Construct:
Flexible parallelism for irregular problems
Tasks can be created dynamically
Execution by any thread, now or later
Task Scheduling:
Task scheduling points
taskwait
: wait for child taskstaskgroup
: wait for all descendantstaskyield
: allow suspension
Task Completion:
Multiple mechanisms to control when tasks finish
Dependencies between tasks (OpenMP 4.0)
Performance Aspects:
Task creation has overhead
Need decent amount of work per task
Use
if
andfinal
clauses to control task generationBalance between parallelism and overhead
Case Studies:
Recursive Fibonacci: demonstrated task basics
Self-refining integrator: demonstrated adaptive algorithms
Advanced Features:
taskloop
(OpenMP 4.5): distribute loops onto tasksTask dependencies: explicit ordering
Various accumulation strategies for results