Worksharing and Scheduling
Objectives
Loop construct
Workshare construct
Section construct
Worksharing constructs allow easy distribution of work onto threads:
Loop Construct:
Easy distribution of loops onto threads
Avoiding load imbalance using the
scheduleclause
Workshare Construct (Fortran):
Parallelization of Fortran array syntax
Sections Construct:
Distributing independent code blocks
Modern alternative: task construct
Distributing Loops
Distributing large loops is a typical target for OpenMP parallelization.
Traditional Approach
As seen in previous lectures, loop distribution can be accomplished manually but requires management code:
Number of iterations per thread (may be unequal)
Starting index of current thread
Final index of current thread
OpenMP Loop Construct
OpenMP offers the “loop construct” to ease loop parallelization:
Convenience: reduces code complexity
Maintainability: cleaner, more readable code
The Loop Construct Features
Distributes the following loop (C/C++/Fortran) onto threads
Makes the iteration variable automatically private
Determines automatically (without management code):
Number of iterations per thread
Start index of current thread
Final index of current thread
Flushes registers to memory at exit, unless
nowaitis specifiedNote: No flush on entry!
Offers mechanisms to balance the load for various situations
Loop Construct in Fortran
Works on Fortran standard-compliant do-construct:
Not supported:
do whileNot supported:
dowithout loop control
!$omp parallel &
!$omp shared(…) &
!$omp private(…)
!$omp do
do i = 1, N
loop-body
end do
!$omp end parallel
Note
!$omp end do is not required but optional.
Example: Vector Norm - Manual Loop Management (Fortran)
norm = 0.0D0
!$omp parallel default(none) &
!$omp shared(vect, norm) private(myNum, i, lNorm)
lNorm = 0.0D0
myNum = vleng / omp_get_num_threads() ! local size
do i = 1 + myNum * omp_get_thread_num(), &
myNum * (1 + omp_get_thread_num())
lNorm = lNorm + vect(i) * vect(i)
enddo
!$omp atomic update
norm = norm + lNorm
!$omp end parallel
norm = sqrt(norm)
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Note
This version requires explicit calculation of loop bounds for each thread.
Example: Vector Norm - Loop Construct (Fortran)
norm = 0.0d0
!$omp parallel default(none) &
!$omp shared(vect, norm) private(i, lNorm)
lNorm = 0.0d0
!$omp do
do i = 1, vleng ! same as serial case
lNorm = lNorm + vect(i) * vect(i)
enddo
!$omp atomic update
norm = norm + lNorm
!$omp end parallel
norm = sqrt(norm)
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Important
The loop bounds are the same as in the serial case. OpenMP handles the distribution automatically.
Loop Construct in C
The loop construct in C is limited to “canonical” loops:
First Argument (Initialization):
Assignment to:
intpointer
random-access-iterator-type (C++)
Second Argument (Condition):
Comparison using: <=, <, >, >=
Third Argument (Increment):
i++,++i,i--,--ii += inc,i -= inci = i + inc,i = inc + i,i = i - inc
Additional Requirements:
All bounds and increments must be loop-invariant.
#pragma omp parallel \
shared(…) \
private(…)
{
#pragma omp for
for (i = 0; i < N; i++)
{
loop-body
}
}
Example: Vector Norm - Manual Loop Management (C)
norm = 0.0;
#pragma omp parallel default(none) \
shared(vect, norm) private(myNum, i, lNorm)
{
lNorm = 0.0;
myNum = vleng / omp_get_num_threads(); // local size
for (i = myNum * omp_get_thread_num();
i < myNum * (1 + omp_get_thread_num()); i++)
lNorm += vect[i] * vect[i];
#pragma omp atomic update
norm += lNorm;
}
norm = sqrt(norm);
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Note
This version requires explicit calculation of loop bounds for each thread.
Example: Vector Norm - Loop Construct (C)
norm = 0.0;
#pragma omp parallel default(none) \
shared(vect, norm) private(i, lNorm)
{
lNorm = 0.0;
#pragma omp for
for (i = 0; i < vleng; i++) // same as serial case
lNorm += vect[i] * vect[i];
#pragma omp atomic update
norm += lNorm;
}
norm = sqrt(norm);
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Important
The loop bounds are the same as in the serial case. OpenMP handles the distribution automatically.
Parallel Loop Construct
Parallel Loop Construct in Fortran
When a parallel region contains only a loop construct, you can use a shorthand:
!$omp parallel do
do i = 1, N
loop-body
enddo ! parallel region ends here!
Note
!$omp end parallel dois not required (optional)Features of parallel region and normal loop construct apply similarly
Parallel Loop Construct in C
When a parallel region contains only a loop construct, you can use a shorthand:
#pragma omp parallel for
for (int i = 0; i < N; i++)
{
loop-body
} // parallel region & loop construct end here!
Note
Features of parallel region and normal loop construct apply similarly.
Loop Reordering and Data Dependency
Order of Execution
In a parallel loop, iterations are executed in a different order from serial code.
Data Dependency Requirement
A correct result is only obtained if the current iteration is independent of previous iterations (no data dependency).
Handling Data Dependencies
If data dependency exists:
Modify/change the algorithm
Serialize relevant part of the loop using special OpenMP features (covered later in course)
Execute loop serially
Example with Dependency
Problem (has dependency):
a[0] = 0;
for (i = 1; i < N; i++)
a[i] = a[i-1] + i;
Possible Fix (algorithm change):
for (i = 0; i < N; i++)
a[i] = 0.5 * i * (i + 1);
Warning
Always verify that loop iterations are independent before parallelizing!
Scheduling Loop Iterations
Work Per Loop Iteration
Previous examples assumed the same amount of work for each loop iteration. This is not always the case.
Examples of Uneven Work
Summing over triangular area:
for (i = 0; i < N; i++)
for (j = 0; j < i + 1; j++)
// work here
Loop body iterates until required accuracy is achieved
Load Imbalance Problem
Uneven work distribution often causes load imbalance:
Some threads finish while others still work
Results in poor performance
Note
Dealing with such problems is typically easier in shared memory than in distributed memory programming.
Schedule Clause
To help load balance in a loop construct, use the schedule clause:
schedule(kind, [chunk_size])
Default schedule is implementation-dependent (OpenMP 3.0). Choices for kind:
staticdynamicguidedautoruntime
Static Scheduling
Divide iteration count into chunks of equal size
Last chunk may be smaller if needed
Thread assignment uses “round robin” distribution
Default Chunk Size
Default chunk size divides iteration count by number of threads.
Performance
Static scheduling has the least overhead compared to other schedules.
Visual Representation
Default static schedule (≈n/4 per thread):
Thread 0: [===============]
Thread 1: [===============]
Thread 2: [===============]
Thread 3: [===============]
Static schedule with chunk size:
T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 ...
Example: Summation Over Triangular Area (Static)
!$omp parallel do &
!$omp private(i, j) shared(a) &
!$omp schedule(static, 100)
do j = 1, 1200
do i = j + 1, 1200
a(i,j) = func(i,j)
a(j,i) = -a(i,j)
enddo
enddo
Performance Comparison
Default static: maximum 7/16 of work area per thread
Static with chunk=100: maximum 5/16 of work area per thread
Trade-offs
Smaller chunks: better load balance
More chunks: larger overhead
Dynamic Scheduling
Loop is split into work packages of
chunk_sizeiterationsEach thread requests a new work package once done with the current one
Default
chunk_size: 1 iteration
When to Use
Use dynamic scheduling when:
Work per iteration varies significantly
The pattern of work is unpredictable
Performance
Better load balance than static (for uneven work)
Higher overhead than static due to runtime work distribution
Example: Summation Over Triangular Area (Dynamic)
!$omp parallel do &
!$omp private(i, j) shared(a) &
!$omp schedule(dynamic, 100)
do j = 1, 1200
do i = j + 1, 1200
a(i,j) = func(i,j)
a(j,i) = -a(i,j)
enddo
enddo
Performance Comparison
Default static: maximum 7/16 of work area per thread
Dynamic with chunk=100: maximum ≈0.27 of work area per thread
Trade-offs
Better balance than static scheduling
Larger overhead than static scheduling
Guided Scheduling
Similar to dynamic, but with adaptive chunk sizes:
Threads request new work packages once done
Work package size is proportional to:
(number of unassigned iterations) / (number of threads)
Package size never smaller than
chunk_size(unless last package)Default
chunk_size= 1
Purpose
The idea is to prevent expensive work packages at the end of the loop.
Performance
Starts with large chunks (low overhead)
Gradually decreases chunk size (better balance toward the end)
Schedules: Auto and Runtime
Auto Schedule
For auto, the implementation decides the scheduling strategy.
!$omp parallel do schedule(auto)
Runtime Schedule
For runtime, the schedule can be controlled at runtime:
Method 1: Using Function (OpenMP 3.0)
omp_set_schedule(omp_sched_static, 10);
Method 2: Using Environment Variable
Bash:
export OMP_SCHEDULE="guided,4"
C-shell:
setenv OMP_SCHEDULE "guided,4"
Warning
Do not specify chunk_size with auto or runtime in the directive itself.
Multiple Loop Parallelization
Simple Example with Nested Loops
Consider this nested loop structure:
do j = 1, 3
do i = 1, 4
a(i,j) = expensiveFunc(i,j)
enddo
enddo
There are three basic options to parallelize nested loops. Which one is best depends on the specific situation.
Option 1: Distribute Outer Loop (Fortran)
!$omp parallel do
do j = 1, 3
do i = 1, 4
a(i,j) = expensiveFunc(i,j)
enddo
enddo
Distributes the j-loop
Maximally 3 work packages
When to Use
Use when the outer loop has sufficient iterations for good load balance.
Option 2: Distribute Inner Loop (Fortran)
!$omp parallel private(j)
do j = 1, 3
!$omp do
do i = 1, 4
a(i,j) = expensiveFunc(i,j)
enddo
!$omp end do
enddo
!$omp end parallel
Distributes the i-loop
Now four work packages
Parallel region before j-loop provides better performance
Requires
ito be private (automatic for loop variable)Starts loop construct 3 times
May cause more cache line conflicts when writing to
a
Option 3: Collapse Clause (Fortran)
!$omp parallel
!$omp do collapse(2)
do j = 1, 3
do i = 1, 4
a(i,j) = expensiveFunc(i,j)
enddo
enddo
Use
collapseclause to specify number of loops to collapseAvailable since OpenMP 3.0
Distributes both loops by creating a single combined loop
Schedules as specified (default in this case)
Now: 12 work packages
May cause more cache line conflicts when writing to
a
Benefits
Maximum parallelism exposure
Best for cases where individual loops have few iterations
Option 1: Distribute Outer Loop (C)
#pragma omp parallel for
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 4; j++)
{
a[i][j] = expensiveFunc(i,j);
}
}
Distributes the i-loop
Maximally 3 work packages
When to Use
Use when the outer loop has sufficient iterations for good load balance.
Option 2: Distribute Inner Loop (C)
#pragma omp parallel
{
for (int i = 0; i < 3; i++)
{
#pragma omp for
for (int j = 0; j < 4; j++)
{
a[i][j] = expensiveFunc(i,j);
}
}
}
Distributes the j-loop
Now four work packages
Parallel region before i-loop provides better performance
Requires
ito be private (automatic for loop variable)Starts loop construct 3 times
May cause more cache line conflicts when writing to
a
Option 3: Collapse Clause (C)
#pragma omp parallel
#pragma omp for collapse(2)
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 4; j++)
{
a[i][j] = expensiveFunc(i,j);
}
}
Use
collapseclause to specify number of loops to collapseAvailable since OpenMP 3.0
Distributes both loops by creating a single combined loop
Schedules as specified (default in this case)
Now: 12 work packages
May cause more cache line conflicts when writing to
a
Benefits
Maximum parallelism exposure
Best for cases where individual loops have few iterations
Sections
Sections Construct
The sections construct allows parallelization when code blocks can be executed independently.
Initialization of multiple data structures
Different tasks executing different code
Considerations
Mismatch between blocks and threads:
Individual threads might execute multiple code blocks
Not every thread necessarily gets a code block
Danger of load imbalance:
Code blocks may have different amounts of work
Mismatch between number of blocks and number of threads
Real-World Application
Example from research: “Acceleration of Semiempirical QM/MM methods,” JCTC, 13, 3525-3536 (2017)
Example: Sections Construct (C)
#pragma omp parallel shared(a, b, N, M)
{
#pragma omp sections
{
#pragma omp section
{
for (int i = 0; i < N; i++)
a[i] = i;
}
#pragma omp section
{
for (int i = 0; i < M; i++)
b[i] = initBmatrix(i, M);
}
}
}
Alternative: Parallel Sections
#pragma omp parallel sections shared(a, b, N, M)
{
#pragma omp section
{
for (int i = 0; i < N; i++)
a[i] = i;
}
#pragma omp section
{
for (int i = 0; i < M; i++)
b[i] = initBmatrix(i, M);
}
}
Note
The parallel sections directive combines the parallel and sections directives for convenience.
Exercise
Parallelize a for loop which has 20 iterations by evenly divinding the number of interations among the available threads. Then use a variable var1 to store the number of iterations in the loop. Hint an atomic operation can help to protect var1.
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 6a-forworksharing-openmp.c -lm
5#include <stdio.h>
6#ifdef _OPENMP
7#include <omp.h>
8#endif
9
10int main()
11{
12
13int i,var1,var2,myWork;
14int n = 20; // number of iterations
15var1 = 0;
16
17#pragma omp parallel private(myWork, var2) shared(var1)
18 {
19#ifdef _OPENMP
20 // The purpose of this code is to add 1 to var1 20 times
21 myWork = n/omp_get_num_threads();
22
23 for ( int i = myWork*omp_get_thread_num(); i < myWork*(1+omp_get_thread_num()); i++)
24 var2 += 1;
25
26 #pragma omp atomic update
27 var1 += var2;
28
29#else
30 printf("Serial code!\n");
31#endif
32 }
33
34 printf("var1 = %i \n", var1);
35
36return 0;
37}
Exercise
Parallelize the previous for loop code but this time using the loop construct. Hint an atomic operation can help to protect var2.
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 6b-forworksharing-openmp.c -lm
5#include <stdio.h>
6#ifdef _OPENMP
7#include <omp.h>
8#endif
9
10int main()
11{
12
13int i,var1;
14int n = 20; // number of iterations
15var1 = 0;
16
17#pragma omp parallel
18 {
19#ifdef _OPENMP
20 // The purpose of this code is to add 1 to var1 20 times
21#pragma omp for
22 for ( int i = 0; i < n; i++)
23 #pragma omp atomic update
24 var1 += 1;
25
26#else
27 printf("Serial code!\n");
28#endif
29 }
30
31 printf("var1 = %i \n", var1);
32
33return 0;
34}
Exercise
Use a critical region to protect a counter variable in a parallel loop that counts the stores the number of iterations. Is this loop parallelization efficient? Why?
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 8-critical-openmp.c -lm
5#include <stdio.h>
6#ifdef _OPENMP
7#include <omp.h>
8#endif
9
10int main()
11{
12
13int sum = 0;
14
15#pragma omp parallel for
16
17#ifdef _OPENMP
18 for ( int i = 0; i < 10; i++) {
19 #pragma omp critical
20 {
21 sum += i;
22 printf("Thread %d added %d to sum, current sum = %d\n", omp_get_thread_num(), i, sum);
23 }
24 }
25#else
26 printf("Serial code!\n");
27#endif
28
29 printf("Final sum = %d\n", sum);
30
31return 0;
32}
Exercise
Create a Section in a parallel region to protect two blocks of code where the thread ID is printed out in each block.
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 6c-sectionsworksharing-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8int main() {
9 #pragma omp parallel num_threads(8) // We have a total of 8 threads
10 {
11 #pragma omp sections
12 {
13 #pragma omp section
14 {
15 printf("Section 1 executed by thread %d\n", omp_get_thread_num());
16 }
17 #pragma omp section
18 {
19 printf("Section 2 executed by thread %d\n", omp_get_thread_num());
20 }
21 }
22 }
23 return 0;
24}
Exercise
Create two nested parallel region with 2 threads each and print out the thread IDs for the two inner and the outer threads.
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 11-nested-openmp.c -lm
5#include <stdio.h>
6#include <omp.h>
7
8int main() {
9 // enabling/disabling nested parallelism
10 omp_set_nested(1);
11
12 #pragma omp parallel num_threads(2)
13 {
14 int outer_tid = omp_get_thread_num();
15 printf("Outer thread %d in outer parallel region\n", outer_tid);
16
17 #pragma omp parallel num_threads(2)
18 {
19 int inner_tid = omp_get_thread_num();
20 printf(" Inner thread %d in inner parallel region but outer thread %d\n", inner_tid, outer_tid);
21 }
22 }
23
24 return 0;
25}
Summary
This guide covered the following OpenMP worksharing constructs:
OpenMP Loop Construct
Easy distribution of standard
do/forloopsThe
scheduleclause deals with many cases of load imbalanceSchedule types:
static,dynamic,guided,auto,runtimecollapseclause for nested loops (OpenMP 3.0+)
OpenMP Workshare Construct (Fortran)
Distributes Fortran array syntax statements
Supports
FORALLandWHEREHandles user-defined
ELEMENTALfunctions
OpenMP Sections Construct
Distributes independent code blocks on different threads
Useful for heterogeneous parallel tasks
Consider load balance issues