Worksharing and Scheduling

Objectives

Loop construct
Workshare construct
Section construct

Worksharing constructs allow easy distribution of work onto threads:

Loop Construct:

Easy distribution of loops onto threads
Avoiding load imbalance using the schedule clause

Workshare Construct (Fortran):

Parallelization of Fortran array syntax

Sections Construct:

Distributing independent code blocks
Modern alternative: task construct

Distributing Loops

Distributing large loops is a typical target for OpenMP parallelization.

Traditional Approach

As seen in previous lectures, loop distribution can be accomplished manually but requires management code:

Number of iterations per thread (may be unequal)
Starting index of current thread
Final index of current thread

OpenMP Loop Construct

OpenMP offers the “loop construct” to ease loop parallelization:

Convenience: reduces code complexity
Maintainability: cleaner, more readable code

The Loop Construct Features

Distributes the following loop (C/C++/Fortran) onto threads
Makes the iteration variable automatically private
Determines automatically (without management code):
- Number of iterations per thread
- Start index of current thread
- Final index of current thread
Flushes registers to memory at exit, unless nowait is specified
- Note: No flush on entry!
Offers mechanisms to balance the load for various situations

Loop Construct in Fortran

Works on Fortran standard-compliant do-construct:

Not supported: do while
Not supported: do without loop control

!$omp parallel &
!$omp shared(…) &
!$omp private(…)
    !$omp do
    do i = 1, N
        loop-body
    end do
!$omp end parallel

Note

!$omp end do is not required but optional.

Example: Vector Norm - Manual Loop Management (Fortran)

norm = 0.0D0

!$omp parallel default(none) &
!$omp shared(vect, norm) private(myNum, i, lNorm)
    lNorm = 0.0D0
    myNum = vleng / omp_get_num_threads()  ! local size

    do i = 1 + myNum * omp_get_thread_num(), &
            myNum * (1 + omp_get_thread_num())
        lNorm = lNorm + vect(i) * vect(i)
    enddo

    !$omp atomic update
    norm = norm + lNorm
!$omp end parallel

norm = sqrt(norm)

Mathematical notation: $\sqrt{\sum_i v(i) \cdot v(i)}$

Note

This version requires explicit calculation of loop bounds for each thread.

Example: Vector Norm - Loop Construct (Fortran)

norm = 0.0d0

!$omp parallel default(none) &
!$omp shared(vect, norm) private(i, lNorm)
    lNorm = 0.0d0

    !$omp do
    do i = 1, vleng  ! same as serial case
        lNorm = lNorm + vect(i) * vect(i)
    enddo

    !$omp atomic update
    norm = norm + lNorm
!$omp end parallel

norm = sqrt(norm)

Mathematical notation: $\sqrt{\sum_i v(i) \cdot v(i)}$

Important

The loop bounds are the same as in the serial case. OpenMP handles the distribution automatically.

Loop Construct in C

The loop construct in C is limited to “canonical” loops:

First Argument (Initialization):

Assignment to:

int
pointer
random-access-iterator-type (C++)

Second Argument (Condition):

Comparison using: <=, <, >, >=

Third Argument (Increment):

i++, ++i, i--, --i
i += inc, i -= inc
i = i + inc, i = inc + i, i = i - inc

Additional Requirements:

All bounds and increments must be loop-invariant.

#pragma omp parallel \
    shared(…) \
    private(…)
{
    #pragma omp for
    for (i = 0; i < N; i++)
    {
        loop-body
    }
}

Example: Vector Norm - Manual Loop Management (C)

norm = 0.0;

#pragma omp parallel default(none) \
    shared(vect, norm) private(myNum, i, lNorm)
{
    lNorm = 0.0;
    myNum = vleng / omp_get_num_threads();  // local size

    for (i = myNum * omp_get_thread_num();
         i < myNum * (1 + omp_get_thread_num()); i++)
        lNorm += vect[i] * vect[i];

    #pragma omp atomic update
    norm += lNorm;
}

norm = sqrt(norm);

Mathematical notation: $\sqrt{\sum_i v(i) \cdot v(i)}$

Note

This version requires explicit calculation of loop bounds for each thread.

Example: Vector Norm - Loop Construct (C)

norm = 0.0;

#pragma omp parallel default(none) \
    shared(vect, norm) private(i, lNorm)
{
    lNorm = 0.0;

    #pragma omp for
    for (i = 0; i < vleng; i++)  // same as serial case
        lNorm += vect[i] * vect[i];

    #pragma omp atomic update
    norm += lNorm;
}

norm = sqrt(norm);

Mathematical notation: $\sqrt{\sum_i v(i) \cdot v(i)}$

Important

The loop bounds are the same as in the serial case. OpenMP handles the distribution automatically.

Parallel Loop Construct

Parallel Loop Construct in Fortran

When a parallel region contains only a loop construct, you can use a shorthand:

!$omp parallel do
do i = 1, N
    loop-body
enddo  ! parallel region ends here!

Note

!$omp end parallel do is not required (optional)
Features of parallel region and normal loop construct apply similarly

Parallel Loop Construct in C

When a parallel region contains only a loop construct, you can use a shorthand:

#pragma omp parallel for
for (int i = 0; i < N; i++)
{
    loop-body
}  // parallel region & loop construct end here!

Note

Features of parallel region and normal loop construct apply similarly.

Loop Reordering and Data Dependency

Order of Execution

In a parallel loop, iterations are executed in a different order from serial code.

Data Dependency Requirement

A correct result is only obtained if the current iteration is independent of previous iterations (no data dependency).

Handling Data Dependencies

If data dependency exists:

Modify/change the algorithm
Serialize relevant part of the loop using special OpenMP features (covered later in course)
Execute loop serially

Example with Dependency

Problem (has dependency):

a[0] = 0;
for (i = 1; i < N; i++)
    a[i] = a[i-1] + i;

Possible Fix (algorithm change):

for (i = 0; i < N; i++)
    a[i] = 0.5 * i * (i + 1);

Warning

Always verify that loop iterations are independent before parallelizing!

Scheduling Loop Iterations

Work Per Loop Iteration

Previous examples assumed the same amount of work for each loop iteration. This is not always the case.

Examples of Uneven Work

Summing over triangular area:

for (i = 0; i < N; i++)
    for (j = 0; j < i + 1; j++)
        // work here

Loop body iterates until required accuracy is achieved

Load Imbalance Problem

Uneven work distribution often causes load imbalance:

Some threads finish while others still work
Results in poor performance

Note

Dealing with such problems is typically easier in shared memory than in distributed memory programming.

Schedule Clause

To help load balance in a loop construct, use the schedule clause:

schedule(kind, [chunk_size])

Default schedule is implementation-dependent (OpenMP 3.0). Choices for kind:

static
dynamic
guided
auto
runtime

Static Scheduling

Divide iteration count into chunks of equal size
- Last chunk may be smaller if needed
Thread assignment uses “round robin” distribution

Default Chunk Size

Default chunk size divides iteration count by number of threads.

Performance

Static scheduling has the least overhead compared to other schedules.

Visual Representation

Default static schedule (≈n/4 per thread):
Thread 0: [===============]
Thread 1: [===============]
Thread 2: [===============]
Thread 3: [===============]

Static schedule with chunk size:
T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 ...

Example: Summation Over Triangular Area (Static)

!$omp parallel do &
!$omp private(i, j) shared(a) &
!$omp schedule(static, 100)
do j = 1, 1200
    do i = j + 1, 1200
        a(i,j) = func(i,j)
        a(j,i) = -a(i,j)
    enddo
enddo

Performance Comparison

Default static: maximum 7/16 of work area per thread
Static with chunk=100: maximum 5/16 of work area per thread

Trade-offs

Smaller chunks: better load balance
More chunks: larger overhead

Dynamic Scheduling

Loop is split into work packages of chunk_size iterations
Each thread requests a new work package once done with the current one
Default chunk_size: 1 iteration

When to Use

Use dynamic scheduling when:

Work per iteration varies significantly
The pattern of work is unpredictable

Performance

Better load balance than static (for uneven work)
Higher overhead than static due to runtime work distribution

Example: Summation Over Triangular Area (Dynamic)

!$omp parallel do &
!$omp private(i, j) shared(a) &
!$omp schedule(dynamic, 100)
do j = 1, 1200
    do i = j + 1, 1200
        a(i,j) = func(i,j)
        a(j,i) = -a(i,j)
    enddo
enddo

Performance Comparison

Default static: maximum 7/16 of work area per thread
Dynamic with chunk=100: maximum ≈0.27 of work area per thread

Trade-offs

Better balance than static scheduling
Larger overhead than static scheduling

Guided Scheduling

Similar to dynamic, but with adaptive chunk sizes:

Threads request new work packages once done

Work package size is proportional to:

(number of unassigned iterations) / (number of threads)

Package size never smaller than chunk_size (unless last package)
Default chunk_size = 1

Purpose

The idea is to prevent expensive work packages at the end of the loop.

Performance

Starts with large chunks (low overhead)
Gradually decreases chunk size (better balance toward the end)

Schedules: Auto and Runtime

Auto Schedule

For auto, the implementation decides the scheduling strategy.

!$omp parallel do schedule(auto)

Runtime Schedule

For runtime, the schedule can be controlled at runtime:

Method 1: Using Function (OpenMP 3.0)

omp_set_schedule(omp_sched_static, 10);

Method 2: Using Environment Variable

Bash:

export OMP_SCHEDULE="guided,4"

C-shell:

setenv OMP_SCHEDULE "guided,4"

Warning

Do not specify chunk_size with auto or runtime in the directive itself.

Multiple Loop Parallelization

Simple Example with Nested Loops

Consider this nested loop structure:

do j = 1, 3
    do i = 1, 4
        a(i,j) = expensiveFunc(i,j)
    enddo
enddo

There are three basic options to parallelize nested loops. Which one is best depends on the specific situation.

Option 1: Distribute Outer Loop (Fortran)

!$omp parallel do
do j = 1, 3
    do i = 1, 4
        a(i,j) = expensiveFunc(i,j)
    enddo
enddo

Distributes the j-loop
Maximally 3 work packages

When to Use

Use when the outer loop has sufficient iterations for good load balance.

Option 2: Distribute Inner Loop (Fortran)

!$omp parallel private(j)
do j = 1, 3
    !$omp do
    do i = 1, 4
        a(i,j) = expensiveFunc(i,j)
    enddo
    !$omp end do
enddo
!$omp end parallel

Distributes the i-loop
Now four work packages
Parallel region before j-loop provides better performance
Requires i to be private (automatic for loop variable)
Starts loop construct 3 times
May cause more cache line conflicts when writing to a

Option 3: Collapse Clause (Fortran)

!$omp parallel
!$omp do collapse(2)
do j = 1, 3
    do i = 1, 4
        a(i,j) = expensiveFunc(i,j)
    enddo
enddo

Use collapse clause to specify number of loops to collapse
Available since OpenMP 3.0
Distributes both loops by creating a single combined loop
Schedules as specified (default in this case)
Now: 12 work packages
May cause more cache line conflicts when writing to a

Benefits

Maximum parallelism exposure
Best for cases where individual loops have few iterations

Option 1: Distribute Outer Loop (C)

#pragma omp parallel for
for (int i = 0; i < 3; i++)
{
    for (int j = 0; j < 4; j++)
    {
        a[i][j] = expensiveFunc(i,j);
    }
}

Distributes the i-loop
Maximally 3 work packages

When to Use

Use when the outer loop has sufficient iterations for good load balance.

Option 2: Distribute Inner Loop (C)

#pragma omp parallel
{
    for (int i = 0; i < 3; i++)
    {
        #pragma omp for
        for (int j = 0; j < 4; j++)
        {
            a[i][j] = expensiveFunc(i,j);
        }
    }
}

Distributes the j-loop
Now four work packages
Parallel region before i-loop provides better performance
Requires i to be private (automatic for loop variable)
Starts loop construct 3 times
May cause more cache line conflicts when writing to a

Option 3: Collapse Clause (C)

#pragma omp parallel
#pragma omp for collapse(2)
for (int i = 0; i < 3; i++)
{
    for (int j = 0; j < 4; j++)
    {
        a[i][j] = expensiveFunc(i,j);
    }
}

Use collapse clause to specify number of loops to collapse
Available since OpenMP 3.0
Distributes both loops by creating a single combined loop
Schedules as specified (default in this case)
Now: 12 work packages
May cause more cache line conflicts when writing to a

Benefits

Maximum parallelism exposure
Best for cases where individual loops have few iterations

Workshare in Fortran

Workshare Construct

OpenMP provides the workshare construct specifically for Fortran.

Supported Constructs

This allows distribution of:

Fortran array syntax

a(1:n, 1:m) = b(1:n, 1:m) + c(1:n, 1:m)

Fortran statements: FORALL, WHERE

Note

This construct is Fortran-only and has no C/C++ equivalent.

Example: Workshare

!$OMP PARALLEL SHARED(n, a, b, c)
!$OMP WORKSHARE
    b(1:n) = b(1:n) + 1
    c(1:n) = c(1:n) + 2
    a(1:n) = b(1:n) + c(1:n)
!$OMP END WORKSHARE
!$OMP END PARALLEL

OpenMP ensures there is no data race
Arrays b and c are ready before assignment to a
Can include user-defined functions if declared ELEMENTAL

Scalar Assignment in Workshare

Shared Scalar (Legal)

REAL :: AA(N,N), BB(N,N), CC(N,N), DD(N,N)
INTEGER :: SHR

!$OMP PARALLEL SHARED(SHR)
!$OMP WORKSHARE
    AA = BB
    SHR = 1
    CC = DD * SHR
!$OMP END WORKSHARE
!$OMP END PARALLEL

This is legal OpenMP. A single thread performs the scalar assignment to SHR.

Private Scalar (Illegal!)

REAL :: AA(N,N), BB(N,N), CC(N,N), DD(N,N)
INTEGER :: PRI

!$OMP PARALLEL PRIVATE(PRI)
!$OMP WORKSHARE
    AA = BB
    PRI = 1
    CC = DD * PRI
!$OMP END WORKSHARE
!$OMP END PARALLEL

Danger

This is ILLEGAL!

Single thread performs scalar assignment to PRI
PRI is undefined on other threads

Sections

Sections Construct

The sections construct allows parallelization when code blocks can be executed independently.

Initialization of multiple data structures
Different tasks executing different code

Considerations

Mismatch between blocks and threads:

Individual threads might execute multiple code blocks
Not every thread necessarily gets a code block

Danger of load imbalance:

Code blocks may have different amounts of work
Mismatch between number of blocks and number of threads

Real-World Application

Example from research: “Acceleration of Semiempirical QM/MM methods,” JCTC, 13, 3525-3536 (2017)

Example: Sections Construct (C)

#pragma omp parallel shared(a, b, N, M)
{
    #pragma omp sections
    {
        #pragma omp section
        {
            for (int i = 0; i < N; i++)
                a[i] = i;
        }

        #pragma omp section
        {
            for (int i = 0; i < M; i++)
                b[i] = initBmatrix(i, M);
        }
    }
}

Alternative: Parallel Sections

#pragma omp parallel sections shared(a, b, N, M)
{
    #pragma omp section
    {
        for (int i = 0; i < N; i++)
            a[i] = i;
    }

    #pragma omp section
    {
        for (int i = 0; i < M; i++)
            b[i] = initBmatrix(i, M);
    }
}

Note

The parallel sections directive combines the parallel and sections directives for convenience.

Exercise

Parallelize a for loop which has 20 iterations by evenly divinding the number of interations among the available threads. Then use a variable var1 to store the number of iterations in the loop. Hint an atomic operation can help to protect var1.

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 6a-forworksharing-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{

int i,var1,var2,myWork;
int n = 20; // number of iterations
var1 = 0;

#pragma omp parallel private(myWork, var2) shared(var1)
    {
#ifdef _OPENMP
    // The purpose of this code is to add 1 to var1 20 times
    myWork = n/omp_get_num_threads();

    for ( int i = myWork*omp_get_thread_num(); i < myWork*(1+omp_get_thread_num()); i++)
        var2 += 1;

    #pragma omp atomic update
    var1 += var2;

#else
    printf("Serial code!\n");
#endif
    }

    printf("var1 =  %i \n", var1);

return 0;
}

Exercise

Parallelize the previous for loop code but this time using the loop construct. Hint an atomic operation can help to protect var2.

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 6b-forworksharing-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{

int i,var1;
int n = 20; // number of iterations
var1 = 0;

#pragma omp parallel
    {
#ifdef _OPENMP
    // The purpose of this code is to add 1 to var1 20 times
#pragma omp for
    for ( int i = 0; i < n; i++)
        #pragma omp atomic update
        var1 += 1;

#else
    printf("Serial code!\n");
#endif
    }

    printf("var1 =  %i \n", var1);

return 0;
}

Exercise

Use a critical region to protect a counter variable in a parallel loop that counts the stores the number of iterations. Is this loop parallelization efficient? Why?

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 8-critical-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{

int sum = 0;

#pragma omp parallel for

#ifdef _OPENMP
    for ( int i = 0; i < 10; i++) {
        #pragma omp critical
        {
            sum += i;
            printf("Thread %d added %d to sum, current sum = %d\n", omp_get_thread_num(), i, sum);
        }
    }
#else
    printf("Serial code!\n");
#endif

    printf("Final sum = %d\n", sum);

return 0;
}

Exercise

Create a Section in a parallel region to protect two blocks of code where the thread ID is printed out in each block.

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 6c-sectionsworksharing-openmp.c -lm
#include <stdio.h>
#include <omp.h>

int main() {
    #pragma omp parallel num_threads(8)   // We have a total of 8 threads
    {
        #pragma omp sections
        {
            #pragma omp section
            {
                printf("Section 1 executed by thread %d\n", omp_get_thread_num());
            }
            #pragma omp section
            {
                printf("Section 2 executed by thread %d\n", omp_get_thread_num());
            }
        }
    }
    return 0;
}

Exercise

Create two nested parallel region with 2 threads each and print out the thread IDs for the two inner and the outer threads.

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 11-nested-openmp.c -lm
#include <stdio.h>
#include <omp.h>

int main() {
    // enabling/disabling nested parallelism
    omp_set_nested(1);

    #pragma omp parallel num_threads(2)
    {
        int outer_tid = omp_get_thread_num();
        printf("Outer thread %d in outer parallel region\n", outer_tid);

        #pragma omp parallel num_threads(2)
        {
            int inner_tid = omp_get_thread_num();
            printf("    Inner thread %d in inner parallel region but outer thread %d\n", inner_tid, outer_tid);
        }
    }

    return 0;
}

Summary

This guide covered the following OpenMP worksharing constructs:

OpenMP Loop Construct

Easy distribution of standard do/for loops
The schedule clause deals with many cases of load imbalance
Schedule types: static, dynamic, guided, auto, runtime
collapse clause for nested loops (OpenMP 3.0+)

OpenMP Workshare Construct (Fortran)

Distributes Fortran array syntax statements
Supports FORALL and WHERE
Handles user-defined ELEMENTAL functions

OpenMP Sections Construct

Distributes independent code blocks on different threads
Useful for heterogeneous parallel tasks
Consider load balance issues