More on Worksharing

Objectives

This guide covers advanced worksharing concepts:

single and master constructs
if clause for conditional parallelization
Flushes and implicit barriers
nowait clause for removing barriers
Orphan directives

Work for Single Threads

Single Construct

The single construct is a worksharing construct placed inside a parallel region.

As the name suggests, a single thread executes the region.

Not specified which thread executes the region
Other threads wait at an implicit barrier at the end
Useful for operations that should be done only once

Use Cases

Guard when writing to shared variables:

Enforcing a single write

Guard I/O operations:

Writing to stdout or file (single write)
Reading from stdin or file (data read once)

Starting tasks:

Task creation (covered later in course)

Example: single Construct (Fortran)

!$omp parallel shared(a, b, n) private(i)
    !$omp single
    a = omp_get_num_threads()
    !$omp end single  ! implied barrier, required!

    !$omp do
    do i = 1, n
        b(i) = a
    enddo
!$omp end parallel

Note

The barrier after single ensures that a is set before any thread uses it in the loop.

Example: single Construct (C)

#pragma omp parallel shared(a, b, n) private(i)
{
    #pragma omp single
    {
        a = omp_get_num_threads();
    }  // implied barrier, required!

    #pragma omp for
    for (i = 0; i < n; i++)
        b[i] = a;
}

Note

The barrier after single ensures that a is set before any thread uses it in the loop.

Master Construct

Similar to single, but with specific differences.

Key Differences from single

Execution:

Work is always done on the master thread (thread 0)
Deterministic behavior

Synchronization:

No implied barrier/synchronization
More lightweight than single if barrier is not needed

When to Use

Use master when:

You specifically need thread 0 to do the work
You don’t need synchronization afterward
Performance is critical and barrier overhead should be avoided

Ordered Construct

Execute part of a loop body in sequential order.

Warning

Significant performance penalty! Requires enough other parallel work to pay the overhead.

How It Works

Thread working on first iteration enters the ordered region, others wait
When done, thread for second iteration enters
And so on, in sequential order

Requirements

ordered clause must also be specified on the loop construct (omp for/omp do)
No more than one ordered region per thread and iteration

Use Cases

Ordered printing from parallel loops
Debugging (e.g., data races)

Example: Ordered Construct

#pragma omp parallel default(none) shared(b)
{
    #pragma omp for ordered schedule(dynamic, 1)
    for (int i = 0; i < PSIZE; i++)
    {
        b[i] = expensiveFunction(i);

        #pragma omp ordered
        printf("b[%3i] = %4i\n", i, b[i]);
    }
}

The computation expensiveFunction(i) happens in parallel
The printf statements execute in sequential order (i=0, 1, 2, …)
This ensures ordered output despite parallel execution

Clauses for Parallel Construct

if Clause

The if clause can be specified on the parallel construct.

If the condition evaluates to false:

No parallel region is started
Code executes serially
Useful for runtime evaluation (e.g., loop count too small to benefit from parallelization)

Syntax

!$omp parallel if (condition)

#pragma omp parallel if (condition)

Example: if Clause (Fortran)

integer :: n = 20

!$omp parallel if (n > 5) shared(n)
    !$omp single
    print *, "The n is: ", n
    !$omp end single

    print *, "Hello, I am thread", &
             omp_get_thread_num(), " of", &
             omp_get_num_threads()
!$omp end parallel

If n > 5: parallel region with multiple threads
If n <= 5: serial execution with single thread

Example: if Clause (C)

int n = 20;

#pragma omp parallel if (n > 5) shared(n)
{
    #pragma omp single
    printf("The n is %i\n", n);

    printf("Hello, I am thread %i of %i\n",
           omp_get_thread_num(),
           omp_get_num_threads());
}

If n > 5: parallel region with multiple threads
If n <= 5: serial execution with single thread

Clause: num_threads

The num_threads clause specifies the number of threads to start in a parallel region.

Syntax

C:

int nthread = 3;
#pragma omp parallel num_threads(nthread)

Fortran:

integer :: nthread = 3
!$omp parallel num_threads(nthread)

Note

This overrides the default thread count and environment variables for this specific parallel region.

Keeping Memory Consistent

OpenMP: Relaxed Memory Model

OpenMP uses a relaxed memory model for performance.

Threads are allowed to have their “own temporary view” of memory:

Not required to be consistent with main memory
Data may be in registers or cache, invisible to other threads

Programmer Responsibility

Important

This is a “may be” for the hardware, but the programmer must assume it is (for portability).

Scope for Data Races

Without proper synchronization:

Memory modified by other threads may not be in temporary view
Own changes may not be visible to other threads

Ensuring Memory Consistency: flush

Use flush to ensure memory consistency across threads.

What flush Does

Writes modifications to memory:

Modifications in temporary view are written to memory system
Guaranteed to be visible to other threads

Discards temporary view:

Temporary view gets discarded
Next access needs to read from memory subsystem
Ensures modifications from other threads are “known”

Prevents reordering:

No reordering of memory access and flush

Example: Without flush (Problem)

integer :: i
integer, dimension(4) :: b
b = (/ 3, 4, 5, 6 /)

!$OMP parallel &
!$OMP shared(b), private(i)
    i = omp_get_thread_num() + 1
    b(i) = b(i) + i
    b(i+1) = b(i+1) + 1
!$OMP end parallel

Memory Behavior (3 threads)

Initial:     [3, 4, 5, 6]

Thread 0: i=1
  b(1) = 3 + 1 = 4
  b(2) = 4 + 1 = 5    (but may read stale value!)

Thread 1: i=2
  b(2) = 4 + 2 = 6    (conflict!)
  b(3) = 5 + 1 = 6

Thread 2: i=3
  b(3) = 5 + 3 = 8    (conflict!)
  b(4) = 6 + 1 = 7

Result: [4, 6, 8, 7]  ← Not what we want!

Warning

Without synchronization, threads may read stale values and overwrite each other’s changes.

Example: With barrier (Solution)

integer :: i
integer, dimension(4) :: b
b = (/ 3, 4, 5, 6 /)

!$OMP parallel &
!$OMP shared(b), private(i)
    i = omp_get_thread_num() + 1
    b(i) = b(i) + i
    !$OMP barrier
    b(i+1) = b(i+1) + 1
!$OMP end parallel

Memory Behavior (3 threads)

Initial:     [3, 4, 5, 6]

Phase 1 (before barrier):
  Thread 0: b(1) = 4
  Thread 1: b(2) = 6
  Thread 2: b(3) = 8

Result after phase 1: [4, 6, 8, 6]

BARRIER (flush to memory)

Phase 2 (after barrier):
  Thread 0: b(2) = 6 + 1 = 7
  Thread 1: b(3) = 8 + 1 = 9
  Thread 2: b(4) = 6 + 1 = 7

Final result: [4, 7, 9, 7]  ← Correct!

Note

The barrier ensures all writes from phase 1 are visible before phase 2 begins.

Sequence Required for Data Visibility

For data to be visible on another thread, the following sequence is required:

First thread writes to shared memory
First thread flush - change goes into memory system
Second thread flush - discard local temporary view
Second thread reads - gets updated value from memory

Important Notes

Important

A flush doesn’t “push” data to other threads
Fixing data races typically also requires synchronization
Implied flushes are often sufficient

Explicit Flush

You can issue an explicit flush:

Fortran:

!$OMP flush

C:

#pragma omp flush

Implicit Barriers and Data Flushes

OpenMP automatically performs barriers and flushes at specific points.

Constructs with Barrier and Flush

At barrier:

!$omp barrier / #pragma omp barrier (flush)

Start and end of constructs:

parallel region (barrier & flush)

Start and end:

critical region (flush)
ordered region (flush)

End only:

Loop constructs (for/do) (barrier & flush)
single (barrier & flush)
workshare (barrier & flush)
sections (barrier & flush)

Note

No barrier or flush at the start of loop, single, workshare, or sections!

Other Operations

Various locking operations (flush)
Start and end of atomic flushes “protected” variable
- Use seq_cst on atomic to include “global” flush

No barrier or flush associated with master construct!

Memory Reorder: Out-of-Order Execution

Problem Scenario

Consider this code:

...
A(5) = 3.0
!$omp atomic write
matrix_set = 1
...

Potential Problems

No guarantee A(5) is in memory:
- Value might still be in registers/cache
No guarantee order is maintained:
- Optimizing compiler might reorder:
```
matrix_set = 1
...
A(5) = 3.0
```

Warning

Another thread might see matrix_set = 1 but read an old value of A(5)!

Fix: Using flush to Prevent Reordering

...
A(5) = 3.0
!$omp flush
!$omp atomic write
matrix_set = 1
...

What the flush Does

Ensures modified A is in memory:
- All threads can see the updated value
Prohibits reordering of memory accesses:
- Compiler and hardware cannot move matrix_set = 1 before the flush
- Guarantees A(5) is written before matrix_set is set

Clause: nowait

Barriers have performance implications. The implied barrier of a construct may not be required for correctness.

Removing Barriers

Specifying nowait:

In C: on the construct itself
In Fortran: on the end construct directive

This suppresses the implied barrier (including flush).

When to Use

Use nowait when:

Threads don’t need to wait for each other
No data dependencies between constructs
You want to improve performance by allowing threads to continue immediately

Example: Tensor Product (C)

#pragma omp parallel shared(a, b, t, n, m)
{
    #pragma omp for nowait
    for (int i = 0; i < n; i++)
        a[i] = funcA(i);  // no barrier needed!

    #pragma omp for
    for (int j = 0; j < m; j++)
        b[j] = funcB(j);  // barrier needed!

    #pragma omp for
    for (int i = 0; i < n; i++)
        for (int j = 0; j < m; j++)
            t[i][j] = a[i] * b[j];  // bad access to b!
}

First loop initializes a with nowait - threads can continue immediately
Second loop initializes b - implicit barrier ensures all threads finish before tensor product
Third loop uses both a and b - needs both to be complete

Example: Adding Vectors (Fortran)

!$omp parallel shared(a, b, t, n)
    !$omp do
    do i = 1, n
        a(i) = sin(real(i))
    !$omp end do nowait  ! no barrier here!

    !$omp do
    do j = 1, n
        b(j) = cos(real(j))  ! barrier here!

    !$omp do
    do i = 1, n
        t(i) = a(i) + b(i)
!$omp end parallel

Note

Demo code - a single loop would help performance.

First loop fills a - can proceed without waiting
Second loop fills b - implicit barrier before final loop
Third loop needs both a and b complete

Example: Adding Vectors (C)

#pragma omp parallel shared(a, b, t, n)
{
    #pragma omp for nowait
    for (int i = 0; i < n; i++)
        a[i] = sin((double)i);  // no barrier here!

    #pragma omp for
    for (int j = 0; j < n; j++)
        b[j] = cos((double)j);  // barrier needed!

    #pragma omp for
    for (int i = 0; i < n; i++)
        t[i] = a[i] + b[i];
}

Note

Demo code - a single loop would help performance.

First loop fills a - can proceed without waiting
Second loop fills b - implicit barrier before final loop
Third loop needs both a and b complete

Performance Impact of nowait

Benchmark Setup

Hardware:

Dual socket, quad-core Intel Xeon E5520 (2.26 GHz)

Compilers tested:

PGI 10.9
GCC 4.4
Intel 12.0

Problem:

Vector addition example with n = 1000
Time measured in microseconds (μs)
Tested with 4, 6, and 8 threads

Results

Threads    Savings from nowait
-------    -------------------
4-8        0.6 - 1.3 μs

Performance Chart

Note

Even small savings (0.6-1.3 μs) can add up in frequently executed code.

Specialty of Static Schedule

When specifying a static schedule with:

Same iteration count
Same chunk size (or default)
Loops bound to same parallel region

Guarantee:

You can safely assume the same thread works on the same iteration in all loops.

Can use nowait even with data dependencies between loops!

Important

This only works with static scheduling. Other schedules don’t guarantee iteration-to-thread mapping.

Example: Static Schedule with Dependencies (Fortran)

!$omp parallel shared(a, b, t, n)
    !$omp do schedule(static)
    do i = 1, n
        a(i) = sin(real(i))
    !$omp end do nowait  ! no barrier here!

    !$omp do schedule(static)
    do j = 1, n
        b(j) = cos(real(j))
    !$omp end do nowait  ! no barrier here!

    !$omp do schedule(static)
    do i = 1, n
        t(i) = a(i) + b(i)
    !$omp end do nowait  ! no barrier here!
!$omp end parallel

Important

The static schedule is crucial! Each thread processes the same indices in all three loops.

Example: Static Schedule with Dependencies (C)

#pragma omp parallel shared(a, b, t, n)
{
    #pragma omp for schedule(static) nowait
    for (int i = 0; i < n; i++)
        a[i] = sin((double)i);  // no barrier here!

    #pragma omp for schedule(static) nowait
    for (int j = 0; j < n; j++)
        b[j] = cos((double)j);  // no barrier here!

    #pragma omp for schedule(static)
    for (int i = 0; i < n; i++)
        t[i] = a[i] + b[i];
}

Important

The static schedule is crucial! Each thread processes the same indices in all three loops.

Why This Works

With static scheduling:

Thread 0 always processes indices 0 to n/num_threads-1
Thread 1 always processes indices n/num_threads to 2*n/num_threads-1
And so on…

Each thread only reads values it wrote, so no race conditions occur!

Orphan Directives

“Orphan” directives are OpenMP directives that appear inside functions/subroutines called from within a parallel region, rather than directly inside the parallel region.

Thread Safety Assumption

Calling subroutines and functions inside a parallel region is legal, assuming thread safety.

What Can Be Orphaned

The called procedures may contain:

Worksharing constructs (for, do, sections)
Synchronization constructs (barrier, critical, etc.)

Example: Orphan Directive (C)

Main Function

#pragma omp parallel shared(v, vl) reduction(+:nm)
{
    vectorinit(v, vl);
    nm = vectornorm(v, vl);
}

Called Function with Orphan Directive

void vectorinit(double* vdata, int leng)
{
    #pragma omp for
    for (int i = 0; i < leng; i++)
    {
        vdata[i] = i;
    }
    return;
}

Note

The #pragma omp for directive is “orphaned” - it’s not directly inside the parallel region but binds to the active parallel region when called.

Example: Orphan Directive (Fortran)

Main Program

!$omp parallel shared(v, vl) reduction(+:nm)
    call vectorinit(v, vl)
    nm = vectornorm(v, vl)
!$omp end parallel

Subroutine with Orphan Directive

subroutine vectorinit(vdata, leng)
    double precision, dimension(leng) :: vdata
    integer :: leng, i

    !$omp do
    do i = 1, leng
        vdata(i) = i
    enddo
end subroutine vectorinit

Note

The !$omp do directive is “orphaned” - it’s not directly inside the parallel region but binds to the active parallel region when called.

Performance Impact of Orphaning

Benchmark Setup

Test: Vector initialization and norm calculation Vector length: 40,000 Hardware: Xeon E5-2650 v3 Compilers: GCC 4.9.3, ICC 16.0

Configurations Tested

parallel for in each function (no orphaning)
Orphaned for in each function
Orphaned for nowait in each function

Key Observations

Orphaned directives perform better than creating new parallel regions
Using nowait provides additional performance gains
Starting/closing parallel regions is very expensive

Discussion of Orphan Directives

Advantages:

Reduces need for code restructuring:

Can parallelize existing functions without major changes

Allows for longer parallel regions:

Starting/closing parallel regions is very expensive
One long parallel region is more efficient than many short ones

Better performance:

As shown in benchmarks, avoids parallel region overhead

Potential Issues:

Warning

Problem: Routine with orphan directive called outside parallel region

If a function with an orphaned directive is called from serial code, the directive may have no effect or cause unexpected behavior.

Best Practices:

Document functions that contain orphan directives
Consider adding checks for parallel context if needed
Design functions to work correctly both inside and outside parallel regions

Exercise

Inspect and run the following code where an orphan directive is used in func1

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 4-static-orphaned-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int func1(int var2)
{
int i;
var2 = 0;

for ( int i = 0; i < 10; i++)
#pragma omp atomic update
        var2 += 1;

return var2;
}

int main()
{

int i,var1,var2;
var1 = 0;

// Static extent

#pragma omp parallel
    {

#ifdef _OPENMP
#pragma omp for reduction(+:var1)
    for ( int i = 0; i < 10; i++)
        var1 += 1;

#else
    printf("Serial code!\n");
#endif
    }

    printf("Static version: var1 =  %i \n", var1);

// Dynamic extent
int result = 0;
#pragma omp parallel reduction(+:result)
    {
    result += func1(var2);
#ifdef _OPENMP
#else
    printf("Serial code!\n");
#endif
    }

    printf("Dynamic version: result =  %i \n", result);

return 0;
}

Summary

This guide covered advanced worksharing concepts in OpenMP:

Constructs

single construct: Execute code on one thread (with barrier)
master construct: Execute code on master thread (no barrier)
ordered construct: Execute loop iterations in sequential order

Clauses

if clause: Conditional parallelization
num_threads clause: Control thread count
nowait clause: Remove implicit barriers for performance

Memory Consistency

flush: Ensure memory consistency across threads
Implicit barriers and flushes: Automatic synchronization points
Memory reordering: Understanding and preventing issues

Advanced Techniques

Static schedule specialty: Using nowait with dependencies
Orphan directives: Worksharing constructs in called functions

Performance Considerations

Balance between synchronization overhead and correctness
Strategic use of nowait can improve performance
Orphan directives reduce parallel region overhead