More on Worksharing
Work for Single Threads
Single Construct
The single
construct is a worksharing construct placed inside a parallel region.
As the name suggests, a single thread executes the region.
Not specified which thread executes the region
Other threads wait at an implicit barrier at the end
Useful for operations that should be done only once
Use Cases
Guard when writing to shared variables:
Enforcing a single write
Guard I/O operations:
Writing to stdout or file (single write)
Reading from stdin or file (data read once)
Starting tasks:
Task creation (covered later in course)
Example: single Construct (Fortran)
!$omp parallel shared(a, b, n) private(i)
!$omp single
a = omp_get_num_threads()
!$omp end single ! implied barrier, required!
!$omp do
do i = 1, n
b(i) = a
enddo
!$omp end parallel
Note
The barrier after single
ensures that a
is set before any thread uses it in the loop.
Example: single Construct (C)
#pragma omp parallel shared(a, b, n) private(i)
{
#pragma omp single
{
a = omp_get_num_threads();
} // implied barrier, required!
#pragma omp for
for (i = 0; i < n; i++)
b[i] = a;
}
Note
The barrier after single
ensures that a
is set before any thread uses it in the loop.
Master Construct
Similar to single
, but with specific differences.
Key Differences from single
Execution:
Work is always done on the master thread (thread 0)
Deterministic behavior
Synchronization:
No implied barrier/synchronization
More lightweight than
single
if barrier is not needed
When to Use
Use master
when:
You specifically need thread 0 to do the work
You don’t need synchronization afterward
Performance is critical and barrier overhead should be avoided
Ordered Construct
Execute part of a loop body in sequential order.
Warning
Significant performance penalty! Requires enough other parallel work to pay the overhead.
How It Works
Thread working on first iteration enters the ordered region, others wait
When done, thread for second iteration enters
And so on, in sequential order
Requirements
ordered
clause must also be specified on the loop construct (omp for
/omp do
)No more than one
ordered
region per thread and iteration
Use Cases
Ordered printing from parallel loops
Debugging (e.g., data races)
Example: Ordered Construct
#pragma omp parallel default(none) shared(b)
{
#pragma omp for ordered schedule(dynamic, 1)
for (int i = 0; i < PSIZE; i++)
{
b[i] = expensiveFunction(i);
#pragma omp ordered
printf("b[%3i] = %4i\n", i, b[i]);
}
}
The computation
expensiveFunction(i)
happens in parallelThe
printf
statements execute in sequential order (i=0, 1, 2, …)This ensures ordered output despite parallel execution
Clauses for Parallel Construct
if Clause
The if
clause can be specified on the parallel
construct.
If the condition evaluates to false:
No parallel region is started
Code executes serially
Useful for runtime evaluation (e.g., loop count too small to benefit from parallelization)
Syntax
!$omp parallel if (condition)
#pragma omp parallel if (condition)
Example: if Clause (Fortran)
integer :: n = 20
!$omp parallel if (n > 5) shared(n)
!$omp single
print *, "The n is: ", n
!$omp end single
print *, "Hello, I am thread", &
omp_get_thread_num(), " of", &
omp_get_num_threads()
!$omp end parallel
If
n > 5
: parallel region with multiple threadsIf
n <= 5
: serial execution with single thread
Example: if Clause (C)
int n = 20;
#pragma omp parallel if (n > 5) shared(n)
{
#pragma omp single
printf("The n is %i\n", n);
printf("Hello, I am thread %i of %i\n",
omp_get_thread_num(),
omp_get_num_threads());
}
If
n > 5
: parallel region with multiple threadsIf
n <= 5
: serial execution with single thread
Clause: num_threads
The num_threads
clause specifies the number of threads to start in a parallel region.
Syntax
C:
int nthread = 3;
#pragma omp parallel num_threads(nthread)
Fortran:
integer :: nthread = 3
!$omp parallel num_threads(nthread)
Note
This overrides the default thread count and environment variables for this specific parallel region.
Keeping Memory Consistent
OpenMP: Relaxed Memory Model
OpenMP uses a relaxed memory model for performance.
Threads are allowed to have their “own temporary view” of memory:
Not required to be consistent with main memory
Data may be in registers or cache, invisible to other threads
Programmer Responsibility
Important
This is a “may be” for the hardware, but the programmer must assume it is (for portability).
Scope for Data Races
Without proper synchronization:
Memory modified by other threads may not be in temporary view
Own changes may not be visible to other threads
Ensuring Memory Consistency: flush
Use flush
to ensure memory consistency across threads.
What flush Does
Writes modifications to memory:
Modifications in temporary view are written to memory system
Guaranteed to be visible to other threads
Discards temporary view:
Temporary view gets discarded
Next access needs to read from memory subsystem
Ensures modifications from other threads are “known”
Prevents reordering:
No reordering of memory access and flush
Example: Without flush (Problem)
integer :: i
integer, dimension(4) :: b
b = (/ 3, 4, 5, 6 /)
!$OMP parallel &
!$OMP shared(b), private(i)
i = omp_get_thread_num() + 1
b(i) = b(i) + i
b(i+1) = b(i+1) + 1
!$OMP end parallel
Memory Behavior (3 threads)
Initial: [3, 4, 5, 6]
Thread 0: i=1
b(1) = 3 + 1 = 4
b(2) = 4 + 1 = 5 (but may read stale value!)
Thread 1: i=2
b(2) = 4 + 2 = 6 (conflict!)
b(3) = 5 + 1 = 6
Thread 2: i=3
b(3) = 5 + 3 = 8 (conflict!)
b(4) = 6 + 1 = 7
Result: [4, 6, 8, 7] ← Not what we want!
Warning
Without synchronization, threads may read stale values and overwrite each other’s changes.
Example: With barrier (Solution)
integer :: i
integer, dimension(4) :: b
b = (/ 3, 4, 5, 6 /)
!$OMP parallel &
!$OMP shared(b), private(i)
i = omp_get_thread_num() + 1
b(i) = b(i) + i
!$OMP barrier
b(i+1) = b(i+1) + 1
!$OMP end parallel
Memory Behavior (3 threads)
Initial: [3, 4, 5, 6]
Phase 1 (before barrier):
Thread 0: b(1) = 4
Thread 1: b(2) = 6
Thread 2: b(3) = 8
Result after phase 1: [4, 6, 8, 6]
BARRIER (flush to memory)
Phase 2 (after barrier):
Thread 0: b(2) = 6 + 1 = 7
Thread 1: b(3) = 8 + 1 = 9
Thread 2: b(4) = 6 + 1 = 7
Final result: [4, 7, 9, 7] ← Correct!
Note
The barrier ensures all writes from phase 1 are visible before phase 2 begins.
Sequence Required for Data Visibility
For data to be visible on another thread, the following sequence is required:
First thread writes to shared memory
First thread flush - change goes into memory system
Second thread flush - discard local temporary view
Second thread reads - gets updated value from memory
Important Notes
Important
A flush doesn’t “push” data to other threads
Fixing data races typically also requires synchronization
Implied flushes are often sufficient
Explicit Flush
You can issue an explicit flush:
Fortran:
!$OMP flush
C:
#pragma omp flush
Implicit Barriers and Data Flushes
OpenMP automatically performs barriers and flushes at specific points.
Constructs with Barrier and Flush
At barrier:
!$omp barrier
/#pragma omp barrier
(flush)
Start and end of constructs:
parallel
region (barrier & flush)
Start and end:
critical
region (flush)ordered
region (flush)
End only:
Loop constructs (
for
/do
) (barrier & flush)single
(barrier & flush)workshare
(barrier & flush)sections
(barrier & flush)
Note
No barrier or flush at the start of loop, single, workshare, or sections!
Other Operations
Various locking operations (flush)
Start and end of
atomic
flushes “protected” variableUse
seq_cst
onatomic
to include “global” flush
No barrier or flush associated with master construct!
Memory Reorder: Out-of-Order Execution
Problem Scenario
Consider this code:
...
A(5) = 3.0
!$omp atomic write
matrix_set = 1
...
Potential Problems
No guarantee A(5) is in memory:
Value might still be in registers/cache
No guarantee order is maintained:
Optimizing compiler might reorder:
matrix_set = 1 ... A(5) = 3.0
Warning
Another thread might see matrix_set = 1
but read an old value of A(5)
!
Fix: Using flush to Prevent Reordering
...
A(5) = 3.0
!$omp flush
!$omp atomic write
matrix_set = 1
...
What the flush Does
Ensures modified A is in memory:
All threads can see the updated value
Prohibits reordering of memory accesses:
Compiler and hardware cannot move
matrix_set = 1
before the flushGuarantees
A(5)
is written beforematrix_set
is set
Clause: nowait
Barriers have performance implications. The implied barrier of a construct may not be required for correctness.
Removing Barriers
Specifying nowait
:
In C: on the construct itself
In Fortran: on the end construct directive
This suppresses the implied barrier (including flush).
When to Use
Use nowait
when:
Threads don’t need to wait for each other
No data dependencies between constructs
You want to improve performance by allowing threads to continue immediately
Example: Tensor Product (C)
#pragma omp parallel shared(a, b, t, n, m)
{
#pragma omp for nowait
for (int i = 0; i < n; i++)
a[i] = funcA(i); // no barrier needed!
#pragma omp for
for (int j = 0; j < m; j++)
b[j] = funcB(j); // barrier needed!
#pragma omp for
for (int i = 0; i < n; i++)
for (int j = 0; j < m; j++)
t[i][j] = a[i] * b[j]; // bad access to b!
}
First loop initializes
a
withnowait
- threads can continue immediatelySecond loop initializes
b
- implicit barrier ensures all threads finish before tensor productThird loop uses both
a
andb
- needs both to be complete
Example: Adding Vectors (Fortran)
!$omp parallel shared(a, b, t, n)
!$omp do
do i = 1, n
a(i) = sin(real(i))
!$omp end do nowait ! no barrier here!
!$omp do
do j = 1, n
b(j) = cos(real(j)) ! barrier here!
!$omp do
do i = 1, n
t(i) = a(i) + b(i)
!$omp end parallel
Note
Demo code - a single loop would help performance.
First loop fills
a
- can proceed without waitingSecond loop fills
b
- implicit barrier before final loopThird loop needs both
a
andb
complete
Example: Adding Vectors (C)
#pragma omp parallel shared(a, b, t, n)
{
#pragma omp for nowait
for (int i = 0; i < n; i++)
a[i] = sin((double)i); // no barrier here!
#pragma omp for
for (int j = 0; j < n; j++)
b[j] = cos((double)j); // barrier needed!
#pragma omp for
for (int i = 0; i < n; i++)
t[i] = a[i] + b[i];
}
Note
Demo code - a single loop would help performance.
First loop fills
a
- can proceed without waitingSecond loop fills
b
- implicit barrier before final loopThird loop needs both
a
andb
complete
Performance Impact of nowait
Benchmark Setup
Hardware:
Dual socket, quad-core Intel Xeon E5520 (2.26 GHz)
Compilers tested:
PGI 10.9
GCC 4.4
Intel 12.0
Problem:
Vector addition example with
n = 1000
Time measured in microseconds (μs)
Tested with 4, 6, and 8 threads
Results
Threads Savings from nowait
------- -------------------
4-8 0.6 - 1.3 μs
Performance Chart
Time (μs)
40 ┤ ■ PGI wait
│ □ PGI nowait
35 ┤ ● GNU wait
│ ○ GNU nowait
30 ┤ ▲ Intel wait
│ △ Intel nowait
25 ┤ ■
│ □ ■
20 ┤ ● □ ■
│ ○ ● □
15 ┤ ▲ ○ ●
│ △ ▲ ○
10 ┤ △ ▲
│ △
0 └─────┴─────┴─────┴─────
4 6 8 Threads
Note
Even small savings (0.6-1.3 μs) can add up in frequently executed code.
Specialty of Static Schedule
When specifying a static schedule with:
Same iteration count
Same chunk size (or default)
Loops bound to same parallel region
Guarantee:
You can safely assume the same thread works on the same iteration in all loops.
Can use nowait
even with data dependencies between loops!
Important
This only works with static scheduling. Other schedules don’t guarantee iteration-to-thread mapping.
Example: Static Schedule with Dependencies (Fortran)
!$omp parallel shared(a, b, t, n)
!$omp do schedule(static)
do i = 1, n
a(i) = sin(real(i))
!$omp end do nowait ! no barrier here!
!$omp do schedule(static)
do j = 1, n
b(j) = cos(real(j))
!$omp end do nowait ! no barrier here!
!$omp do schedule(static)
do i = 1, n
t(i) = a(i) + b(i)
!$omp end do nowait ! no barrier here!
!$omp end parallel
Important
The static schedule is crucial! Each thread processes the same indices in all three loops.
Example: Static Schedule with Dependencies (C)
#pragma omp parallel shared(a, b, t, n)
{
#pragma omp for schedule(static) nowait
for (int i = 0; i < n; i++)
a[i] = sin((double)i); // no barrier here!
#pragma omp for schedule(static) nowait
for (int j = 0; j < n; j++)
b[j] = cos((double)j); // no barrier here!
#pragma omp for schedule(static)
for (int i = 0; i < n; i++)
t[i] = a[i] + b[i];
}
Important
The static schedule is crucial! Each thread processes the same indices in all three loops.
Why This Works
With static scheduling:
Thread 0 always processes indices 0 to n/num_threads-1
Thread 1 always processes indices n/num_threads to 2*n/num_threads-1
And so on…
Each thread only reads values it wrote, so no race conditions occur!
Orphan Directives
“Orphan” directives are OpenMP directives that appear inside functions/subroutines called from within a parallel region, rather than directly inside the parallel region.
Thread Safety Assumption
Calling subroutines and functions inside a parallel region is legal, assuming thread safety.
What Can Be Orphaned
The called procedures may contain:
Worksharing constructs (
for
,do
,sections
)Synchronization constructs (
barrier
,critical
, etc.)
Example: Orphan Directive (C)
Main Function
#pragma omp parallel shared(v, vl) reduction(+:nm)
{
vectorinit(v, vl);
nm = vectornorm(v, vl);
}
Called Function with Orphan Directive
void vectorinit(double* vdata, int leng)
{
#pragma omp for
for (int i = 0; i < leng; i++)
{
vdata[i] = i;
}
return;
}
Note
The #pragma omp for
directive is “orphaned” - it’s not directly inside the parallel region but binds to the active parallel region when called.
Example: Orphan Directive (Fortran)
Main Program
!$omp parallel shared(v, vl) reduction(+:nm)
call vectorinit(v, vl)
nm = vectornorm(v, vl)
!$omp end parallel
Subroutine with Orphan Directive
subroutine vectorinit(vdata, leng)
double precision, dimension(leng) :: vdata
integer :: leng, i
!$omp do
do i = 1, leng
vdata(i) = i
enddo
end subroutine vectorinit
Note
The !$omp do
directive is “orphaned” - it’s not directly inside the parallel region but binds to the active parallel region when called.
Performance Impact of Orphaning
Benchmark Setup
Test: Vector initialization and norm calculation Vector length: 40,000 Hardware: Xeon E5-2650 v3 Compilers: GCC 4.9.3, ICC 16.0
Configurations Tested
parallel for
in each function (no orphaning)Orphaned
for
in each functionOrphaned
for nowait
in each function
Results
Time (ms)
0.06 ┤
│ ■ gcc: parallel for
0.05 ┤ □ gcc: orphaned for
│ ○ gcc: orphaned for nowait
0.04 ┤ ■ ● icc: parallel for
│ ■ ▲ icc: orphaned for
0.03 ┤ □ △ icc: orphaned for nowait
│ ■ □
0.02 ┤ ○ ■ □ ○
│ ● ▲ △
0.01 ┤
│
0 └─────┴─────┴─────┴─────┴─────
2 4 6 8 10 Cores
Key Observations
Orphaned directives perform better than creating new parallel regions
Using
nowait
provides additional performance gainsStarting/closing parallel regions is very expensive
Discussion of Orphan Directives
Advantages:
Reduces need for code restructuring:
Can parallelize existing functions without major changes
Allows for longer parallel regions:
Starting/closing parallel regions is very expensive
One long parallel region is more efficient than many short ones
Better performance:
As shown in benchmarks, avoids parallel region overhead
Potential Issues:
Warning
Problem: Routine with orphan directive called outside parallel region
If a function with an orphaned directive is called from serial code, the directive may have no effect or cause unexpected behavior.
Best Practices:
Document functions that contain orphan directives
Consider adding checks for parallel context if needed
Design functions to work correctly both inside and outside parallel regions
Summary
This guide covered advanced worksharing concepts in OpenMP:
Constructs
single construct: Execute code on one thread (with barrier)
master construct: Execute code on master thread (no barrier)
ordered construct: Execute loop iterations in sequential order
Clauses
if clause: Conditional parallelization
num_threads clause: Control thread count
nowait clause: Remove implicit barriers for performance
Memory Consistency
flush: Ensure memory consistency across threads
Implicit barriers and flushes: Automatic synchronization points
Memory reordering: Understanding and preventing issues
Advanced Techniques
Static schedule specialty: Using nowait with dependencies
Orphan directives: Worksharing constructs in called functions
Performance Considerations
Balance between synchronization overhead and correctness
Strategic use of
nowait
can improve performanceOrphan directives reduce parallel region overhead