Data for Parallel Regions

Objectives

Data:
- What is private data
- What is shared data
- How to control which is which
Race conditions:
- Basic constructs to avoid data races

Shared & Private Data

Private and Shared Data Concepts

In a parallel region, data can be either shared or private.

Shared Data

Every thread can access the same memory location (potential for conflict)
Value remains unchanged on entry to parallel region
Value survives after end of parallel region

Private Data

Each thread has its own private copy
Normally uninitialized at the beginning of parallel region
Contents typically lost when parallel region finishes

Note

In a shared memory architecture, shared data resides in the main memory accessible by all processors, while each thread maintains its own private data copy.

Controlling Data Sharing in Fortran

For data declared before the start of a parallel region:

Use clause shared to declare a data structure as shared
Use clause private to declare a data structure as private

integer :: a=5, b

!$omp parallel &
!$omp shared(a) private(b)
    b = a + omp_get_thread_num()
    print *, "result=", b
!$omp end parallel

In this example:

a is shared among all threads
b is private to each thread

Controlling Data Sharing in C

For data declared before the start of a parallel region:

Use clause shared to declare a data structure as shared
Use clause private to declare a data structure as private

int a, b;
a = 5;

#pragma omp parallel \
    shared(a) private(b)
{
    b = a + omp_get_thread_num();
    printf("%i\n", b);
}

In this example:

a is shared among all threads
b is private to each thread

Private Data

Private data is typically used for control variables, including:

Thread identification
Loop indices
Variables internal to the algorithm

Default Private Variables

Most variables declared inside a parallel region are private by default:

Variables declared inside the block (C/C++)
Variables in subroutine/function called from inside parallel region

Exceptions

The following are NOT private by default:

static (C/C++) or save (Fortran) variables
File scope variables (C/C++) or COMMON blocks
Variables passed by reference inherit their data-sharing attribute

Warning

In Fortran, special care is needed with COMMON and EQUIVALENCE statements.

Example: Memory Movements for Private Data (Fortran)

integer :: b

b = 5

!$OMP parallel &
!$OMP private(b)
    b = omp_get_thread_num()
    b = b + 3
!$OMP end parallel

Memory Layout

Main Memory: b = 5

Thread 0: b = 0 → b = 3
Thread 1: b = 1 → b = 4
Thread 2: b = 2 → b = 5
Thread 3: b = 3 → b = 6

Main Memory: b = 5

Note

Each thread has its own copy of b, and changes do not affect the original value in main memory.

Example: Memory Movements for Private Data (C)

int b;

b = 5;

#pragma omp parallel \
    private(b)
{
    b = omp_get_thread_num();
    b += 3;
}

Memory Layout

Main Memory: b = 5

Thread 0: b = 0 → b = 3
Thread 1: b = 1 → b = 4
Thread 2: b = 2 → b = 5
Thread 3: b = 3 → b = 6

Main Memory: b = 5

Note

Each thread has its own copy of b, and changes do not affect the original value in main memory.

Shared Data

Majority of the data in parallel programs
Typically large data structures (e.g., arrays)

Properties

Keeps its value on entry to parallel region
Keeps its value on exit from parallel region
Every thread can access (read and/or write) the data

Safety Considerations

Safe scenario:

Multiple threads only read the data

Dangerous scenario:

Multiple threads access the same memory location
At least one of these is a write access
This easily results in a race condition

Example: Vector Initialization (Fortran)

integer, parameter :: vleng = 120
integer :: vect(vleng), myNum, start, fin, i

!$omp parallel shared(vect) &
!$omp private(myNum, start, fin, i)
    myNum = vleng / omp_get_num_threads()
    start = 1 + omp_get_thread_num() * myNum
    fin = (omp_get_thread_num() + 1) * myNum

    do i = start, fin
        vect(i) = 4 * i  ! threads write different elements
    enddo
!$omp end parallel

Mathematical notation: \(v_i = 4i\)

Note

This is safe because each thread writes to different elements of the shared array.

Example: Vector Initialization (C)

const int vleng = 120;
int vect[vleng], myNum, start, fin, i;

#pragma omp parallel shared(vect) \
    private(myNum, start, fin, i)
{
    myNum = vleng / omp_get_num_threads();
    start = omp_get_thread_num() * myNum;
    fin = start + myNum;

    for (i = start; i < fin; i++)
        vect[i] = 4 * i;  // threads write different elements
}

Mathematical notation: \(v_i = 4i\)

Note

This is safe because each thread writes to different elements of the shared array.

Example: Write Conflict for Shared Data (Fortran)

integer :: a, b

a = 5

!$OMP parallel &
!$OMP shared(a, b)
    b = a + omp_get_thread_num()
    print *, "updated"
    print *, "my b:", b
!$OMP end parallel

Memory Behavior

Main Memory: a = 5

All threads read: a = 5

Thread 0: b = 5
Thread 1: b = 6
Thread 2: b = 7
Thread 3: b = 8

Final b value: RANDOM (could be 5, 6, 7, or 8)

Warning

Final b value is random/unpredictable
Individual threads might print b before it has its final value
This is a race condition

Example: Write Conflict for Shared Data (C)

int a, b;

a = 5;

#pragma omp parallel \
    shared(a, b)
{
    b = a + omp_get_thread_num();
    printf("updated\n");
    printf("my b: %i\n", b);
}

Memory Behavior

Main Memory: a = 5

All threads read: a = 5

Thread 0: b = 5
Thread 1: b = 6
Thread 2: b = 7
Thread 3: b = 8

Final b value: RANDOM (could be 5, 6, 7, or 8)

Warning

Final b value is random/unpredictable
Individual threads might print b before it has its final value
This is a race condition

Default Clause

The default clause can be used on a parallel or task construct to determine data sharing of implicitly determined variables.

In C:

default(shared | none)

In Fortran:

default(shared | none | private | firstprivate)

For parallel constructs, if no default clause is supplied, default(shared) applies.

Recommendation

Important

Using default(none) is typically a good idea!

With default(none), all variables accessed in the parallel region must be explicitly declared as shared, private, etc.

Fixing Data Races

OpenMP provides several constructs to avoid data races:

barrier - synchronization point
critical - mutual exclusion region
atomic - lightweight protection for simple operations

Note

These constructs impact code performance, but we have no interest in “fast garbage”!

Barrier and Synchronization

The Barrier Construct

Fortran:

!$omp barrier

C:

#pragma omp barrier

Behavior

All threads wait for the last one to arrive at the barrier
Registers are flushed to the memory system
All threads must have the barrier in their line of execution

Warning

If not all threads reach the barrier, a deadlock will occur!

Example: Data Race in Matrix Transpose (Fortran)

!$omp parallel default(none) &
!$omp private(mysize, tid, i, j) shared(matrix, mtrans)
    tid = omp_get_thread_num()
    mysize = nsize / omp_get_num_threads()

    do j = 1 + tid*mysize, (tid+1)*mysize
        do i = 1, nsize
            matrix(i,j) = 1000.0 * j + i
        enddo
    enddo

    !$omp barrier

    do j = 1 + tid*mysize, (tid+1)*mysize
        do i = 1, nsize
            mtrans(i,j) = matrix(j,i)
        enddo
    enddo
!$omp end parallel

Note

The barrier ensures that all threads complete writing to matrix before any thread begins reading from it for the transpose operation.

Example: Data Race in Matrix Transpose (C)

#pragma omp parallel default(none) \
    private(mysize, tid, i, j) shared(matrix, mtrans)
{
    tid = omp_get_thread_num();
    mysize = nsize / omp_get_num_threads();

    for (i = tid*mysize; i < (tid+1)*mysize; i++)
        for (j = 0; j < nsize; j++)
            matrix[i][j] = 1000.0 * j + i;

    #pragma omp barrier

    for (i = tid*mysize; i < (tid+1)*mysize; i++)
        for (j = 0; j < nsize; j++)
            mtrans[i][j] = matrix[j][i];
}

Note

The barrier ensures that all threads complete writing to matrix before any thread begins reading from it for the transpose operation.

Critical Regions

Critical regions protect updates of shared memory locations by ensuring only one thread executes the critical region at a time.

Syntax in C

#pragma omp critical (name)
{
    code-block
}

Syntax in Fortran

!$omp critical (name)
    code-block
!$omp end critical (name)

Name is optional:
- If named: only one thread in all regions with the same name
- If unnamed: only one thread in all unnamed regions
Implies a register flush at entrance and exit
Useful to execute non-thread-safe functions
Performance penalty due to serialization

Example: Use of Critical Region (Fortran)

Computing a sum with critical section:

sum = 0.0_dpr

!$omp parallel default(none) &
!$omp shared(sum) private(tid, cont)
    tid = omp_get_thread_num()
    cont = func(tid)

    !$omp critical (exp_up)
        sum = sum + cont
        print *, tid, ": c=", cont, " s=", sum
    !$omp end critical (exp_up)
!$omp end parallel

Mathematical notation: \(\sum_{k=0}^{n-1} e^k\)

Note

The critical region ensures that only one thread updates sum at a time, preventing race conditions.

Example: Use of Critical Region (C)

Computing a sum with critical section:

sum = 0.0;

#pragma omp parallel default(none) \
    shared(sum) private(tid, cont)
{
    tid = omp_get_thread_num();
    cont = func(tid);

    #pragma omp critical (exp_up)
    {
        sum += cont;
        printf("%i: c=%f s=%f\n", tid, cont, sum);
    }
}

Mathematical notation: \(\sum_{k=0}^{n-1} e^k\)

Note

The critical region ensures that only one thread updates sum at a time, preventing race conditions.

Atomic Operations

atomic is a lightweight alternative to critical for simple cases.

Works with simple statements only
Can use special hardware instructions if they exist
Flushes the “protected” variable on entry and exit
Much more efficient than critical for simple operations

Versions (from OpenMP 3.1)

Four different versions:

read - atomic read operation
write - atomic write operation
update - atomic update operation
capture - atomic update with capture of old/new value

OpenMP 4.0 Enhancement

Adding seq_cst to atomic flushes all variables:

Important for controlling instruction reordering
Example use case: implementing a lock

Atomic Read

Protects only the reading of a scalar intrinsic variable.

Fortran Syntax

!$omp atomic read
v = x

C Syntax

#pragma omp atomic read
v = x;

Protects only the reading of scalar variable x
Flushes x on entry and exit

Atomic Write

Protects only the writing of a scalar intrinsic variable.

Fortran Syntax

!$omp atomic write
x = expr

C Syntax

#pragma omp atomic write
x = expr;

Example Expressions

x = 5;
x = v;
x = func(a);

Warning

Protects only the writing of x
No protection for evaluation of expr on the right-hand side
Flushes x on entry and exit

Atomic Update

Protects the update of a variable in simple arithmetic operations.

Note

atomic update was the only atomic operation prior to OpenMP 3.1. The update keyword is optional for backward compatibility.

Only protects the update of the variable, not function calls on the right-hand side
Works with simple statements only
Can use special hardware instructions if available
Flushes the updated variable on entry and exit

Example

x += func(a);
x = x + func(a);

Warning

The evaluation of func(a) is NOT protected. Use critical if protection is needed!

Atomic Update: Fortran Statements

Examples

!$omp atomic update
x = x + 1

!$omp atomic update
x = x + f(a)

Warning

The evaluation of f(a) is NOT protected. Use critical if needed!

Allowed Operations

x = x operator expr
x = expr operator x
x = intr_proc(x, expr_list)
x = intr_proc(expr_list, x)

Where:

x is scalar, intrinsic type
operator is one of: +, *, -, /, .AND., .OR., .EQV., .NEQV.
intr_proc is one of: MAX, MIN, IAND, IOR, IEOR

Note

The update keyword is optional for consistency with older OpenMP standards.

Example: Vector Norm (Fortran)

norm = 0.0D0

!$omp parallel default(none) &
!$omp shared(vect, norm) private(myNum, i, lNorm)
    lNorm = 0.0D0
    myNum = vleng / omp_get_num_threads()  ! local size

    do i = 1 + myNum * omp_get_thread_num(), &
            myNum * (1 + omp_get_thread_num())
        lNorm = lNorm + vect(i) * vect(i)
    enddo

    !$omp atomic update
    norm = norm + lNorm
!$omp end parallel

norm = sqrt(norm)

Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)

Note

Each thread computes a local sum (lNorm), then atomically adds it to the global norm.

Atomic Update: C Statements

Examples

#pragma omp atomic update
x++;

#pragma omp atomic update
x += f(a);

Warning

The evaluation of f(a) is NOT protected. Use critical if needed!

Allowed Operations

x binop= expr;
x++;
++x;
x--;
--x;
x = x binop expr;

Where:

x is lvalue, scalar
binop is one of: +, *, -, /, &, ^, |, <<, >>

Note

The update keyword is optional for consistency with older OpenMP standards.

Example: Vector Norm (C)

norm = 0.0;

#pragma omp parallel default(none) \
    shared(vect, norm) private(myNum, i, lNorm)
{
    lNorm = 0.0;
    myNum = vleng / omp_get_num_threads();  // local size

    for (i = myNum * omp_get_thread_num();
         i < myNum * (1 + omp_get_thread_num()); i++)
        lNorm += vect[i] * vect[i];

    #pragma omp atomic update
    norm += lNorm;
}  // synchronize at end parallel

norm = sqrt(norm);

Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)

Note

Each thread computes a local sum (lNorm), then atomically adds it to the global norm.

Atomic Capture

Atomic capture allows you to:

Update a shared variable atomically
Keep a thread-private copy of either (but not both):
- The old value before update
- The new value after update

Restrictions apply to the allowed statement forms.

Atomic Capture: C Statements

#pragma omp atomic capture
statement_or_structured_block

Allowed Statements (OpenMP 4.0)

v = x++;
v = x--;
v = ++x;
v = --x;
v = x binop= expr;
v = x = x binop expr;
v = x = expr binop x;

Allowed Structured Blocks

{v = x; x binop= expr;}
{x binop= expr; v = x;}
{v = x; x = x binop expr;}
{v = x; x = expr binop x;}
{x = x binop expr; v = x;}
{x = expr binop x; v = x;}
{v = x; x = expr;}
{v = x; x++;}
{v = x; ++x;}
{++x; v = x;}
{x++; v = x;}
{v = x; x--;}
{v = x; --x;}
{--x; v = x;}
{x--; v = x;}

Atomic Capture: Fortran Statements

Syntax Form 1

!$omp atomic capture
    update-statement
    capture-statement
!$omp end atomic

Syntax Form 2

!$omp atomic capture
    capture-statement
    update-statement
!$omp end atomic

Allowed Update Statements

x = x operator expr
x = expr operator x
x = intr_proc(x, expr_list)
x = intr_proc(expr_list, x)

Example (C)

#pragma omp atomic capture
{
    v = x;      // v gets the OLD value
    x += 5;     // then x is updated
}

Without atomic capture one would need to operations which could lead to race coditions:

// WRONG - race condition!
#pragma omp atomic read
v = x;
#pragma omp atomic update
x += 5;
// Another thread could modify x between these two operations!

Exercise

Monitor the race condition in the following code:

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=4
// gcc -O3 -march=native -fopenmp -o test.x 3-codestructure-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{

int var1, var2, var3;   // Three variables
var1 = 1;
var2 = 2;
var3 = 3;

#pragma omp parallel private(var1,var2) shared(var3)
    {

#ifdef _OPENMP
    printf("var1 =  %i , var2 = %i , var3 = %i \n",var1,var2,var3);
    var1 = 10;
    var2 = 20;
    var3 = omp_get_thread_num();    // RACE CONDITION!!
#else
    printf("Serial code!\n");
#endif
    }

    printf("var1 =  %i , var2 = %i , var3 = %i \n");

return 0;
}

Exercise

Repair the race condition in the previous code for var3 and set this variable with the number of threads available.

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=4
// gcc -O3 -march=native -fopenmp -o test.x 3-codestructure-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{

int var1, var2, var3;   // Three variables
var1 = 1;
var2 = 2;
var3 = 3;

#pragma omp parallel private(var1,var2) shared(var3)
    {

#ifdef _OPENMP
    printf("var1 =  %i , var2 = %i , var3 = %i \n",var1,var2,var3);
    var1 = 10;
    var2 = 20;
    #pragma omp atomic write
    var3 = omp_get_thread_num();    // RACE CONDITION REPAIRED!!
    printf("In parallel region var1 =  %i , var2 = %i , var3 = %i \n",var1,var2,var3);
#else
    printf("Serial code!\n");
#endif
    }

    printf("var1 =  %i , var2 = %i , var3 = %i \n");

return 0;
}

Exercise

Use the atomic operation to update a variable counter that counts the available number of threads in a parallel region.

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 10a-atomic-openmp.c -lm
#include <stdio.h>
#include <omp.h>

int main() {
    int counter = 0;  // Shared variable to be updated by threads

    #pragma omp parallel num_threads(4)
    {
        int thread_id = omp_get_thread_num();

        // safely increment the counter with the atomic
        #pragma omp atomic
        counter++;

        printf("Thread %d incremented counter to %d\n", thread_id, counter);
    }

    printf("Counter value: %d\n", counter);

    return 0;
}

Summary

This guide covered the following OpenMP concepts:

Data Management:

Private data: each thread has its own copy
Shared data: accessible by all threads
Controlling data attributes with clauses

Preventing Race Conditions:

barrier: synchronization point for all threads
critical: mutual exclusion for code regions
atomic: lightweight protection for simple operations
- read, write, update, capture

Parallelization Strategies:

Examples demonstrated various approaches to parallel data management
Techniques for avoiding data races while maintaining performance