More on Private Data

Objectives

This guide covers special versions of private data:

firstprivate - initialization of private variables
lastprivate - capturing values from last iteration
reduction - parallel reductions (sums, products, etc.)
threadprivate - privatizing global storage
User-defined reductions (OpenMP 4.0+)

Private and Shared Data: Review

In a parallel region, data can be either shared or private.

Shared Data

Value unchanged on entry to parallel region
Survives after end of parallel region

Private Data

Each thread has its own private copy
Normally uninitialized at beginning of parallel region
Contents typically lost when parallel region finishes
However: Connection to values before/after is often needed

Memory Layout

Main Memory: [Shared data]

Thread 0: [Private T0]
Thread 1: [Private T1]
Thread 2: [Private T2]
Thread 3: [Private T3]

Clause: firstprivate

Problem

Private variables are not initialized by default.

Solution: firstprivate Clause

The firstprivate clause:

Declares variable(s) as private
Initializes each private copy with the value prior to the construct

Fortran Example

integer :: lsum = 10

!$omp parallel &
!$omp firstprivate(lsum)
    lsum = lsum + omp_get_thread_num()
    print *, lsum
!$omp end parallel

C Example

int lsum = 10;

#pragma omp parallel \
    firstprivate(lsum)
{
    lsum += omp_get_thread_num();
    printf("%i\n", lsum);
}

Expected Output

With 4 threads:

Thread 0: 10
Thread 1: 11
Thread 2: 12
Thread 3: 13

Example: Vector Norm with private

Fortran Version

norm = 0.0

!$omp parallel default(none) &
!$omp shared(vect, norm) private(i, lNorm)
    lNorm = 0.0

    !$omp do
    do i = 0, vleng
        lNorm = lNorm + vect(i)**2
    enddo

    !$omp atomic update
    norm = norm + lNorm
!$omp end parallel

norm = sqrt(norm)

C Version

norm = 0.0;

#pragma omp parallel default(none) \
    shared(vect, norm) private(i, lNorm)
{
    lNorm = 0.0;

    #pragma omp for
    for (i = 0; i < vleng; i++)
        lNorm += vect[i] * vect[i];

    #pragma omp atomic update
    norm += lNorm;
}

norm = sqrt(norm);

Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)

Note

lNorm must be explicitly initialized to 0.0 inside the parallel region.

Example: Vector Norm with firstprivate

Fortran Version

norm = 0.0
lNorm = 0.0

!$omp parallel default(none) &
!$omp shared(vect, norm) private(i) firstprivate(lNorm)
    !$omp do
    do i = 0, vleng
        lNorm = lNorm + vect(i)**2
    enddo

    !$omp atomic update
    norm = norm + lNorm
!$omp end parallel

norm = sqrt(norm)

C Version

norm = 0.0;
lNorm = 0.0;

#pragma omp parallel default(none) \
    shared(vect, norm) private(i) firstprivate(lNorm)
{
    #pragma omp for
    for (i = 0; i < vleng; i++)
        lNorm += vect[i] * vect[i];

    #pragma omp atomic update
    norm += lNorm;
}

norm = sqrt(norm);

Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)

Important

With firstprivate, lNorm is automatically initialized to 0.0 from the master thread’s value.

Clause: lastprivate

Purpose

The lastprivate clause:

Used with loop and sections constructs
Variable is private during execution
At the end: assigns value from last iteration or section
Undefined if not set in last iteration/section

Combined Usage

Variables can be both firstprivate and lastprivate.

Fortran Example

integer :: i, a

!$omp parallel do &
!$omp lastprivate(a)
do i = 1, 100
    a = i + 1
    call func(a)
enddo

print *, "a=", a
! This prints: a=101

C Example

int i, a;

#pragma omp parallel for \
    lastprivate(a)
for (i = 0; i < 100; i++)
{
    a = i + 1;
    func(a);
}

printf("a=%i\n", a);
// This prints: a=100

Note

The value from the sequentially last iteration is assigned back to the original variable.

Reduction Variables

Reductions of private variables are frequently needed:

Averages of array values
Scalar products
Sum, product, minimum, maximum operations

Previous Approach

We’ve done this before (e.g., vector norm example) using atomic to protect the update.

Better Approach: Reduction Clause

For a reduction, we specify:

Operation: e.g., addition, multiplication, OR, AND, etc.
One or more variables
A construct can have more than one reduction

Behavior of Reduction

reduction(operator : variable_list)

How It Works

Variables specified in reduction:

Each thread gets a private copy
Private copies are initialized with default values matching the operator
At the end of the construct (e.g., parallel region):
- Value prior to construct is combined with private copies
- Using the specified operator for combining values
- New combined value is available after the construct

Example: Memory Movements for Reduction (C)

int b;
b = 5;

#pragma omp parallel \
    reduction(+:b)
{
    b += omp_get_thread_num();
}

printf("%i\n", b);

Memory Behavior

Main Memory: b = 5

Thread 0: b = 0 → b = 0
Thread 1: b = 0 → b = 1
Thread 2: b = 0 → b = 2
Thread 3: b = 0 → b = 3

Final: 5 + 0 + 1 + 2 + 3 = 11

Output: 11

Note

Each thread’s private copy is initialized to 0 (identity for addition), then combined at the end.

Example: Memory Movements for Reduction (Fortran)

integer :: b

b = 5

!$omp parallel &
!$omp reduction(+:b)
    b = b + omp_get_thread_num()
!$omp end parallel

print *, b

Memory Behavior

Main Memory: b = 5

Thread 0: b = 0 → b = 0
Thread 1: b = 0 → b = 1
Thread 2: b = 0 → b = 2
Thread 3: b = 0 → b = 3

Final: 5 + 0 + 1 + 2 + 3 = 11

Output: 11

Note

Each thread’s private copy is initialized to 0 (identity for addition), then combined at the end.

Example: Vector Norm with atomic update

Fortran Version

norm = 0.0
lNorm = 0.0

!$omp parallel default(none) &
!$omp shared(vect, norm) private(i) firstprivate(lNorm)
    !$omp do
    do i = 1, vleng
        lNorm = lNorm + vect(i)**2  ! private copy
    enddo

    !$omp atomic update
    norm = norm + lNorm
!$omp end parallel  ! combine copies

norm = sqrt(norm)  ! master copy

C Version

norm = 0.0;
lNorm = 0.0;

#pragma omp parallel default(none) \
    shared(vect, norm) private(i) firstprivate(lNorm)
{
    #pragma omp for
    for (i = 0; i < vleng; i++)
        lNorm += vect[i] * vect[i];

    #pragma omp atomic update
    norm += lNorm;
}

norm = sqrt(norm);

Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)

Example: Vector Norm with reduction

Fortran Version

norm = 0.0  ! master copy
! lNorm gone

!$omp parallel default(none) &
!$omp shared(vect) reduction(+:norm) private(i)
    !$omp do  ! private copy = 0
    do i = 1, vleng
        norm = norm + vect(i)**2  ! private copy
    enddo
!$omp end parallel  ! combine copies

norm = sqrt(norm)  ! master copy

C Version

norm = 0.0;  // master copy
// lNorm gone!

#pragma omp parallel default(none) \
    shared(vect) reduction(+:norm) private(i)
{  // private copy: 0
    #pragma omp for
    for (i = 0; i < vleng; i++)
        norm += vect[i] * vect[i];  // private copy
}  // combine copies

norm = sqrt(norm);  // master copy

Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)

Important

No need for lNorm variable or atomic directive. The reduction clause handles everything automatically.

Example: Vector Norm with reduction (Simplified)

Fortran Version

norm = 0.0  ! master copy

!$omp parallel do default(none) &
!$omp shared(vect) reduction(+:norm)
do i = 1, vleng
    norm = norm + vect(i)**2  ! private copy
enddo
!$omp end parallel do

norm = sqrt(norm)  ! master copy

C Version

norm = 0.0;  // master copy

#pragma omp parallel for default(none) \
    shared(vect) reduction(+:norm)
for (i = 0; i < vleng; i++)
    norm += vect[i] * vect[i];  // private copy

norm = sqrt(norm);  // master copy

Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)

Note

Using parallel do/parallel for makes the code even more concise.

Supported Operators: Fortran (OpenMP 3.0)

Name	Symbol	Initial Value of Local Copy
add	`+`	0
multiply	`*`	1
subtract	`-`	0
logical AND	`.and.`	`.true.`
logical OR	`.or.`	`.false.`
EQUIVALENCE	`.eqv.`	`.true.`
NON-EQUIV.	`.neqv.`	`.false.`
maximum	`max`	smallest representable number
minimum	`min`	largest representable number
bitwise AND	`iand`	all bits on
bitwise OR	`ior`	0
bitwise XOR	`ieor`	0

Supported Operators: C (OpenMP 3.0)

Name	Symbol	Initial Value of Local Copy
add	`+`	0
multiply	`*`	1
subtract	`-`	0
bitwise AND	`&`	`~0`
bitwise OR	`\|`	0
bitwise XOR	`^`	0
logical AND	`&&`	1
logical OR	`\|\|`	0

Restrictions on Reduction

Important Limitations

C/C++:

Arrays are unsupported as reduction variables
No pointer or reference types

Fortran:

ALLOCATABLE arrays must be allocated at the beginning of construct
Must not be deallocated during construct
No Fortran pointers or assumed-size arrays

Order of Execution

Warning

No order of threads is specified!

Repeated runs are typically not bit-identical
This is common in parallel computing
This is technically a race condition, which is typically tolerated

OpenMP 4.0 Enhancement

OpenMP 4.0 allows you to declare your own custom reductions.

User-Defined Reductions

Allows definition of custom reduction operations.

Use Cases

Particularly useful with derived data types:

C/C++: struct
Fortran: type

Requirements

You need to provide:

Combiner: Combines thread-private results to final result
Initializer: Initializes private contributions at outset

Case Study: Maximum Value and Its Position

Given a large array:

Determine the maximum value
Find the location (index) of the maximum in the array

Parallelization Strategy:

Assign a portion of array to each thread
Each thread determines maximum and position in its part
Use user-defined reduction to determine final result

User-Defined Reduction in Fortran

Step 1: Define the Data Type

type :: mx_s
    real :: value
    integer :: index
end type

Step 2: Declare the Reduction

!$omp declare reduction(maxloc: mx_s: &
!$omp mx_combine(omp_out, omp_in)) &
!$omp initializer(mx_init(omp_priv, omp_orig))

The operation can be triggered by the name maxloc
Utilizes subroutine mx_combine and mx_init
Acts on objects of type mx_s

The Initializer in Fortran

Can be a subroutine or assignment statement (here: subroutine).

Special Variables:

omp_priv: reference to variable to be initialized
omp_orig: reference to original variable prior to construct

Example Implementation

Initialize from value prior to construct:

subroutine mx_init(priv, orig)
    type(mx_s), intent(out) :: priv
    type(mx_s), intent(in) :: orig

    priv%value = orig%value
    priv%index = orig%index
end subroutine mx_init

The Combiner in Fortran

Can be a subroutine or assignment statement (here: subroutine).

Special Variables:

omp_in: reference to contribution from thread
omp_out: reference to combined result

Example Implementation

Replace if contribution is larger:

subroutine mx_combine(out, in)
    type(mx_s), intent(inout) :: out
    type(mx_s), intent(in) :: in

    if (out%value < in%value) then
        out%value = in%value
        out%index = in%index
    endif
end subroutine mx_combine

Using User-Defined Reduction in Fortran

mx%value = val(1)
mx%index = 1

!$omp parallel do reduction(maxloc: mx)
do i = 2, count
    if (mx%value < val(i)) then
        mx%value = val(i)
        mx%index = i
    endif
enddo

Easily readable code
Similar to what one would do in serial programming
Abstracts away the parallel complexity

User-Defined Reduction in C

Step 1: Define the Data Type

struct mx_s {
    float value;
    int index;
};

Step 2: Declare the Reduction

#pragma omp declare reduction(maxloc: \
    struct mx_s: mx_combine(&omp_out, &omp_in)) \
    initializer(mx_init(&omp_priv, &omp_orig))

The operation can be triggered by the name maxloc
Utilizes functions mx_combine and mx_init
Acts on objects of type struct mx_s

The Initializer in C

An expression (here: implemented with a function).

Special Variables:

omp_priv: reference to variable to be initialized
omp_orig: reference to original variable prior to construct

Example Implementation

Initialize from value prior to construct:

void mx_init(struct mx_s *priv, struct mx_s *orig)
{
    priv->value = orig->value;
    priv->index = orig->index;
}

The Combiner in C

An expression (here: implemented with a function).

Special Variables:

omp_in: reference to contribution from thread
omp_out: reference to combined result

Example Implementation

Replace if contribution is larger:

void mx_combine(struct mx_s *out, struct mx_s *in)
{
    if (out->value < in->value) {
        out->value = in->value;
        out->index = in->index;
    }
}

Using User-Defined Reduction in C

mx->value = val[0];
mx->index = 0;

#pragma omp parallel for reduction(maxloc: mx)
for (i = 1; i < count; i++) {
    if (mx.value < val[i])
    {
        mx.value = val[i];
        mx.index = i;
    }
}

Easily readable code
Similar to what one would do in serial programming
Abstracts away the parallel complexity

Declaring a Reduction Operation: Syntax Summary

C Syntax

#pragma omp declare reduction (reduction-identifier : \
    typename-list : combiner) [initializer-clause] new-line

Fortran Syntax

!$omp declare reduction(reduction-identifier : &
!$omp type-list : combiner) [initializer-clause]

Components

reduction-identifier: Name for your reduction
typename-list/type-list: Data types the reduction applies to
combiner: Function/subroutine to combine values
initializer-clause: Optional initialization specification

Dealing with Global Storage

By default, global storage is shared among all threads.

Examples of Global Storage

C/C++:

File scope variables
static variables

Fortran:

COMMON blocks
Module data
Variables with save attribute

This default behavior is not always what is needed.

Directive: threadprivate in C

The threadprivate directive makes global storage private to each thread.

int g_var = 1;
#pragma omp threadprivate(g_var)

int main()
{
    g_var = 4;

    #pragma omp parallel
    {
        printf("%d\n", g_var);
    }

    return 0;
}

Each thread gets a private copy
Outside parallel region: modifications affect master’s copy

Example Output

With 4 threads:

Thread 0 (master): 4
Thread 1: 1
Thread 2: 1
Thread 3: 1

Directive: threadprivate in Fortran

The threadprivate directive makes global storage private to each thread.

module gmod
    integer :: g_var = 1
    !$omp threadprivate(g_var)
end module gmod

program example
    use gmod

    g_var = 4

    !$omp parallel
    print *, g_var
    !$omp end parallel
end program example

Each thread gets a private copy
Outside parallel region: modifications affect master’s copy

Example Output

With 4 threads:

Thread 0 (master): 4
Thread 1: 1
Thread 2: 1
Thread 3: 1

Clause: copyin

The copyin clause initializes threadprivate data from the master thread.

C Example

int g_var = 1;
#pragma omp threadprivate(g_var)

int main()
{
    g_var = 4;

    #pragma omp parallel \
        copyin(g_var)
    {
        printf("%d\n", g_var);
    }

    return 0;
}

Fortran Example

module gmod
    integer :: g_var = 1
    !$omp threadprivate(g_var)
end module gmod

program example
    use gmod

    g_var = 4

    !$omp parallel copyin(g_var)
    print *, g_var
    !$omp end parallel
end program example

Output

With 4 threads, all threads print: 4

More on threadprivate

Data Persistence

threadprivate data remains unchanged between parallel regions if:

Neither region is nested inside another parallel region
Both regions have the same thread count
Internal variable dyn-var is false in both regions
- Use function omp_set_dynamic to control this

Fortran COMMON Blocks

In Fortran, you can make a COMMON block threadprivate:

integer :: a, b, c
COMMON /abccom/ a, b, c
!$OMP threadprivate(/abccom/)

Exercise

In a previous exercise, we parallelized a for loop which had 20 iterations by evenly divinding the number of interations among the available threads. An variable was used to store the number of iterations in the loop and an atomic operation protected the data from race conditions. Rewrite this code but now use the reduction operation.

Solution

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 6b-forworksharing-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{

int i,var1;
int n = 20; // number of iterations
var1 = 0;

#pragma omp parallel
    {
#ifdef _OPENMP
    // The purpose of this code is to add 1 to var1 20 times
#pragma omp for reduction(+:var1)
    for ( int i = 0; i < n; i++)
        var1 += 1;

#else
    printf("Serial code!\n");
#endif
    }

    printf("var1 =  %i \n", var1);

return 0;
}

Exercise

In the following code, monitor the values of the variables at the different stages of the runtime.

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 5b-datascope-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

int main()
{

int var1, var2, var3;   // Three variables
var1 = 1;
var2 = 2;
var3 = 3;

#pragma omp parallel firstprivate(var1,var2) shared(var3)
    {

#ifdef _OPENMP
    printf("var1 =  %i , var2 = %i , var3 = %i \n",var1,var2,var3);
    var1 = 10;
    var2 = 20;
    var3 = 30;
#else
    printf("Serial code!\n");
#endif
    }

    printf("var1 =  %i , var2 = %i , var3 = %i \n",var1,var2,var3);


int x = 0; // variable to hold the value from the last iteration

#pragma omp parallel for lastprivate(x)
for (int i = 0; i < 10; i++) {
    x = i; // x is private to each thread, but will retain value from the last iteration
    printf("Thread %d: i = %d, x = %d\n", omp_get_thread_num(), i, x);
}

printf("After the loop, x = %d\n", x); // x has the value from the last iteration (n - 1)


return 0;
}

Exercise

In the following code, monitor the values of the variable counter at the different stages of the runtime.

// On cluster Kebnekaise
// ml foss
// export OMP_NUM_THREADS=1
// gcc -O3 -march=native -fopenmp -o test.x 7-threadprivate-openmp.c -lm
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif

// declare a global variable
int counter;

// this variable is private to each thread
#pragma omp threadprivate(counter)

int main()
{

counter = 0;

#pragma omp parallel
{
#ifdef _OPENMP
    int thread_id = omp_get_thread_num();

    // Each thread sets its private copy of 'counter'
    counter = thread_id * 10;
    printf("Thread %d: counter = %d\n", thread_id, counter);

    // sync all threads
    #pragma omp barrier

    // Modify the thread-private variable
    counter += 5;
    printf("Thread %d after modification: counter = %d\n", thread_id, counter);
#else
    printf("Serial code!\n");
#endif
    }

// Outside the parallel region, the main thread's 'counter' value is unaffected
    printf("In main thread, counter = %d\n", counter);

#pragma omp parallel
{
#ifdef _OPENMP
    int thread_id = omp_get_thread_num();

    // print the value of 'counter' in another parallel region
    printf("Thread %d: counter = %d in the second parallel region\n", thread_id, counter);

#else
    printf("Serial code!\n");
#endif
    }

return 0;
}

Summary

This guide covered special private variables in OpenMP:

Special Private Variable Types

firstprivate: Initialization of private variables from master thread
lastprivate: Set value of private variable to value of last loop iteration or last section at end of construct
reduction: Calculating sums, products, etc. in parallel
threadprivate: Privatize global storage

User-Defined Reductions

Available in OpenMP 4.0+
Useful for complex data types
Requires combiner and initializer functions

When to Use Standard Constructs

The above constructs handle standard situations. For special cases, use:

Explicit initialization of private variables from shared variables
atomic/critical for writes to shared variables