More on Private Data
Objectives
This guide covers special versions of private data:
firstprivate- initialization of private variableslastprivate- capturing values from last iterationreduction- parallel reductions (sums, products, etc.)threadprivate- privatizing global storageUser-defined reductions (OpenMP 4.0+)
Clause: firstprivate
Problem
Private variables are not initialized by default.
Solution: firstprivate Clause
The firstprivate clause:
Declares variable(s) as private
Initializes each private copy with the value prior to the construct
Fortran Example
integer :: lsum = 10
!$omp parallel &
!$omp firstprivate(lsum)
lsum = lsum + omp_get_thread_num()
print *, lsum
!$omp end parallel
C Example
int lsum = 10;
#pragma omp parallel \
firstprivate(lsum)
{
lsum += omp_get_thread_num();
printf("%i\n", lsum);
}
Expected Output
With 4 threads:
Thread 0: 10
Thread 1: 11
Thread 2: 12
Thread 3: 13
Example: Vector Norm with private
Fortran Version
norm = 0.0
!$omp parallel default(none) &
!$omp shared(vect, norm) private(i, lNorm)
lNorm = 0.0
!$omp do
do i = 0, vleng
lNorm = lNorm + vect(i)**2
enddo
!$omp atomic update
norm = norm + lNorm
!$omp end parallel
norm = sqrt(norm)
C Version
norm = 0.0;
#pragma omp parallel default(none) \
shared(vect, norm) private(i, lNorm)
{
lNorm = 0.0;
#pragma omp for
for (i = 0; i < vleng; i++)
lNorm += vect[i] * vect[i];
#pragma omp atomic update
norm += lNorm;
}
norm = sqrt(norm);
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Note
lNorm must be explicitly initialized to 0.0 inside the parallel region.
Example: Vector Norm with firstprivate
Fortran Version
norm = 0.0
lNorm = 0.0
!$omp parallel default(none) &
!$omp shared(vect, norm) private(i) firstprivate(lNorm)
!$omp do
do i = 0, vleng
lNorm = lNorm + vect(i)**2
enddo
!$omp atomic update
norm = norm + lNorm
!$omp end parallel
norm = sqrt(norm)
C Version
norm = 0.0;
lNorm = 0.0;
#pragma omp parallel default(none) \
shared(vect, norm) private(i) firstprivate(lNorm)
{
#pragma omp for
for (i = 0; i < vleng; i++)
lNorm += vect[i] * vect[i];
#pragma omp atomic update
norm += lNorm;
}
norm = sqrt(norm);
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Important
With firstprivate, lNorm is automatically initialized to 0.0 from the master thread’s value.
Clause: lastprivate
Purpose
The lastprivate clause:
Used with loop and sections constructs
Variable is private during execution
At the end: assigns value from last iteration or section
Undefined if not set in last iteration/section
Combined Usage
Variables can be both firstprivate and lastprivate.
Fortran Example
integer :: i, a
!$omp parallel do &
!$omp lastprivate(a)
do i = 1, 100
a = i + 1
call func(a)
enddo
print *, "a=", a
! This prints: a=101
C Example
int i, a;
#pragma omp parallel for \
lastprivate(a)
for (i = 0; i < 100; i++)
{
a = i + 1;
func(a);
}
printf("a=%i\n", a);
// This prints: a=100
Note
The value from the sequentially last iteration is assigned back to the original variable.
Reduction Variables
Reductions of private variables are frequently needed:
Averages of array values
Scalar products
Sum, product, minimum, maximum operations
Previous Approach
We’ve done this before (e.g., vector norm example) using atomic to protect the update.
Better Approach: Reduction Clause
For a reduction, we specify:
Operation: e.g., addition, multiplication, OR, AND, etc.
One or more variables
A construct can have more than one reduction
Behavior of Reduction
reduction(operator : variable_list)
How It Works
Variables specified in reduction:
Each thread gets a private copy
Private copies are initialized with default values matching the operator
At the end of the construct (e.g., parallel region):
Value prior to construct is combined with private copies
Using the specified operator for combining values
New combined value is available after the construct
Example: Memory Movements for Reduction (C)
int b;
b = 5;
#pragma omp parallel \
reduction(+:b)
{
b += omp_get_thread_num();
}
printf("%i\n", b);
Memory Behavior
Main Memory: b = 5
Thread 0: b = 0 → b = 0
Thread 1: b = 0 → b = 1
Thread 2: b = 0 → b = 2
Thread 3: b = 0 → b = 3
Final: 5 + 0 + 1 + 2 + 3 = 11
Output: 11
Note
Each thread’s private copy is initialized to 0 (identity for addition), then combined at the end.
Example: Memory Movements for Reduction (Fortran)
integer :: b
b = 5
!$omp parallel &
!$omp reduction(+:b)
b = b + omp_get_thread_num()
!$omp end parallel
print *, b
Memory Behavior
Main Memory: b = 5
Thread 0: b = 0 → b = 0
Thread 1: b = 0 → b = 1
Thread 2: b = 0 → b = 2
Thread 3: b = 0 → b = 3
Final: 5 + 0 + 1 + 2 + 3 = 11
Output: 11
Note
Each thread’s private copy is initialized to 0 (identity for addition), then combined at the end.
Example: Vector Norm with atomic update
Fortran Version
norm = 0.0
lNorm = 0.0
!$omp parallel default(none) &
!$omp shared(vect, norm) private(i) firstprivate(lNorm)
!$omp do
do i = 1, vleng
lNorm = lNorm + vect(i)**2 ! private copy
enddo
!$omp atomic update
norm = norm + lNorm
!$omp end parallel ! combine copies
norm = sqrt(norm) ! master copy
C Version
norm = 0.0;
lNorm = 0.0;
#pragma omp parallel default(none) \
shared(vect, norm) private(i) firstprivate(lNorm)
{
#pragma omp for
for (i = 0; i < vleng; i++)
lNorm += vect[i] * vect[i];
#pragma omp atomic update
norm += lNorm;
}
norm = sqrt(norm);
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Example: Vector Norm with reduction
Fortran Version
norm = 0.0 ! master copy
! lNorm gone
!$omp parallel default(none) &
!$omp shared(vect) reduction(+:norm) private(i)
!$omp do ! private copy = 0
do i = 1, vleng
norm = norm + vect(i)**2 ! private copy
enddo
!$omp end parallel ! combine copies
norm = sqrt(norm) ! master copy
C Version
norm = 0.0; // master copy
// lNorm gone!
#pragma omp parallel default(none) \
shared(vect) reduction(+:norm) private(i)
{ // private copy: 0
#pragma omp for
for (i = 0; i < vleng; i++)
norm += vect[i] * vect[i]; // private copy
} // combine copies
norm = sqrt(norm); // master copy
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Important
No need for lNorm variable or atomic directive. The reduction clause handles everything automatically.
Example: Vector Norm with reduction (Simplified)
Fortran Version
norm = 0.0 ! master copy
!$omp parallel do default(none) &
!$omp shared(vect) reduction(+:norm)
do i = 1, vleng
norm = norm + vect(i)**2 ! private copy
enddo
!$omp end parallel do
norm = sqrt(norm) ! master copy
C Version
norm = 0.0; // master copy
#pragma omp parallel for default(none) \
shared(vect) reduction(+:norm)
for (i = 0; i < vleng; i++)
norm += vect[i] * vect[i]; // private copy
norm = sqrt(norm); // master copy
Mathematical notation: \(\sqrt{\sum_i v(i) \cdot v(i)}\)
Note
Using parallel do/parallel for makes the code even more concise.
Supported Operators: Fortran (OpenMP 3.0)
Name |
Symbol |
Initial Value of Local Copy |
|---|---|---|
add |
|
0 |
multiply |
|
1 |
subtract |
|
0 |
logical AND |
|
|
logical OR |
|
|
EQUIVALENCE |
|
|
NON-EQUIV. |
|
|
maximum |
|
smallest representable number |
minimum |
|
largest representable number |
bitwise AND |
|
all bits on |
bitwise OR |
|
0 |
bitwise XOR |
|
0 |
Supported Operators: C (OpenMP 3.0)
Name |
Symbol |
Initial Value of Local Copy |
|---|---|---|
add |
|
0 |
multiply |
|
1 |
subtract |
|
0 |
bitwise AND |
|
|
bitwise OR |
|
0 |
bitwise XOR |
|
0 |
logical AND |
|
1 |
logical OR |
|
0 |
Restrictions on Reduction
Important Limitations
C/C++:
Arrays are unsupported as reduction variables
No pointer or reference types
Fortran:
ALLOCATABLEarrays must be allocated at the beginning of constructMust not be deallocated during construct
No Fortran pointers or assumed-size arrays
Order of Execution
Warning
No order of threads is specified!
Repeated runs are typically not bit-identical
This is common in parallel computing
This is technically a race condition, which is typically tolerated
OpenMP 4.0 Enhancement
OpenMP 4.0 allows you to declare your own custom reductions.
User-Defined Reductions
Allows definition of custom reduction operations.
Use Cases
Particularly useful with derived data types:
C/C++:
structFortran:
type
Requirements
You need to provide:
Combiner: Combines thread-private results to final result
Initializer: Initializes private contributions at outset
Case Study: Maximum Value and Its Position
Given a large array:
Determine the maximum value
Find the location (index) of the maximum in the array
Parallelization Strategy:
Assign a portion of array to each thread
Each thread determines maximum and position in its part
Use user-defined reduction to determine final result
User-Defined Reduction in Fortran
Step 1: Define the Data Type
type :: mx_s
real :: value
integer :: index
end type
Step 2: Declare the Reduction
!$omp declare reduction(maxloc: mx_s: &
!$omp mx_combine(omp_out, omp_in)) &
!$omp initializer(mx_init(omp_priv, omp_orig))
The operation can be triggered by the name
maxlocUtilizes subroutine
mx_combineandmx_initActs on objects of type
mx_s
The Initializer in Fortran
Can be a subroutine or assignment statement (here: subroutine).
Special Variables:
omp_priv: reference to variable to be initializedomp_orig: reference to original variable prior to construct
Example Implementation
Initialize from value prior to construct:
subroutine mx_init(priv, orig)
type(mx_s), intent(out) :: priv
type(mx_s), intent(in) :: orig
priv%value = orig%value
priv%index = orig%index
end subroutine mx_init
The Combiner in Fortran
Can be a subroutine or assignment statement (here: subroutine).
Special Variables:
omp_in: reference to contribution from threadomp_out: reference to combined result
Example Implementation
Replace if contribution is larger:
subroutine mx_combine(out, in)
type(mx_s), intent(inout) :: out
type(mx_s), intent(in) :: in
if (out%value < in%value) then
out%value = in%value
out%index = in%index
endif
end subroutine mx_combine
Using User-Defined Reduction in Fortran
mx%value = val(1)
mx%index = 1
!$omp parallel do reduction(maxloc: mx)
do i = 2, count
if (mx%value < val(i)) then
mx%value = val(i)
mx%index = i
endif
enddo
Easily readable code
Similar to what one would do in serial programming
Abstracts away the parallel complexity
User-Defined Reduction in C
Step 1: Define the Data Type
struct mx_s {
float value;
int index;
};
Step 2: Declare the Reduction
#pragma omp declare reduction(maxloc: \
struct mx_s: mx_combine(&omp_out, &omp_in)) \
initializer(mx_init(&omp_priv, &omp_orig))
The operation can be triggered by the name
maxlocUtilizes functions
mx_combineandmx_initActs on objects of type
struct mx_s
The Initializer in C
An expression (here: implemented with a function).
Special Variables:
omp_priv: reference to variable to be initializedomp_orig: reference to original variable prior to construct
Example Implementation
Initialize from value prior to construct:
void mx_init(struct mx_s *priv, struct mx_s *orig)
{
priv->value = orig->value;
priv->index = orig->index;
}
The Combiner in C
An expression (here: implemented with a function).
Special Variables:
omp_in: reference to contribution from threadomp_out: reference to combined result
Example Implementation
Replace if contribution is larger:
void mx_combine(struct mx_s *out, struct mx_s *in)
{
if (out->value < in->value) {
out->value = in->value;
out->index = in->index;
}
}
Using User-Defined Reduction in C
mx->value = val[0];
mx->index = 0;
#pragma omp parallel for reduction(maxloc: mx)
for (i = 1; i < count; i++) {
if (mx.value < val[i])
{
mx.value = val[i];
mx.index = i;
}
}
Easily readable code
Similar to what one would do in serial programming
Abstracts away the parallel complexity
Declaring a Reduction Operation: Syntax Summary
C Syntax
#pragma omp declare reduction (reduction-identifier : \
typename-list : combiner) [initializer-clause] new-line
Fortran Syntax
!$omp declare reduction(reduction-identifier : &
!$omp type-list : combiner) [initializer-clause]
Components
reduction-identifier: Name for your reduction
typename-list/type-list: Data types the reduction applies to
combiner: Function/subroutine to combine values
initializer-clause: Optional initialization specification
Dealing with Global Storage
By default, global storage is shared among all threads.
Examples of Global Storage
C/C++:
File scope variables
staticvariables
Fortran:
COMMONblocksModule data
Variables with
saveattribute
This default behavior is not always what is needed.
Directive: threadprivate in C
The threadprivate directive makes global storage private to each thread.
int g_var = 1;
#pragma omp threadprivate(g_var)
int main()
{
g_var = 4;
#pragma omp parallel
{
printf("%d\n", g_var);
}
return 0;
}
Each thread gets a private copy
Outside parallel region: modifications affect master’s copy
Example Output
With 4 threads:
Thread 0 (master): 4
Thread 1: 1
Thread 2: 1
Thread 3: 1
Directive: threadprivate in Fortran
The threadprivate directive makes global storage private to each thread.
module gmod
integer :: g_var = 1
!$omp threadprivate(g_var)
end module gmod
program example
use gmod
g_var = 4
!$omp parallel
print *, g_var
!$omp end parallel
end program example
Each thread gets a private copy
Outside parallel region: modifications affect master’s copy
Example Output
With 4 threads:
Thread 0 (master): 4
Thread 1: 1
Thread 2: 1
Thread 3: 1
Clause: copyin
The copyin clause initializes threadprivate data from the master thread.
C Example
int g_var = 1;
#pragma omp threadprivate(g_var)
int main()
{
g_var = 4;
#pragma omp parallel \
copyin(g_var)
{
printf("%d\n", g_var);
}
return 0;
}
Fortran Example
module gmod
integer :: g_var = 1
!$omp threadprivate(g_var)
end module gmod
program example
use gmod
g_var = 4
!$omp parallel copyin(g_var)
print *, g_var
!$omp end parallel
end program example
Output
With 4 threads, all threads print: 4
More on threadprivate
Data Persistence
threadprivate data remains unchanged between parallel regions if:
Neither region is nested inside another parallel region
Both regions have the same thread count
Internal variable
dyn-varis false in both regionsUse function
omp_set_dynamicto control this
Fortran COMMON Blocks
In Fortran, you can make a COMMON block threadprivate:
integer :: a, b, c
COMMON /abccom/ a, b, c
!$OMP threadprivate(/abccom/)
Exercise
In a previous exercise, we parallelized a for loop which had 20 iterations by evenly divinding the number of interations among the available threads. An variable was used to store the number of iterations in the loop and an atomic operation protected the data from race conditions. Rewrite this code but now use the reduction operation.
Solution
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 6b-forworksharing-openmp.c -lm
5#include <stdio.h>
6#ifdef _OPENMP
7#include <omp.h>
8#endif
9
10int main()
11{
12
13int i,var1;
14int n = 20; // number of iterations
15var1 = 0;
16
17#pragma omp parallel
18 {
19#ifdef _OPENMP
20 // The purpose of this code is to add 1 to var1 20 times
21#pragma omp for reduction(+:var1)
22 for ( int i = 0; i < n; i++)
23 var1 += 1;
24
25#else
26 printf("Serial code!\n");
27#endif
28 }
29
30 printf("var1 = %i \n", var1);
31
32return 0;
33}
Exercise
In the following code, monitor the values of the variables at the different stages of the runtime.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 5b-datascope-openmp.c -lm
5#include <stdio.h>
6#ifdef _OPENMP
7#include <omp.h>
8#endif
9
10int main()
11{
12
13int var1, var2, var3; // Three variables
14var1 = 1;
15var2 = 2;
16var3 = 3;
17
18#pragma omp parallel firstprivate(var1,var2) shared(var3)
19 {
20
21#ifdef _OPENMP
22 printf("var1 = %i , var2 = %i , var3 = %i \n",var1,var2,var3);
23 var1 = 10;
24 var2 = 20;
25 var3 = 30;
26#else
27 printf("Serial code!\n");
28#endif
29 }
30
31 printf("var1 = %i , var2 = %i , var3 = %i \n",var1,var2,var3);
32
33
34int x = 0; // variable to hold the value from the last iteration
35
36#pragma omp parallel for lastprivate(x)
37for (int i = 0; i < 10; i++) {
38 x = i; // x is private to each thread, but will retain value from the last iteration
39 printf("Thread %d: i = %d, x = %d\n", omp_get_thread_num(), i, x);
40}
41
42printf("After the loop, x = %d\n", x); // x has the value from the last iteration (n - 1)
43
44
45return 0;
46}
Exercise
In the following code, monitor the values of the variable counter at the different stages of the runtime.
1// On cluster Kebnekaise
2// ml foss
3// export OMP_NUM_THREADS=1
4// gcc -O3 -march=native -fopenmp -o test.x 7-threadprivate-openmp.c -lm
5#include <stdio.h>
6#ifdef _OPENMP
7#include <omp.h>
8#endif
9
10// declare a global variable
11int counter;
12
13// this variable is private to each thread
14#pragma omp threadprivate(counter)
15
16int main()
17{
18
19counter = 0;
20
21#pragma omp parallel
22{
23#ifdef _OPENMP
24 int thread_id = omp_get_thread_num();
25
26 // Each thread sets its private copy of 'counter'
27 counter = thread_id * 10;
28 printf("Thread %d: counter = %d\n", thread_id, counter);
29
30 // sync all threads
31 #pragma omp barrier
32
33 // Modify the thread-private variable
34 counter += 5;
35 printf("Thread %d after modification: counter = %d\n", thread_id, counter);
36#else
37 printf("Serial code!\n");
38#endif
39 }
40
41// Outside the parallel region, the main thread's 'counter' value is unaffected
42 printf("In main thread, counter = %d\n", counter);
43
44#pragma omp parallel
45{
46#ifdef _OPENMP
47 int thread_id = omp_get_thread_num();
48
49 // print the value of 'counter' in another parallel region
50 printf("Thread %d: counter = %d in the second parallel region\n", thread_id, counter);
51
52#else
53 printf("Serial code!\n");
54#endif
55 }
56
57return 0;
58}
Summary
This guide covered special private variables in OpenMP:
Special Private Variable Types
firstprivate: Initialization of private variables from master thread
lastprivate: Set value of private variable to value of last loop iteration or last section at end of construct
reduction: Calculating sums, products, etc. in parallel
threadprivate: Privatize global storage
User-Defined Reductions
Available in OpenMP 4.0+
Useful for complex data types
Requires combiner and initializer functions
When to Use Standard Constructs
The above constructs handle standard situations. For special cases, use:
Explicit initialization of private variables from shared variables
atomic/criticalfor writes to shared variables
