Profiling analysis with VTune and Advisor

Objectives

Learn the profiling tool VTune for OpenMP codes

VTune Workflow

Load your compiler tool: ml foss
Copy/paste the following C code that contains an OpenMP parallel implementation (at this point you are not

expected to understand the OpenMP directives):

// On cluster Kebnekaise
// ml foss
// gcc -O3 -march=native -g -fopenmp -o test.x fibonacci_recursion_omp_tasking.c -lm
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

unsigned long long fibbonacci(int n) {
if(n < 2)
    return n;
else {
    unsigned long long left, right;     //shared variables
    if (n<40)
    {
        left = fibbonacci(n-1);
        right = fibbonacci(n-2);
        return left+right;
    }
    else {
    #pragma omp task shared(left) firstprivate(n)
    left = fibbonacci(n-1);
    #pragma omp task shared(right) firstprivate(n)
    right = fibbonacci(n-2);

    #pragma omp taskwait //sync tasks
    return left + right;
    }
}
}

int main(int argc, char *argv[]) {
int n;
int i;

if(argc > 1)
    n = atoi(argv[1]);
else
{
    printf("Give n : "); scanf("%d", &n);
}

omp_set_dynamic(0);

#pragma omp parallel shared(n)
{
    #pragma omp single
    printf("F(%d) = %llu\n",n,fibbonacci(n));
}

}

Copy/paste the following batch script job_vtune.sh for sending the jobs to the Kebnekaise’s batch queue:

#!/bin/bash
#SBATCH -A hpc2n202X-XYZ
#SBATCH -N 1
#SBATCH -c 10
#SBATCH --time=00:10:00
#SBATCH --mail-type=END
#SBATCH -C skylake


export OMP_NUM_THREADS=10

# Load VTUNE
ml VTune/2021.6.0
# Load foss
ml foss

vtune -collect hotspots -app-working-dir /path-to-your-folder  --app-working-dir=/path-to-your-folder -- /path-to-your-folder/executable list-of-arguments

Compile your code

gcc -O3 -march=native -g -fopenmp-o test.x fibonacci_recursion_omp_tasking.c -lm

Fix the paths to the directory where you obtained the executable test.x in the job_vtune.sh script.
Also correct the projectID. Then, submitthe job with sbatch job_vtune.sh.
In this script, the number of threads is set to 10 (it takes ~ 2min.) for the Fibonacci number 56.
Once the job finishes. Load the Vtune module on the terminal: ml VTune/2021.6.0 and load the gui: vtune-gui. Then, load the r*hs project:

If you don’t see a project, go to Open Result and choose the r*hs project and then the *.vtune file.

Intel’s tutorial:

You can then see the different types of results for this hotspots analysis:

Advisor Workflow

Step 1: Compile Your Code

Load your compiler tool:
```
ml foss
```
Use this code:

// On cluster Kebnekaise
// ml foss
// gcc -O3 -march=native -g -o test.x fibonacci_recursion.c -lm
#include <stdio.h>
#include <stdlib.h>

unsigned long long fibbonacci(int n) {
if(n == 0){
    return 0;
} else if(n == 1) {
    return 1;
} else {
    return (fibbonacci(n-1) + fibbonacci(n-2));
}
}

int main(int argc, char *argv[]) {
int n;
int i;

if(argc > 1)
    n = atoi(argv[1]);
else
{
    printf("Give n : "); scanf("%d", &n);
}

printf("%llu ",fibbonacci(n));
}

and compile it:

gcc -O3 -march=native -g -o test.x fibonacci_recursion.c -lm

Fix the paths to the directory where you obtained the executable test.x in the job_advisor.sh script. Also correct the project ID.
Submit the job:

#!/bin/bash
#SBATCH -A hpc2n202X-XYZ
#SBATCH -c 1
#SBATCH --time=00:10:00
#SBATCH --mail-type=END
#SBATCH -C skylake

# Load Intel Advisor tool
ml Advisor/2023.2.0
# Load foss
ml foss

advisor --collect=roofline --project-dir=./advi_results -- ./executable list-of-arguments

with the standard command: sbatch job_advisor.sh

Note: This script for the Fibonacci number 50 takes approximately 6 minutes.

Step 2: View Results with Advisor GUI

Once the job finishes:

Load the Advisor module on the terminal:
```
ml Advisor/2023.2.0
```
Launch the GUI:
```
advisor-gui
```
Go to Open Project

Find the advi_results folder

Choose the advi_results.advixeproj file

Click Show Results

Measuring Code Performance

Performance Metrics Formula

Floating Point Operations per second (FLOPS):

\[\text{FLOPS} = \frac{\text{Nr.FLOP}}{1\text{sec}} = \frac{\text{Nr.FLOP}}{\text{Byte}} \times \frac{\text{Byte}}{\text{sec}}\]

\[= \text{Arithmetic Intensity (AI)} \times \text{Bandwidth (BW)}\]

Roofline Model

The roofline model visualizes performance bottlenecks:

X-axis: Log(AI) - Arithmetic Intensity
Y-axis: Log(FLOPS) - Floating Point Operations per Second
Diagonal line: Represents memory bandwidth constraint
Horizontal line: Represents peak FLOPS (compute capability)

Performance Regions:

Memory Bound: Performance limited by bandwidth (below the roofline intersection)
Compute Bound: Performance limited by computation capability (at the roofline ceiling)

More details: https://www.telesens.co/2018/07/26/understanding-roofline-charts/

Understanding Roofline Analysis Results

The Roofline analysis provides insights into code performance:

Code Analytics Section

In the Code Analytics view, you can see:

GFLOPS = Giga floating point operations per second
GINTOPS = Giga integer operations per second

These metrics show the number of operations per second (floating point or integers) for expensive functions in your code.

Key Metrics Displayed

Performance: Operations per second for each function
Arithmetic Intensity: Ratio of compute operations to memory access
Elapsed Time: Time spent in each function
Top Functions: Most computationally expensive functions

The visualization helps identify:

Whether your code is memory-bound or compute-bound
Optimization opportunities
Performance bottlenecks in specific functions

Exercise

Use the code provided in the VTune section and run the code with 8 threads for the Fibonacci number 56. Use the VTune GUI to obtain the Elapsed Time, and the Top Hotspots in the Summary tab.

Go to the Bottom-up tab and see the Effective Time by Utilization of the functions, which one has a Poor utilization?

In the plot at the bottom one can see the CPU Utilization for each Thread. The CPU Time shows when the threads are doing some work, otherwise they are idle. Ideally, the plot would show brown bars denoting fully occupied threads. How does this plot look for the present code? Do the behavior of the individual threads explain the Effective Time by Utilization above?

Exercise

Use the code example in the Advisor section and collect results for the Fibonacci number 50. Then, in the Advisor GUI, obtain the GFLOPS and GINTOPS in the Code Analytics tab.

In the Source tab, do you see the part of the code that coul be improved?

Summary

Intel Advisor’s Roofline analysis helps you:

Understand performance characteristics of your code
Identify optimization opportunities
Determine if your code is limited by memory bandwidth or compute capability
Focus optimization efforts on the most impactful areas