Profiling analysis with VTune and Advisor
Objectives
Learn the profiling tool VTune for OpenMP codes
VTune Workflow
Load your compiler tool:
ml fossCopy/paste the following C code that contains an OpenMP parallel implementation (at this point you are not
expected to understand the OpenMP directives):
// On cluster Kebnekaise
// ml foss
// gcc -O3 -march=native -g -fopenmp -o test.x fibonacci_recursion_omp_tasking.c -lm
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
unsigned long long fibbonacci(int n) {
if(n < 2)
return n;
else {
unsigned long long left, right; //shared variables
if (n<40)
{
left = fibbonacci(n-1);
right = fibbonacci(n-2);
return left+right;
}
else {
#pragma omp task shared(left) firstprivate(n)
left = fibbonacci(n-1);
#pragma omp task shared(right) firstprivate(n)
right = fibbonacci(n-2);
#pragma omp taskwait //sync tasks
return left + right;
}
}
}
int main(int argc, char *argv[]) {
int n;
int i;
if(argc > 1)
n = atoi(argv[1]);
else
{
printf("Give n : "); scanf("%d", &n);
}
omp_set_dynamic(0);
#pragma omp parallel shared(n)
{
#pragma omp single
printf("F(%d) = %llu\n",n,fibbonacci(n));
}
}
Copy/paste the following batch script job_vtune.sh for sending the jobs to the Kebnekaise’s batch queue:
#!/bin/bash
#SBATCH -A hpc2n202X-XYZ
#SBATCH -N 1
#SBATCH -c 10
#SBATCH --time=00:10:00
#SBATCH --mail-type=END
#SBATCH -C skylake
export OMP_NUM_THREADS=10
# Load VTUNE
ml VTune/2021.6.0
# Load foss
ml foss
vtune -collect hotspots -app-working-dir /path-to-your-folder --app-working-dir=/path-to-your-folder -- /path-to-your-folder/executable list-of-arguments
Compile your code
gcc -O3 -march=native -g -fopenmp-o test.x fibonacci_recursion_omp_tasking.c -lm
Fix the paths to the directory where you obtained the executable test.x in the job_vtune.sh script.
Also correct the projectID. Then, submitthe job with sbatch job_vtune.sh.
In this script, the number of threads is set to 10 (it takes ~ 2min.) for the Fibonacci number 56.
Once the job finishes. Load the Vtune module on the terminal:
ml VTune/2021.6.0and load the gui: vtune-gui. Then, load ther*hsproject:
If you don’t see a project, go to Open Result and choose the
r*hsproject and then the*.vtunefile.
You can then see the different types of results for this hotspots analysis:
Advisor Workflow
Step 1: Compile Your Code
Load your compiler tool:
ml foss
Use this code:
// On cluster Kebnekaise
// ml foss
// gcc -O3 -march=native -g -o test.x fibonacci_recursion.c -lm
#include <stdio.h>
#include <stdlib.h>
unsigned long long fibbonacci(int n) {
if(n == 0){
return 0;
} else if(n == 1) {
return 1;
} else {
return (fibbonacci(n-1) + fibbonacci(n-2));
}
}
int main(int argc, char *argv[]) {
int n;
int i;
if(argc > 1)
n = atoi(argv[1]);
else
{
printf("Give n : "); scanf("%d", &n);
}
printf("%llu ",fibbonacci(n));
}
and compile it:
gcc -O3 -march=native -g -o test.x fibonacci_recursion.c -lm
Fix the paths to the directory where you obtained the executable
test.xin thejob_advisor.shscript. Also correct the project ID.Submit the job:
#!/bin/bash
#SBATCH -A hpc2n202X-XYZ
#SBATCH -c 1
#SBATCH --time=00:10:00
#SBATCH --mail-type=END
#SBATCH -C skylake
# Load Intel Advisor tool
ml Advisor/2023.2.0
# Load foss
ml foss
advisor --collect=roofline --project-dir=./advi_results -- ./executable list-of-arguments
with the standard command: sbatch job_advisor.sh
Note: This script for the Fibonacci number 50 takes approximately 6 minutes.
Step 2: View Results with Advisor GUI
Once the job finishes:
Load the Advisor module on the terminal:
ml Advisor/2023.2.0
Launch the GUI:
advisor-gui
Go to Open Project
Find the
advi_resultsfolder
Choose the
advi_results.advixeprojfile
Click Show Results
Measuring Code Performance
Performance Metrics Formula
Floating Point Operations per second (FLOPS):
Roofline Model
The roofline model visualizes performance bottlenecks:
X-axis: Log(AI) - Arithmetic Intensity
Y-axis: Log(FLOPS) - Floating Point Operations per Second
Diagonal line: Represents memory bandwidth constraint
Horizontal line: Represents peak FLOPS (compute capability)
Performance Regions:
Memory Bound: Performance limited by bandwidth (below the roofline intersection)
Compute Bound: Performance limited by computation capability (at the roofline ceiling)
More details: https://www.telesens.co/2018/07/26/understanding-roofline-charts/
Understanding Roofline Analysis Results
The Roofline analysis provides insights into code performance:
Code Analytics Section
In the Code Analytics view, you can see:
GFLOPS = Giga floating point operations per second
GINTOPS = Giga integer operations per second
These metrics show the number of operations per second (floating point or integers) for expensive functions in your code.
Key Metrics Displayed
Performance: Operations per second for each function
Arithmetic Intensity: Ratio of compute operations to memory access
Elapsed Time: Time spent in each function
Top Functions: Most computationally expensive functions
The visualization helps identify:
Whether your code is memory-bound or compute-bound
Optimization opportunities
Performance bottlenecks in specific functions
Exercise
Use the code provided in the VTune section and run the code with 8 threads for the Fibonacci number 56. Use the VTune GUI to obtain the Elapsed Time, and the Top Hotspots in the Summary tab.
Go to the Bottom-up tab and see the Effective Time by Utilization of the functions, which one has a Poor utilization?
In the plot at the bottom one can see the CPU Utilization for each Thread. The CPU Time shows when the threads are doing some work, otherwise they are idle. Ideally, the plot would show brown bars denoting fully occupied threads. How does this plot look for the present code? Do the behavior of the individual threads explain the Effective Time by Utilization above?
Exercise
Use the code example in the Advisor section and collect results for the Fibonacci number 50. Then, in the Advisor GUI, obtain the GFLOPS and GINTOPS in the Code Analytics tab.
In the Source tab, do you see the part of the code that coul be improved?
Summary
Intel Advisor’s Roofline analysis helps you:
Understand performance characteristics of your code
Identify optimization opportunities
Determine if your code is limited by memory bandwidth or compute capability
Focus optimization efforts on the most impactful areas