Skip to content

Collector

The sams-collector is run on the compute-node and collects information about the running jobs. The collector uses three types of modules, a pidfinder, sampler and the outputs.

The pidfinder module finds process ids (PID) of a job.

The sampler modules gets the PIDs from pidfinder and collects metrics about the processes.

The output modules outputs the result of the samplers.

The collector needs to know which Slurm job it is collecting information about, provide it with the --jobid command line option. In slurm this must be the JobIDRaw''* and not the jobid with job array extension (NNNNNN_A)

Configuration

Key Description
pid_finder_update_interval The number of seconds to wait before trying to find new pids.
pid_finder Name of the plugin that finds PIDs.
samplers A list of plugins that sample metrics about the PIDs.
outputs A list of plugins that stores the metrics from the samplers.

Here is an example configuration file.

---
sams-collector:  
  pid_finder_update_interval: 30
  pid_finder: sams.pidfinder.Slurm
  samplers:
    - sams.sampler.Core
    - sams.sampler.Software
    - sams.sampler.SlurmInfo
  outputs:
    - sams.output.File

  umask: '077' # only used in daemon mode.
  logfile: /var/log/sams-collector.%(jobid)s.%(node)s.log
  loglevel: ERROR

sams.pidfinder.Slurm:
  grace_period: 600

sams.sampler.SlurmInfo:
  sampler_interval: 30

sams.sampler.Software:
  sampler_interval: 30

sams.output.File:
  base_path: /var/spool/softwareaccounting/data
  file_pattern: "%(jobid)s.%(node)s.json"
  jobid_hash_size: 1000

Invoking from Slurm

In Slurm prolog start

sams-collector.py --config=/path/config.yaml --jobid=$SLURM_JOB_ID --daemon \
  --pidfile=/var/run/sams-collector.$SLURM_JOB_ID

The sams-collector needs to run as root.

In Slurm epilog use kill -HUP. If HUP i missing the collector will exit after 10 minutes without active processes.

Using Systemd

Starting and stopping the collector with systemd is easy.

Create the file: /etc/systemd/system/softwareaccounting@.service with the following content:

[Unit]
Description=Software Accounting (%i)

[Service]
PIDFile=/var/run/softwareaccounting.%i.pid
ExecStart=/opt/softwareaccounting/bin/sams-collector.py --jobid=%i --config=/etc/slurm/softwareaccounting.yaml
KillSignal=SIGHUP
KillMode=process

To start the accounting process just run

systemctl start softwareaccounting@${SLURM_JOB_ID}.service

in the slurm prolog and

systemctl stop softwareaccounting@${SLURM_JOB_ID}.service

in the slurm epilog.