NOTE: The GPU resources are primarily owned by the
imacm
andetechnik
groups. All other groups are allowed to submit a limited number of jobs with a lower priority to increase utilization. We may have to block access to GPU resources by other groups in times of high demand. You can check your own group affiliation withsshare -U
.
NOTE: There is a general GPU tutorial available at hpc-wiki.info/hpc/GPU_Tutorial.
Slurm: Using GPUs
This guide gives a brief overview of our available GPU resources. Feel free to contact us if there are problems or if you expect something to behave differently.
Hardware
There are 5 identical GPU nodes available, called gpu21[001-005]
. Each node has the following specs:
- 8 GPUs: NVidia HGX A100
- 128 Cores: Two sockets with AMD EPYC 7763 64-Core Processors each
- 2TB Memory, resulting in 16GB per core
- 14TB of disk space at
/data/
. Please clean up your data when you are done
Certain cores are associated to a certain GPU and can check the affinity between them with nvidia-smi
:
$ srun -N1 -p gpu -A <yourGroupAccount>_gpu -n1 --cpus-per-task 128 --gpus 8 --gpus-per-task 8 nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS 48-63 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS 48-63 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB 16-31 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB 16-31 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS 112-127 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS 112-127 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS 80-95 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS 80-95 5
mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NOTE: As a consequence of the CPU <-> GPU association, you should always request exactly 16 CPU cores per GPU. This is already the default, so you don’t have to request the number of CPU cores yourself.
Job Submission and Accounting
The GPU nodes are available in their own gpu
partition:
$ scontrol show partition gpu
PartitionName=gpu
AllowGroups=ALL AllowAccounts=astro_gpu,cobra_gpu,fugg_gpu,imacm_gpu,lrz_gpu,modellierung_gpu,ops_gpu,optimierung_gpu,risiko_gpu,stroemung_gpu,whep_gpu,zim_gpu AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=gpu21[001-005]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=640 TotalNodes=5 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=16000 MaxMemPerCPU=32000
As you can see in AllowAccounts
, we only allow a special purpose Slurm account with a suffix of _gpu
to submit jobs to the node. Every user is associated to their groups account and a corresponding _gpu
account. This way we can have separate fairshare configurations between CPU resources and GPU resources, which might be necessary to prioritize research groups with a larger demand for GPU resources. You can check the current share distribution with sshare
yourself.
Slurm treats GPUs as consumable resources and you must specify how many GPUs you would like to request for your job with --gpus 8
. Additionally you should decide how many tasks (e.g. processes, option -n
) you plan to run on a given node and how many --gpus-per-task
and --cpus-per-task
you require. All of these options depend on the scaling behavior of you code and how many CPUs per GPU you need.
To summarize, whenever you intend to submit a job with GPU resources, consider the following options:
-p gpu
to submit to the gpu partition-A <yourGroupAccount>_gpu
to use your groups special purpose account for the fairshare evaluation- Number of nodes, e.g.
-N1
, and number of tasks, e.g.-n1
--gpus <N>
, where N is the number of GPUs you require for your job (up to 8 per node)--gpus-per-task <N>
to set the number of GPUs to be used in a single process/task (up to 8)--cpus-per-task <N>
to set the number of CPU cores in a single task (up to 128 per node)
Here is an example job script, submitting 8 processes, each with 16 cores and one GPU to a single node:
#!/bin/sh
#SBATCH --job-name=gputests
#SBATCH --partition=gpu
#SBATCH --account=<groupname>_gpu
#SBATCH -N 1
#SBATCH --ntasks 8
#SBATCH --cpus-per-task 16 # Always request 16 cores per GPU. This line can be omitted
#SBATCH --gpus-per-task 1
#SBATCH --time=0-01:00:00
#SBATCH -o %x-%j.out
srun -n8 nvidia-smi topo -m
Which GPUs is my Job Using?
When you use only a couple of available GPUs on a system, you might wonder which. When submitting your jobs through Slurm, your job will have the CUDA_VISIBLE_DEVICES
environment variable, which lists your GPU IDs, e.g.:
user@gpu21005:~$ env | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=2,6
The GPU IDs range from 0 to 7, and match the GPU IDs reported via the nvidia-smi
tool or in our monitoring system.
You may add echo $CUDA_VISIBLE_DEVICES
to your Slurm job script in order to see which GPUs each job is using.
Software
Some basic packages are installed on all GPU nodes:
# yum list installed | grep nvidia
Loaded plugins: fastestmirror, nvidia
cuda-drivers.x86_64 495.29.05-1 @local-centos7.9-x86_64--install-repos-nvidia
kmod-nvidia-latest-dkms.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-driver-latest-dkms.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-driver-latest-dkms-NVML.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-driver-latest-dkms-NvFBCOpenGL.x86_64
3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-driver-latest-dkms-cuda.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-driver-latest-dkms-cuda-libs.x86_64
3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-driver-latest-dkms-devel.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-driver-latest-dkms-libs.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-fabric-manager.x86_64 495.29.05-1 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-libXNVCtrl.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-libXNVCtrl-devel.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-modprobe-latest-dkms.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-persistenced-latest-dkms.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-settings.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
nvidia-xconfig-latest-dkms.x86_64 3:495.29.05-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
yum-plugin-nvidia.noarch 0.5-1.el7 @local-centos7.9-x86_64--install-repos-nvidia
Additionally, you can try some of the available modules with GPU related features:
CUDA/11.4.2
NVHPC/21.7
: The successor of the PGI compilers- Various debuggers and profilers, e.g.
TotalView/2021.3.9
andARMForge/21.1.1
, as well as Intel parallel studio XE TensorFlow/2.6.0
In all of these cases you would run module load 2021a <module/version>
in your sbatch
job script or your interactive salloc
/srun
allocation to gain access to the corresponding tools. The modules have not been tested in this context yet, so please get in touch if something does not work as expected.
Stop GPU Jobs with Low Utilization
You may want to make sure that your GPU job is canceled, if it is not utilizing the GPUs well enough. With this approach you stop early in case of bugs or other issues, which improves the availability of GPU resources for everyone.
Following the example of a dedicated PLEIADES user, you could create a shell script monitor_gpu_usage.sh
:
#! /usr/bin/env bash
# Specify the idle time threshold in seconds (e.g., 1800 seconds for 30 minutes)
IDLE_THRESHOLD=${IDLE_THRESHOLD:-1800} # Default to 30 minutes if not set
# Function to check GPU usage
check_gpu_usage() {
local gpu_usage
# Use CUDA_VISIBLE_DEVICES if set, otherwise fall back to GPU_DEVICE_ORDINAL
local gpu_list=${CUDA_VISIBLE_DEVICES:-$GPU_DEVICE_ORDINAL}
if [ -z "$gpu_list" ]; then
echo "No GPUs allocated to this job"
return 1
fi
gpu_usage=$(nvidia-smi --id=$(echo $gpu_list | tr ',' ' ') --query-gpu=utilization.gpu --format=csv,noheader,nounits)
# If all GPUs are below the threshold (e.g., 5%), return 1
for usage in $gpu_usage; do
if [ "$usage" -gt 5 ]; then
return 0
fi
done
return 1
}
# Record the start time
start_time=$(date +%s)
while true; do
sleep 60 # Check every 60 seconds
if check_gpu_usage; then
# If GPUs are not idle, update the start time
start_time=$(date +%s)
else
# Calculate the elapsed time since GPUs became idle
current_time=$(date +%s)
elapsed_time=$((current_time - start_time))
if [ "$elapsed_time" -ge "$IDLE_THRESHOLD" ]; then
# If the elapsed time exceeds the threshold, exit the job
echo "GPUs have been idle (below 5% utilization) for more than $((IDLE_THRESHOLD / 60)) minutes. Terminating job."
scancel $SLURM_JOB_ID
exit 0
fi
fi
done
You have to tune the $IDLE_THRESHOLD
to your workload though! Downloading or preparing large datasets and similar preparation tasks naturally have low GPU utilization.
In your Slurm job, you could start the script in background and kill it after your main workload is done:
#! /usr/bin/env bash
#SBATCH -p gpu
#SBATCH ...
# Do preparation tasks with low GPU utilization here
# e.g. Download or prepare some dataset
bash monitor_gpu_usage.sh & # Start monitor script in background
MONITOR_PID=$! # Remember PID of monitor script
# Put your work here
some_command # main workload
# Anything you wish to do
kill $MONITOR_PID # Kill monitoring if main workload is done