User documentation for the PLEIADES cluster at the University of Wuppertal
Note:
Your files on BeeGFS are not backed up! Make sure to regularly store important results and data on another device. In case of a catastrophic failure or event, we won’t be able to restore the data.
Many and frequent accesses to files on /beegfs
can produce significant load on the metadata servers.
In consequence, the responsiveness of the shared filesystem goes bad for all users.
There are a couple of things you should avoid when working on /beegfs
, because of this:
ls
or implicitly through any readdir
operation in your program or programming language. The lookup results in a locked operations in the metadata server. This could happen if you frequently check for file status in your job scripts./beegfs
as your working directory for frequent file I/O in your job. Please consider using the local /tmp
storage. Every worker node is equipped with fast 2TB SSDs for exactly this purpose.
/beegfs
.tar
file, since a single large file is better to digest on a parallel filesystem than many small filesIf your job is aborted for any reason, you probably left files in /tmp
that you want to rescue or remove.
Right now, the best approach is to book an interactive shell on the node and resolve it manually, via:
srun -p short -w wn21053 -n1 -t 60 --pty /bin/bash
Every job submission in Slurm introduces some overhead to the batch system. If you have many short jobs of the same kind, e.g. 2000 x 30 minutes, you should combine your workload in fewer submission scripts or consider using Slurms job arrays. This way you bundle all of these jobs in a single job submission, but still can treat the items individually as job steps.
Please try to estimate a maximum execution time and set your job time limits accordingly.
-t
, shorter than the partitions maximum timesbatch -d <jobid>
to set a job dependency to the original job, to continue processinglong
partition allows for longer jobs, but there are a couple of risks involved:
The login nodes, especially fugg1 and 2, are mostly intended for job submissions. If you expect to move larger amounts of data, e.g. to a local computer, consider submitting a job that moves the data from a worker node to your system. This way, you can shift the workload away from login nodes.