Your files on BeeGFS are not backed up! Make sure to regularly store important results and data on another device. In case of a catastrophic failure or event, we won’t be able to restore the data.
Many and frequent accesses to files on
/beegfs can produce significant load on the metadata servers.
In consequence, the responsiveness of the shared filesystem goes bad for all users.
There are a couple of things you should avoid when working on
/beegfs, because of this:
lsor implicitly through any
readdiroperation in your program or programming language. The lookup results in a locked operations in the metadata server. This could happen if you frequently check for file status in your job scripts.
/beegfsas your working directory for frequent file I/O in your job. Please consider using the local
/tmpstorage. Every worker node is equipped with fast 2TB SSDs for exactly this purpose.
.tarfile, since a single large file is better to digest on a parallel filesystem than many small files
If your job is aborted for any reason, you probably left files in
/tmp that you want to rescue or remove.
Right now, the best approach is to book an interactive shell on the node and resolve it manually, via:
srun -p short -w wn21053 -n1 -t 60 --pty /bin/bash
Every job submission in Slurm introduces some overhead to the batch system. If you have many short jobs of the same kind, e.g. 2000 x 30 minutes, you should combine your workload in fewer submission scripts or consider using Slurms job arrays. This way you bundle all of these jobs in a single job submission, but still can treat the items individually as job steps.
Please try to estimate a maximum execution time and set your job time limits accordingly.
-t, shorter than the partitions maximum time
sbatch -d <jobid>to set a job dependency to the original job, to continue processing
longpartition allows for longer jobs, but there are a couple of risks involved:
The login nodes, especially fugg1 and 2, are mostly intended for job submissions. If you expect to move larger amounts of data, e.g. to a local computer, consider submitting a job that moves the data from a worker node to your system. This way, you can shift the workload away from login nodes.