File Systems: Best Practices
Note:
Your files on BeeGFS are not backed up! Make sure to regularly store important results and data on another device. In case of a catastrophic failure or event, we won’t be able to restore the data.
BeeGFS and /tmp
Many and frequent accesses to files on /beegfs
can produce significant load on the metadata servers. In consequence, the responsiveness of the shared filesystem goes bad for all users. There are a couple of things you should avoid when working on /beegfs
, because of this:
- Too many files in a single directory. Each directory is managed by one of the metadata servers and having very many files (e.g. 1000+) in one directory can result in unbalanced blocking operations, if the directory is used in your jobs.
- Frequent lookups in a directory, e.g. through
ls
or implicitly through anyreaddir
operation in your program or programming language. The lookup results in a locked operations in the metadata server. This could happen if you frequently check for file status in your job scripts. - Starting many short running processes (seconds), with the software installed on BeeGFS. Each process is creating a new data stream to read the program data, which can overwhelm the storage system.
- Using
/beegfs
as your working directory for frequent file I/O in your job. Please consider using the local/tmp
storage. Every worker node is equipped with fast 2TB SSDs for exactly this purpose.- Afterwards you can transfer your results to a permanent storage on
/beegfs
- If you want to store logfiles etc., consider packing everything in a
.tar
file, since a single large file is better to digest on a parallel filesystem than many small files - If you use /tmp in your jobs, please make sure that you clean up the directories you created. Also consider what happens to these files, if your job gets canceled or crashes.
- Afterwards you can transfer your results to a permanent storage on
We have a job script example that automatically cleans up the /tmp
directory at the end of a job.
If left files in /tmp
that you want to rescue or remove manually, the best approach is to book an interactive shell on the node via:
srun -p short -w wn21053 -n1 -t 60 --pty /bin/bash