During 2019-07-[08-09] maintenance window for Spartan HPC, we ran extensive benchmarking with IO500 on our CephFS cluster. Here are the details.
## Why IO500?
It is the most recognised IO benchmark for HPC systems, as popularised by the Supercomputing Conference http://supercomputing.org/ Every popular HPC filesystem has had entries on their database and therefore it makes comparisons easier.
It uses the same job scheduling system that your HPC cluster has, to mimic as much of real HPC job IO as possible.
The individual benchmarks: please see https://github.com/VI4IO/io-500-dev/tree/master/doc#io500-individual-benchmarks for their descriptions.
This is our first time running IO500, so we have not been able to tune our clusters enough for an official submission, nor optimal performance. Our learning from this will enable us to make those possible in the next maintenance window.
## Infrastructure details
### Networking
...
...
@@ -220,4 +230,11 @@ We monitor ceph cluster loads via Prometheus/Graphite/Grafana, and here are some
## What we learned from this
### Compute jobs on NLSAS
Although the cluster and NLSAS pool coped fine, due to this, we do not recommend running big jobs directly on the NLSAS pool, which should be reserved for long term storage only. Compute jobs should be on faster scratch storage e.g SSD or NVMe.