During 2019-07-[08-09] maintenance window for Spartan HPC, we ran extensive benchmarking with IO500 on our CephFS cluster. Here are the details.
Why IO500?
It is the most recognised IO benchmark for HPC systems, as popularised by the Supercomputing Conference http://supercomputing.org/ Every popular HPC filesystem has had entries on their database and therefore it makes comparisons easier.
It uses the same job scheduling system that your HPC cluster has, to mimic as much of real HPC job IO as possible.
The individual benchmarks: please see https://github.com/VI4IO/io-500-dev/tree/master/doc#io500-individual-benchmarks for their descriptions.
This is our first time running IO500, so we have not been able to tune our clusters enough for an official submission, nor optimal performance. Our learning from this will enable us to make those possible in the next maintenance window.
Infrastructure details
Networking
- Mellanox leaf switches SN2100 for Ceph nodes
- Mellanox leaf switches SN2700 for client gpgpu nodes
- Mellanox spine switches SN2700
- 2x100G from leaf to spine, 4x100G between spines
Ceph cluster
RHEL 7.6, kernel-lt elrepo 4.4.135-1.el7.elrepo.x86_64, Mellanox OFED 4.3-3.0.2.1
- mon[1-5]: x10 cores Xeon v4 2.4GHz, 64GB of RAM, 2x25Gbe Mellanox
- mds[1-3]: 2 active, 1 standby, each is: 1x6-cores Xeon v4 3.4GHz, 512GB of RAM, 2x25Gbe Mellanox
- NLSAS data pool:
- 36 OSD nodes, 16 drives each (576 drives in total), mix of 8TB and 10TB NLSAS drives.
- Each node has 1xNVMe card Intel P3700 or Optane 900P for WAL (2GB) and RocksDB (10GB) per OSD.
- 1x10-cores Xeon v4 2.4GHz, 128GB of RAM, 2x25Gbe Mellanox
- Replicated 3:1 ratio
- Fullness: ~60%
- SSD data pool:
- 16 OSD nodes, 8 Sandisk BSSD 8TB drives each over 12Gb SAS (IF150 unit), 128 drives in total
- Each node has 2x NVMe cards (Optane 900P) for WAL (4GB) and RocksDB( 40GB) per OSD
- 2x16-cores Xeon v4 2.6GHz, 128GB of RAM, 2x25Gbe Mellanox
- Erasure Code 4:2 ratio
- Fullness: ~73%
- Metadata pool:
- On 10 of the 16 SSD OSD nodes
- Each node has 1x NVMe (Optane 900p 480GB) partitioned into 4, each becomes an OSD (so 40 NVMe OSDs in total)
Client nodes:
We use 2, 4, 10 and 32 nodes from our gpgpu cluster. Each node has the following specs:
spartan-gpgpu:
- 2x12-cores Xeon v4 2.2GHz
- 128GB of RAM
- 1x100Gbe Mellanox
- 4x Tesla P100s (not actually used in IO500 benchmarks)
Nodes are spread between as many racks (one switch per rack) as possible, up to 6 racks.
Compiling IO500
We run IO500 via Spartan Slurm, and compile it through Spartan modules.
ssh login@spartan
<create benchmark directories on NLSAS and Sandisk SSD pools>
cd <benchmark directory>
module load OpenMPI/3.1.3-GCC-6.2.0-ucx Autoconf/2.69-GCC-6.2.0 Automake/1.15-GCC-6.2.0
git clone https://github.com/VI4IO/io-500-dev
cd io-500-dev
./utilities/prepare.sh
Preparing scripts
Slurm
We run IO500 benchmark through slurm. The slurm script looks like this:
mytest_10n_16t_ssd.sh
#!/bin/bash
#SBATCH -p debug
#SBATCH -w spartan-gpgpu[001-002,013-014,024-025,035-036,047-048]
#SBATCH --ntasks=160
#SBATCH --tasks-per-node=16
#SBATCH --mem=100G
#SBATCH --cpus-per-task=1
#SBATCH --time=06:00:00
module load OpenMPI/3.1.3-GCC-6.2.0-ucx
./io500_10n_16t_ssd.sh
And is called by running: sbatch mytest_10n_16t_ssd.sh
Output of IO500 is captured in slurm-${job_id}.out
in the same directory.
io500_10n_16t_ssd.sh
is a copy of the provided io500.sh
with parameters modified as officially instructed, details below.
Tests
Each test has the following IO500 runs:
io500_run_ior_easy="True" # does the write phase and enables the subsequent read
io500_run_md_easy="True" # does the creat phase and enables the subsequent stat
io500_run_ior_hard="True" # does the write phase and enables the subsequent read
io500_run_md_hard="True" # does the creat phase and enables the subsequent read
io500_run_find="True"
io500_run_ior_easy_read="True"
io500_run_md_easy_stat="True"
io500_run_ior_hard_read="True"
io500_run_md_hard_stat="True"
io500_run_md_hard_read="True"
io500_run_md_easy_delete="True" # turn this off if you want to just run find by itself
io500_run_md_hard_delete="True" # turn this off if you want to just run find by itself
io500_run_mdreal="False" # this one is optional
io500_cleanup_workdir="False" # this flag is currently ignored. You'll need to clean up your data files manually if you want to.
io500_stonewall_timer=300 # Stonewalling timer, stop with wearout after 300s with default test, set to 0, if you never want to abort...
io500_find_mpi="True"
Note: we use the parallel mpi find command.
We run each test twice, once on NLSAS, once on SSD. The tests are:
Tests | 2n1t | 4n8t | 10n16t | 32n1t |
---|---|---|---|---|
Clients | 2 | 4 | 10 | 32 |
Threads per client | 1 | 8 | 16 | 1 |
mpirun args | -np 2 | -np 32 | -np 160 | -np 32 |
ior_easy_size per t | 200G | 20G | 20G | 200G |
ior_easy_size total | 400G | 640G | 3.2T | 6.4T |
ior_easy bs | 1M | 1M | 1M | 1M |
mdtest easy files per t | 600K | 12.5K | 12.5K | 600K |
mdtest easy files total | 1200K | 400K | 2000K | 19200K |
ior hard writes per t | 100K | 10K | 25K | 100K |
ior hard writes total | 200K | 320K | 4000K | 3200K |
mdtest hard files per t | 500K | 62.5K | 100K | 500K |
mdtest hard files total | 1000K | 2000K | 16000K | 16000K |
Why these test?
- 2n1t was chosen to fit a simple MPI job.
- 4n8t was requested by HPC team to match one of their bigger jobs.
- 10n16t is to match with certain known IO500 submissions from vendors.
- 32n1t is to match some non-IO500 benchmarks done by vendors.
Results
Tests | 2n1t-nlsas | 2n1t-ssd | 4n8t-nlsas | 4n8t-ssd | 10n16t-nlsas | 10n16t-ssd | 32n1t |
---|---|---|---|---|---|---|---|
ior_easy_write GB/s | 2.687 | 2.756 | 6.150 | 11.541 | 6.283 | 11.897 | n/a (2) |
mdtest_easy_write kiops | 5.304 | 4.146 | 12.446 | 9.387 | 6.872 | 6.220 | n/a |
ior_hard_write GB/s | 0.008 | 0.036 | 0.049 | 0.366 | 0.156 | 0.953 | n/a |
mdtest_hard_write kiops | 5.088 | 4.067 | 6.260 | 4.743 | 6.762 | 5.570 | n/a |
find kiops | 63.520 | 126.030 | 108.610 | 105.110 | 131.290 | 99.810 | n/a |
ior_easy_read GB/s | 2.605 | 2.435 | 15.398 | 18.598 | 41.064 | 22.847 | n/a |
mdtest_easy_stat kiops | 13.112 | 11.833 | 15.977 | 15.485 | 18.632 | 19.441 | n/a |
ior_hard_read GB/s | 0.013 | 0.062 | 0.111 | 0.908 | 0.359 | 3.890 | n/a |
mdtest_hard_stat kiops | 10.462 | 11.996 | 23.456 | 19.114 | 15.759 | 18.722 | n/a |
mdtest_easy_del kiops | 2.856 | 2.341 | 5.345 | 3.983 | 4.451 | 4.463 | n/a |
mdtest_hard_read kiops | n/a (1) | 1.392 | 2.560 | 4.242 | 3.958 | 5.038 | n/a |
mdtest_hard_del kiops | n/a (1) | 2.673 | 4.160 | 4.310 | 4.577 | 4.491 | n/a |
- (1) had a ceph MDS hiccup with client failing to release caps error, killed the slurm job as it was taking too long
- (2) 32n1t ssd put too high loads on the SSD pool, perhaps due to having too few storage nodes there vs the clients, and also the big mismatch in network speed (100G on client vs 25G on storage), and crashed 2 storage nodes. Did not have time to run the 32n1t nlsas
Note: the metadata performance is mostly about the NVMe metadata pool, and does not really reflect the differences between NLSAS and SSD. Sometimes, the metadata performance in the SSD pool benchmarks can actually be lower than that of the NLSAS benchmarks because the metadata OSDs are on the same servers as the SSD OSDs, therefore the loads from the SSD pool benchmarks would affect metadata performance more.
System loads
During testing, we observed and also monitored the system loads. Here are some highlights:
Slow request storm
IO easy writes put a lot of load on our NLSAS OSDs, which created a storm of slow requests. At worst, they affected every single NLSAS OSD, and piled up like this: 90817 slow requests are blocked > 32 sec
. However, they cleared up as soon as the test neared its end, and did not cause any harmful effect.
MDS requests
Our MDS nodes got hit really hard during the metadata tests. The 10n16t benchmarks put the biggest load we had ever seen on them, e.g:
+------+--------+----------------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+----------------+---------------+-------+-------+
| 0 | active | mds3-ceph2-qh2 | Reqs: 19.8k/s | 2542k | 2519k |
| 1 | active | mds2-ceph2-qh2 | Reqs: 14.4k/s | 6751k | 6751k |
+------+--------+----------------+---------------+-------+-------+
Peak throughputs in ceph
Glancing through ceph status
during the tests, we could see some big numbers, such as during the SSD 10n16t write test:
io:
client: 24.3GiB/s wr, 0op/s rd, 24.93kop/s wr
This was out of an average 11.897 GB/s as a final result.
The read test that resulted in an average 41.5GB/s throughput must have had a very high peak. Unfortunately our Grafana chart that tracks this somehow did not respond and collect any metrics during the whole benchmark window, possibly due to high loads.
Grafana charts
We monitor ceph cluster loads via Prometheus/Graphite/Grafana, and here are some screenshots of the cluster during the tests.
Overall throughout the whole 8+ hours window of IO500 benchmarking
Overall combined network throughput may not be accurate due to sampling time mismatch from each node
2n1t_nlsas
2n1t_ssd
4n8t_nlsas
4n8t_ssd
10n16t_nlsas
10n16t_ssd
32n1t_ssd this is the one that crashed
What we learned from this
Compute jobs on NLSAS
Although the cluster and NLSAS pool coped fine, due to this, we do not recommend running big jobs directly on the NLSAS pool, which should be reserved for long term storage only. Compute jobs should be on faster scratch storage e.g SSD or NVMe.
Networking mismatch
The client nodes we chose all have 1x100Gbe networking which could be too much for the storage nodes on 2x25Gbe. The switches that the client nodes are on are also faster.
Misconfigurations
Being first time users of IO500, we did not optimally configure the parameters, and also made a few typos that resulted in workloads that did not quite match what we were after e.g too many files for the 32-nodes test.
A lot of IO500 tests require a 300s minimum runtime for official submission. Without knowing how long it takes our cluster to run through each test, it is hard to predict what params we should give IO500. This requires multiple runs of the same test to make sure all results are valid for official submission. We can rectify this in the next maintenance window when we aim to run IO500 tests again.
OSD loads
The NLSAS OSD host loads are fairly similar to each other during each test. However, the Sandisk SSD OSD hosts, which our EC SSD pool is on, have wildly varying loads. During the 10n16t test, a couple were hitting 110-120 while the rest were between 25-50. During the 32n1t test which crashed two SSD hosts, it was similar but even higher. This might be an effect of EC which is very CPU intensive, but we did not expect some hosts to be hit so much harder than the rest. We expected hosts to have very similar high loads.
Metadata performance
Metadata performance is primarily dictated by the performance of our 10x NVMe disks, which we made 40 NVMe OSDs out of. The NVMe OSDs are co-located on the same hosts as the SSD OSDs. This is likely the main reason why once the job load is much higher, the metadata kiops is better during the NLSAS tests vs the SSD tests, as the higher loads (CPU, memory, network) on the SSD hosts would start to affect the NVMe metadata OSD performance.
One lesson to learn from this is that ideally for a big CephFS cluster serving HPC, we should have dedicated metadata OSD hosts, e.g 5x servers with NVMe disks.
In production, however, we have not run into such IO intensive jobs like our IO500 benchmarks yet, so this weakness has never been exposed.