HPC IO500 2019 07 08 · Wiki · resplat-public / devops

This is an old version of this page.

Go to most recent version Browse history

During 2019-07-[08-09] maintenance window for Spartan HPC, we ran extensive benchmarking with IO500 on our CephFS cluster. Here are the details.

Why IO500?

It is the most recognised IO benchmark for HPC systems, as popularised by the Supercomputing Conference http://supercomputing.org/ Every popular HPC filesystem has had entries on their database and therefore it makes comparisons easier.

It uses the same job scheduling system that your HPC cluster has, to mimic as much of real HPC job IO as possible.

The individual benchmarks: please see https://github.com/VI4IO/io-500-dev/tree/master/doc#io500-individual-benchmarks for their descriptions.

This is our first time running IO500, so we have not been able to tune our clusters enough for an official submission, nor optimal performance. Our learning from this will enable us to make those possible in the next maintenance window.

Infrastructure details

Networking

Mellanox leaf switches SN2100 for Ceph nodes
Mellanox leaf switches SN2700 for client gpgpu nodes
Mellanox spine switches SN2700
2x100G from leaf to spine, 4x100G between spines

Ceph cluster

RHEL 7.6, kernel-lt elrepo 4.4.135-1.el7.elrepo.x86_64, Mellanox OFED 4.3-3.0.2.1

mon[1-5]: x10 cores Xeon v4 2.4GHz, 64GB of RAM, 2x25Gbe Mellanox
mds[1-3]: 2 active, 1 standby, each is: 1x6-cores Xeon v4 3.4GHz, 512GB of RAM, 2x25Gbe Mellanox
NLSAS data pool:
- 36 OSD nodes, 16 drives each (576 drives in total), mix of 8TB and 10TB NLSAS drives.
- Each node has 1xNVMe card Intel P3700 or Optane 900P for WAL (2GB) and RocksDB (10GB) per OSD.
- 1x10-cores Xeon v4 2.4GHz, 128GB of RAM, 2x25Gbe Mellanox
- Replicated 3:1 ratio
- Fullness: ~60%
SSD data pool:
- 16 OSD nodes, 8 Sandisk BSSD 8TB drives each over 12Gb SAS (IF150 unit), 128 drives in total
- Each node has 2x NVMe cards (Optane 900P) for WAL (4GB) and RocksDB( 40GB) per OSD
- 2x16-cores Xeon v4 2.6GHz, 128GB of RAM, 2x25Gbe Mellanox
- Erasure Code 4:2 ratio
- Fullness: ~73%
Metadata pool:
- On 10 of the 16 SSD OSD nodes
- Each node has 1x NVMe (Optane 900p 480GB) partitioned into 4, each becomes an OSD (so 40 NVMe OSDs in total)

Client nodes:

We use 2, 4, 10 and 32 nodes from our gpgpu cluster. Each node has the following specs:

spartan-gpgpu:

2x12-cores Xeon v4 2.2GHz
128GB of RAM
1x100Gbe Mellanox
4x Tesla P100s (not actually used in IO500 benchmarks)

Nodes are spread between as many racks (one switch per rack) as possible, up to 6 racks.

Compiling IO500

We run IO500 via Spartan Slurm, and compile it through Spartan modules.

ssh login@spartan
<create benchmark directories on NLSAS and Sandisk SSD pools>
cd <benchmark directory>
module load OpenMPI/3.1.3-GCC-6.2.0-ucx Autoconf/2.69-GCC-6.2.0 Automake/1.15-GCC-6.2.0
git clone https://github.com/VI4IO/io-500-dev
cd io-500-dev
./utilities/prepare.sh

Preparing scripts

Slurm

We run IO500 benchmark through slurm. The slurm script looks like this:

mytest_10n_16t_ssd.sh

#!/bin/bash

#SBATCH -p debug
#SBATCH -w spartan-gpgpu[001-002,013-014,024-025,035-036,047-048]
#SBATCH --ntasks=160
#SBATCH --tasks-per-node=16
#SBATCH --mem=100G
#SBATCH --cpus-per-task=1
#SBATCH --time=06:00:00

module load OpenMPI/3.1.3-GCC-6.2.0-ucx

./io500_10n_16t_ssd.sh

And is called by running: sbatch mytest_10n_16t_ssd.sh

Output of IO500 is captured in slurm-${job_id}.out in the same directory.

io500_10n_16t_ssd.sh is a copy of the provided io500.sh with parameters modified as officially instructed, details below.

Tests

Each test has the following IO500 runs:

io500_run_ior_easy="True" # does the write phase and enables the subsequent read
io500_run_md_easy="True"  # does the creat phase and enables the subsequent stat
io500_run_ior_hard="True" # does the write phase and enables the subsequent read
io500_run_md_hard="True"  # does the creat phase and enables the subsequent read
io500_run_find="True"
io500_run_ior_easy_read="True"
io500_run_md_easy_stat="True"
io500_run_ior_hard_read="True"
io500_run_md_hard_stat="True"
io500_run_md_hard_read="True"
io500_run_md_easy_delete="True" # turn this off if you want to just run find by itself
io500_run_md_hard_delete="True" # turn this off if you want to just run find by itself
io500_run_mdreal="False"  # this one is optional
io500_cleanup_workdir="False"  # this flag is currently ignored. You'll need to clean up your data files manually if you want to.
io500_stonewall_timer=300 # Stonewalling timer, stop with wearout after 300s with default test, set to 0, if you never want to abort...
io500_find_mpi="True"

Note: we use the parallel mpi find command.

We run each test twice, once on NLSAS, once on SSD. The tests are:

Tests	2n1t	4n8t	10n16t	32n1t
Clients	2	4	10	32
Threads per client	1	8	16	1
mpirun args	-np 2	-np 32	-np 160	-np 32
ior_easy_size per t	200G	20G	20G	200G
ior_easy_size total	400G	640G	3.2T	6.4T
ior_easy bs	1M	1M	1M	1M
mdtest easy files per t	600K	12.5K	12.5K	600K
mdtest easy files total	1200K	400K	2000K	19200K
ior hard writes per t	100K	10K	25K	100K
ior hard writes total	200K	320K	4000K	3200K
mdtest hard files per t	500K	62.5K	100K	500K
mdtest hard files total	1000K	2000K	16000K	16000K

Why these test?

2n1t was chosen to fit a simple MPI job.
4n8t was requested by HPC team to match one of their bigger jobs.
10n16t is to match with certain known IO500 submissions from vendors.
32n1t is to match some non-IO500 benchmarks done by vendors.

Results

Tests	2n1t-nlsas	2n1t-ssd	4n8t-nlsas	4n8t-ssd	10n16t-nlsas	10n16t-ssd	32n1t
ior_easy_write GB/s	2.687	2.756	6.150	11.541	6.283	11.897	n/a (2)
mdtest_easy_write kiops	5.304	4.146	12.446	9.387	6.872	6.220	n/a
ior_hard_write GB/s	0.008	0.036	0.049	0.366	0.156	0.953	n/a
mdtest_hard_write kiops	5.088	4.067	6.260	4.743	6.762	5.570	n/a
find kiops	63.520	126.030	108.610	105.110	131.290	99.810	n/a
ior_easy_read GB/s	2.605	2.435	15.398	18.598	41.064	22.847	n/a
mdtest_easy_stat kiops	13.112	11.833	15.977	15.485	18.632	19.441	n/a
ior_hard_read GB/s	0.013	0.062	0.111	0.908	0.359	3.890	n/a
mdtest_hard_stat kiops	10.462	11.996	23.456	19.114	15.759	18.722	n/a
mdtest_easy_del kiops	2.856	2.341	5.345	3.983	4.451	4.463	n/a
mdtest_hard_read kiops	n/a (1)	1.392	2.560	4.242	3.958	5.038	n/a
mdtest_hard_del kiops	n/a (1)	2.673	4.160	4.310	4.577	4.491	n/a

(1) had a ceph MDS hiccup with client failing to release caps error, killed the slurm job as it was taking too long
(2) 32n1t ssd put too high loads on the SSD pool, perhaps due to having too few storage nodes there vs the clients, and also the big mismatch in network speed (100G on client vs 25G on storage), and crashed 2 storage nodes. Did not have time to run the 32n1t nlsas

Note: the metadata performance is mostly about the NVMe metadata pool, and does not really reflect the differences between NLSAS and SSD. Sometimes, the metadata performance in the SSD pool benchmarks can actually be lower than that of the NLSAS benchmarks because the metadata OSDs are on the same servers as the SSD OSDs, therefore the loads from the SSD pool benchmarks would affect metadata performance more.

System loads

During testing, we observed and also monitored the system loads. Here are some highlights:

Slow request storm

IO easy writes put a lot of load on our NLSAS OSDs, which created a storm of slow requests. At worst, they affected every single NLSAS OSD, and piled up like this: 90817 slow requests are blocked > 32 sec. However, they cleared up as soon as the test neared its end, and did not cause any harmful effect.

MDS requests

Our MDS nodes got hit really hard during the metadata tests. The 10n16t benchmarks put the biggest load we had ever seen on them, e.g:

+------+--------+----------------+---------------+-------+-------+
| Rank | State  |      MDS       |    Activity   |  dns  |  inos |
+------+--------+----------------+---------------+-------+-------+
|  0   | active | mds3-ceph2-qh2 | Reqs: 19.8k/s | 2542k | 2519k |
|  1   | active | mds2-ceph2-qh2 | Reqs: 14.4k/s | 6751k | 6751k |
+------+--------+----------------+---------------+-------+-------+

Peak throughputs in ceph

Glancing through ceph status during the tests, we could see some big numbers, such as during the SSD 10n16t write test:

  io:
    client:   24.3GiB/s wr, 0op/s rd, 24.93kop/s wr

This was out of an average 11.897 GB/s as a final result.

The read test that resulted in an average 41.5GB/s throughput must have had a very high peak. Unfortunately our Grafana chart that tracks this somehow did not respond and collect any metrics during the whole benchmark window, possibly due to high loads.

Grafana charts

We monitor ceph cluster loads via Prometheus/Graphite/Grafana, and here are some screenshots of the cluster during the tests.

Overall throughout the whole 8+ hours window of IO500 benchmarking

Overall combined network throughput may not be accurate due to sampling time mismatch from each node

2n1t_nlsas

2n1t_ssd

4n8t_nlsas

4n8t_ssd

10n16t_nlsas

10n16t_ssd

32n1t_ssd this is the one that crashed

What we learned from this

Compute jobs on NLSAS

Although the cluster and NLSAS pool coped fine, due to this, we do not recommend running big jobs directly on the NLSAS pool, which should be reserved for long term storage only. Compute jobs should be on faster scratch storage e.g SSD or NVMe.

Networking mismatch

The client nodes we chose all have 1x100Gbe networking which could be too much for the storage nodes on 2x25Gbe. The switches that the client nodes are on are also faster.

Misconfigurations

Being first time users of IO500, we did not optimally configure the parameters, and also made a few typos that resulted in workloads that did not quite match what we were after e.g too many files for the 32-nodes test.

A lot of IO500 tests require a 300s minimum runtime for official submission. Without knowing how long it takes our cluster to run through each test, it is hard to predict what params we should give IO500. This requires multiple runs of the same test to make sure all results are valid for official submission. We can rectify this in the next maintenance window when we aim to run IO500 tests again.

OSD loads

The NLSAS OSD host loads are fairly similar to each other during each test. However, the Sandisk SSD OSD hosts, which our EC SSD pool is on, have wildly varying loads. During the 10n16t test, a couple were hitting 110-120 while the rest were between 25-50. During the 32n1t test which crashed two SSD hosts, it was similar but even higher. This might be an effect of EC which is very CPU intensive, but we did not expect some hosts to be hit so much harder than the rest. We expected hosts to have very similar high loads.

Metadata performance

Metadata performance is primarily dictated by the performance of our 10x NVMe disks, which we made 40 NVMe OSDs out of. The NVMe OSDs are co-located on the same hosts as the SSD OSDs. This is likely the main reason why once the job load is much higher, the metadata kiops is better during the NLSAS tests vs the SSD tests, as the higher loads (CPU, memory, network) on the SSD hosts would start to affect the NVMe metadata OSD performance.

One lesson to learn from this is that ideally for a big CephFS cluster serving HPC, we should have dedicated metadata OSD hosts, e.g 5x servers with NVMe disks.

In production, however, we have not run into such IO intensive jobs like our IO500 benchmarks yet, so this weakness has never been exposed.

Comments

Please register or sign in to add a comment.