During 2019-07-[08-09] maintenance window for Spartan HPC, we ran extensive benchmarking with IO500 on our CephFS cluster. Here are the details.
Infrastructure details
Networking
Mellanox leaf switches SN2100 for Ceph nodes Mellanox leaf switches SN2700 for client gpgpu nodes Mellanox spine switches SN2700 2x100G from leaf to spine, 4x100G between spines
Ceph cluster
RHEL 7.6, kernel-lt elrepo 4.4.135-1.el7.elrepo.x86_64, Mellanox OFED 4.3-3.0.2.1
- mon[1-5]: x10 cores Xeon v4 2.4GHz, 64GB of RAM, 2x25Gbe Mellanox
- mds[1-3]: 2 active, 1 standby, each is: 1x6-cores Xeon v4 3.4GHz, 512GB of RAM, 2x25Gbe Mellanox
- NLSAS data pool:
- 36 OSD nodes, 16 drives each (576 drives in total), mix of 8TB and 10TB NLSAS drives.
- Each node has 1xNVMe card Intel P3700 or Optane 900P for WAL (2GB) and RocksDB (10GB) per OSD.
- 1x10-cores Xeon v4 2.4GHz, 128GB of RAM, 2x25Gbe Mellanox
- Replicated 3:1 ratio
- Fullness: ~60%
- SSD data pool:
- 16 OSD nodes, 8 Sandisk BSSD 8TB drives each over 12Gb SAS (IF150 unit), 128 drives in total
- Each node has 2x NVMe cards (Optane 900P) for WAL (4GB) and RocksDB( 40GB) per OSD
- 2x16-cores Xeon v4 2.6GHz, 128GB of RAM, 2x25Gbe Mellanox
- Erasure Code 4:2 ratio
- Fullness: ~73%
- Metadata pool:
- On 10 of the 16 SSD OSD nodes
- Each node has 1x NVMe (Optane 900p 480GB) partitioned into 4, each becomes an OSD (so 40 NVMe OSDs in total)
Client nodes:
We use 2, 4, 10 and 32 nodes from our gpgpu cluster. Each node has the following specs:
spartan-gpgpu:
- 2x12-cores Xeon v4 2.2GHz
- 128GB of RAM
- 1x100Gbe Mellanox
- 4x Tesla P100s (not actually used in IO500 benchmarks)
Nodes are spread between as many racks (one switch per rack) as possible, up to 6 racks.
Compiling IO500
We run IO500 via Spartan Slurm, and compile it through Spartan modules.
ssh login@spartan
<create benchmark directories on NLSAS and Sandisk SSD pools>
cd <benchmark directory>
module load OpenMPI/3.1.3-GCC-6.2.0-ucx Autoconf/2.69-GCC-6.2.0 Automake/1.15-GCC-6.2.0
git clone https://github.com/VI4IO/io-500-dev
cd io-500-dev
./utilities/prepare.sh
Preparing scripts
Slurm
We run IO500 benchmark through slurm. The slurm script looks like this:
mytest_10n_16t_ssd.sh
#!/bin/bash
#SBATCH -p debug
#SBATCH -w spartan-gpgpu[001-002,013-014,024-025,035-036,047-048]
#SBATCH --ntasks=160
#SBATCH --tasks-per-node=16
#SBATCH --mem=100G
#SBATCH --cpus-per-task=1
#SBATCH --time=06:00:00
module load OpenMPI/3.1.3-GCC-6.2.0-ucx
./io500_10n_16t_ssd.sh
And is called by running: sbatch mytest_10n_16t_ssd.sh
Output of IO500 is captured in slurm-${job_id}.out
in the same directory.
io500_10n_16t_ssd.sh
is a copy of the provided io500.sh
with parameters modified as officially instructed, details below.
Tests
Each test has the following IO500 runs:
io500_run_ior_easy="True" # does the write phase and enables the subsequent read
io500_run_md_easy="True" # does the creat phase and enables the subsequent stat
io500_run_ior_hard="True" # does the write phase and enables the subsequent read
io500_run_md_hard="True" # does the creat phase and enables the subsequent read
io500_run_find="True"
io500_run_ior_easy_read="True"
io500_run_md_easy_stat="True"
io500_run_ior_hard_read="True"
io500_run_md_hard_stat="True"
io500_run_md_hard_read="True"
io500_run_md_easy_delete="True" # turn this off if you want to just run find by itself
io500_run_md_hard_delete="True" # turn this off if you want to just run find by itself
io500_run_mdreal="False" # this one is optional
io500_cleanup_workdir="False" # this flag is currently ignored. You'll need to clean up your data files manually if you want to.
io500_stonewall_timer=300 # Stonewalling timer, stop with wearout after 300s with default test, set to 0, if you never want to abort...
io500_find_mpi="True"
Note: we use the parallel mpi find command.
We run each test twice, once on NLSAS, once on SSD. The tests are:
Tests | 2n1t | 4n8t | 10n16t | 32n1t |
---|---|---|---|---|
Clients | 2 | 4 | 10 | 32 |
Threads per client | 1 | 8 | 16 | 1 |
mpirun args | -np 2 | -np 32 | -np 160 | -np 32 |
ior_easy_size per t | 200G | 20G | 20G | 200G |
ior_easy_size total | 400G | 640G | 3.2T | 6.4T |
ior_easy bs | 1M | 1M | 1M | 1M |
mdtest easy files per t | 600K | 12.5K | 12.5K | 600K |
mdtest easy files total | 1200K | 400K | 2000K | 19200K |
ior hard writes per t | 100K | 10K | 25K | 100K |
ior hard writes total | 200K | 320K | 4000K | 3200K |
mdtest hard files per t | 500K | 62.5K | 100K | 500K |
mdtest hard files total | 1000K | 2000K | 16000K | 16000K |
Why these test?
- 2n1t was chosen to fit a simple MPI job.
- 4n8t was requested by HPC team to match one of their bigger jobs.
- 10n16t is to match with certain known IO500 submissions from vendors.
- 32n1t is to match some non-IO500 benchmarks done by vendors.
Results
Tests | 2n1t-nlsas | 2n1t-ssd | 4n8t-nlsas | 4n8t-ssd | 10n16t-nlsas | 10n16t-ssd | 32n1t |
---|---|---|---|---|---|---|---|
ior_easy_write GB/s | 2.687 | 2.756 | 6.150 | 11.541 | 6.283 | 11.897 | n/a (2) |
mdtest_easy_write kiops | 5.304 | 4.146 | 12.446 | 9.387 | 6.872 | 6.220 | n/a |
ior_hard_write GB/s | 0.008 | 0.036 | 0.049 | 0.366 | 0.156 | 0.953 | n/a |
mdtest_hard_write kiops | 5.088 | 4.067 | 6.260 | 4.743 | 6.762 | 5.570 | n/a |
find kiops | 63.520 | 126.030 | 108.610 | 105.110 | 131.290 | 99.810 | n/a |
ior_easy_read GB/s | 2.605 | 2.435 | 15.398 | 18.598 | 41.064 | 22.847 | n/a |
mdtest_easy_stat kiops | 13.112 | 11.833 | 15.977 | 15.485 | 18.632 | 19.441 | n/a |
ior_hard_read GB/s | 0.013 | 0.062 | 0.111 | 0.908 | 0.359 | 3.890 | n/a |
mdtest_hard_stat kiops | 10.462 | 11.996 | 23.456 | 19.114 | 15.759 | 18.722 | n/a |
mdtest_easy_del kiops | 2.856 | 2.341 | 5.345 | 3.983 | 4.451 | 4.463 | n/a |
mdtest_hard_read kiops | n/a (1) | 1.392 | 2.560 | 4.242 | 3.958 | 5.038 | n/a |
mdtest_hard_del kiops | n/a (1) | 2.673 | 4.160 | 4.310 | 4.577 | 4.491 | n/a |
- (1) had a ceph MDS hiccup with client failing to release caps error, killed the slurm job as it was taking too long
- (2) 32n1t ssd put too high loads on the SSD pool, perhaps due to having too few storage nodes there vs the clients, and also the big mismatch in network speed (100G on client vs 25G on storage), and crashed 2 storage nodes. Did not have time to run the 32n1t nlsas
Note: the metadata performance is mostly about the NVMe metadata pool, and does not really reflect the differences between NLSAS and SSD.
System loads
During testing, we observed and also monitored the system loads. Here are some highlights:
Slow request storm
IO easy writes put a lot of load on our NLSAS OSDs, which created a storm of slow requests. At worst, they affected every single NLSAS OSD, and piled up like this: 90817 slow requests are blocked > 32 sec
. However, they cleared up as soon as the test neared its end, and did not cause any harmful effect.
MDS requests
Our MDS nodes got hit really hard during the metadata tests. The 10n16t benchmarks put the biggest load we had ever seen on them, e.g:
+------+--------+----------------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+----------------+---------------+-------+-------+
| 0 | active | mds3-ceph2-qh2 | Reqs: 19.8k/s | 2542k | 2519k |
| 1 | active | mds2-ceph2-qh2 | Reqs: 14.4k/s | 6751k | 6751k |
+------+--------+----------------+---------------+-------+-------+