HPC IO500 2019 07 08 · Wiki · resplat-public / devops

This is an old version of this page.

Go to most recent version Browse history

During 2019-07-[08-09] maintenance window for Spartan HPC, we ran extensive benchmarking with IO500 on our CephFS cluster. Here are the details.

Infrastructure details

Networking

Mellanox leaf switches SN2100 for Ceph nodes Mellanox leaf switches SN2700 for client gpgpu nodes Mellanox spine switches SN2700 2x100G from leaf to spine, 4x100G between spines

Ceph cluster

RHEL 7.6, kernel-lt elrepo 4.4.135-1.el7.elrepo.x86_64, Mellanox OFED 4.3-3.0.2.1

mon[1-5]: x10 cores Xeon v4 2.4GHz, 64GB of RAM, 2x25Gbe Mellanox
mds[1-3]: 2 active, 1 standby, each is: 1x6-cores Xeon v4 3.4GHz, 512GB of RAM, 2x25Gbe Mellanox
NLSAS data pool:
- 36 OSD nodes, 16 drives each (576 drives in total), mix of 8TB and 10TB NLSAS drives.
- Each node has 1xNVMe card Intel P3700 or Optane 900P for WAL (2GB) and RocksDB (10GB) per OSD.
- 1x10-cores Xeon v4 2.4GHz, 128GB of RAM, 2x25Gbe Mellanox
- Replicated 3:1 ratio
- Fullness: ~60%
SSD data pool:
- 16 OSD nodes, 8 Sandisk BSSD 8TB drives each over 12Gb SAS (IF150 unit), 128 drives in total
- Each node has 2x NVMe cards (Optane 900P) for WAL (4GB) and RocksDB( 40GB) per OSD
- 2x16-cores Xeon v4 2.6GHz, 128GB of RAM, 2x25Gbe Mellanox
- Erasure Code 4:2 ratio
- Fullness: ~73%
Metadata pool:
- On 10 of the 16 SSD OSD nodes
- Each node has 1x NVMe (Optane 900p 480GB) partitioned into 4, each becomes an OSD (so 40 NVMe OSDs in total)

Client nodes:

We use 2, 4, 10 and 32 nodes from our gpgpu cluster. Each node has the following specs:

spartan-gpgpu:

2x12-cores Xeon v4 2.2GHz
128GB of RAM
1x100Gbe Mellanox
4x Tesla P100s (not actually used in IO500 benchmarks)

Nodes are spread between as many racks (one switch per rack) as possible, up to 6 racks.

Compiling IO500

We run IO500 via Spartan Slurm, and compile it through Spartan modules.

ssh login@spartan
<create benchmark directories on NLSAS and Sandisk SSD pools>
cd <benchmark directory>
module load OpenMPI/3.1.3-GCC-6.2.0-ucx Autoconf/2.69-GCC-6.2.0 Automake/1.15-GCC-6.2.0
git clone https://github.com/VI4IO/io-500-dev
cd io-500-dev
./utilities/prepare.sh

Preparing scripts

Slurm

We run IO500 benchmark through slurm. The slurm script looks like this:

mytest_10n_16t_ssd.sh

#!/bin/bash

#SBATCH -p debug
#SBATCH -w spartan-gpgpu[001-002,013-014,024-025,035-036,047-048]
#SBATCH --ntasks=160
#SBATCH --tasks-per-node=16
#SBATCH --mem=100G
#SBATCH --cpus-per-task=1
#SBATCH --time=06:00:00

module load OpenMPI/3.1.3-GCC-6.2.0-ucx

./io500_10n_16t_ssd.sh

And is called by running: sbatch mytest_10n_16t_ssd.sh

Output of IO500 is captured in slurm-${job_id}.out in the same directory.

io500_10n_16t_ssd.sh is a copy of the provided io500.sh with parameters modified as officially instructed, details below.

Tests

Each test has the following IO500 runs:

io500_run_ior_easy="True" # does the write phase and enables the subsequent read
io500_run_md_easy="True"  # does the creat phase and enables the subsequent stat
io500_run_ior_hard="True" # does the write phase and enables the subsequent read
io500_run_md_hard="True"  # does the creat phase and enables the subsequent read
io500_run_find="True"
io500_run_ior_easy_read="True"
io500_run_md_easy_stat="True"
io500_run_ior_hard_read="True"
io500_run_md_hard_stat="True"
io500_run_md_hard_read="True"
io500_run_md_easy_delete="True" # turn this off if you want to just run find by itself
io500_run_md_hard_delete="True" # turn this off if you want to just run find by itself
io500_run_mdreal="False"  # this one is optional
io500_cleanup_workdir="False"  # this flag is currently ignored. You'll need to clean up your data files manually if you want to.
io500_stonewall_timer=300 # Stonewalling timer, stop with wearout after 300s with default test, set to 0, if you never want to abort...
io500_find_mpi="True"

Note: we use the parallel mpi find command.

We run each test twice, once on NLSAS, once on SSD. The tests are:

Tests	2n1t	4n8t	10n16t	32n1t
Clients	2	4	10	32
Threads per client	1	8	16	1
mpirun args	-np 2	-np 32	-np 160	-np 32
ior_easy_size per t	200G	20G	20G	200G
ior_easy_size total	400G	640G	3.2T	6.4T
ior_easy bs	1M	1M	1M	1M
mdtest easy files per t	600K	12.5K	12.5K	600K
mdtest easy files total	1200K	400K	2000K	19200K
ior hard writes per t	100K	10K	25K	100K
ior hard writes total	200K	320K	4000K	3200K
mdtest hard files per t	500K	62.5K	100K	500K
mdtest hard files total	1000K	2000K	16000K	16000K

Why these test?

2n1t was chosen to fit a simple MPI job.
4n8t was requested by HPC team to match one of their bigger jobs.
10n16t is to match with certain known IO500 submissions from vendors.
32n1t is to match some non-IO500 benchmarks done by vendors.

Results

Tests	2n1t-nlsas	2n1t-ssd	4n8t-nlsas	4n8t-ssd	10n16t-nlsas	10n16t-ssd	32n1t
ior_easy_write GB/s	2.687	2.756	6.150	11.541	6.283	11.897	n/a (2)
mdtest_easy_write kiops	5.304	4.146	12.446	9.387	6.872	6.220	n/a
ior_hard_write GB/s	0.008	0.036	0.049	0.366	0.156	0.953	n/a
mdtest_hard_write kiops	5.088	4.067	6.260	4.743	6.762	5.570	n/a
find kiops	63.520	126.030	108.610	105.110	131.290	99.810	n/a
ior_easy_read GB/s	2.605	2.435	15.398	18.598	41.064	22.847	n/a
mdtest_easy_stat kiops	13.112	11.833	15.977	15.485	18.632	19.441	n/a
ior_hard_read GB/s	0.013	0.062	0.111	0.908	0.359	3.890	n/a
mdtest_hard_stat kiops	10.462	11.996	23.456	19.114	15.759	18.722	n/a
mdtest_easy_del kiops	2.856	2.341	5.345	3.983	4.451	4.463	n/a
mdtest_hard_read kiops	n/a (1)	1.392	2.560	4.242	3.958	5.038	n/a
mdtest_hard_del kiops	n/a (1)	2.673	4.160	4.310	4.577	4.491	n/a

(1) had a ceph MDS hiccup with client failing to release caps error, killed the slurm job as it was taking too long
(2) 32n1t ssd put too high loads on the SSD pool, perhaps due to having too few storage nodes there vs the clients, and also the big mismatch in network speed (100G on client vs 25G on storage), and crashed 2 storage nodes. Did not have time to run the 32n1t nlsas

Note: the metadata performance is mostly about the NVMe metadata pool, and does not really reflect the differences between NLSAS and SSD.

System loads

During testing, we observed and also monitored the system loads. Here are some highlights:

Slow request storm

IO easy writes put a lot of load on our NLSAS OSDs, which created a storm of slow requests. At worst, they affected every single NLSAS OSD, and piled up like this: 90817 slow requests are blocked > 32 sec. However, they cleared up as soon as the test neared its end, and did not cause any harmful effect.

MDS requests

Our MDS nodes got hit really hard during the metadata tests. The 10n16t benchmarks put the biggest load we had ever seen on them, e.g:

+------+--------+----------------+---------------+-------+-------+
| Rank | State  |      MDS       |    Activity   |  dns  |  inos |
+------+--------+----------------+---------------+-------+-------+
|  0   | active | mds3-ceph2-qh2 | Reqs: 19.8k/s | 2542k | 2519k |
|  1   | active | mds2-ceph2-qh2 | Reqs: 14.4k/s | 6751k | 6751k |
+------+--------+----------------+---------------+-------+-------+

Comments

Please register or sign in to add a comment.