... | ... | @@ -188,6 +188,19 @@ Our MDS nodes got hit really hard during the metadata tests. The 10n16t benchmar |
|
|
+------+--------+----------------+---------------+-------+-------+
|
|
|
```
|
|
|
|
|
|
### Peak throughputs in ceph
|
|
|
|
|
|
Glancing through `ceph status` during the tests, we could see some big numbers, such as during the SSD 10n16t write test:
|
|
|
|
|
|
```bash
|
|
|
io:
|
|
|
client: 24.3GiB/s wr, 0op/s rd, 24.93kop/s wr
|
|
|
```
|
|
|
|
|
|
This was out of an average 11.897 GB/s as a final result.
|
|
|
|
|
|
The read test that resulted in an average 41.5GB/s throughput must have had a very high peak. Unfortunately our Grafana chart that tracks this somehow did not respond and collect any metrics during the whole benchmark window, possibly due to high loads.
|
|
|
|
|
|
### Grafana charts
|
|
|
|
|
|
We monitor ceph cluster loads via Prometheus/Graphite/Grafana, and here are some screenshots of the cluster during the tests.
|
... | ... | @@ -236,5 +249,22 @@ Although the cluster and NLSAS pool coped fine, due to this, we do not recommend |
|
|
|
|
|
### Networking mismatch
|
|
|
|
|
|
The client nodes we chose all have 1x100Gbe networking which could be too much for the storage nodes on 2x25Gbe. The switches that the client nodes are on are also faster.
|
|
|
|
|
|
### Misconfigurations
|
|
|
|
|
|
Being first time users of IO500, we did not optimally configure the parameters, and also made a few typos that resulted in workloads that did not quite match what we were after e.g too many files for the 32-nodes test.
|
|
|
|
|
|
A lot of IO500 tests require a 300s minimum runtime for official submission. Without knowing how long it takes our cluster to run through each test, it is hard to predict what params we should give IO500. This requires multiple runs of the same test to make sure all results are valid for official submission. We can rectify this in the next maintenance window when we aim to run IO500 tests again.
|
|
|
|
|
|
### OSD loads
|
|
|
|
|
|
The NLSAS OSD host loads are fairly similar to each other during each test. However, the Sandisk SSD OSD hosts, which our EC SSD pool is on, have wildly varying loads. During the 10n16t test, a couple were hitting 110-120 while the rest were between 25-50. During the 32n1t test which crashed two SSD hosts, it was similar but even higher. This might be an effect of EC which is very CPU intensive, but we did not expect some hosts to be hit so much harder than the rest. We expected hosts to have very similar high loads.
|
|
|
|
|
|
### Metadata performance
|
|
|
|
|
|
Metadata performance is primarily dictated by the performance of our 10x NVMe disks, which we made 40 NVMe OSDs out of. The NVMe OSDs are co-located on the same hosts as the SSD OSDs. This is likely the main reason why once the job load is much higher, the metadata kiops is better during the NLSAS tests vs the SSD tests, as the higher loads (CPU, memory, network) on the SSD hosts would start to affect the NVMe metadata OSD performance.
|
|
|
|
|
|
One lesson to learn from this is that ideally for a big CephFS cluster serving HPC, we should have dedicated metadata OSD hosts, e.g 5x servers with NVMe disks.
|
|
|
|
|
|
In production, however, we have not run into such IO intensive jobs like our IO500 benchmarks yet, so this weakness has never been exposed. |
|
|
\ No newline at end of file |