Changes

Linh Vu · 92d50950
--- a/HPC-IO500-2019-07-08.md
+++ b/HPC-IO500-2019-07-08.md
@@ -165,8 +165,6 @@ During testing, we observed and also monitored the system loads. Here are some h

 IO easy writes put a lot of load on our NLSAS OSDs, which created a storm of slow requests. At worst, they affected every single NLSAS OSD, and piled up like this: `90817 slow requests are blocked > 32 sec`. However, they cleared up as soon as the test neared its end, and did not cause any harmful effect. 

-Although the cluster and NLSAS pool coped fine, due to this, we do not recommend running big jobs directly on the NLSAS pool, which should be reserved for long term storage only. Compute jobs should be on faster scratch storage e.g SSD or NVMe. 
-
 ### MDS requests

 Our MDS nodes got hit really hard during the metadata tests. The 10n16t benchmarks put the biggest load we had ever seen on them, e.g:
@@ -180,4 +178,46 @@ Our MDS nodes got hit really hard during the metadata tests. The 10n16t benchmar
 +------+--------+----------------+---------------+-------+-------+
 ```

+### Grafana charts
+
+We monitor ceph cluster loads via Prometheus/Graphite/Grafana, and here are some screenshots of the cluster during the tests. 
+
+**Overall throughout the whole 8+ hours window of IO500 benchmarking**
+
+![ceph_load_combined_io500](uploads/7d0f4b091b9f5ae19c4f5a24e0a6966b/ceph_load_combined_io500.png)
+
+**Overall combined network throughput** *may not be accurate due to sampling time mismatch from each node*
+
+![network_throughput_io500](uploads/0ee736d4d02b1c80852e28118229dd3f/network_throughput_io500.png)
+
+**2n1t_nlsas**
+
+![ceph_load_2n1t_nlsas](uploads/2b743a2420d6cf3792996f7eac7935ce/ceph_load_2n1t_nlsas.png)
+
+**2n1t_ssd**
+
+![ceph_load_2n1t_ssd](uploads/1ff84a3894fb6d3aafb7b172d3790578/ceph_load_2n1t_ssd.png)
+
+**4n8t_nlsas**

+![ceph_load_4n8t_nlsas](uploads/3acfe795612fb3d4906ac7005b94c480/ceph_load_4n8t_nlsas.png)
+
+**4n8t_ssd**
+
+![ceph_load_4n8t_ssd](uploads/9d3086ceca97ca1559cb6043981595ad/ceph_load_4n8t_ssd.png)
+
+**10n16t_nlsas**
+
+![ceph_load_10n16t_nlsas](uploads/9d923c76725b33ddec94898d30527f9a/ceph_load_10n16t_nlsas.png)
+
+**10n16t_ssd**
+
+![ceph_load_10n16t_ssd](uploads/cb51fecff37fbd2724fa92976e462825/ceph_load_10n16t_ssd.png)
+
+**32n1t_ssd** *this is the one that crashed*
+
+![ceph_load_32n1t_ssd](uploads/77de285eb72f54897de24de40971f4ff/ceph_load_32n1t_ssd.png)
+
+## What we learned from this
+
+Although the cluster and NLSAS pool coped fine, due to this, we do not recommend running big jobs directly on the NLSAS pool, which should be reserved for long term storage only. Compute jobs should be on faster scratch storage e.g SSD or NVMe.