On Being Wrong (Again) – Now With Metrics

Recently, I wrote about being wrong. Specifically germane to this post, I wrote:

Being wrong means I probably had some gross misunderstanding of a system’s architecture or the operation of a program, and that means I have an opportunity to learn more about it, and hopefully to be able to guide my decisions.

I was involved in an incident where I thought that some EC2 instances had reached disk saturation. This was primarily driven by three factors: referencing an outdated man page, tunnel vision, and missing the units on a graph.

My work has some i4i-family EC2 instances, which have native NVMe drives. As in, locally attached to the host, not over EBS. If you haven’t personally experienced the performance difference, it is massive. IOPS are essentially unlimited (not exactly, but at the 3.75 GiB/device level, you get 400K/220K Read/Write – presumably with 4K blocks, but it doesn’t specify), albeit the data is ephemeral and does not survive a stop/start. This is a fair trade, if you design your system for the eventuality that the hardware may fail. Ours have 4x such drives, in a RAID0 stripe, so we have about 1.6M/880K Read/Write IOPS.

Some queries were taking an absurd amount of time to complete, and running iostat on the host, I noticed that the reported %util for the array was hovering between 99 – 100%. I had also previously noticed that there were a large amount of blocks being dirtied by other queries running, and that there were buffer waits being reported. My thought was that the queries dirtying blocks were causing this other large query to have to flush/fill the buffer pool, which was saturating the disks. IOWait was quite low, but that’s known to be deceptive, so I ignored that. In hindsight, had I paid closer attention to the r/s and w/s values, I’d have found that they were far below device limits.

Had I instead referenced man7.org for iostat (or, you know, just ran man iostat on the machine), I would’ve found this:

Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

Annoyingly, I know this, and had I not gotten tunnel vision on what I was sure was the issue, I might have caught myself. Similarly, had I paid closer attention to the AWS-generated graphs, I would’ve seen that they report in ops/minute, not ops/second. Alas, that wasn’t the case. A good reminder to take a deep breath, think about the metrics you’re seeing, and ask if they support your hypothesis.

In order to fully disabuse myself of this notion (and to run an experiment), I opted to utilize the NVMe drives in my homelab. In my Proxmox compute cluster, I have three Samsung 983 DCT 1.92 TB drives, one in each server. This is distributed via Ceph in replication mode, and is connected via Mellanox ConnectX3-Pro NICs in a mesh. This provides for 56 Gbps over Infiniband. As I don’t yet have RoCE working, I’m using IPoIB, which is much more CPU intensive. I can get about 20 Gbps before saturating a CPU core. Still, that’s not too shabby compared to the 1 Gbps I had before. The drives themselves are rated at 580K/52K IOPS Random Read/Write (it’s a read-optimized model), or 3.4/2.2 GBps Sequential Read/Write. So, I have a known maximum.

Ceph of course has its own overhead, but since all I really wanted to know was whether I had misinterpreted the output, that didn’t matter. A simple test was set up by creating a test pool in Ceph, and in that pool, creating two rbd images. They were then mapped to /dev/image{01,02}, exposing them to the kernel as normal block devices. A normal ext4 filesystem was created on each, they were mounted, and then benched with the built-in rbd bench toolset. During this, each device was monitored with iostat. I first tested one singly, then both simultaneously. Since the underlying device[s] are the same, this should test how iostat shows utilization.

Output during a single test run (peak w/s window selected):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.5%    0.0%    8.0%    0.5%    0.0%   75.9%

     r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme0n1

     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
 16597.00    617.4M 10328.00  38.4%    0.14    38.1k nvme0n1

     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme0n1

     f/s f_await  aqu-sz  %util Device
    0.00    0.00    2.28  96.0% nvme0n1

sgarland@dell01-pve:~$ sudo rbd bench --io-type write image01 --pool=testbench --io-size 8192 --io-total 8GB
bench  type write io_size 8192 io_threads 16 bytes 8589934592 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1    103152    103581   809 MiB/s
    2    199008     99710   779 MiB/s
    3    287488   95961.3   750 MiB/s
    4    361552   90481.2   707 MiB/s
    5    454848   91044.3   711 MiB/s
    6    553520   90072.3   704 MiB/s
    7    656064   91409.9   714 MiB/s
    8    766800     95861   749 MiB/s
    9    880320    103752   811 MiB/s
   10    998896    108808   850 MiB/s
elapsed: 10   ops: 1048576   ops/sec: 100360   bytes/sec: 784 MiB/s

Output with two simultaneous runs (peak w/s window selected):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19.2%    0.0%    8.6%    0.4%    0.0%   71.8%

     r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme0n1

     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
 21357.00    591.9M  9520.00  30.8%    0.07    28.4k nvme0n1

     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme0n1

     f/s f_await  aqu-sz  %util Device
    0.00    0.00    1.49  98.0% nvme0n1

sgarland@dell01-pve:~$ sudo rbd bench --io-type write image01 --pool=testbench --io-size 8192 --io-total 8GB
bench  type write io_size 8192 io_threads 16 bytes 8589934592 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     68848   69139.6   540 MiB/s
    2    139104   69698.4   545 MiB/s
    3    218880   73061.7   571 MiB/s
    4    300704   75254.2   588 MiB/s
    5    382704   76542.9   598 MiB/s
    6    464544   79138.1   618 MiB/s
    7    543344   80846.8   632 MiB/s
    8    614624   79147.7   618 MiB/s
    9    689072   77672.5   607 MiB/s
   10    761792   75877.2   593 MiB/s
   11    836736   74437.3   582 MiB/s
   12    911088   73547.7   575 MiB/s
   13    988256   74725.3   584 MiB/s
elapsed: 13   ops: 1048576   ops/sec: 75523   bytes/sec: 590 MiB/s

sgarland@dell01-pve:~$ sudo rbd bench --io-type write image02 --pool=testbench --io-size 8192 --io-total 8GB
bench  type write io_size 8192 io_threads 16 bytes 8589934592 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     45952   46151.9   361 MiB/s
    2     95120   47662.6   372 MiB/s
    3    141968   47390.5   370 MiB/s
    4    191536   47935.2   374 MiB/s
    5    240320     48105   376 MiB/s
    6    290560   48920.9   382 MiB/s
    7    340224   49020.1   383 MiB/s
    8    395408   50687.3   396 MiB/s
    9    453680     52428   410 MiB/s
   10    510992   54047.1   422 MiB/s
   11    567488   55384.8   433 MiB/s
   12    630288     58012   453 MiB/s
   13    684544   57826.4   452 MiB/s
   14    744768   58216.8   455 MiB/s
   15    812784   60454.2   472 MiB/s
   16    912272   68955.8   539 MiB/s
   17   1029440   79829.2   624 MiB/s
elapsed: 17   ops: 1048576   ops/sec: 61048   bytes/sec: 477 MiB/s

You can clearly see that iowait was relatively low for both, and even though there were ~30% more writes/sec at peak for the simultaneous run, %util for both is effectively at maximum (96% vs 98%). The same held true for reads. Thus, %util is indeed fairly meaningless for devices capable of parallel processing, like NVMe drives. At least I learned something.