K8s Misadventures, Pt. 2 – Ceph is easy, and other lies people tell themselves

Ceph is a distributed storage system. That severely understates its complexity and capabilities. Here is its documentation, and if you’re intending on trying Ceph out, I strongly recommend you spend some time thoroughly reading it. All of it. I spent about two weeks on and off studying it, along with Red Hat’s docs (which are much the same, but different enough that it’s worthwhile), various Reddit posts, random Medium blogs, etc. I’ll try to do a simple architecture explanation here, but again, I highly recommend reading the source documentation.

Controlled, Scalable, Decentralized Placement of Replicated Data (CRUSH): No, the acronym doesn’t line up. This is how Ceph decides where data is stored, and in my eyes it’s the coolest part of Ceph. Essentially, each chunk of data is algorithmically determined to be in a certain OSD, and can include factors like rack number, networking, physical location (e.g. datacenter), and more. By doing so, each client can compute where its requested data is located, and directly query that location rather than relying on a central server.
Manager (MGR): Tracks metrics and the current state of the cluster.
Metadata Server (MDS): Stores metadata for the Ceph File System, which is a POSIX-compliant layer atop Ceph’s Object Store.
Monitor (MON): Maintains maps of the cluster state, other daemons, and the CRUSH map.
Object Storage Daemon (OSD): This is how Ceph stores data, and provides replication, recovery, and monitoring of said data.

Got all that? Probably not - go read the source docs.

I then ran a test in Digital Ocean; recreating my infrastructure as best I could. I then enabled Ceph using Proxmox’s GUI, and to my surprise, it just worked. Seriously - it abstracts all of the complexity and difficulty away, and works flawlessly. Proxmox of course has its own docs, and I highly recommend you read it as well, but in general know that it works quite well. Thinking myself competent, I decided to then try Rook/Ceph. Rook is a Kubernetes storage orchestrator, and can also handle NFS and Cassandra, should you be interested.

After bungling its Helm chart and realizing I forgot to add my override values, I tried uninstalling it. It seemed like it was hanging, so I thought, “might as well delete the namespace, that’ll kill it.” Big mistake. As it turns out, Ceph has an entire page on tearing it down. tl;dr if you don’t do it in the right order, you’ll have dangling finalizers and signatures on disks.

So that fixed, I moved on to new mistakes, like trying to use regexes for OSDs. You can tell Ceph to exclude or include a given disk, which is useful if you’re not building a pure Ceph cluster, and have other applications installed on your disks. Ceph allows you to use regexes for deviceFilters, but not for devices - a fact I missed. I was trying to get it to ignore /dev/sdb for reasons which are now lost, but it may have been that the node’s Proxmox install decided that its root was /dev/sdb (friends don’t let friends use friendly names; insist on /dev/disk-by-id), and I was trying to avoid having to wipe it yet again. I vaguely recall considering using mknod to swap major/minor numbers, which would have been entertaining, and probably also a failure. Anyway, the OSD provisioner logs showed it attempting to use /dev/sd[^b] literally, and failing. I then failed to learn from my lessons with GlusterFS, and detached a disk via Proxmox while trying to fix something. This caused its OSD to become very upset, and since I dropped to 1/3 availability, recovery would have been irritating, so I nuked everything and started over.

I then failed with partitions. Ceph wants either a blank partition (i.e. no filesystem), or a raw disk. If the latter, it uses LVM to create a VG on the raw disk, and uses that. I was trying to avoid LVM - again, for reasons which are now lost, other than I find LVM an annoying layer - so I had created /dev/sdb1 on the Proxmox nodes, and was trying to pass it in to Ceph. I first tried passing in the partition, but understandably, Ceph viewed that as /dev/sdb (from its perspective, you just gave it a device), and then freaked out upon finding partition information. Bizarrely, if I passed in /dev/sdb from the node, Ceph refused to see its partition, but still found the partition signature, and refused to continue. tl;dr just pass a raw disk in and let it use LVM.

After a weekend of work, I eventually got everything running. I was able to have StatefulSets dynamically provision PVCs, read and write to them, and everything was monitored. I declared this a success, and wiped it out. Why? Glad you asked.

Ceph has pretty steep hardware requirements. While you certainly don’t have to meet the production-ready recommendations from Ceph or Red Hat, you can’t exactly skimp, either. For a 3-node cluster in HA, I estimated I needed about 10 vCPUs and 16 GB of RAM per node for it to run well. Also, it really prefers 10 GBe, which I don’t yet have. Since my nodes have 32 vCPUs each (2x E5-2650v2, 8C/16T), and at the time 32 GB of RAM (now upgraded to 64 GB), this seemed like too much overhead.
Ceph gets upset about clock skew. Per a user on Reddit, this was because I set my CPUs for maximum energy efficiency in BIOS, allowing idling of cores, and higher C-states. Apparently, you have to disable all power savings to prevent this. I love my Dell R620s, and enjoy their low power consumption (~85-90W at idle, no spinning disks), so this wasn’t an attractive option to me.
Ceph’s dashboard would just… fail. I never figured out why. MetalLB wasn’t reporting any issues, and the MGR daemon (which handles the dashboard) similarly didn’t report anything. It would just refuse to load sometimes, and then sometimes fix itself. I’m sure with more troubleshooting, I could have figured it out, but this was kind of the straw that broke the camel’s back.

So what I did wind up with? Longhorn! More on that in a future post.