K8s Misadventures, Pt. 1 - Distributed Systems are Easy

It’s finally happening - I’m splitting compute and storage up, and using k8s to do so. I bought two more Dell R620s, bringing my total to three - behold, quorum. This will be a multi-part series, focusing on different things I did wrong.

I want HA, or as close as I can. I know I still have SPOF with power (aside from my UPS), internet, and switching, but my real concern is the ability to drop a node and not notice. I often find myself wanting to tweak something at an inopportune time, i.e. during the day, and this disrupts the household internet. A capital crime, to be sure. In order to accomplish HA without having a truly ludicrous amount of physical servers, I need three servers. Since I don’t need to dedicate any of them strictly as control planes, and since the R620 is dual-socket, I’ll install Proxmox as the hypervisor and run a control plane and worker on each node.

For drives, I decided on (heavily-used) Intel S3500, 300 GB. These are enterprise-level SSDs, which is important as they have Power Loss Protection - essentially, a super-capacitor on the drive that will power it for some amount of seconds after power loss to ensure writes are completed. This matters for Ceph, something I intend[ed] on using. Specifically, it matters because Ceph doesn’t report a write as being complete until all replicas have done so, and it ensures this is the case by calling fsync after each write. Drives with PLP can immediately report back as being complete, safe in the knowledge that even with a power loss, they will complete the write. Drives without PLP must wait for it to complete. I’m not sure how much of a performance hit this has in homelab use scenarios, but as it turns out, small used SSDs are quite cheap on eBay ($30-40 each), so why not? I bought nine, three per node. My intent was to use one for Proxmox and its VMs, one for a Ceph OSD, and one for a Ceph MSD.

For the Proxmox cluster, I bought a small PoE-powered Unifi switch, and hooked eth4 (eth1/2 are 10G SFP+, eth3/4 are 1GBe) on each physical node up to it. This was for corosync, which Proxmox claims needs its own network. The latency seems to be about the same between this dedicated switch and my main one, so who knows if it matters at this scale. Still, the switch was quite cheap, so I don’t particularly care. It has its own VLAN, and a rule in the router to drop all traffic to and from that isn’t in its subnet.

I want[ed] distributed ephemeral storage. Why? I have no idea, in retrospect. I thought it was important and not an anti-pattern. I believe I was wrong. I settled on GlusterFS for this purpose. While I think Gluster is a fine project, and is admittedly easy to set up, it wound up being an unnecessary abstraction. An issue with this is that the Proxmox installer simply takes your entire drive, without partitioning options. However, since it’s Debian, they also allow you to add their repos and convert an existing Debian installation into Proxmox. So I did this, configuring a 32 GB partition for Proxmox, and the remaining ~268 GB as a separate partition for Gluster. This worked fine, and I imported the Gluster brick into Proxmox as storage. I then created VMs using this storage, two per node.

For the OS, I wanted to use Talos. Talos is an ephemeral and immutable OS that is designed solely to run k8s. It doesn’t even have shell access. It’s also a snap to set up - write some YAML, download their CLI tool, boot the VMs with the ISO attached, and apply the YAML via their tool. Boom, cluster. It works amazingly well, and if you’re doing bare metal installation, they have a side project called Sidero that will automatically bootstrap new nodes, making it even easier. I wasn’t, so I had to manually create the cluster. Still an easy process, and before too long I had three control planes and three amd64 workers, plus my RPi4 as an arm64 worker (because why not?).

As it turns out, distributed file systems get cranky if you do things like remove a disk, unplug a network cable, etc. Strange things would then happen to Talos, namely it losing its GRUB entries. While I don’t intend to drop nodes often, the ability to do so is a key priority for this project. I’m not precisely sure of the failure mechanism, only that if the VMs were installed to local storage, doing the same chaos testing resulted in no unexpected ill effects to the cluster. So, remove Gluster, delete its partition, expand /dev/sda1 to take up the entire disk. This worked fine, and I was back to the beginning.

In the next part, I’ll discuss Rook / Ceph, which is a large enough topic to get its own post.