I may have mentioned this before, but I have two nearly identical Supermicro 2Us. One is the main server that hosts VMs under Proxmox to do various tasks - a NAS, a Docker host, jumpbox, etc. The other came about because I discovered that its motherboard hardware revision level was too old to support Xeon v2 (Ivy Bridge), even with an updated BIOS. I wanted the lower thermals that v2s offered, so I bought a newer motherboard, and then got another 2U case to throw the old motherboard and CPU into.
The intent was that it would wake up once a day, ingest ZFS snapshots from my NAS, and shut off. Also, in case of hardware failure, it was a warm-ish spare that could handle everything if needed. This intent was foiled, first by laziness, then by lack of space (my main zpool quickly eclipsed the size that a 3-wide RAIDZ1 of 14 TB drives could handle), then by a strange hardware quirk - failure to boot. I use NVMe drives on PCIe adapters, in order to free up the drive bays for data. The speed boost doesn’t hurt, either. This requires a modded BIOS to load the NVMe module, but beyond that, it’s been a breeze. The second server, for reasons which are utterly unclear, routinely doesn’t find the NVMe drive at boot. It then goes through its list of options, ending on network boot, and hangs having not found anything. This makes WOL difficult, since you can only tell it to boot up once, and you can’t issue a reboot command via magic packet (that I’m aware of). For reasons that are even more unclear, it generally finds the drive after 3-4 reboots. I’d blame the adapter or drive, but it seems to be perfectly steady once running, with nary a complaint logged.
The answer is Python. Well, and some Bash. Also SOPS. And cron.
The general flow of it is thus: first, ascertain if SOPS has loaded the secrets into the shell’s environment. If not, there’s not point in continuing, so bail out. If it has, check if the first arg was “start” or “stop.” If the former, call a function that returns True IFF the NAS VM (which auto-boots once the hypervisor is up) accepts a connection on 22/TCP. That function issues IPMI Power On, waits 180 seconds, then attempts to open the socket. If it succeeds, it closes the socket and returns True, which falls back to the while loop, and the script exits. If it fails, it returns False, at which point the while loop sees that the function return is still False, so it issues an IPMI Reset, and then enters the function again. Once the script has exited, the parent Bash script then calls pyznap to send snapshots, and then calls the Python script again with “stop” as the first arg. The Python script again checks for secrets in its environment, and then issues an ACPI Shutdown command over IPMI, at which point everything is done. All of this is called by a cronjob that runs daily.
You may notice a manual WOL magic packet (fine, it’s just a frame…) creation in the script. For reasons which are unclear to me (see below statement on not currently caring), I got PermissionErrors. Duh, you’re trying to bind to a privileged port as an unprivileged user - yes, but the same thing happened when executing the script as root, and when the port was set to an unprivileged port. Manually sending a packet with netcat worked fine. I spent a few minutes looking into it, but since I knew I already had to use IPMI to control the reset and shutdown, I left it in for historical reasons but moved on. In general, if you find yourself directly interacting with sockets in Python, you’re either creating a client/server chat system for a class where the Professor took pity on everyone and didn’t insist on C, or you don’t know that Twisted exists and you hate your life.
All in all, I’m pretty pleased with this horrible band-aid. I don’t know nor do I (currently) care to troubleshoot the Mystery of the Sometimes Missing Drive, and writing this scratched a few itches. First, how do you wake up a computer remotely? There are multiple ways. How do you ensure you woke it up? How do you know that it’s ready to do useful work? This is analogous to liveness and readiness probes in Kubernetes, but only at startup, which k8s also has. Finally, the annoying detail of encrypted secrets was pleasantly easy to solve. While one day I’d like to get Vault set up, SOPS is dead-easy to use.