Sometimes, Things Just Don’t Work Out As Planned.

Unfortunately, it’s quite annoying when they don’t.

Back in 2020, I made a blogpost about setting up my great new home lab, full of grandiose and excitement for a new build with more storage. I even said “It’s a fun adventure that’s just getting started!”. Oh, young Q, you have no idea what you were in for, but present Q does, so let’s talk about it.


Bad Decisions Lead to Bad Outcomes, Informed Bad Decisions Lead To Regret

Let’s do a quick recap of the decisions that were made in the initial setup:

Operating System: Debian 10

This decision wasn’t as bad as it could’ve been, if I had gone with something like CentOS, it would’ve been a really bad time.. The biggest problem with Debian was it’s greatest strength, the stability of the platform. Within the first few months, there were a few things that I wanted to run, but were unable to due to the older kernel included with Debian 10. Eventually, this did get updated to Debian 11 and the kinks worked out, all was happy again in the kernel space…

Orchestration System: Docker Swarm

This was another decision that ultimately, I was relatively happy with. Organizing things was easy since you could use Docker Compose files to set things up, and overall maintenence for things was fine, if a bit tedious sometimes. Setting up new services was a breeze, and honestly, there wasn’t too much to complain about other than networking features were lacking on the provided defaults.

Storage System: Ceph

Can I cry? Is that OK? Ceph was, admittedly probably the best of worst options for my setup, stemming from the the decision of using it across two machines, instead of >3, like recommended, lead to very “sketchy” situations when there was a drive failure, and the nail in the coffin was actually that after I moved to Denver (in December), I had enough drives fail that I lost everything.

Shortly after rebuilding everything, I started getting read locks on Plex of seconds to tens of seconds, preventing it from operating correctly and causing multiple instances of database corruption. I probably spent a work week (~40 hours) of personal time trying to get things stable again, attempting numerous things, but unfortunately, it just wasn’t meant to be.

Ironically, I’m still using Ceph, however it’s being managed by Rook, and only on a single machine. More on this later.


So, What Now?

Excellent question, my scholarly friend! It’s a home lab, so experimentation! Adventure! Risk! Failure?!?

After getting the read lockups repeatedly, I shut everything down, screamed to myself, and went on vacation to Seattle to get some drinks with my friends. This gave me some time to sit down and plan out my next moves, I really enjoyed having my own setup for media, backups, and things so I wasn’t going to throw in the towel, I just needed something more manageable. Something that wouldn’t eat into my free time as much, so I could play games with my Fiance.

The Shortlist

I chatted with some friends to get anecdotal references and experiences, as well as just perusing around /r/homelab to build out my total list of options, and narrowing down the ones I want to invest time into testing. Speaking of timeline, this testing was done in mid/late January, things may have changed, make your own shortlist!

In no particular order:

My rating criteria were as follows:

  • Setup Experience
  • Scalability
  • Maintenance Overhead
  • User Experience
  • Compatibility

TrueNAS SCALE

I’m always a sucker for trying new things, and TrueNAS SCALE is definitely new. It’s a currently in development product from the same team who built TrueNAS CORE, which is generally referred to as one of the best for any custom-built home NAS.

SCALE is not CORE, and I can’t stress that enough. It’s an almost-complete ground-up rewrite of TrueNAS to add several advanced and new features. Things like cluster-storage via GlusterFS, scalable containers via Kubernetes, and overall a much more powerful product. As most know, more power is more complexity to manage, and for SCALE this was the downfall for why it was very quickly removed from the shortlist.

It’s not complete, things like building a GlusterFS cluster out were noticibly missing from their documentation, and to do it without jumping through hoops required using their TrueCommand product, which has a cyclical dependency for having a TrueNAS system setup already.

I think the thing that drove me away was the inability to run Kubernetes on more than one node when I was doing my sample testing. Even setting things up was non-trivial as I had to configure everything over the CLI, while everything else was in the provided WebUI. It just wasn’t easy which is what I went in expecting.

Maybe SCALE will/has become more than what I was hoping for, and I really hope it does, because their project is so close to what I want for my cluster.

ProxMox

Honestly, I tried Proxmox “for the meme”. I had used it years ago and, honestly, nothing has changed that much. Except for the major addition of Ceph management.

Sweet! I can run Ceph inside Proxmox, and it’ll handle the management and configuration options for me. Again expectation leads to disappointment. I got the cluster up on two machines and then I had a handful of OSDs that wouldn’t come back after a reboot.

I had Proxmox running for about a day before I quit.

k3os (and k3s)

For quite a few people, k3os is a relatively unknown ’thing’. It’s a continuation of k3s, which is a distribution of Kubernetes that strips out a lot of the complex things that most don’t use and turns it into a single binary. It’s actually phenomenal for running local dev environments since you always have a cluster nearby for testing deployments of applications.

k3os meanwhile is a full blown operating system that is built from the same ideal and strips out everything you don’t need except to run k3s. This includes things like a package manager, and local control of the machine, other than the networking config, almost everything about the machines can be controlled via an operator pre-installed into the k3s cluster.

So, this was the most promising candidate thus far, but it was missing a converged storage solution that both ProxMox and TrueNAS SCALE brought to the table. Everything is plug-and-play on k3s, so I ended up choosing Rook for storage, and only wiped my root-disk once.

There were some limitations that I had to work around, notably, the lack of simple support for VLANs for the management traffic, but that was just some routing policies that had to be adjusted.


Nobody Is Allowed To Be Happy, Including Myself.

I ended up choosing k3os. Surprise, surprise. Mostly because of the flexibility of the platform and the capability for updates to be a single kubectl apply away. I ended up configuring most of my media, networking, monitoring, and other stacks again, but never ended up actually adding another machine to the cluster. I’m pretty sure that when I do, it’s going to break things hard and I’m just not wanting to at the moment.

So, currently, I’m running a single node k3os cluster, where I’m pretty sure that I’ve got a bad tower of cards waiting for me if I breathe on it wrong….

In other news, I’ve got really into networking recently And I’m already working on the blogpost!