Another year, another rebuild.

The eternal delimma of having a “home lab”.

  • “Why do I do this?”
  • “When am I done?”
  • “Who thought this was a good idea?”
  • “What caused me to do this?”
  • “Where do I even put this hardware?”
  • and more questions I won’t be answering.

Where did we come from, where did we go?

Following a similar format to last time, we’ll recap the decisions that were made in the last post.

(FAIL) Operating System: k3os

k3os was a great decision at the time when I picked it out. Unfortunately, almost immediately after I switched to it, Rancher, the company that built k3os and k3s, were purchased by SUSE and k3os stopped receving support almost immediately.

This was particularly painful as, you know, I just switched to it… I ended up needing to upgrade k3os for something a few months later and being unable to get kernel patches and software support was particularly painful.

As I’m writing this post, the GitHub is at 204 open issues, and hasn’t seen a commit on it’s master branch since 5 days after my blog-post.

(PASS) Storage System: Rook/Ceph

Rook was an excellent choice on this, especially after adding additional machines to the cluster (which I’ll cover later), allowing for great flexibility, and ease-of-use with the operator integration with Kubernetes. In my case, migrating from k3os to Talos (again, later) was a breeze as I just kubectl apply-ed my way foward and had the same cluster configuration, and then used the “external” Ceph configuration in order to migrate the data between the clusters.

There were a few other options in during the time I was migrating, but nothing really worked quite as well. So, if it ain’t broke, don’t fix it, and I left it well alone aside from tuning some values to enable RWX (ReadWriteMany).

(PASS) Orchestration: Kubernetes

Kubernetes can be thought of in two different ways:

  • A Specification
  • A Distribution

Kubernetes is really a list of how certain components can be mixed together and should turn out to an expected and consistent result. However there’s a bit of a mixup because there’s also the “Distribution” side, which is really the inner workings of how “Kubernetes” actually translates the API calls to the desired state. Common distributions of Kubernetes are the official implementation, generally referred to as “Kubernetes” or the “Vanilla” distribution, but there’s also Rancher SUSE “k3s”, VMWare’s “Tanzu”, and Google’s “Google Kubernetes Engine” (aka GKE).

In the previous blog post, I didn’t really cover this fact, and just glossing over it in my prior post. But each distribution of Kubernetes has it’s own opinions on how to approach certain components of the ecosystem, GKE building in native support for Google’s Load Balancers, k3os focusing the etcd storage into SQLite, or VMWare…. Where you pay more for it? (Full Disclosure: I’ve never actually used Tanzu.)

In the previous case, k3s (as bundled with k3os) was fine, especially when getting started with running it on metal with the inclusion of sane defaults for CNI, Load Balancers, etc.

What’s next?

Well, in an effort to try to not have to rebuild things again. Some more buisness-minded thinking has put some different values forward for picking replacements for tools like k3os, notably “Is the tool a core product of the buisness building it?”.

This specific question means that I should have an smaller chance of not having to rebuild my cluster every year, as well as generally having better documentation and community support.

Skipping all the dramatic fanfare of picking one, I settled on Talos Linux. Another Kubernetes Operating System with the same focus on security and stateless-ness that k3os did. The major differentiaon is that it’s running “vanilla” Kubernetes, not k3s. This means that I need to pick out what I want for being able to actually use the cluster.

The “Stack”

So what does my Kubernetes cluster look like:

There’s a lot of small reasons on why each component was chosen, from familiarity to familiarity… Well, mostly just one. Serving time working as a Technical Solutions Engineer (TSE) at Google, I had experience with tons of different customer environments using Kubernetes, and more specifically GKE. Drawing on that experience for myself means that I already “hit” some of the pitfalls with other technologies and understood where their strengths and weaknesses were.

MetalLB is fantastic for small “bare-metal” clusters, where direct access and full Layer 2 complaince is almost guarnteed. It has two modes of operating, Layer 2, where it intercepts and responds to ARP packets, as well as a BGP mode where it can connect to upstream routers and publish routes using gasp BGP.

Pomerium is a bit more of a stylistic choice, where I enjoy having complete external access to my cluster, being able to access the API and other less secure applications, but with the assurances that there’s an authentication layer before it even reaches the backend.

Rook (and Ceph) are holdovers from the work on the previous iteration of the cluster, mostly because switching to another distributed storage system would be increasing the risk of data loss (again), as well as needing to learn the intracacies for an increasingly annoying part of the cluster. I’ve got storage, and I just want containers to use it. Rook does a pretty bang-up job, so no major complaints.

Prometheus is the gold standard. Much better people have written much better articles about much better configurations than mine.

As for the applications running on it, it’s mostly personal tidbits with various monitoring and backup software. There might be more, who knows?


There’s the update, for now anyways. Got any cool suggestions? Shoot me a DM over Mastodon/Fediverse, @[email protected].