Bad Ideas Are Born out of Necessity

So recently, I got a new Dell R720xd (from TechMikeNY) to compliment my Dell C2100 that I’ve had for a few years. The new server was promptly christened “Irminsul” (following the C2100 as “Yggdrasil”, a “World Tree”), and begun a full disk test to make sure everything was spotless. Unfortunately, a disk ended 99.97% with bad sectors, so a quick email to Mike (Seriously, shout out to the team over there.) and we were off to the races.

Side Note:

The server’s we’ll be working with are both running in the same rack, with the same OS (Debian 10). Doing these things is not reccomended and you may brick something.

Small Aside: Dell’s RAID Card Sucks

The Irminsul (the R720xd) came with a Mini-PERC card, with Dell’s firmware on it. For some dumbass reason, they decided that no, you’re not smart enough to handle all of the drives in software. This limits us on two things, performance and driver selection. I personally won’t go into it all here, but Fohdeesha has a great writeup about this, and how to cross flash on their website. 10000x thanks to them.

How does one distribute data across several (read: two) nodes?

Short Answer

If less-than 3, don’t; If you do, do so wisely.

Long Answer

The question then turns to “What do I use to distribute data?”, and the answer is a bit tricky, and it depends on your usecase.

Your options (including, but not limited to, and in no particular order):

And I could go into the specifics on each one but there are plenty of places that have, and this doesn’t need to be another one.

Personally, I went with Ceph. Not because of a technical reason, but because a good amount of my friends and co-workers are using it as well. So if I’ve got a question, I can just ask them. There’s some nice things like Prometheus reporting, and some other monitoring stuff, but that wasn’t the decision maker.

Alright, so we’ve got a new server, and a distributed storage solution, how do we do things? Not easily. Ceph wasn’t built for being run on one node, much less two. But here we are, and I know enough to be dangerous.

A small note

I haven’t actually got my second Ceph node online yet, so all of the following information is confirmed to work on just one node. I’ve got to replace a NIC and a RAID card in Yggdrasil before I can add it in, or otherwise risk data loss. I have however added Swarm, mentioned later, on it to give it some extra oomf.

Installing Ceph

For the Ceph installation, we’re going to be using the new cephadm tool.

Pre requisites:

Following along with the “Deploying a new Ceph cluster” documentation was almost perfect. The only weird thinsg were that the key that apt uses was invalid, but that was avoided by using the following method:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add - 

That will download the key again, though you’ll have to remove the .asc from /etc/apt/trusted.gpg.d. Otherwise, just make sure you apt update, and everything else there should run smoothly.

Creating your Ceph Pools

Creating a default pool is easy. Just ceph osd pool create cephfs_data and ceph osd pool create cephfs_meta and we’ve got the two pools we need for CephFS. Unfortunately, I’m a special snowflake, and I don’t like Ceph’s default of replicated pools , and will instead opt for an erasure coded pool for the data pool. Think of the difference between them as “replicated pools” being RAID1, and “erasure coded” as being RAID5.

Adding another layer of complexity, I want to shard the data a bit more effectively, as I’ll be hosting large files on this cluster, so modifying the [erasure profile] is going to be happening as well.

  1. Create a new erasure profile, with the parameters wanted
    • I’m using 10 shards with 1 erasure code, so k=10, m=2
    • Additionally, the “failure domain” will be on the OSD (Read: Disk) level.
    • ceph osd erasure-code-profile set data-pool-profile \
      k=10 \
      m=1 \
      crush-failure-domain=osd
      
  2. Create a new data pool with the new erasure profile
    • ceph osd pool create cephfs_data erasure data-pool-profile
      
  3. Enable Erasure-code overwrites
    • A requirement for CephFS
    • ceph osd pool set cephfs_data allow_ec_overwrites true
      
  4. Create a new metadata pool
    • The CephFS metadata pool cannot be a erasure pool.
    • ceph ost pool create cephfs_meta
      

Actually make your Metadata pool usable

Great, so we’re all setup for getting setup for CephFS, but unfortunately the resources won’t allow I/O to the cluster because of the cephfs_meta pool isn’t healthy. Why? Because it’s a replicated pool, and the default CRUSH map says that our failure domain for the pool is a host failure. Unfortunately, our issue is that we don’t have another node available yet. We could figure out how to add one, but my other server has data I want to get into Ceph first.

What’s the solution? Not easy. We have to modify the default CRUSH map, which requires some specialized tooling. CRUSH maps are actually compiled into a different format that Ceph can use directly. Is it reccomended to do this? Nope, but we’re running Ceph on two nodes, so caution be damned.

Step 0: Masure sure you have crushtool installed, if you don’t cephadm install ceph.

  1. Retrieve the current (compiled) CRUSH Map

    • ceph osd getcrushmap -o compiled-crush.map
      
  2. Decompile the compiled-crush.map

    • crushtool -d compiled-crush.map -o crush.map
      
  3. Modify crush.map to change the failure domain

    • Find the line for step chooseleaf firstn 0 type host
    • Change it to the required domain, in our case osd
  4. Compile the crush.map

    • crushtool -c crush.map -o new-compiled-crush.map
      
  5. Configure Ceph with new-compiled-crush.map

    • ceph osd setcrushmap -o new-compiled-crush.map
      

After #5, you should see your cephfs_meta pool go healthy, and you’re ready to create a new CephFS mount.

Create a new CephFS CephFS

Alrighty, so now we’ve got a cluster, two healthy pools, and nothing to show for it. Let’s get on with CephFS.

I’m not going to cover this in detail as the documentation about setting up a CephFS is actually pretty decent. It’s mostly, ceph fs new <name> cephfs\_meta cephfs\_data and then mounting it to the actual system where you want it using mount -t ceph :/ <mnt directory> -o fs=<name>,name=<CephX User>. In my case, I’ve actually got several FSs, which is experimental (of course), for Docker volumes, and raw media storage. The usage of such will be covered in the next part.

Spooky Scary Docker Swarm

A while ago, Docker Swarm was the laughing stock of the part of the tech community I was in, mostly, it has some similar levels of complexity as Kubernetes. This initially kept me for trying it for quite some time, but I’ve been hearing different things from around the grapevine, and I decided to give it another try, and considering my current infrastructure, It’s not a terrible thing to get started with.

As a note, yes, I know that it’s now called “Swarm mode”, but I’m going to keep calling it Swarm. Don’t @ me.

Creating a Swarm

Surprisingly, Docker Swarm is pretty simple to get started with once you have Docker installed (which is required for Ceph anyways):

docker swarm init --advertise-addr <MANAGER-IP>

Ta-dah! You’ve configured your first instance in your swarm. The STDOUT of the command will give you another command, docker swarm join <token>, to run on any other nodes, as well as some instructions on how to setup another manager node, if that’s what floats your boat.

Managing the Swarm

Swarm breaks things down into a few different components:

  • Tasks
    • a definition of setting up a container
  • Services
    • a collection of Tasks

In my case, I was setting up Plex, and the supporting service (read: tasks) around it. Easy to setup using a docker-compose.yml file. I’m planning on getting into detail on my actual setup in another blogpost, but for now:

version: "3"
services:
######
## Plex
######
  plex:
    image: plexinc/pms-docker:plexpass
    restart: unless-stopped
    environment:
      - TZ=${TZ}
      - PLEX_UID=${PUID}
      - PLEX_GID=${PGID}
    volumes:
      - /mnt/cephfs_docker/config/plex:/config
      - /mnt/cephfs_media/:/data
    networks:
      media-internal:
        aliases:
          - plex
    ports:
      - "32400:32400/tcp"
      - "3005:3005/tcp"
      - "8324:8324/tcp"
      - "32469:32469/tcp"
      - "1900:1900/udp"
      - "32410:32410/udp"
      - "32412:32412/udp"
      - "32413:32413/udp"
      - "32414:32414/udp"
networks:
  default:
    external:
      name: media-internal
  media-internal:
    external: true

This is the compose file that we’re going to be working with. So how do we deploy things onto the swarm?

Shipping things with Portainer

Portainer, is a UI for managing Docker Instances, and even Kubernetes. Personally, I like it as a way to take a peek at services without having to break open another shell. Even works on mobile! There is some installation documentation on their website, however, There’s a few tweaks that I made to their compose file.

version: '3.2'

services:
  agent:
    image: portainer/agent
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      - agent_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

  portainer:
    image: portainer/portainer-ce
    command: -H tcp://tasks.agent:9001 --tlsskipverify
    ports:
      - "9000:9000"
      - "8000:8000"
    volumes:
      - /mnt/cephfs_docker:/data
    networks:
      - agent_network
      - internal
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

networks:
  agent_network:
    driver: overlay
    attachable: true
  internal:
    external: true

Some things to note with this, compared to the one found in the documentation:

  • Removed the Volume mount, and replaced with a Bind to a CephFS.
    • As my swarm has two nodes in it, I’ve actually set my second node as a master as well.
    • If the pod were to get scheduled on another node, the default volume mount is only local to the machine. This means that it wouls start from scratch and want me to configure a user. Not quite what I wanted there.
  • Added the internal network, which handled communication between my reverse proxy and Portainer.
    • This will be revisted in the next edition of this post.

To deploy Portainer, just save the above YAML file as portainer.yml and then run docker stack deploy --compose-file=portainer.yml. This will get the agent out into the swarm and setup the actual maintenece instance as well. Once it’s up, you’ll be able to access and start configuring things through port 9000.

The End (for now)

And that’s my cluster thusfar. This blog post was mostly for me to keep track of everything that I’ve done, so I don’t have to go find it from all the different sources again. It’s a fun adventure that’s just getting started!

Stay tuned for the next installment, where I’ll be setting up all of the services and doing some security-first changes.

Q.