Distribute ALL THE THINGS!
Bad Ideas Are Born out of Necessity⌗
So recently, I got a new Dell R720xd (from TechMikeNY) to compliment my Dell C2100 that I’ve had for a few years. The new server was promptly christened “Irminsul” (following the C2100 as “Yggdrasil”, a “World Tree”), and begun a full disk test to make sure everything was spotless. Unfortunately, a disk ended 99.97% with bad sectors, so a quick email to Mike (Seriously, shout out to the team over there.) and we were off to the races.
Side Note:⌗
The server’s we’ll be working with are both running in the same rack, with the same OS (Debian 10). Doing these things is not reccomended and you may brick something.
Small Aside: Dell’s RAID Card Sucks⌗
The Irminsul (the R720xd) came with a Mini-PERC card, with Dell’s firmware on it. For some dumbass reason, they decided that no, you’re not smart enough to handle all of the drives in software. This limits us on two things, performance and driver selection. I personally won’t go into it all here, but Fohdeesha has a great writeup about this, and how to cross flash on their website. 10000x thanks to them.
How does one distribute data across several (read: two) nodes?⌗
Short Answer⌗
If less-than 3, don’t; If you do, do so wisely.
Long Answer⌗
The question then turns to “What do I use to distribute data?”, and the answer is a bit tricky, and it depends on your usecase.
Your options (including, but not limited to, and in no particular order):
And I could go into the specifics on each one but there are plenty of places that have, and this doesn’t need to be another one.
Personally, I went with Ceph. Not because of a technical reason, but because a good amount of my friends and co-workers are using it as well. So if I’ve got a question, I can just ask them. There’s some nice things like Prometheus reporting, and some other monitoring stuff, but that wasn’t the decision maker.
Alright, so we’ve got a new server, and a distributed storage solution, how do we do things? Not easily. Ceph wasn’t built for being run on one node, much less two. But here we are, and I know enough to be dangerous.
A small note⌗
I haven’t actually got my second Ceph node online yet, so all of the following information is confirmed to work on just one node. I’ve got to replace a NIC and a RAID card in Yggdrasil before I can add it in, or otherwise risk data loss. I have however added Swarm, mentioned later, on it to give it some extra oomf.
Installing Ceph⌗
For the Ceph installation, we’re going to be using the new cephadm
tool.
Pre requisites:
- Docker
- NTPd
- LLVM
Following along with the “Deploying a new Ceph cluster” documentation was almost perfect.
The only weird thinsg were that the key that apt
uses was invalid, but that was avoided by using the
following method:
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
That will download the key again, though you’ll have to remove the .asc
from /etc/apt/trusted.gpg.d
. Otherwise, just make sure you apt update
,
and everything else there should run smoothly.
Creating your Ceph Pools⌗
Creating a default pool is easy. Just ceph osd pool create cephfs_data
and ceph osd pool create cephfs_meta
and we’ve got the two pools we need
for CephFS. Unfortunately, I’m a special snowflake, and I don’t like Ceph’s default of replicated pools
, and will instead opt for an erasure coded pool for the data pool. Think of the
difference between them as “replicated pools” being RAID1, and “erasure coded” as being RAID5.
Adding another layer of complexity, I want to shard the data a bit more effectively, as I’ll be hosting large files on this cluster, so modifying the [erasure profile] is going to be happening as well.
- Create a new erasure profile, with the parameters wanted
- I’m using 10 shards with 1 erasure code, so k=10, m=2
- Additionally, the “failure domain” will be on the OSD (Read: Disk) level.
-
ceph osd erasure-code-profile set data-pool-profile \ k=10 \ m=1 \ crush-failure-domain=osd
- Create a new data pool with the new erasure profile
-
ceph osd pool create cephfs_data erasure data-pool-profile
-
- Enable Erasure-code overwrites
- A requirement for CephFS
-
ceph osd pool set cephfs_data allow_ec_overwrites true
- Create a new metadata pool
- The CephFS metadata pool cannot be a erasure pool.
-
ceph ost pool create cephfs_meta
Actually make your Metadata pool usable⌗
Great, so we’re all setup for getting setup for CephFS, but unfortunately the resources won’t allow I/O to the cluster because of the cephfs_meta
pool
isn’t healthy. Why? Because it’s a replicated pool, and the default CRUSH map says that our failure domain for the pool is a host
failure. Unfortunately,
our issue is that we don’t have another node available yet. We could figure out how to add one, but my other server has data I want to get into Ceph first.
What’s the solution? Not easy. We have to modify the default CRUSH map, which requires some specialized tooling. CRUSH maps are actually compiled into a different format that Ceph can use directly. Is it reccomended to do this? Nope, but we’re running Ceph on two nodes, so caution be damned.
Step 0: Masure sure you have crushtool
installed, if you don’t cephadm install ceph
.
-
Retrieve the current (compiled) CRUSH Map
-
ceph osd getcrushmap -o compiled-crush.map
-
-
Decompile the
compiled-crush.map
-
crushtool -d compiled-crush.map -o crush.map
-
-
Modify
crush.map
to change the failure domain- Find the line for
step chooseleaf firstn 0 type host
- Change it to the required domain, in our case
osd
- Find the line for
-
Compile the
crush.map
-
crushtool -c crush.map -o new-compiled-crush.map
-
-
Configure Ceph with
new-compiled-crush.map
-
ceph osd setcrushmap -o new-compiled-crush.map
-
After #5, you should see your cephfs_meta
pool go healthy, and you’re ready to create a new CephFS mount.
Create a new CephFS CephFS⌗
Alrighty, so now we’ve got a cluster, two healthy pools, and nothing to show for it. Let’s get on with CephFS.
I’m not going to cover this in detail as the documentation about setting up a CephFS is actually
pretty decent. It’s mostly, ceph fs new <name> cephfs\_meta cephfs\_data
and then mounting it to the actual system where you want it using
mount -t ceph :/ <mnt directory> -o fs=<name>,name=<CephX User>
. In my case, I’ve actually got several FSs, which is experimental (of course), for Docker volumes,
and raw media storage. The usage of such will be covered in the next part.
Spooky Scary Docker Swarm⌗
A while ago, Docker Swarm was the laughing stock of the part of the tech community I was in, mostly, it has some similar levels of complexity as Kubernetes. This initially kept me for trying it for quite some time, but I’ve been hearing different things from around the grapevine, and I decided to give it another try, and considering my current infrastructure, It’s not a terrible thing to get started with.
As a note, yes, I know that it’s now called “Swarm mode”, but I’m going to keep calling it Swarm. Don’t @ me.
Creating a Swarm⌗
Surprisingly, Docker Swarm is pretty simple to get started with once you have Docker installed (which is required for Ceph anyways):
docker swarm init --advertise-addr <MANAGER-IP>
Ta-dah! You’ve configured your first instance in your swarm. The STDOUT of the command will give you another command, docker swarm join <token>
, to run on any
other nodes, as well as some instructions on how to setup another manager node, if that’s what floats your boat.
Managing the Swarm⌗
Swarm breaks things down into a few different components:
- Tasks
- a definition of setting up a container
- Services
- a collection of Tasks
In my case, I was setting up Plex, and the supporting service (read: tasks) around it. Easy to setup using a docker-compose.yml
file. I’m planning on getting into detail on my actual setup in another blogpost, but for now:
version: "3"
services:
######
## Plex
######
plex:
image: plexinc/pms-docker:plexpass
restart: unless-stopped
environment:
- TZ=${TZ}
- PLEX_UID=${PUID}
- PLEX_GID=${PGID}
volumes:
- /mnt/cephfs_docker/config/plex:/config
- /mnt/cephfs_media/:/data
networks:
media-internal:
aliases:
- plex
ports:
- "32400:32400/tcp"
- "3005:3005/tcp"
- "8324:8324/tcp"
- "32469:32469/tcp"
- "1900:1900/udp"
- "32410:32410/udp"
- "32412:32412/udp"
- "32413:32413/udp"
- "32414:32414/udp"
networks:
default:
external:
name: media-internal
media-internal:
external: true
This is the compose file that we’re going to be working with. So how do we deploy things onto the swarm?
Shipping things with Portainer⌗
Portainer, is a UI for managing Docker Instances, and even Kubernetes. Personally, I like it as a way to take a peek at services without having to break open another shell. Even works on mobile! There is some installation documentation on their website, however, There’s a few tweaks that I made to their compose file.
version: '3.2'
services:
agent:
image: portainer/agent
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /var/lib/docker/volumes:/var/lib/docker/volumes
networks:
- agent_network
deploy:
mode: global
placement:
constraints: [node.platform.os == linux]
portainer:
image: portainer/portainer-ce
command: -H tcp://tasks.agent:9001 --tlsskipverify
ports:
- "9000:9000"
- "8000:8000"
volumes:
- /mnt/cephfs_docker:/data
networks:
- agent_network
- internal
deploy:
mode: replicated
replicas: 1
placement:
constraints: [node.role == manager]
networks:
agent_network:
driver: overlay
attachable: true
internal:
external: true
Some things to note with this, compared to the one found in the documentation:
- Removed the Volume mount, and replaced with a Bind to a CephFS.
- As my swarm has two nodes in it, I’ve actually set my second node as a master as well.
- If the pod were to get scheduled on another node, the default volume mount is only local to the machine. This means that it wouls start from scratch and want me to configure a user. Not quite what I wanted there.
- Added the
internal
network, which handled communication between my reverse proxy and Portainer.- This will be revisted in the next edition of this post.
To deploy Portainer, just save the above YAML file as portainer.yml
and then run docker stack deploy --compose-file=portainer.yml
. This will get the agent out into the swarm and setup the actual maintenece instance as well. Once it’s up, you’ll be able to access and start configuring things through port 9000.
The End (for now)⌗
And that’s my cluster thusfar. This blog post was mostly for me to keep track of everything that I’ve done, so I don’t have to go find it from all the different sources again. It’s a fun adventure that’s just getting started!
Stay tuned for the next installment, where I’ll be setting up all of the services and doing some security-first changes.
Q.