I want to practice a little Kubernetes ML-Ops (machine learning operations) at home, just for the fun of it. With so much stuff lying around, especially SBCs (single board computers) and so much great open source software available, that should be possible.
A number of people have done similar things, but this won’t be a step-by-step tutorial, more of a series of articles for my own documentation, a lot of ranting about the state of things (everything is defective by default), and a number of lessons learned (the hard way).
Hardware
So lets see what we have:
- An Odroid H2+ with 16GB RAM for the control plane VMs, running Fedora 35 Server
- Two nVidia Jetson Nano 4GB Developer Kits
- Two nVidia Jetson Nano 2GB Developer Kits (thanks to global supply shortages)
- Two Raspberry Pi 4 with 4GB RAM
- Two Raspberry Pi 4 with 8GB RAM
- A 10port Cisco gigabit managed switch
- Piles of assorted cables, power supplies, cases, stuff, etc.
Note that there is no BOM (bill of materials) or cost breakdown, because most of these things have been lying around for ages, and are repurposed from various other projects. The only thing I bought specifially for this project are the two Jetson 2GB Developer Kits, because the other two seemed so lonely, and there was still space in the case.
Choices and Decisions
I will use “vanilla” Kubernetes and kubeadm for the deployment. Not the best choice, because K3s performs better on SBCs, and is easier to deploy. There is also Talos OS , apparently very good, and lots of systems to automate deployment, like Typhoon or kubespray . But this is my personal learning project, so I choose the systems I either know or want/need to learn:
- Fedora CoreOS for the nodes, because that’s similar to what Openshift uses (RHCOS), and I teach Openshift. Plus, I like the idea of having an immutable OS. Apparently it works on the Raspberry Pi 4, so we’ll see how that goes. The Jetsons will have to use some kind of Ubuntu, thanks to the proprietary nVidia blobs.
- Ansible for deployment and automation, as much as possible. Because I teach Ansible, and love it.
- Weave Net for the network layer. Simply because.
Planning and Timeframe
There is no set timeframe, this is a hobby project. But I do have a rough plan:
- Get the master nodes working. So set up a FCOS deployment system on the master machine, and try to automate that as much as possible.
- Get a basic 3-node k8s deployment up, with only the control plane.
- Get the Jetsons up and running. The provided OS (L4T, based on ancient Ubuntu) is a big mess, so there are a bunch of challenges ahead.
- Try to run Fedora CoreOS on the Pis. No idea how that will go.
- Automate everything
The Future
Once the cluster as such is running (hopefully one day) I can start playing around with automation and GitOps in a machine learning context. Well, the decade is still long…