Building a Kubernetes ML Cluster on SBCs - Part 0

I want to practice a little Kubernetes ML-Ops (machine learning operations) at home, just for the fun of it. With so much stuff lying around, especially SBCs (single board computers) and so much great open source software available, that should be possible.

A number of people have done similar things, but this won’t be a step-by-step tutorial, more of a series of articles for my own documentation, a lot of ranting about the state of things (everything is defective by default), and a number of lessons learned (the hard way).

Hardware

So lets see what we have:

Note that there is no BOM (bill of materials) or cost breakdown, because most of these things have been lying around for ages, and are repurposed from various other projects. The only thing I bought specifially for this project are the two Jetson 2GB Developer Kits, because the other two seemed so lonely, and there was still space in the case.

Choices and Decisions

I will use “vanilla” Kubernetes and kubeadm for the deployment. Not the best choice, because K3s performs better on SBCs, and is easier to deploy. There is also Talos OS , apparently very good, and lots of systems to automate deployment, like Typhoon or kubespray . But this is my personal learning project, so I choose the systems I either know or want/need to learn:

  • Fedora CoreOS for the nodes, because that’s similar to what Openshift uses (RHCOS), and I teach Openshift. Plus, I like the idea of having an immutable OS. Apparently it works on the Raspberry Pi 4, so we’ll see how that goes. The Jetsons will have to use some kind of Ubuntu, thanks to the proprietary nVidia blobs.
  • Ansible for deployment and automation, as much as possible. Because I teach Ansible, and love it.
  • Weave Net for the network layer. Simply because.

Planning and Timeframe

There is no set timeframe, this is a hobby project. But I do have a rough plan:

  1. Get the master nodes working. So set up a FCOS deployment system on the master machine, and try to automate that as much as possible.
  2. Get a basic 3-node k8s deployment up, with only the control plane.
  3. Get the Jetsons up and running. The provided OS (L4T, based on ancient Ubuntu) is a big mess, so there are a bunch of challenges ahead.
  4. Try to run Fedora CoreOS on the Pis. No idea how that will go.
  5. Automate everything

The Future

Once the cluster as such is running (hopefully one day) I can start playing around with automation and GitOps in a machine learning context. Well, the decade is still long…


Hello World
Building a Kubernetes ML Cluster on SBCs - Part 1