Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Welcome!

This guide is the one I wish I had when I set out to provision my Kubernetes cluster. It can be very difficult as a beginner to navigate the landscape where everything is modular and there’s several options for even the most basic things like storage and networking, and everyone just says “it depends” with no further context when asked which one to use. Sometimes, it’s useful to just have someone tell you what to use, so this is me, telling you what I use. You may like it, you may not, but hopefully it can give you something to go on.

I recommend taking this slow. Build a lab with virtual machines or unused PCs first and walk through this manual step by step. There is a lot involved in setting up your first Kube cluster, so please do yourself a favor and don’t skip sections until you get to the maintenance guide. If there is information you believe to be missing that can’t be found easily online, feel free to open an issue on the Github repo. I will only be covering the stack that I use. If you’d like to use your own stack, feel free to fork this repo, but the point of this is to give a set of good technologies that will fulfill the requirements of most homelab environments, not to explore every option in detail. Once the manual is complete, I would like to add a section that explores commonly-used components and how they differ, but I will only be covering the installation process for my chosen stack.

Work in progress

Please note that this guide is not finished. I am actively writing it to document how my new cluster is set up for disaster recovery purposes. There will be many sections that just say “TODO”. Please keep this in mind when reading.

Stack & Justification

TODO: comparisons

ComponentChosen TechnologyRequired for basic operation
Operating System (OS)Talos Linux✅️
Container Runtime Interface (CRI)Containerd✅️
Container Network Interface (CNI)Calico✅️
Load BalancerMetalLB❌️ (recommended)
Container Storage Interface (CSI)Rook (Ceph)✅️
Certificate managementCertManager❌️ (recommended)
Ingress / Gateway API controllerTraefik❌️ (recommended)
GitOpsFluxCD❌️ (recommended)
Postgres databasesCloud-Native Postgres (CNPG)❌️
Virtual machine managementKubeVirt❌️

Required Skills

Kubernetes is a beast, and should not be the first thing you go for when learning about server administration or cloud environments. This guide assumes you already have a solid foundation in the following areas:

  • Git
  • Linux System Administration
    • CLI
    • Disk management
    • Package management
    • Virtual machines
    • Certificate management (acme.sh, certbot, letsencrypt, or similar technologies)
  • Linux Containers (one of Docker, Podman, etc)
  • Networking Fundamentals
    • IP addressing
    • Subnetting
    • VLANs
    • Firewalls
    • Routing
    • DHCP
    • ARP

What to do if you’re not ready

You may be able to get by without expertise in some of these areas, but expect to do a lot of Googling and YouTube-watching. Covering all of these areas is out of scope for this manual, as it would balloon out of control and no longer be useful for me. I would recommend at least taking a Linux+ course (even if you don’t get the cert) before attempting to start this journey. It will help you immensely. It should give you at least a shallow set of knowledge on all of these areas and prepare you well for Kubernetes.

I recommend Shawn Powers’ Linux+ video courses on YouTube and CBTNuggets.

Initial Bringup

I’ll be using Talos Linux, because having provisioned my own cluster with kubeadm, it’s what I wish I used to begin with. This section goes over the bring-up process for Talos Linux using my chosen Kubernetes stack.

Generate base configs

Official Docs

First, you’ll need to generate the base configurations for Talos. To do this, cd to a directory where you are comfortable storing secrets and run the following commands:

talosctl gen secrets -o secrets.yaml
talosctl gen config --with-secrets secrets.yaml <cluster_name> https://<kubernetes_endpoint>:6443
talosctl config merge ./talosconfig
talosctl config endpoint <kubernetes_endpoint ...>

Get an install image

TODO

Start nodes

Talos seems to be massively overcomplicated for network configuration. It’s probably best to stick to DHCP with static leases for now…

First, bring up the nodes with appropriate install images. Once you see the linux logs, you can remove the drive and move on to the next node.

Once the node is booted, you can get its mac address with the following command:

talosctl get links --insecure --nodes <node_ip>

This may help with DHCP static lease configuration. Not amazing that it has to be done after it gets a lease already, but whatever… It would be nice if they displayed the MAC ANYWHERE on the dashboard in maintenance mode…

Create patch files

Next, you’ll want to create a patch file for each node. This provides important information to the installer that may or may not be specific to that node, such as the data disk, the system schematic (for adding extra drivers, etc), and overrides for some default components.

Disks

Use the following command to list disks on the node:

talosctl get disks --insecure --nodes <node_ip>

Schematic

Use the image factory at the following link to acquire an “Initial Instalation” image URL:

Talos Image Factory

CNI Override

Also, if you don’t like Flannel and want to use a CNI capable of actually isolating pods/namespaces and controlling traffic, make sure to include cluster.network.cni.name = false, as shown in my example below. Flannel is great for a demonstration, but doesn’t include any kind of network policy management, so you may want to use something like Calico instead.

Example

Use this information to create a configuration file with a .yaml extension similar to this one:

machine:
  install:
    disk: /dev/vda
    image: factory.talos.dev/metal-installer-secureboot/d65015d8cb6aeafd3607f403cf96d63c5e1d9d16cda42709dc42c5c1e85f1929:v1.12.1
cluster:
  network:
    cni:
      name: none

Add pull-through cache mirrors (optional)

If you have a pull-through cache set up (most probably to mitigate docker.io’s rate limiting errors), you can add the following config to your patch files to ensure each node is set up to use your cache (duplicate for each registry):

---
apiVersion: v1alpha1
kind: RegistryMirrorConfig
name: docker.io
endpoints:
    - url: https://<your_domain>/v2/<cache_namespace>
      overridePath: true

This can be used with JFrog Artifactory or Harbor’s pull-through caches, or the cache_namespace piece can be removed if you’re using a cache that doesn’t serve them under a subdirectory.

Apply config to nodes

Next, you’ll want to apply your configurations to the nodes. You can do this using the following command:

talosctl apply-config --insecure --file <base_config> --config-patch <node_patch_file> --nodes <node_ip>

Example base_configs are: controlplane.yaml, worker.yaml. The node_patch_files are the patch files you created in the previous step. I recommend having one for each node.

Bootstrap

Now you can bootstrap the cluster using the following command:

talosctl bootstrap --nodes <control_plane_ip>

This will prompt Talos to set up etcd and bring up the cluster.

Add cluster to kubectl contexts & monitor cluster bring-up

Next, you’ll want to access the Kubernetes API of the cluster and check on the progress of cluster bring-up. Use the following command to add the Talos cluster to your kubectl contexts:

talosctl kubeconfig --nodes <control_plane_ip>

Now, you can use the following command to see the nodes in the cluster:

kubectl get nodes

If you don’t see all of your nodes (including the control plane), try again. It may take a little bit for them all to appear.

Conclusion

Congratulations! You have a cluster! Here’s a brief summary of what we just did:

  • Download Talos Linux and flash it to a USB drive
  • Boot Talos Linux on all nodes
  • Generate certificates for the Talos & Kubernetes API servers
  • Write patch files for each node based on information we retrieved from the CLI
  • Install Talos into the nodes
  • Configure kubectl to control the Talos cluster

If you opted not to use Flannel, you won’t have any networking just yet. Next, you’ll install Calico CNI to provide your networking stack, and MetalLB to provide support for LoadBalancer services using gratuitous ARP to route packets.

Tools

Managing a Kubernetes cluster can be complicated and difficult. Fortunately, Kubernetes is all about automation, and we have a variety of tools at our disposal to help with all this. In this section, we will briefly touch on a few of them that get used throughout this guide. There may be more tools in other sections that are specific to those sections, but this page will walk you through some of the common ones you’ll use frequently.

Helm

First up is Helm. Helm is to Kubernetes what apt is to Debian Linux. It collects sets of resources, allows customization, and keeps track of them, allowing you to easily add, remove, and update without having to worry about things like cleaning up garbage from old package versions.

We won’t be interacting with Helm directly much in this guide - just for installing the CNI and for rendering templates - because there’s another tool called FluxCD that will allow you to track your manifests, Helm charts, etc from a git repository.

Basic Concepts

Helm actually has pretty good documentation, so I won’t go into much detail here. The one thing you really shouldn’t miss is that customizations to a Helm chart are done using a values.yaml file. Each chart defines its own that it uses within a series of template files that pull in your values to generate a Kubernetes manifest (collection of resources defined in yaml).

Installing Helm

Helm has instructions for installation at the following link:

Using Helm

Helm has a concept of “repos”, similar to apt, which allow people to host their own collections of kubernetes manifests as Helm charts. The general process for installing a helm chart is as follows:

helm repo add <repo_name> <url>
helm get values <repo_name>/<chart_name> -o values.yml
helm install --namespace <namespace_name> <arbitrary_name> <repo_name>/<chart_name> -f values.yaml

Krew

Next, we have Krew. Krew is a “plugin manager” for kubectl. Some frameworks that you install in Kubernetes have a lot going on, or wrap some kind of pre-existing technology and require some extra hoops to interacting with them.

This is where kubectl plugins come in. They are a way to extend the kubectl CLI with additional functionality. For example, in a later section, you’ll be installing the rook-ceph kubectl plugin through krew to interact with the Ceph CLI via Kubernetes. In another section, you’ll be installing the virt kubectl plugin to interact with KubeVirt for managing your virtual machines. Krew is a handy package manager for installing these plugins.

Installation

Krew actually has an installation guide that’s short and sweet, so you can just follow the instructions here:

Krew Installation Guide

Using Krew

Official Quickstart Guide

Krew is pretty easy to use, and usually when you need it, the documentation of whatever you’re working on will tell you how to use it. Here’s a brief overview though just in case.

Update package cache:

kubectl krew update

Search for a package:

kubectl krew search <keyword>

Install a package:

kubectl krew install <package>

Upgrade packages:

kubectl krew upgrade

Uninstall a package:

kubectl krew uninstall <package>

Networking Setup

For networking, we will be using Calico for our CNI and MetalLB for our LoadBalancer service manager.

Calico Setup

First, we’ll set up our CNI. If you’ve opted to use Flannel, you can skip this section. Otherwise, go ahead and install the Tigera Operator using Helm.

First, create your values.yml file for Tigera Operator:

installation:
  cni:
    type: Calico
  calicoNetwork:
    bgp: Disabled
    ipPools:
      - cidr: 10.244.0.0/16
        encapsulation: VXLAN

Next, install the operator:

kubectl apply -f- << EOF
---
apiVersion: v1
kind: Namespace
metadata:
  name: tigera-operator
  labels:
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged
EOF
helm repo add projectcalico https://docs.tigera.io/calico/charts
helm install \
  --create-namespace \
  --namespace tigera-operator \
  --version v3.31.3 \
  -f values.yml \
  calico \
  projectcalico/tigera-operator

This will configure Calico to use 10.244.0.0/16 as range to use for IPAM (IP address management), and use VXLAN, which is a type of overlay network that allows the pod network to safely cross your existing network when communicating between nodes.

MetalLB Setup

Finally, we’ll set up MetalLB. This will provide the LoadBalancer service implementation.

You’ll need:

  • The names of the interfaces to use for advertising address changes
  • A range of IP addresses that MetalLB can take exclusive control of

First, create the metallb-system namespace with extra privileges:

# kubectl apply -f <file>
---
apiVersion: v1
kind: Namespace
metadata:
  name: metallb-system
  labels:
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged

Then install MetalLB with Helm:

helm repo add metallb https://metallb.github.io/metallb
helm install --namespace metallb-system metallb metallb/metallb

And finally, add your configuration. Make sure to add your network interfaces and IP pools:

---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: simple-services-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
  - simple-services
  interfaces:
  - eno1
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: simple-services
  namespace: metallb-system
spec:
  addresses:
  - x.x.x.x-x.x.x.x

FluxCD (TODO)

FluxCD isn’t required for a Kubernetes cluster, but I strongly recommend it. I haven’t personally used this particular offering before (my experience is with ArgoCD), so this section may have some rough edges until the manual is finished.

I have chosen to use FluxCD for my cluster rebuild project because of its low-dependency operating style. It seems to follow the UNIX philosophy of doing one thing and doing it well. It has no UI, no auth system, just a synchronization controller that ensures your Kube cluster is synced with your Git project. After installing it, I realized it doesn’t even have any PVCs, which means it can be used for your Rook Ceph installation as well.

To install FluxCD, pick a provider for your Git repository and follow their documentation linked below:

FluxCD Bootstrap Guide

FluxCD also has a “Getting Started” guide that will take you through some more details past bootstrap:

FluxCD Getting Started

The rest of this guide will assume you’ve gone through FluxCD’s getting started guide and have at least done the exercise.

Overview

FluxCD adds custom resource definitions (CRDs) to Kubernetes and manages deployment of resourced based on these CRDs. It regularly polls data sources, checks for updates, and redeploys them automatically.

NOTE: The examples here are to give you a basic understanding of how FluxCD deploys resources. They are not intended for you to follow along.

Managing Helm Charts

Rook Ceph

Rook is a management overlay for Ceph that can deploy it in a Kubernetes cluster. If you don’t know what Ceph is, I highly recommend watching some of the videos on it by the guys at 45drives.

The elevator pitch: “It’s like RAID, but it joins multiple servers instead of just drives.”

The more nuanced explanation:

  • Imagine if you could take every piece of data you want to store and break it down into manageable byte-sized pieces.
  • Now imagine that you could replicate each of these pieces of data a number of times to create redundancy, so if you lose one of them, you have replicas to pull from, and when all is functioning normally, you could run periodic checks to make sure that all of your replicas actually have the same data. If two replicas say the data is “a”, but one says it’s “b”, then chances are that “b” was supposed to be an “a”. This solves both availability and integrity problems (like bitrot).
  • Now imagine that you could build an algorithm that takes into account the number of drives you have, the size of each one, the number of hosts you have, which drives were in which hosts, as well as your entire datacenter heirarchy, and use that to determine to which drive a piece of data should go to maintain a certain amount of redundancy across any failure domain of your choice.
  • Now imagine that each drive had its own server that you could communicate with directly, and there was a server to ensure that not only are your redundancy requirements enforced when the data is created, but also as requirements and resource availability change.
  • Now imagine that you could build interfaces on top of this data storage strategy which exposes a filesystem, a block device, and an S3-compatible object store.

That’s Ceph.

It’s big, and it’s complicated, but it is a true marvel of engineering that solves an extremely complicated problem in a way that any application can interface with and have no idea that anything has changed. One of my personal biggest use cases for it is CephFS. CephFS is the only open source network filesystem that I know of which is fully POSIX compliant, and it’s built on a rock-solid data redundancy system that has never fully failed on me except when I got paranoid and forced it to do something stupid.

I highly recommend learning to manage Ceph properly, but what’s even nicer about all this is that unless you’re doing anything crazy like me, you likely won’t have to do anything more than initiate automated repair for occasional data corruption after heavy use. I’ll get into some more Ceph management essentials in a later section of the book, but Rook takes care of most of it for you.

Side note: If Ceph tells you that your PGs are inconsistent, don’t freak out. It knows what it’s doing, and if you instruct it to repair, it almost always will, but it takes some time, so stay calm and give it space to work. I’ll go over repairing data in the maintenance guide.

Installation

We will be using Helm to install Rook. I am writing this manual off of the official docs which can be found here:

Ceph Operator Helm Chart

[Aside] Why Helm?

I recommend using the Helm repo instead of applying the manifest files directly because helm can help you manage resources after they’ve been deployed. This is greatly helpful for things like upgrades where resources may be added, modified, or removed, and can be difficult to keep track of by hand.

Normally, I would recommend tracking everything in a GitOps platform like FluxCD, but since Ceph provides the CSI that FluxCD needs in order to run properly, we unfortunately cannot use FluxCD yet.

Install Rook Ceph Operator

First, add the Rook Helm chart repository.

helm repo add rook-release https://charts.rook.io/release

Next, create your values.yml file. I personally like the default config, so I won’t be creating one. However, you can use this section of the docs to determine the variables you want to use. Note that the . characters used in the parameter names represent nested values. For example, if you want to set the crds.enabled parameter to false, you’d use the following yaml:

crds:
  enabled: false

If you choose to customize any values, make sure to add -f values.yml to the end of your install command.

According to the Talos Linux documentation, the default configuration does not allow privileged pods. You’ll need to create the namespace first to allow them:

# kubectl apply -f <file>
---
apiVersion: v1
kind: Namespace
metadata:
  name: rook-ceph
  labels:
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged

Ceph needs privileged pods to effectively manage disks on the host.

Once that’s done, install the Rook Ceph Operator Helm chart, use the following command:

helm install --namespace rook-ceph rook-ceph rook-release/rook-ceph

Install Rook Ceph Cluster

Now that the Rook Ceph operator has been installed, we need to add a CephCluster resource in order to get it to provision a Ceph Cluster. This is because the operator is not itself a cluster. It is an automated interface for managing clusters, and is actually capable of managing more than one. Therefore, we need to give it a cluster definition for it to build one. You can find details on this process at the following link:

CephCluster CRD Documentation

CephCluster Full Example

I’ll be creating a cluster that spans the whole cluster and consumes all unpartitioned disks.

# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v19.2.3
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 1
    allowMultiplePerNode: false
    modules:
      # List of modules to optionally enable or disable.
      # Note the "dashboard" and "monitoring" modules are already configured by
      # other settings in the cluster CR.
      # I recommend the "rook" module to inform the dashboard that Ceph
      # resources are configured by Kubernetes manifests.
      # I also recommend the "nfs" module as it will provide easy configuration
      # of NFS exports via the dashboard. This is very useful for things like
      # pre-seeding PVCs, data export, and troubleshooting.
      - name: rook
        enabled: true
      - name: nfs
        enabled: true
  dashboard:
    enabled: true
    ssl: true
  storage:
    useAllNodes: true
    useAllDevices: true
    config:
      encryptedDevice: "true"
  disruptionManagement:
    managePodBudgets: true
  # This can be turned on to help with OSD removal. Since OSDs will be
  # automatically marked "out" if they are offline for too long, I recommend
  # keeping this off except when you need it.
  removeOSDsIfOutAndSafeToRemove: false

Install rook-ceph Krew Plugin

Much of Ceph’s administration post-install happens via CLI, so you’ll want to make sure you have it. You can either deploy the “toolbox” container (the official docs go over this), or you can use the kubectl plugin. I recommend the plugin for simplicity. Assuming you installed krew from the tools section, you can get the rook-ceph plugin using the following command:

kubectl krew install rook-ceph

Now you can run ceph commands like so:

# `ceph status` becomes...
kubectl rook-ceph ceph status

I recommend adding aliases to the rc file for whatever shell you use:

alias ceph='kubectl rook-ceph ceph'
alias rbd='kubectl rook-ceph rbd'
alias rados='kubectl rook-ceph rados'
alias radosgw-admin='kubectl rook-ceph radosgw-admin'

This will cover most commands you might find in the official Ceph documentation. Ordinarily, you’d run these commands from a Ceph host, but since Rook is provisioning everything for us and we don’t necessarily have direct access, it’s a lot easier to use the plugin. This will run your commands in the “operator” container and attach stdio as if it were running locally.

Add CephFS Filesystem

CephFS is the interface you’re primarily going to want to use for a Homelab since it provides the greatest degree of flexibility if all you need is a POSIX filesystem for application data.

It provides support for the ReadWriteMany and ReadOnlyMany PVC modes, which is extremely useful for things like Plex/Emby/Jellyfin libraries and other file shares, as well as quasi, mostly-stateless things like the Home Assistant voice pipeline components, which, apart from caching models, don’t share any state and could easily be replicated.

Prior Considerations

CephFS is very flexible and can be configured in a number of ways. Personally, due to certain operational requirements (and the simplicity of doing so), I like to keep all of my storage mechanisms in the same CephFS filesystem. I’ll get to why this is important in a moment.

Rook supports 2 main device classes, and auto-detects which one each OSD belongs to during OSD preparation. These classes are ssd and hdd. CephFS supports storing data on devices of multiple classes, but THESE MUST BE DETERMINED UPFRONT! If you fail to specify your device classes upfront, you’ll have to resort to manual editing of your CRUSH map, and even then, the Ceph operator will have no idea what’s going on and be extremely annoying to deal with. Alternatively, you could just deal with your data being striped across both SSDs and HDDs, but this will likely yield a weird and sub-par experience. In summary, PLEASE specify your storage classes, even if you only have one right now. You’ll thank me later…

I recommend always storing metadata on ssd storage classes. This is a heavy random I/O operation, which is what SSDs were designed for, as opposed to spinning disk drives which are better for bulk storage and sequential I/O. If you have enough space, I also recommend making this your default storage medium altogether, since you’ll usually have more control over the provisioning of mass storage volumes than database volumes, and databases do a lot of random I/O.

CephFS Definition

Here is the definition I use. I use ec-2-1 for massive data that I’d like to be highly available, but could be taken down and restored from a backup if it needs to be. I then have two data pools: one for ssd, and one for hdd with 3 replicas for normal storage.

# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: kubefs
  namespace: rook-ceph # namespace:cluster
spec:
  # The metadata pool spec
  metadataPool:
    deviceClass: ssd
    replicated:
      # You need at least three OSDs on different nodes for this config to work
      size: 3
  # The list of data pool specs
  dataPools:
    - name: replicate-3-ssd
      deviceClass: ssd
      replicated:
        size: 3
    - name: replicate-3-hdd
      deviceClass: hdd
      replicated:
        size: 3
    # You need at least three OSDs on different nodes for this config to work
    - name: ec-2-1-hdd
      deviceClass: hdd
      erasureCoded:
        dataChunks: 2
        codingChunks: 1
      parameters:
        compression_mode: none
  # Whether to preserve filesystem after CephFilesystem CRD deletion
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true
    # The affinity rules to apply to the mds deployment
    placement:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: filesystem
                  operator: In
                  values:
                    - media
            topologyKey: kubernetes.io/hostname
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - rook-ceph-mds
              # topologyKey: */zone can be used to spread MDS across different AZ
              topologyKey: topology.kubernetes.io/zone
    annotations:
    labels:
      filesystem: kubefs
    resources:

Add Storage Classes

Finally, you’ll need to add a storage class for each pool you want to use for PVC provisioning. For my configuration above, I add the following storage classes (all but one hidden for brevity):

# kubectl apply -f <file>
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-kubefs-replicate-3-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.cephfs.csi.ceph.com # csi-provisioner-name
parameters:
  # matches the name of the CephCluster resource
  clusterID: rook-ceph
  # matches the name of the CephFilesystem resource
  fsName: kubefs
  # matches the name of a dataPool object within the CephFilesystem resource
  pool: kubefs-replicate-3-ssd

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

  mounter: kernel
reclaimPolicy: Delete
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-kubefs-replicate-3-hdd
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: kubefs
  pool: kubefs-replicate-3-hdd

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

  mounter: kernel
reclaimPolicy: Delete
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-kubefs-ec-2-1-hdd
provisioner: rook-ceph.cephfs.csi.ceph.com # csi-provisioner-name
parameters:
  clusterID: rook-ceph
  fsName: kubefs
  pool: kubefs-ec-2-1-hdd

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

  mounter: kernel
reclaimPolicy: Delete
allowVolumeExpansion: true

Testing the Filesystem

To test your brand new filesystem, create a PersistentVolumeClaim and see what happens:

# kubectl apply -f <file>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cephfs-pvc-test
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

Then look at it using this command:

kubectl get pvcs

If you see something like this with status=Bound:

NAME              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
cephfs-pvc-test   Bound    pvc-e2a72895-e95f-4a69-a042-ecfe9480c5aa   1Gi        RWX            rook-kubefs-replicate-3-ssd   <unset>                 4s

Congratulations! You’ve successfully set up CephFS as your container storage interface! Now, in the next session, you’ll learn (among other things) how to set up an NFS server to access this new volume. If you decide not to set up NFS, you can go ahead and delete the PVC with kubectl delete pvc cephfs-test-pvc. Otherwise, continue to the next section.

Rook Ceph Continued…

There are more moving parts to Ceph that you’ll probably want to set up, but aren’t strictly needed. I’ll document those here. Feel free to skip anything in this section that you don’t care about.

Snapshot Classes

For some backup utilities, or even for manual backup, you’ll probably want to be able to create point-in-time snapshots of your PVCs with COW (copy-on-write) semantics. This is something that Ceph natively supports using the hidden (even from ls -al) directories called .snap that exist in every directory in the filesystem. However, it takes a bit more setup to do this “the Kubernetes way”.

First, we need to install the optional “external-snapshotter” manifests to help Kubernetes understand what a snapshot is and how to manage snapshot resources. These manifests are official resources that are agnostic to the CSI driver used, but do not always come with Kubernetes by default.

# Install snapshot CRDs
kubectl kustomize https://github.com/kubernetes-csi/external-snapshotter/client/config/crd | kubectl create -f -
# Install the CSI-agnostic snapshot controller
kubectl -n kube-system kustomize https://github.com/kubernetes-csi/external-snapshotter/deploy/kubernetes/snapshot-controller | kubectl create -f -

Now that the snapshotter is installed, the following YAML file will create a VolumeSnapshotClass that will tell Kubernetes how to take these point-in-time snapshots:

# kubectl apply -f <file>
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: rook-fs-snap
driver: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph # name of the CephCluster resource
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
deletionPolicy: Delete

Then, you’ll be able to create snapshots of PVCs using a manifest like this:

---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: cephfs-pvc-test-snap
  namespace: default
spec:
  volumeSnapshotClassName: rook-fs-snap
  source:
    persistentVolumeClaimName: cephfs-pvc-test

… and create a new PVC to use that snapshot like this:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cephfs-pvc-test-snap-view
  namespace: default
spec:
  dataSource:
    name: cephfs-pvc-test-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 16Ti

Block Storage

If you plan on running virtual machines, have specific filesystem requirements, etc, you’ll likely want to set up block storage. The following yaml will set up a Ceph RBD pool backed by ssd storage:

# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: vm-storage-ssd
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
    requireSafeReplicaSize: true
  deviceClass: ssd

This one will set up a block-based storage class to consume the CephBlockPool:

# kubectl apply -f <file>
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: vm-storage-ssd
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    clusterID: rook-ceph
    # Ceph pool into which the RBD image shall be created
    pool: vm-storage-ssd

    # RBD image format. Defaults to "2".
    imageFormat: "2"

    # For 5.4 or later kernels:
    imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock

    # The secrets contain Ceph admin credentials.
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
    # in hyperconverged settings where the volume is mounted on the same node as the osds.
    csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete

# Optional, if you want to add dynamic resize for PVC.
# For now only ext3, ext4, xfs resize support provided, like in Kubernetes itself.
allowVolumeExpansion: true

NFS Server

The following yaml will provision the Ceph NFS gateway, which will allow you to mount anything in CephFS via NFS. This is exceptionally useful for troubleshooting deployments, as it’s much easier to port forward, does not require authentication (except IP address-based access controls), and doesn’t require any special drivers.

# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephNFS
metadata:
  name: nfs-server
  namespace: rook-ceph
spec:
  server:
    active: 1

Please keep in mind that the usual NFS limitations apply. NFS is NOT a POSIX-compliant filesystem, and if your use case requires POSIX compliance, you’re going to have a bad time. This is simply a means of accessing your PVCs as ordinary document stores. File locks may not be respected, and I/O will fail if a file is deleted (this sounds like intuitive behavior, but actually is not how most filesystems work, and can cause real problems in production).

You’ll also need a service to access this NFS server. Here is an example:

# kubectl apply -f <file>
---
apiVersion: v1
kind: Service
metadata:
  name: nfs-server
  namespace: rook-ceph
spec:
  ports:
    - name: nfs
      port: 2049
  type: LoadBalancer
  loadBalancerIP: 10.3.0.192 # Set this to your IP address
  externalTrafficPolicy: Local
  selector:
    # Use the name of the CephNFS here
    ceph_nfs: nfs-server

    # It is safest to send clients to a single NFS server instance. Instance "a" always exists.
    # ^-- Please pay attention to this!!!
    #     NFS can get weird in high availability deployments!!!
    #     I'm tired of cleaning up messes caused by poor NFS deployments...
    instance: a

Now, to add an NFS share for the cephfs-test-pvc we created earlier, run the following in your shell:

# Get the path to the named PVC within the CephFS cluster:
pvc_name="cephfs-pvc-test"
pv_name="$(kubectl get pvc "$pvc_name" --template '{{index . "spec" "volumeName"}}')"
subvol_path="$(kubectl get pv "$pv_name" --template '{{index . "spec" "csi" "volumeAttributes" "subvolumePath"}}')"

# Create an NFS export for the PVC path:
ceph nfs export create cephfs nfs-server /test kubefs "$subvol_path"

Assuming you got no errors, you can mount the share with a command like this:

sudo mount -t nfs -o nfsvers=4.1,proto=tcp <nfs_service_ip>:/test /path/to/your/mountpoint

Now to clean up, run the following commands:

sudo umount /path/to/your/mountpoint
ceph nfs export rm nfs-server /test
kubectl delete pvc cephfs-pvc-test

Dashboard Access

You can add a Service for accessing the dashboard outside the cluster as documented in the official docs. If you don’t want to expose the dashboard outside the cluster network, I recommend using an alias like this:

Linux (wl-clipboard):

alias rook-dash='kubectl get secret -n rook-ceph rook-ceph-dashboard-password --template '{{index . "data" "password"}}' | base64 -d | wl-copy; echo '\''Copied password to clipboard'\''; kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard --address 127.0.0.1 8443:8443 & xdg-open '\''https://127.0.0.1:8443/'\''; fg'

Linux (xclip):

alias rook-dash='kubectl get secret -n rook-ceph rook-ceph-dashboard-password --template '{{index . "data" "password"}}' | base64 -d | xclip -selection clipboard; echo '\''Copied password to clipboard'\''; kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard --address 127.0.0.1 8443:8443 & xdg-open '\''https://127.0.0.1:8443/'\''; fg'

Mac:

alias rook-dash='kubectl get secret -n rook-ceph rook-ceph-dashboard-password --template '{{index . "data" "password"}}' | base64 -d | pbcopy; echo '\''Copied password to clipboard'\''; kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard --address 127.0.0.1 8443:8443 & open '\''https://127.0.0.1:8443/'\''; fg'

When run, this alias will pull the auto-generated admin secret from the cluster and copy it to your clipboard, then start a port forward and open the dashboard in your default browser.