Introduction
Welcome!
This guide is the one I wish I had when I set out to provision my Kubernetes cluster. It can be very difficult as a beginner to navigate the landscape where everything is modular and there’s several options for even the most basic things like storage and networking, and everyone just says “it depends” with no further context when asked which one to use. Sometimes, it’s useful to just have someone tell you what to use, so this is me, telling you what I use. You may like it, you may not, but hopefully it can give you something to go on.
I recommend taking this slow. Build a lab with virtual machines or unused PCs first and walk through this manual step by step. There is a lot involved in setting up your first Kube cluster, so please do yourself a favor and don’t skip sections until you get to the maintenance guide. If there is information you believe to be missing that can’t be found easily online, feel free to open an issue on the Github repo. I will only be covering the stack that I use. If you’d like to use your own stack, feel free to fork this repo, but the point of this is to give a set of good technologies that will fulfill the requirements of most homelab environments, not to explore every option in detail. Once the manual is complete, I would like to add a section that explores commonly-used components and how they differ, but I will only be covering the installation process for my chosen stack.
Work in progress
Please note that this guide is not finished. I am actively writing it to document how my new cluster is set up for disaster recovery purposes. There will be many sections that just say “TODO”. Please keep this in mind when reading.
Stack & Justification
TODO: comparisons
| Component | Chosen Technology | Required for basic operation |
|---|---|---|
| Operating System (OS) | Talos Linux | ✅️ |
| Container Runtime Interface (CRI) | Containerd | ✅️ |
| Container Network Interface (CNI) | Calico | ✅️ |
| Load Balancer | MetalLB | ❌️ (recommended) |
| Container Storage Interface (CSI) | Rook (Ceph) | ✅️ |
| Certificate management | CertManager | ❌️ (recommended) |
| Ingress / Gateway API controller | Traefik | ❌️ (recommended) |
| GitOps | FluxCD | ❌️ (recommended) |
| Postgres databases | Cloud-Native Postgres (CNPG) | ❌️ |
| Virtual machine management | KubeVirt | ❌️ |
Required Skills
Kubernetes is a beast, and should not be the first thing you go for when learning about server administration or cloud environments. This guide assumes you already have a solid foundation in the following areas:
- Git
- Linux System Administration
- CLI
- Disk management
- Package management
- Virtual machines
- Certificate management (acme.sh, certbot, letsencrypt, or similar technologies)
- Linux Containers (one of Docker, Podman, etc)
- Networking Fundamentals
- IP addressing
- Subnetting
- VLANs
- Firewalls
- Routing
- DHCP
- ARP
What to do if you’re not ready
You may be able to get by without expertise in some of these areas, but expect to do a lot of Googling and YouTube-watching. Covering all of these areas is out of scope for this manual, as it would balloon out of control and no longer be useful for me. I would recommend at least taking a Linux+ course (even if you don’t get the cert) before attempting to start this journey. It will help you immensely. It should give you at least a shallow set of knowledge on all of these areas and prepare you well for Kubernetes.
I recommend Shawn Powers’ Linux+ video courses on YouTube and CBTNuggets.
Initial Bringup
I’ll be using Talos Linux, because having provisioned my own cluster with kubeadm, it’s what I wish I used to begin with. This section goes over the bring-up process for Talos Linux using my chosen Kubernetes stack.
Generate base configs
First, you’ll need to generate the base configurations for Talos. To do this, cd to a directory where you are comfortable storing secrets and run the following commands:
talosctl gen secrets -o secrets.yaml
talosctl gen config --with-secrets secrets.yaml <cluster_name> https://<kubernetes_endpoint>:6443
talosctl config merge ./talosconfig
talosctl config endpoint <kubernetes_endpoint ...>
Get an install image
TODO
Start nodes
Talos seems to be massively overcomplicated for network configuration. It’s probably best to stick to DHCP with static leases for now…
First, bring up the nodes with appropriate install images. Once you see the linux logs, you can remove the drive and move on to the next node.
Once the node is booted, you can get its mac address with the following command:
talosctl get links --insecure --nodes <node_ip>
This may help with DHCP static lease configuration. Not amazing that it has to be done after it gets a lease already, but whatever… It would be nice if they displayed the MAC ANYWHERE on the dashboard in maintenance mode…
Create patch files
Next, you’ll want to create a patch file for each node. This provides important information to the installer that may or may not be specific to that node, such as the data disk, the system schematic (for adding extra drivers, etc), and overrides for some default components.
Disks
Use the following command to list disks on the node:
talosctl get disks --insecure --nodes <node_ip>
Schematic
Use the image factory at the following link to acquire an “Initial Instalation” image URL:
CNI Override
Also, if you don’t like Flannel and want to use a CNI capable of actually
isolating pods/namespaces and controlling traffic, make sure to include
cluster.network.cni.name = false, as shown in my example below. Flannel is
great for a demonstration, but doesn’t include any kind of network policy
management, so you may want to use something like Calico instead.
Example
Use this information to create a configuration file with a .yaml extension
similar to this one:
machine:
install:
disk: /dev/vda
image: factory.talos.dev/metal-installer-secureboot/d65015d8cb6aeafd3607f403cf96d63c5e1d9d16cda42709dc42c5c1e85f1929:v1.12.1
cluster:
network:
cni:
name: none
Add pull-through cache mirrors (optional)
If you have a pull-through cache set up (most probably to mitigate docker.io’s rate limiting errors), you can add the following config to your patch files to ensure each node is set up to use your cache (duplicate for each registry):
---
apiVersion: v1alpha1
kind: RegistryMirrorConfig
name: docker.io
endpoints:
- url: https://<your_domain>/v2/<cache_namespace>
overridePath: true
This can be used with JFrog Artifactory or Harbor’s pull-through caches, or the cache_namespace piece can be removed if you’re using a cache that doesn’t serve them under a subdirectory.
Apply config to nodes
Next, you’ll want to apply your configurations to the nodes. You can do this using the following command:
talosctl apply-config --insecure --file <base_config> --config-patch <node_patch_file> --nodes <node_ip>
Example base_configs are: controlplane.yaml, worker.yaml. The
node_patch_files are the patch files you created in the previous step. I
recommend having one for each node.
Bootstrap
Now you can bootstrap the cluster using the following command:
talosctl bootstrap --nodes <control_plane_ip>
This will prompt Talos to set up etcd and bring up the cluster.
Add cluster to kubectl contexts & monitor cluster bring-up
Next, you’ll want to access the Kubernetes API of the cluster and check on the progress of cluster bring-up. Use the following command to add the Talos cluster to your kubectl contexts:
talosctl kubeconfig --nodes <control_plane_ip>
Now, you can use the following command to see the nodes in the cluster:
kubectl get nodes
If you don’t see all of your nodes (including the control plane), try again. It may take a little bit for them all to appear.
Conclusion
Congratulations! You have a cluster! Here’s a brief summary of what we just did:
- Download Talos Linux and flash it to a USB drive
- Boot Talos Linux on all nodes
- Generate certificates for the Talos & Kubernetes API servers
- Write patch files for each node based on information we retrieved from the CLI
- Install Talos into the nodes
- Configure kubectl to control the Talos cluster
If you opted not to use Flannel, you won’t have any networking just yet. Next,
you’ll install Calico CNI to provide your networking stack, and MetalLB to
provide support for LoadBalancer services using gratuitous ARP to route
packets.
Tools
Managing a Kubernetes cluster can be complicated and difficult. Fortunately, Kubernetes is all about automation, and we have a variety of tools at our disposal to help with all this. In this section, we will briefly touch on a few of them that get used throughout this guide. There may be more tools in other sections that are specific to those sections, but this page will walk you through some of the common ones you’ll use frequently.
Helm
First up is Helm. Helm is to Kubernetes what apt is to Debian Linux. It collects sets of resources, allows customization, and keeps track of them, allowing you to easily add, remove, and update without having to worry about things like cleaning up garbage from old package versions.
We won’t be interacting with Helm directly much in this guide - just for installing the CNI and for rendering templates - because there’s another tool called FluxCD that will allow you to track your manifests, Helm charts, etc from a git repository.
Basic Concepts
Helm actually has pretty good documentation, so I won’t go into much detail
here. The one thing you really shouldn’t miss is that customizations to a Helm
chart are done using a values.yaml file. Each chart defines its own that it
uses within a series of template files that pull in your values to generate a
Kubernetes manifest (collection of resources defined in yaml).
Installing Helm
Helm has instructions for installation at the following link:
Using Helm
Helm has a concept of “repos”, similar to apt, which allow people to host their own collections of kubernetes manifests as Helm charts. The general process for installing a helm chart is as follows:
helm repo add <repo_name> <url>
helm get values <repo_name>/<chart_name> -o values.yml
helm install --namespace <namespace_name> <arbitrary_name> <repo_name>/<chart_name> -f values.yaml
Krew
Next, we have Krew. Krew is a “plugin manager” for kubectl. Some frameworks that you install in Kubernetes have a lot going on, or wrap some kind of pre-existing technology and require some extra hoops to interacting with them.
This is where kubectl plugins come in. They are a way to extend the kubectl CLI
with additional functionality. For example, in a later section, you’ll be
installing the rook-ceph kubectl plugin through krew to interact with the
Ceph CLI via Kubernetes. In another section, you’ll be installing the virt
kubectl plugin to interact with KubeVirt for managing your virtual machines.
Krew is a handy package manager for installing these plugins.
Installation
Krew actually has an installation guide that’s short and sweet, so you can just follow the instructions here:
Using Krew
Krew is pretty easy to use, and usually when you need it, the documentation of whatever you’re working on will tell you how to use it. Here’s a brief overview though just in case.
Update package cache:
kubectl krew update
Search for a package:
kubectl krew search <keyword>
Install a package:
kubectl krew install <package>
Upgrade packages:
kubectl krew upgrade
Uninstall a package:
kubectl krew uninstall <package>
Networking Setup
For networking, we will be using Calico for our CNI and MetalLB for our
LoadBalancer service manager.
Calico Setup
First, we’ll set up our CNI. If you’ve opted to use Flannel, you can skip this section. Otherwise, go ahead and install the Tigera Operator using Helm.
First, create your values.yml file for Tigera Operator:
installation:
cni:
type: Calico
calicoNetwork:
bgp: Disabled
ipPools:
- cidr: 10.244.0.0/16
encapsulation: VXLAN
Next, install the operator:
kubectl apply -f- << EOF
---
apiVersion: v1
kind: Namespace
metadata:
name: tigera-operator
labels:
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/warn: privileged
EOF
helm repo add projectcalico https://docs.tigera.io/calico/charts
helm install \
--create-namespace \
--namespace tigera-operator \
--version v3.31.3 \
-f values.yml \
calico \
projectcalico/tigera-operator
This will configure Calico to use 10.244.0.0/16 as range to use for IPAM (IP
address management), and use VXLAN, which is a type of overlay network that
allows the pod network to safely cross your existing network when communicating
between nodes.
MetalLB Setup
Finally, we’ll set up MetalLB. This will provide the LoadBalancer service
implementation.
You’ll need:
- The names of the interfaces to use for advertising address changes
- A range of IP addresses that MetalLB can take exclusive control of
First, create the metallb-system namespace with extra privileges:
# kubectl apply -f <file>
---
apiVersion: v1
kind: Namespace
metadata:
name: metallb-system
labels:
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/warn: privileged
Then install MetalLB with Helm:
helm repo add metallb https://metallb.github.io/metallb
helm install --namespace metallb-system metallb metallb/metallb
And finally, add your configuration. Make sure to add your network interfaces and IP pools:
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: simple-services-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- simple-services
interfaces:
- eno1
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: simple-services
namespace: metallb-system
spec:
addresses:
- x.x.x.x-x.x.x.x
FluxCD (TODO)
FluxCD isn’t required for a Kubernetes cluster, but I strongly recommend it. I haven’t personally used this particular offering before (my experience is with ArgoCD), so this section may have some rough edges until the manual is finished.
I have chosen to use FluxCD for my cluster rebuild project because of its low-dependency operating style. It seems to follow the UNIX philosophy of doing one thing and doing it well. It has no UI, no auth system, just a synchronization controller that ensures your Kube cluster is synced with your Git project. After installing it, I realized it doesn’t even have any PVCs, which means it can be used for your Rook Ceph installation as well.
To install FluxCD, pick a provider for your Git repository and follow their documentation linked below:
FluxCD also has a “Getting Started” guide that will take you through some more details past bootstrap:
The rest of this guide will assume you’ve gone through FluxCD’s getting started guide and have at least done the exercise.
Overview
FluxCD adds custom resource definitions (CRDs) to Kubernetes and manages deployment of resourced based on these CRDs. It regularly polls data sources, checks for updates, and redeploys them automatically.
NOTE: The examples here are to give you a basic understanding of how FluxCD deploys resources. They are not intended for you to follow along.
Managing Helm Charts
Rook Ceph
Rook is a management overlay for Ceph that can deploy it in a Kubernetes cluster. If you don’t know what Ceph is, I highly recommend watching some of the videos on it by the guys at 45drives.
The elevator pitch: “It’s like RAID, but it joins multiple servers instead of just drives.”
The more nuanced explanation:
- Imagine if you could take every piece of data you want to store and break it down into manageable byte-sized pieces.
- Now imagine that you could replicate each of these pieces of data a number of times to create redundancy, so if you lose one of them, you have replicas to pull from, and when all is functioning normally, you could run periodic checks to make sure that all of your replicas actually have the same data. If two replicas say the data is “a”, but one says it’s “b”, then chances are that “b” was supposed to be an “a”. This solves both availability and integrity problems (like bitrot).
- Now imagine that you could build an algorithm that takes into account the number of drives you have, the size of each one, the number of hosts you have, which drives were in which hosts, as well as your entire datacenter heirarchy, and use that to determine to which drive a piece of data should go to maintain a certain amount of redundancy across any failure domain of your choice.
- Now imagine that each drive had its own server that you could communicate with directly, and there was a server to ensure that not only are your redundancy requirements enforced when the data is created, but also as requirements and resource availability change.
- Now imagine that you could build interfaces on top of this data storage strategy which exposes a filesystem, a block device, and an S3-compatible object store.
That’s Ceph.
It’s big, and it’s complicated, but it is a true marvel of engineering that solves an extremely complicated problem in a way that any application can interface with and have no idea that anything has changed. One of my personal biggest use cases for it is CephFS. CephFS is the only open source network filesystem that I know of which is fully POSIX compliant, and it’s built on a rock-solid data redundancy system that has never fully failed on me except when I got paranoid and forced it to do something stupid.
I highly recommend learning to manage Ceph properly, but what’s even nicer about all this is that unless you’re doing anything crazy like me, you likely won’t have to do anything more than initiate automated repair for occasional data corruption after heavy use. I’ll get into some more Ceph management essentials in a later section of the book, but Rook takes care of most of it for you.
Side note: If Ceph tells you that your PGs are inconsistent, don’t freak out. It knows what it’s doing, and if you instruct it to repair, it almost always will, but it takes some time, so stay calm and give it space to work. I’ll go over repairing data in the maintenance guide.
Installation
We will be using Helm to install Rook. I am writing this manual off of the official docs which can be found here:
[Aside] Why Helm?
I recommend using the Helm repo instead of applying the manifest files directly because helm can help you manage resources after they’ve been deployed. This is greatly helpful for things like upgrades where resources may be added, modified, or removed, and can be difficult to keep track of by hand.
Normally, I would recommend tracking everything in a GitOps platform like FluxCD, but since Ceph provides the CSI that FluxCD needs in order to run properly, we unfortunately cannot use FluxCD yet.
Install Rook Ceph Operator
First, add the Rook Helm chart repository.
helm repo add rook-release https://charts.rook.io/release
Next, create your values.yml file. I personally like the default config, so
I won’t be creating one. However, you can use
this section of the docs
to determine the variables you want to use. Note that the . characters used
in the parameter names represent nested values. For example, if you want to set
the crds.enabled parameter to false, you’d use the following yaml:
crds:
enabled: false
If you choose to customize any values, make sure to add -f values.yml to the
end of your install command.
According to the Talos Linux documentation, the default configuration does not allow privileged pods. You’ll need to create the namespace first to allow them:
# kubectl apply -f <file>
---
apiVersion: v1
kind: Namespace
metadata:
name: rook-ceph
labels:
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/warn: privileged
Ceph needs privileged pods to effectively manage disks on the host.
Once that’s done, install the Rook Ceph Operator Helm chart, use the following command:
helm install --namespace rook-ceph rook-ceph rook-release/rook-ceph
Install Rook Ceph Cluster
Now that the Rook Ceph operator has been installed, we need to add a
CephCluster resource in order to get it to provision a Ceph Cluster. This is
because the operator is not itself a cluster. It is an automated interface for
managing clusters, and is actually capable of managing more than one.
Therefore, we need to give it a cluster definition for it to build one. You can
find details on this process at the following link:
I’ll be creating a cluster that spans the whole cluster and consumes all unpartitioned disks.
# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v19.2.3
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
mgr:
count: 1
allowMultiplePerNode: false
modules:
# List of modules to optionally enable or disable.
# Note the "dashboard" and "monitoring" modules are already configured by
# other settings in the cluster CR.
# I recommend the "rook" module to inform the dashboard that Ceph
# resources are configured by Kubernetes manifests.
# I also recommend the "nfs" module as it will provide easy configuration
# of NFS exports via the dashboard. This is very useful for things like
# pre-seeding PVCs, data export, and troubleshooting.
- name: rook
enabled: true
- name: nfs
enabled: true
dashboard:
enabled: true
ssl: true
storage:
useAllNodes: true
useAllDevices: true
config:
encryptedDevice: "true"
disruptionManagement:
managePodBudgets: true
# This can be turned on to help with OSD removal. Since OSDs will be
# automatically marked "out" if they are offline for too long, I recommend
# keeping this off except when you need it.
removeOSDsIfOutAndSafeToRemove: false
Install rook-ceph Krew Plugin
Much of Ceph’s administration post-install happens via CLI, so you’ll want to
make sure you have it. You can either deploy the “toolbox” container (the
official docs go over this), or you can use the kubectl plugin. I recommend the
plugin for simplicity. Assuming you installed krew from the tools section,
you can get the rook-ceph plugin using the following command:
kubectl krew install rook-ceph
Now you can run ceph commands like so:
# `ceph status` becomes...
kubectl rook-ceph ceph status
I recommend adding aliases to the rc file for whatever shell you use:
alias ceph='kubectl rook-ceph ceph'
alias rbd='kubectl rook-ceph rbd'
alias rados='kubectl rook-ceph rados'
alias radosgw-admin='kubectl rook-ceph radosgw-admin'
This will cover most commands you might find in the official Ceph documentation. Ordinarily, you’d run these commands from a Ceph host, but since Rook is provisioning everything for us and we don’t necessarily have direct access, it’s a lot easier to use the plugin. This will run your commands in the “operator” container and attach stdio as if it were running locally.
Add CephFS Filesystem
CephFS is the interface you’re primarily going to want to use for a Homelab since it provides the greatest degree of flexibility if all you need is a POSIX filesystem for application data.
It provides support for the ReadWriteMany and ReadOnlyMany PVC modes, which
is extremely useful for things like Plex/Emby/Jellyfin libraries and other file
shares, as well as quasi, mostly-stateless things like the Home Assistant voice
pipeline components, which, apart from caching models, don’t share any state
and could easily be replicated.
Prior Considerations
CephFS is very flexible and can be configured in a number of ways. Personally, due to certain operational requirements (and the simplicity of doing so), I like to keep all of my storage mechanisms in the same CephFS filesystem. I’ll get to why this is important in a moment.
Rook supports 2 main device classes, and auto-detects which one each OSD
belongs to during OSD preparation. These classes are ssd and hdd. CephFS
supports storing data on devices of multiple classes, but THESE MUST BE
DETERMINED UPFRONT! If you fail to specify your device classes upfront,
you’ll have to resort to manual editing of your CRUSH map, and even then, the
Ceph operator will have no idea what’s going on and be extremely annoying to
deal with. Alternatively, you could just deal with your data being striped
across both SSDs and HDDs, but this will likely yield a weird and sub-par
experience. In summary, PLEASE specify your storage classes, even if you
only have one right now. You’ll thank me later…
I recommend always storing metadata on ssd storage classes. This is a heavy random I/O operation, which is what SSDs were designed for, as opposed to spinning disk drives which are better for bulk storage and sequential I/O. If you have enough space, I also recommend making this your default storage medium altogether, since you’ll usually have more control over the provisioning of mass storage volumes than database volumes, and databases do a lot of random I/O.
CephFS Definition
Here is the definition I use. I use ec-2-1 for massive data that I’d like to be highly available, but could be taken down and restored from a backup if it needs to be. I then have two data pools: one for ssd, and one for hdd with 3 replicas for normal storage.
# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: kubefs
namespace: rook-ceph # namespace:cluster
spec:
# The metadata pool spec
metadataPool:
deviceClass: ssd
replicated:
# You need at least three OSDs on different nodes for this config to work
size: 3
# The list of data pool specs
dataPools:
- name: replicate-3-ssd
deviceClass: ssd
replicated:
size: 3
- name: replicate-3-hdd
deviceClass: hdd
replicated:
size: 3
# You need at least three OSDs on different nodes for this config to work
- name: ec-2-1-hdd
deviceClass: hdd
erasureCoded:
dataChunks: 2
codingChunks: 1
parameters:
compression_mode: none
# Whether to preserve filesystem after CephFilesystem CRD deletion
preserveFilesystemOnDelete: true
metadataServer:
activeCount: 1
activeStandby: true
# The affinity rules to apply to the mds deployment
placement:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: filesystem
operator: In
values:
- media
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-mds
# topologyKey: */zone can be used to spread MDS across different AZ
topologyKey: topology.kubernetes.io/zone
annotations:
labels:
filesystem: kubefs
resources:
Add Storage Classes
Finally, you’ll need to add a storage class for each pool you want to use for PVC provisioning. For my configuration above, I add the following storage classes (all but one hidden for brevity):
# kubectl apply -f <file>
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-kubefs-replicate-3-ssd
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.cephfs.csi.ceph.com # csi-provisioner-name
parameters:
# matches the name of the CephCluster resource
clusterID: rook-ceph
# matches the name of the CephFilesystem resource
fsName: kubefs
# matches the name of a dataPool object within the CephFilesystem resource
pool: kubefs-replicate-3-ssd
# The secrets contain Ceph admin credentials. These are generated automatically by the operator
# in the same namespace as the cluster.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
mounter: kernel
reclaimPolicy: Delete
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-kubefs-replicate-3-hdd
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: kubefs
pool: kubefs-replicate-3-hdd
# The secrets contain Ceph admin credentials. These are generated automatically by the operator
# in the same namespace as the cluster.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
mounter: kernel
reclaimPolicy: Delete
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-kubefs-ec-2-1-hdd
provisioner: rook-ceph.cephfs.csi.ceph.com # csi-provisioner-name
parameters:
clusterID: rook-ceph
fsName: kubefs
pool: kubefs-ec-2-1-hdd
# The secrets contain Ceph admin credentials. These are generated automatically by the operator
# in the same namespace as the cluster.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
mounter: kernel
reclaimPolicy: Delete
allowVolumeExpansion: true
Testing the Filesystem
To test your brand new filesystem, create a PersistentVolumeClaim and see
what happens:
# kubectl apply -f <file>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cephfs-pvc-test
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
Then look at it using this command:
kubectl get pvcs
If you see something like this with status=Bound:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
cephfs-pvc-test Bound pvc-e2a72895-e95f-4a69-a042-ecfe9480c5aa 1Gi RWX rook-kubefs-replicate-3-ssd <unset> 4s
Congratulations! You’ve successfully set up CephFS as your container storage
interface! Now, in the next session, you’ll learn (among other things) how to
set up an NFS server to access this new volume. If you decide not to set up
NFS, you can go ahead and delete the PVC with
kubectl delete pvc cephfs-test-pvc. Otherwise, continue to the next section.
Rook Ceph Continued…
There are more moving parts to Ceph that you’ll probably want to set up, but aren’t strictly needed. I’ll document those here. Feel free to skip anything in this section that you don’t care about.
Snapshot Classes
For some backup utilities, or even for manual backup, you’ll probably want to
be able to create point-in-time snapshots of your PVCs with COW (copy-on-write)
semantics. This is something that Ceph natively supports using the hidden (even
from ls -al) directories called .snap that exist in every directory in the
filesystem. However, it takes a bit more setup to do this “the Kubernetes way”.
First, we need to install the optional “external-snapshotter” manifests to help Kubernetes understand what a snapshot is and how to manage snapshot resources. These manifests are official resources that are agnostic to the CSI driver used, but do not always come with Kubernetes by default.
# Install snapshot CRDs
kubectl kustomize https://github.com/kubernetes-csi/external-snapshotter/client/config/crd | kubectl create -f -
# Install the CSI-agnostic snapshot controller
kubectl -n kube-system kustomize https://github.com/kubernetes-csi/external-snapshotter/deploy/kubernetes/snapshot-controller | kubectl create -f -
Now that the snapshotter is installed, the following YAML file will create a
VolumeSnapshotClass that will tell Kubernetes how to take these point-in-time
snapshots:
# kubectl apply -f <file>
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: rook-fs-snap
driver: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph # name of the CephCluster resource
csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
deletionPolicy: Delete
Then, you’ll be able to create snapshots of PVCs using a manifest like this:
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: cephfs-pvc-test-snap
namespace: default
spec:
volumeSnapshotClassName: rook-fs-snap
source:
persistentVolumeClaimName: cephfs-pvc-test
… and create a new PVC to use that snapshot like this:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cephfs-pvc-test-snap-view
namespace: default
spec:
dataSource:
name: cephfs-pvc-test-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 16Ti
Block Storage
If you plan on running virtual machines, have specific filesystem requirements, etc, you’ll likely want to set up block storage. The following yaml will set up a Ceph RBD pool backed by ssd storage:
# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: vm-storage-ssd
namespace: rook-ceph
spec:
failureDomain: host
replicated:
size: 3
requireSafeReplicaSize: true
deviceClass: ssd
This one will set up a block-based storage class to consume the
CephBlockPool:
# kubectl apply -f <file>
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: vm-storage-ssd
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
# clusterID is the namespace where the rook cluster is running
clusterID: rook-ceph
# Ceph pool into which the RBD image shall be created
pool: vm-storage-ssd
# RBD image format. Defaults to "2".
imageFormat: "2"
# For 5.4 or later kernels:
imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock
# The secrets contain Ceph admin credentials.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
# Specify the filesystem type of the volume. If not specified, csi-provisioner
# will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
# in hyperconverged settings where the volume is mounted on the same node as the osds.
csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete
# Optional, if you want to add dynamic resize for PVC.
# For now only ext3, ext4, xfs resize support provided, like in Kubernetes itself.
allowVolumeExpansion: true
NFS Server
The following yaml will provision the Ceph NFS gateway, which will allow you to mount anything in CephFS via NFS. This is exceptionally useful for troubleshooting deployments, as it’s much easier to port forward, does not require authentication (except IP address-based access controls), and doesn’t require any special drivers.
# kubectl apply -f <file>
---
apiVersion: ceph.rook.io/v1
kind: CephNFS
metadata:
name: nfs-server
namespace: rook-ceph
spec:
server:
active: 1
Please keep in mind that the usual NFS limitations apply. NFS is NOT a POSIX-compliant filesystem, and if your use case requires POSIX compliance, you’re going to have a bad time. This is simply a means of accessing your PVCs as ordinary document stores. File locks may not be respected, and I/O will fail if a file is deleted (this sounds like intuitive behavior, but actually is not how most filesystems work, and can cause real problems in production).
You’ll also need a service to access this NFS server. Here is an example:
# kubectl apply -f <file>
---
apiVersion: v1
kind: Service
metadata:
name: nfs-server
namespace: rook-ceph
spec:
ports:
- name: nfs
port: 2049
type: LoadBalancer
loadBalancerIP: 10.3.0.192 # Set this to your IP address
externalTrafficPolicy: Local
selector:
# Use the name of the CephNFS here
ceph_nfs: nfs-server
# It is safest to send clients to a single NFS server instance. Instance "a" always exists.
# ^-- Please pay attention to this!!!
# NFS can get weird in high availability deployments!!!
# I'm tired of cleaning up messes caused by poor NFS deployments...
instance: a
Now, to add an NFS share for the cephfs-test-pvc we created earlier, run the following in your shell:
# Get the path to the named PVC within the CephFS cluster:
pvc_name="cephfs-pvc-test"
pv_name="$(kubectl get pvc "$pvc_name" --template '{{index . "spec" "volumeName"}}')"
subvol_path="$(kubectl get pv "$pv_name" --template '{{index . "spec" "csi" "volumeAttributes" "subvolumePath"}}')"
# Create an NFS export for the PVC path:
ceph nfs export create cephfs nfs-server /test kubefs "$subvol_path"
Assuming you got no errors, you can mount the share with a command like this:
sudo mount -t nfs -o nfsvers=4.1,proto=tcp <nfs_service_ip>:/test /path/to/your/mountpoint
Now to clean up, run the following commands:
sudo umount /path/to/your/mountpoint
ceph nfs export rm nfs-server /test
kubectl delete pvc cephfs-pvc-test
Dashboard Access
You can add a Service for accessing the dashboard outside the cluster as documented in the official docs. If you don’t want to expose the dashboard outside the cluster network, I recommend using an alias like this:
Linux (wl-clipboard):
alias rook-dash='kubectl get secret -n rook-ceph rook-ceph-dashboard-password --template '{{index . "data" "password"}}' | base64 -d | wl-copy; echo '\''Copied password to clipboard'\''; kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard --address 127.0.0.1 8443:8443 & xdg-open '\''https://127.0.0.1:8443/'\''; fg'
Linux (xclip):
alias rook-dash='kubectl get secret -n rook-ceph rook-ceph-dashboard-password --template '{{index . "data" "password"}}' | base64 -d | xclip -selection clipboard; echo '\''Copied password to clipboard'\''; kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard --address 127.0.0.1 8443:8443 & xdg-open '\''https://127.0.0.1:8443/'\''; fg'
Mac:
alias rook-dash='kubectl get secret -n rook-ceph rook-ceph-dashboard-password --template '{{index . "data" "password"}}' | base64 -d | pbcopy; echo '\''Copied password to clipboard'\''; kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard --address 127.0.0.1 8443:8443 & open '\''https://127.0.0.1:8443/'\''; fg'
When run, this alias will pull the auto-generated admin secret from the cluster and copy it to your clipboard, then start a port forward and open the dashboard in your default browser.