8 February 2024
In this article, we'll look at how to implement workload isolation in a Kubernetes cluster using taints, tolerations, and node affinities. We'll take a closer look at how taints and tolerations work and see common pitfalls to avoid when using them to isolate workloads.
What is pod scheduling?
When a pod is created on the Kubernetes API, a scheduler component is responsible for assigning the pod to a node in the cluster. Once the pod is assigned, the node is responsible for running the required containers defined in the pod specification.
The way a scheduler chooses a node for a pod depends on its implementation and configuration. For instance, the default Kubernetes scheduler kube-scheduler proceeds in two phases:
- a filtering phase where it selects the node that matches scheduling constraints (such as the pod’s node selector)
- and a scoring phase where it decides which node among the filtered ones is the best candidate for the pod.
For each pod (same applies for deployment, stateful set, etc.) we can define a set of nodes where the pod can be theoretically assigned and may be run in the future, in case the scheduler has to reschedule it (in the event a node fails for instance). These nodes can be referred to as feasible nodes for the pod to be scheduled.
Schematic view of feasible nodes in a Kubernetes cluster
Why does workload isolation matter for my Kubernetes cluster security?
If containers are somehow “isolated” from the host where they run, nodes can still be compromised from a running pod. Indeed, the “isolation” provided by containers depends highly on the configuration of the workload and the absence of kernel vulnerabilities. There are many well-documented technics to take advantage of Pod misconfigurations on the internet.
When a node is compromised, an attacker can access the credentials and data used by all the pods running on the node. Therefore, it is important to ensure that pods with heterogeneous security requirements have distinct feasible nodes.
In this article, we will consider a use case where we have two kinds of pods: exposed pods that may be accessible from the internet and more likely to be compromised; and sensitive pods we want to protect. In our use case, we want the intersection between feasible nodes for sensitive pods and exposed ports to be empty: it is called workload isolation.
Overlapping feasible nodes set may present a security risk
Bonus: How to retrieve a service account token from a compromised node?
When having compromised a Kubernetes node during a pentest you can ask for a service account token by using the kubelet identity. Indeed, the kubectl tool has a command that allows you to ask for a service account token! The only requirement is to specify the bound pod which is assigned to the node.
For example, if the pod my-pod is running with the my-service-account service account:
# assuming pod id is 8411fd6b-22a2-42ed-9cfd-1ad6254dbe44
# you can find yours by getting the pod object from the API
$ kubectl create token \
--bound-object-kind Pod \
--bound-object-name my-pod \
--bound-object-uid 8411fd6b-22a2-42ed-9cfd-1ad6254dbe44 \
my-service-account
eyJhbGciOiJSUzI1NiIsImtpZCI6Inc.....
After that, you can use the token by using the --token
flag on the kubeclt
tool and the requests will be performed using the service account identity:
$ kubeclt --token eyJhbGciOiJSUzI1NiIsImtpZCI6Inc..... get secrets
When performing the same command for a pod that is not assigned to the node you should get the following error:
$ kubectl create token \
--bound-object-kind Pod \
--bound-object-name my-pod-not-on-my-node \
--bound-object-uid 8411fd6b-22a2-42ed-9cfd-1ad6254dbe44 \
my-other-service-account
error: failed to create token: serviceaccounts "my-other-service-account" is forbidden: User "system:node:other-node" cannot create resource "serviceaccounts/token" in API group "" in the namespace "default": no relationship found between node 'other-node' and this object
From now on you should be convinced of the need to perform workload isolation on your Kubernetes cluster: let's see how we can achieve it!
How to isolate workloads on a Kubernetes cluster?
To isolate workloads on a Kubernetes cluster we can take advantage of two mechanisms:
- Taints and tolerations
- Node affinity (or node selector)
We will see that we need to use both to design a robust isolation. But first, let’s see what these two mechanisms are about.
Taints and tolerations
Taints allow a node to repel a set of pods.
The definition above is taken from Kubernetes documentation and it is incredibly efficient. By using taints, we can restrict what can be executed on a node. In our case, we can use it to define a dedicated node to isolate our sensitive workloads in our cluster.
If we define a taint on these nodes, we are guaranteed that only pods that tolerate this taint can be scheduled on this node.
BUT
By using taints and tolerations we are NOT guaranteed that our sensitive node will be scheduled on our dedicated node. They can, but they can also be scheduled on nodes without any taint. As the Kubernetes documentation explains: taints repel pods, they do not attract them.
You can still only use taints and tolerations if all your node pools are tainted with mutually exclusive taints (e.g. a node must be an exposed node or a sensitive node but not both). I do not recommend relying only on this mechanism as it can be broken as soon as an untainted node is added to the cluster.
By using taints and tolerations only, we may still have overlapping feasible node sets
If we want an effective taint for our use case, we will need to use the NoExecute
so that we can be guaranteed that no pod without adequate toleration can run on the node. For more details on how to implement taints and tolerations, you can see a previous article made by Kimelyne on Padok blog.
Bonus: If you want to convince yourself and not only trust me (which I recommend), you can look for a pod in a not too much customized Kubernetes 1.18 and later cluster. (as it is the default behavior) and see that it has default tolerations! (and they do not prevent it from being scheduled on nodes without the specified taints).
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
Indeed, the Taint-based eviction mechanism relies on the fact that pods will not tolerate certain taints, making nodes reject them. By default, pods are configured to tolerate unreachable
and not-ready
taint for 5 minutes. They’ll be rejected from their node 5 minutes after the node controller has added the not-ready
or unreachable
taint with NoExecute
effect to the node.
Node affinity
Node affinity is a property of Pods that attracts them to a set of nodes (either as a preference or a hard requirement).
Once again the Kubernetes documentation is pretty explicit. By using node affinities we can force our sensitive and exposed pods to be scheduled on our dedicated nodes.
To define node affinity rules we need to create target labels on our dedicated nodes to be selected by our pods. To define labels for workload isolation purposes, it is recommended to use labels using the prefix node-restriction.kubernetes.io
as they cannot be modified by kubelet account in case a node is compromised (if you have Node Restriction admission plugin enabled on your cluster).
Both node affinity with requiredDuringSchedulingIgnoredDuringExecution
rule and node selector rules will match our isolation needs. For more details on how to implement node affinity rules you can refer to the node affinity and node selector documentation.
However, node affinity alone does not guarantee workload isolation. It could be if ALL pod sets would be using mutually exclusive node affinities or selectors but once again, I do not recommend it.
Node affinities allow us to isolate our feasible nodes but we still may have issues with pods that do not specify affinity rules
Combining taints, tolerations, and node affinities to achieve workload isolation
By combining both systems we can end up with a robust workload isolation mechanism:
- Taints and tolerations ensure we can have a dedicated node for our sensitive pods where other workloads cannot run.
- Node affinities (or node selectors) ensure that our sensitive pods are only scheduled on our dedicated nodes and not on other nodes.
At long last, workload isolation!
Workload isolation at the node level greatly the security level of your cluster but you'll still have a couple of points to watch out for:
- It does not protect your cluster against Kubernetes RBAC misconfigurations: if the service accounts attached to your exposed pods are cluster admins, workload isolation won’t prevent sensitive pods from being compromised.
- If daemon sets are required to run on all nodes, you should pay particular attention to the rights associated with their service accounts.
- It may impact your finops performance and you’ll need to size your nodes properly to avoid extra costs on your cluster.
Generally speaking, you should put effort into securing your Kubernetes workloads before implementing defense-in-depth with workload isolation.
Conclusion
I hope this article has helped you better understand taints and tolerations and how to combine them with affinities to create a robust workload isolation system. In a future article, we will look in more depth at how to exploit isolation flaws on Kubernetes clusters as an attacker. Stay tuned!