Kubernetes and Resilience (Pod Topology Spread Constraints)

Pierre RAFFA
AWS Tip
Published in
14 min readMay 16, 2023

--

Topology Spread Constraints is a feature in Kubernetes that allows to specify how pods should be spread across nodes based on certain rules or constraints. This is useful for ensuring high availability and fault tolerance of applications running on Kubernetes clusters.

For example, to ensure that:

  • pods belonging to a particular deployment or statefulset are spread across multiple nodes in different availability zones to ensure that the application remains available even if one zone goes down.
  • pods belonging to a particular service are spread across multiple nodes to limit the blast radius of an outage.

This feature ensures that the applications are distributed across multiple nodes in a way that maximizes availability and resilience while minimizing latency and other performance issues.

Spread constraint definition

We can define one or multiple topologySpreadConstraints entries to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across our cluster.

topologySpreadConstraints:
- maxSkew: <integer>
minDomains: <integer> # optional; beta since v1.25
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
matchLabelKeys: <list> # optional; beta since v1.27
nodeAffinityPolicy: [Honor|Ignore] # optional; beta since v1.26
nodeTaintsPolicy: [Honor|Ignore] # optional; beta since v1.26

maxSkew:
describes the degree to which Pods may be unevenly distributed. Its value must be greater than zero.
Its semantics differ according to the value of whenUnsatisfiable

whenUnsatisfiable:

  • DoNotSchedule, then maxSkew defines the maximum permitted difference between the number of matching pods in the target topology and the global minimum (the minimum number of matching pods in an eligible domain or zero if the number of eligible domains is less than MinDomains). For example, if we have 3 zones with 2, 2 and 1 matching pods respectively, MaxSkew is set to 1 then the global minimum is 1.
  • ScheduleAnyway, the scheduler gives higher precedence to topologies that would help reduce the skew.

topologyKey:
Specifies the key of node labels. Nodes that have a label with this key and identical values are considered to be in the same topology. We call each instance of a topology (in other words, a <key, value> pair) a domain. The scheduler will try to put a balanced number of pods into each domain. Also, we define an eligible domain as a domain whose nodes meet the requirements of nodeAffinityPolicy and nodeTaintsPolicy.

Note that, when using Karpenter, the three supported topologyKey values that Karpenter supports are:
topology.kubernetes.io/zone
kubernetes.io/hostname
karpenter.sh/capacity-type

For more info: Pod Topology Spread Constraints | Kubernetes

Let’s see this feature in action…

Test method

The cluster is configured to run on 3 AZs in AWS.
The cluster has an autoscaling process in place.
5 instances will be running at the beginning.
All instances are on-demand.

For each test, a script will automate the capacity change of the application.
A delay of 70s will be applied between each increase of the number of replicas to give some time to a new node to join the cluster.

#!/bin/bash
for i in {2..21}
do
echo "Scale to $i replicas"
kubectl scale deployment my-app --replicas $i
sleep 70
done

The output will list the pods ordered by creation date, plus their zone (AZ) and the current pod distribution across the AZs.
For instance, the pod distribution 121 means:
- 1 pod in a
- 2 pods in b
- 1 pod in c

Test 1

topology.kubernetes.io/zone : DoNotSchedule

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
terminationGracePeriodSeconds: 0
containers:
- name: my-app
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
cpu: 0.5
memory: 1Gi
limits:
cpu: 0.5
memory: 1Gi
topologySpreadConstraints:
- topologyKey: "topology.kubernetes.io/zone"
maxSkew: 1
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
Pod               Age   Node                            Zone        abc(zones)
my-app-xxx-zr5w4 68m ip-192-168-32-251.us-east-2... us-east-2b
my-app-xxx-gpmf8 23m ip-192-168-11-154.us-east-2... us-east-2a
my-app-xxx-d84s2 22m ip-192-168-71-72.us-east-2... us-east-2c 111
my-app-xxx-h8xj4 21m ip-192-168-11-154.us-east-2... us-east-2a
my-app-xxx-qgvzw 20m ip-192-168-42-238.us-east-2... us-east-2b
my-app-xxx-nxj8l 18m ip-192-168-65-28.us-east-2... us-east-2c 222
my-app-xxx-bf8j7 17m ip-192-168-65-28.us-east-2... us-east-2c
my-app-xxx-gcztt 16m ip-192-168-13-90.us-east-2... us-east-2a
my-app-xxx-rfr59 15m ip-192-168-47-214.us-east-2... us-east-2b 333
my-app-xxx-m9pnp 14m ip-192-168-13-90.us-east-2... us-east-2a
my-app-xxx-d6zps 13m ip-192-168-47-214.us-east-2... us-east-2b
my-app-xxx-ncsmk 11m ip-192-168-65-28.us-east-2... us-east-2c 444
my-app-xxx-mdgbm 10m ip-192-168-47-214.us-east-2... us-east-2b
my-app-xxx-c9vj2 9m28s ip-192-168-13-90.us-east-2... us-east-2a
my-app-xxx-ckdcv 8m17s ip-192-168-78-7.us-east-2... us-east-2c 555
my-app-xxx-h6wgr 7m6s ip-192-168-78-7.us-east-2... us-east-2c
my-app-xxx-4ht8j 5m55s ip-192-168-9-101.us-east-2... us-east-2a
my-app-xxx-2n5gw 4m44s ip-192-168-37-71.us-east-2... us-east-2b 666
my-app-xxx-w8td7 3m33s ip-192-168-37-71.us-east-2... us-east-2b
my-app-xxx-vp276 2m22s ip-192-168-9-101.us-east-2... us-east-2a
my-app-xxx-8wz8x 71s ip-192-168-78-7.us-east-2... us-east-2c 777
Pod AZ Distribution

The result is as expected. The pods are evenly distributed across AZs 👍
The skew never exceeded 1.
The pods are spread across 10 different nodes.

Test 2

topology.kubernetes.io/zone : ScheduleAnyway

apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
terminationGracePeriodSeconds: 0
containers:
- name: my-app
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
cpu: 0.5
memory: 1Gi
limits:
cpu: 0.5
memory: 1Gi
topologySpreadConstraints:
- topologyKey: "topology.kubernetes.io/zone"
maxSkew: 1
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-app
Pod               Age   Node                            Zone       abc(zones)
my-app-xxx-vlszw 23m ip-192-168-33-85.us-east-2... us-east-2b
my-app-xxx-wzstb 22m ip-192-168-15-54.us-east-2... us-east-2a
my-app-xxx-lbp9p 21m ip-192-168-78-7.us-east-2... us-east-2c 333
my-app-xxx-x9m9c 21m ip-192-168-71-72.us-east-2... us-east-2c
my-app-xxx-dg2zt 20m ip-192-168-0-84.us-east-2... us-east-2a
my-app-xxx-bgxjl 19m ip-192-168-45-100.us-east-2... us-east-2b 333
my-app-xxx-sxqdw 17m ip-192-168-45-100.us-east-2... us-east-2b
my-app-xxx-gslwz 16m ip-192-168-45-100.us-east-2... us-east-2b 353 (skew 2)
my-app-xxx-klrlk 15m ip-192-168-68-142.us-east-2... us-east-2c 354 (skew 2)
my-app-xxx-m89wp 14m ip-192-168-68-142.us-east-2... us-east-2c 355 (skew 2)
my-app-xxx-n2vf9 13m ip-192-168-68-142.us-east-2... us-east-2c 356 (skew 3)
my-app-xxx-xpxrv 11m ip-192-168-11-127.us-east-2... us-east-2a 456 (skew 2)
my-app-xxx-fcgp5 10m ip-192-168-11-127.us-east-2... us-east-2a 556
my-app-xxx-r96l2 9m33s ip-192-168-11-127.us-east-2... us-east-2a 656
my-app-xxx-gcvmt 8m22s ip-192-168-41-12.us-east-2... us-east-2b 666
my-app-xxx-xxhvs 7m11s ip-192-168-41-12.us-east-2... us-east-2b 676
my-app-xxx-8ptt9 6m ip-192-168-41-12.us-east-2... us-east-2b 686 (skew 2)
my-app-xxx-nss4m 4m48s ip-192-168-7-88.us-east-2... us-east-2a 786 (skew 2)
my-app-xxx-c7kls 3m37s ip-192-168-7-88.us-east-2... us-east-2a 886 (skew 2)
my-app-xxx-pxrwr 2m26s ip-192-168-7-88.us-east-2... us-east-2a 986 (skew 3)
my-app-xxx-xrvl5 75s ip-192-168-73-177.us-east-2... us-east-2c 987 (skew 2)

At the beginning of the test, 5 nodes already exists. That’s why the scheduler could place the first pods evenly across AZs.
As soon as new nodes need to be provisioned, the skew reaches 2 or 3.

Since ip-192–168–45–100.us-east-2.compute.internal has been provisioned, the scheduler will target this node until it reaches its full capacity, followed by ip-192–168–68–142.us-east-2.compute.internal etc…

It looks like the scheduler will prioritise a kind of binpack placement strategy over the topology constraints.
To make sure that’s the case, let’s make the same test with larger instances.

Pod               Age    Node                          Zone       abc(zones)
my-app-xxx-dzqf8 26m ip-192-168-76-158.us-east-2.. us-east-2c
my-app-xxx-tx5vc 23m ip-192-168-46-123.us-east-2.. us-east-2b
my-app-xxx-trfv8 22m ip-192-168-8-231.us-east-2... us-east-2a 111
my-app-xxx-cbzhh 21m ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-7r26p 20m ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-k7zlb 18m ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-bldtp 17m ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-lwn7c 16m ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-4h9fq 15m ip-192-168-8-231.us-east-2... us-east-2a 711 (skew 6)
my-app-xxx-hrvbr 14m ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-bzs75 13m ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-xfcp4 11m ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-vtdp6 10m ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-qs5ml 9m28s ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-9x9zd 8m17s ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-gwxsh 7m6s ip-192-168-45-86.us-east-2... us-east-2b 781 (skew 6)
my-app-xxx-c4hv2 5m55s ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-6fjvd 4m44s ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-tq6lc 3m33s ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-qsbpp 2m22s ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-rgbl8 71s ip-192-168-67-38.us-east-2... us-east-2c 786 (skew 2)

We can see there, that the allocated resource for the node ip-192–168–8–231.us-east-2.compute.internal reached almost its full capacity before another instance is provisioned.

Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3705m (94%) 3900m (99%)
memory 7318Mi (49%) 7986Mi (53%)
Pod AZ Distribution

Test 3

topology.kubernetes.io/zone : DoNotSchedule
kubernetes.io/hostname : DoNotSchedule

We can define a list of constraints as well.
The scheduler will try to respect these conditions:

  • The pods are evenly distributed across the AZs.
  • Only one pod of the service can sit in a same node.

The pods are spread across 21 different nodes. (+100% compared to Test1)

This setup improves the high availability of the services by limiting the blast radius of an outage at the node level but could lead in larger bill.

Pod               Age    Node                          Zone        abc(zones)
my-app-xxx-zhghv 26m ip-192-168-77-62.us-east-2.... us-east-2c
my-app-xxx-fwflh 23m ip-192-168-35-250.us-east-2... us-east-2b
my-app-xxx-7pjc7 22m ip-192-168-3-29.us-east-2.c... us-east-2a 111
my-app-xxx-9qkmc 21m ip-192-168-35-199.us-east-2... us-east-2b
my-app-xxx-q8r8c 20m ip-192-168-73-121.us-east-2... us-east-2c
my-app-xxx-h4l4q 18m ip-192-168-7-123.us-east-2.... us-east-2a 222
my-app-xxx-b8tcp 17m ip-192-168-15-161.us-east-2... us-east-2a
my-app-xxx-8m6hc 16m ip-192-168-72-211.us-east-2... us-east-2c
my-app-xxx-cgj25 15m ip-192-168-34-100.us-east-2... us-east-2b 333
my-app-xxx-56h9c 14m ip-192-168-37-144.us-east-2... us-east-2b
my-app-xxx-kpsjt 13m ip-192-168-69-178.us-east-2... us-east-2c
my-app-xxx-l7b86 11m ip-192-168-2-67.us-east-2.c... us-east-2a 444
my-app-xxx-q4krz 10m ip-192-168-43-34.us-east-2.... us-east-2b
my-app-xxx-8hc8f 9m29s ip-192-168-8-203.us-east-2.... us-east-2a
my-app-xxx-zqfbv 8m17s ip-192-168-73-95.us-east-2.... us-east-2c 555
my-app-xxx-d5wnl 7m6s ip-192-168-67-207.us-east-2... us-east-2c
my-app-xxx-nq2vp 5m55s ip-192-168-37-87.us-east-2.... us-east-2b
my-app-xxx-xscmb 4m44s ip-192-168-5-126.us-east-2.... us-east-2a 666
my-app-xxx-kx64d 3m33s ip-192-168-77-59.us-east-2.... us-east-2c
my-app-xxx-zw5xc 2m22s ip-192-168-13-228.us-east-2... us-east-2a
my-app-xxx-fv6lx 71s ip-192-168-39-223.us-east-2... us-east-2b 777
Pod AZ Distribution

Test 4

topology.kubernetes.io/zone : DoNotSchedule
kubernetes.io/hostname : ScheduleAnyway

The pods are spread across 10 different nodes. Same result as the Test 1.


Pod Age Node Zone abc(zones
my-app-xxx-7ttx7 24m ip-192-168-9-27.us-east-2.... us-east-2a
my-app-xxx-nskkc 23m ip-192-168-64-38.us-east-2... us-east-2c
my-app-xxx-f8xjn 22m ip-192-168-42-228.us-east-2.. us-east-2b 111
my-app-xxx-nzdws 21m ip-192-168-72-206.us-east-2.. us-east-2c
my-app-xxx-bsz7z 20m ip-192-168-42-228.us-east-2.. us-east-2b
my-app-xxx-g8t5c 18m ip-192-168-9-27.us-east-2.c.. us-east-2a 222
my-app-xxx-c2rmc 17m ip-192-168-42-228.us-east-2.. us-east-2b
my-app-xxx-pc8jx 16m ip-192-168-68-140.us-east-2.. us-east-2c
my-app-xxx-whg27 15m ip-192-168-9-115.us-east-2... us-east-2a 333
my-app-xxx-t5jzx 14m ip-192-168-68-140.us-east-2.. us-east-2c
my-app-xxx-9lrtd 13m ip-192-168-9-115.us-east-2... us-east-2a
my-app-xxx-p7m94 11m ip-192-168-42-39.us-east-2... us-east-2b 444
my-app-xxx-hk8fg 10m ip-192-168-42-39.us-east-2... us-east-2b
my-app-xxx-4cmct 9m29s ip-192-168-9-115.us-east-2... us-east-2a
my-app-xxx-cw27c 8m17s ip-192-168-68-140.us-east-2.. us-east-2c 555
my-app-xxx-8cp7m 7m6s ip-192-168-42-39.us-east-2... us-east-2b
my-app-xxx-skdfb 5m55s ip-192-168-64-28.us-east-2... us-east-2c
my-app-xxx-59vmb 4m44s ip-192-168-14-63.us-east-2... us-east-2a 666
my-app-xxx-r65jg 3m33s ip-192-168-64-28.us-east-2... us-east-2c
my-app-xxx-fjgjm 2m22s ip-192-168-14-63.us-east-2... us-east-2a
my-app-xxx-zrb6s 71s ip-192-168-36-47.us-east-2... us-east-2b 777
Pod AZ Ditribution

But….

Topology Constraints does not guarantee maxSkew will be respected when whenUnsatisfiable: DoNotSchedule.
It happened during the test few times when the delay between each increase of the number of replicas is a bit too low, such as 30s.
When this situation happens, the scheduler manages to mitigate the skew on the next steps, to finally make the skew back to 1.

The reason is because the scheduler defines the placement of a pod only on a pod creation. It won’t evaluate the current state and won’t deschedule a pod to preserve maxSkew.

The way to run strictly the pods of a service in separate nodes is to use podAffinity/podAntiAffinity instead.

Here is an example of the skew temporarily not respected with topology.kubernetes.io/zone : DoNotSchedule and delay of 30s between each increase.

Pod               Age    Node                          Zone        abc(zones
my-app-xxx-sh7g6 10m ip-192-168-74-106.us-east-2.. us-east-2c
my-app-xxx-z6wzg 10m ip-192-168-0-245.us-east-2... us-east-2a
my-app-xxx-nbdrj 9m51s ip-192-168-34-118.us-east-2.. us-east-2b 111
my-app-xxx-xff7w 9m20s ip-192-168-34-36.us-east-2... us-east-2b
my-app-xxx-c9sk7 8m49s ip-192-168-79-68.us-east-2... us-east-2c
my-app-xxx-pz54k 8m17s ip-192-168-10-53.us-east-2... us-east-2a 222
my-app-xxx-sqzxl 7m46s ip-192-168-14-172.us-east-2.. us-east-2a
my-app-xxx-lzdjq 7m15s ip-192-168-35-36.us-east-2... us-east-2b
my-app-xxx-d6kds 6m44s ip-192-168-35-36.us-east-2... us-east-2b 342 (skew 2)
my-app-xxx-qj9tv 6m13s ip-192-168-65-153.us-east-2.. us-east-2c 343
my-app-xxx-qgnk4 5m42s ip-192-168-65-153.us-east-2.. us-east-2c 344
my-app-xxx-x5724 5m11s ip-192-168-9-121.us-east-2... us-east-2a 444
my-app-xxx-8bj6q 4m40s ip-192-168-65-153.us-east-2.. us-east-2c
my-app-xxx-bxjdl 4m9s ip-192-168-9-121.us-east-2... us-east-2a
my-app-xxx-ptbf8 3m38s ip-192-168-35-36.us-east-2... us-east-2b 555
my-app-xxx-rjgkl 3m6s ip-192-168-9-121.us-east-2... us-east-2a
my-app-xxx-zqgm7 2m35s ip-192-168-69-34.us-east-2... us-east-2c
my-app-xxx-76zxb 2m4s ip-192-168-45-149.us-east-2.. us-east-2b 666
my-app-xxx-4nrq2 93s ip-192-168-69-34.us-east-2... us-east-2c
my-app-xxx-fxms8 62s ip-192-168-45-149.us-east-2.. us-east-2b
my-app-xxx-trgj7 31s ip-192-168-5-47.us-east-2.c.. us-east-2a 777

Or another example of the skew temporarily not respected with topology.kubernetes.io/zone : DoNotSchedule, and instances are spot, the delay of 70s between each increase.

Pod               Age    Node                            Zone       abc(zones
my-app-xxx-gbswx 27m ip-192-168-3-107.us-east-2... us-east-2a
my-app-xxx-xxgsw 23m ip-192-168-44-31.us-east-2... us-east-2b
my-app-xxx-nfpbh 22m ip-192-168-66-238.us-east-2.. us-east-2c 111
my-app-xxx-t4dfn 21m ip-192-168-66-238.us-east-2.. us-east-2c
my-app-xxx-4vj65 20m ip-192-168-44-31.us-east-2... us-east-2b
my-app-xxx-972l9 18m ip-192-168-3-15.us-east-2.c.. us-east-2a 222
my-app-xxx-wfh8z 16m ip-192-168-36-14.us-east-2... us-east-2b
my-app-xxx-w92vb 15m ip-192-168-2-166.us-east-2... us-east-2a
my-app-xxx-dd2rz 14m ip-192-168-2-166.us-east-2... us-east-2a 432 (skew 2)
my-app-xxx-2gk8s 13m ip-192-168-78-117.us-east-2.. us-east-2c 433
my-app-xxx-w7v74 11m ip-192-168-46-160.us-east-2.. us-east-2b 443
my-app-xxx-s6n8h 10m ip-192-168-78-117.us-east-2.. us-east-2c 444
my-app-xxx-s8xjw 10m ip-192-168-46-160.us-east-2.. us-east-2b
my-app-xxx-jqcrb 9m29s ip-192-168-2-166.us-east-2... us-east-2a
my-app-xxx-twfpq 8m18s ip-192-168-76-130.us-east-2.. us-east-2c 555
my-app-xxx-fd4gv 7m6s ip-192-168-76-130.us-east-2.. us-east-2c
my-app-xxx-mwmpz 5m55s ip-192-168-9-107.us-east-2... us-east-2a
my-app-xxx-vvmhm 4m44s ip-192-168-46-79.us-east-2... us-east-2b 666
my-app-xxx-5pjzc 3m33s ip-192-168-9-107.us-east-2... us-east-2a
my-app-xxx-lw7fm 2m22s ip-192-168-46-79.us-east-2... us-east-2b
my-app-xxx-79d7c 71s ip-192-168-78-53.us-east-2... us-east-2c 777

Sources:

Pod Topology Spread Constraints | Kubernetes
Scheduling | Karpenter
The Ultimate Guide to Kubernetes Pod Topology Spread Constraints | mby.io

--

--