Kubernetes and Resilience (Pod Topology Spread Constraints)

Published in

AWS Tip

14 min readMay 16, 2023

Topology Spread Constraints is a feature in Kubernetes that allows to specify how pods should be spread across nodes based on certain rules or constraints. This is useful for ensuring high availability and fault tolerance of applications running on Kubernetes clusters.

For example, to ensure that:

pods belonging to a particular deployment or statefulset are spread across multiple nodes in different availability zones to ensure that the application remains available even if one zone goes down.
pods belonging to a particular service are spread across multiple nodes to limit the blast radius of an outage.

This feature ensures that the applications are distributed across multiple nodes in a way that maximizes availability and resilience while minimizing latency and other performance issues.

Spread constraint definition

We can define one or multiple topologySpreadConstraints entries to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across our cluster.

topologySpreadConstraints:
- maxSkew: <integer>
  minDomains: <integer> # optional; beta since v1.25
  topologyKey: <string>
  whenUnsatisfiable: <string>
  labelSelector: <object>
  matchLabelKeys: <list> # optional; beta since v1.27
  nodeAffinityPolicy: [Honor|Ignore] # optional; beta since v1.26
  nodeTaintsPolicy: [Honor|Ignore] # optional; beta since v1.26

maxSkew:
describes the degree to which Pods may be unevenly distributed. Its value must be greater than zero.
Its semantics differ according to the value of whenUnsatisfiable

whenUnsatisfiable:

DoNotSchedule, then maxSkew defines the maximum permitted difference between the number of matching pods in the target topology and the global minimum (the minimum number of matching pods in an eligible domain or zero if the number of eligible domains is less than MinDomains). For example, if we have 3 zones with 2, 2 and 1 matching pods respectively, MaxSkew is set to 1 then the global minimum is 1.
ScheduleAnyway, the scheduler gives higher precedence to topologies that would help reduce the skew.

topologyKey:
Specifies the key of node labels. Nodes that have a label with this key and identical values are considered to be in the same topology. We call each instance of a topology (in other words, a <key, value> pair) a domain. The scheduler will try to put a balanced number of pods into each domain. Also, we define an eligible domain as a domain whose nodes meet the requirements of nodeAffinityPolicy and nodeTaintsPolicy.

Note that, when using Karpenter, the three supported topologyKey values that Karpenter supports are:
topology.kubernetes.io/zone kubernetes.io/hostname karpenter.sh/capacity-type

For more info: Pod Topology Spread Constraints | Kubernetes

Let’s see this feature in action…

Test method

The cluster is configured to run on 3 AZs in AWS.
The cluster has an autoscaling process in place.
5 instances will be running at the beginning.
All instances are on-demand.

For each test, a script will automate the capacity change of the application.
A delay of 70s will be applied between each increase of the number of replicas to give some time to a new node to join the cluster.

#!/bin/bash
for i in {2..21}
do
  echo "Scale to $i replicas"
  kubectl scale deployment my-app --replicas $i
  sleep 70
done

The output will list the pods ordered by creation date, plus their zone (AZ) and the current pod distribution across the AZs.
For instance, the pod distribution 121 means:
- 1 pod in a
- 2 pods in b
- 1 pod in c

Test 1

topology.kubernetes.io/zone : DoNotSchedule

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      terminationGracePeriodSeconds: 0
      containers:
      - name: my-app
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        resources:
          requests:
            cpu: 0.5
            memory: 1Gi
          limits:
            cpu: 0.5
            memory: 1Gi
      topologySpreadConstraints:
      - topologyKey: "topology.kubernetes.io/zone"
        maxSkew: 1
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-app

Pod               Age   Node                            Zone        abc(zones)
my-app-xxx-zr5w4  68m   ip-192-168-32-251.us-east-2...  us-east-2b
my-app-xxx-gpmf8  23m   ip-192-168-11-154.us-east-2...  us-east-2a
my-app-xxx-d84s2  22m   ip-192-168-71-72.us-east-2...   us-east-2c  111
my-app-xxx-h8xj4  21m   ip-192-168-11-154.us-east-2...  us-east-2a
my-app-xxx-qgvzw  20m   ip-192-168-42-238.us-east-2...  us-east-2b
my-app-xxx-nxj8l  18m   ip-192-168-65-28.us-east-2...   us-east-2c  222
my-app-xxx-bf8j7  17m   ip-192-168-65-28.us-east-2...   us-east-2c
my-app-xxx-gcztt  16m   ip-192-168-13-90.us-east-2...   us-east-2a
my-app-xxx-rfr59  15m   ip-192-168-47-214.us-east-2...  us-east-2b  333
my-app-xxx-m9pnp  14m   ip-192-168-13-90.us-east-2...   us-east-2a
my-app-xxx-d6zps  13m   ip-192-168-47-214.us-east-2...  us-east-2b
my-app-xxx-ncsmk  11m   ip-192-168-65-28.us-east-2...   us-east-2c  444
my-app-xxx-mdgbm  10m   ip-192-168-47-214.us-east-2...  us-east-2b
my-app-xxx-c9vj2  9m28s ip-192-168-13-90.us-east-2...   us-east-2a
my-app-xxx-ckdcv  8m17s ip-192-168-78-7.us-east-2...    us-east-2c  555
my-app-xxx-h6wgr  7m6s  ip-192-168-78-7.us-east-2...    us-east-2c
my-app-xxx-4ht8j  5m55s ip-192-168-9-101.us-east-2...   us-east-2a
my-app-xxx-2n5gw  4m44s ip-192-168-37-71.us-east-2...   us-east-2b  666
my-app-xxx-w8td7  3m33s ip-192-168-37-71.us-east-2...   us-east-2b
my-app-xxx-vp276  2m22s ip-192-168-9-101.us-east-2...   us-east-2a
my-app-xxx-8wz8x  71s   ip-192-168-78-7.us-east-2...    us-east-2c  777

The result is as expected. The pods are evenly distributed across AZs 👍
The skew never exceeded 1.
The pods are spread across 10 different nodes.

Test 2

topology.kubernetes.io/zone : ScheduleAnyway

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      terminationGracePeriodSeconds: 0
      containers:
      - name: my-app
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        resources:
          requests:
            cpu: 0.5
            memory: 1Gi
          limits:
            cpu: 0.5
            memory: 1Gi
      topologySpreadConstraints:
      - topologyKey: "topology.kubernetes.io/zone"
        maxSkew: 1
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: my-app

Pod               Age   Node                            Zone       abc(zones)
my-app-xxx-vlszw  23m   ip-192-168-33-85.us-east-2...  us-east-2b
my-app-xxx-wzstb  22m   ip-192-168-15-54.us-east-2...  us-east-2a
my-app-xxx-lbp9p  21m   ip-192-168-78-7.us-east-2...   us-east-2c  333
my-app-xxx-x9m9c  21m   ip-192-168-71-72.us-east-2...  us-east-2c
my-app-xxx-dg2zt  20m   ip-192-168-0-84.us-east-2...   us-east-2a
my-app-xxx-bgxjl  19m   ip-192-168-45-100.us-east-2... us-east-2b  333
my-app-xxx-sxqdw  17m   ip-192-168-45-100.us-east-2... us-east-2b
my-app-xxx-gslwz  16m   ip-192-168-45-100.us-east-2... us-east-2b  353 (skew 2)
my-app-xxx-klrlk  15m   ip-192-168-68-142.us-east-2... us-east-2c  354 (skew 2)
my-app-xxx-m89wp  14m   ip-192-168-68-142.us-east-2... us-east-2c  355 (skew 2)
my-app-xxx-n2vf9  13m   ip-192-168-68-142.us-east-2... us-east-2c  356 (skew 3)
my-app-xxx-xpxrv  11m   ip-192-168-11-127.us-east-2... us-east-2a  456 (skew 2)
my-app-xxx-fcgp5  10m   ip-192-168-11-127.us-east-2... us-east-2a  556
my-app-xxx-r96l2  9m33s ip-192-168-11-127.us-east-2... us-east-2a  656
my-app-xxx-gcvmt  8m22s ip-192-168-41-12.us-east-2...  us-east-2b  666
my-app-xxx-xxhvs  7m11s ip-192-168-41-12.us-east-2...  us-east-2b  676
my-app-xxx-8ptt9  6m    ip-192-168-41-12.us-east-2...  us-east-2b  686 (skew 2)
my-app-xxx-nss4m  4m48s ip-192-168-7-88.us-east-2...   us-east-2a  786 (skew 2)
my-app-xxx-c7kls  3m37s ip-192-168-7-88.us-east-2...   us-east-2a  886 (skew 2)
my-app-xxx-pxrwr  2m26s ip-192-168-7-88.us-east-2...   us-east-2a  986 (skew 3)
my-app-xxx-xrvl5  75s   ip-192-168-73-177.us-east-2... us-east-2c  987 (skew 2)

At the beginning of the test, 5 nodes already exists. That’s why the scheduler could place the first pods evenly across AZs.
As soon as new nodes need to be provisioned, the skew reaches 2 or 3.

Since ip-192–168–45–100.us-east-2.compute.internal has been provisioned, the scheduler will target this node until it reaches its full capacity, followed by ip-192–168–68–142.us-east-2.compute.internal etc…

It looks like the scheduler will prioritise a kind of binpack placement strategy over the topology constraints.
To make sure that’s the case, let’s make the same test with larger instances.

Pod               Age    Node                          Zone       abc(zones)
my-app-xxx-dzqf8  26m    ip-192-168-76-158.us-east-2.. us-east-2c 
my-app-xxx-tx5vc  23m    ip-192-168-46-123.us-east-2.. us-east-2b
my-app-xxx-trfv8  22m    ip-192-168-8-231.us-east-2... us-east-2a 111
my-app-xxx-cbzhh  21m    ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-7r26p  20m    ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-k7zlb  18m    ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-bldtp  17m    ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-lwn7c  16m    ip-192-168-8-231.us-east-2... us-east-2a
my-app-xxx-4h9fq  15m    ip-192-168-8-231.us-east-2... us-east-2a 711 (skew 6)
my-app-xxx-hrvbr  14m    ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-bzs75  13m    ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-xfcp4  11m    ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-vtdp6  10m    ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-qs5ml  9m28s  ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-9x9zd  8m17s  ip-192-168-45-86.us-east-2... us-east-2b
my-app-xxx-gwxsh  7m6s   ip-192-168-45-86.us-east-2... us-east-2b 781 (skew 6)
my-app-xxx-c4hv2  5m55s  ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-6fjvd  4m44s  ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-tq6lc  3m33s  ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-qsbpp  2m22s  ip-192-168-67-38.us-east-2... us-east-2c
my-app-xxx-rgbl8  71s    ip-192-168-67-38.us-east-2... us-east-2c 786 (skew 2)

We can see there, that the allocated resource for the node ip-192–168–8–231.us-east-2.compute.internal reached almost its full capacity before another instance is provisioned.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         3705m (94%)   3900m (99%)
  memory                      7318Mi (49%)  7986Mi (53%)

Test 3

topology.kubernetes.io/zone : DoNotSchedule
kubernetes.io/hostname : DoNotSchedule

We can define a list of constraints as well.
The scheduler will try to respect these conditions:

The pods are evenly distributed across the AZs.
Only one pod of the service can sit in a same node.

The pods are spread across 21 different nodes. (+100% compared to Test1)

This setup improves the high availability of the services by limiting the blast radius of an outage at the node level but could lead in larger bill.

Pod               Age    Node                          Zone        abc(zones)
my-app-xxx-zhghv  26m    ip-192-168-77-62.us-east-2.... us-east-2c
my-app-xxx-fwflh  23m    ip-192-168-35-250.us-east-2... us-east-2b
my-app-xxx-7pjc7  22m    ip-192-168-3-29.us-east-2.c... us-east-2a 111
my-app-xxx-9qkmc  21m    ip-192-168-35-199.us-east-2... us-east-2b
my-app-xxx-q8r8c  20m    ip-192-168-73-121.us-east-2... us-east-2c
my-app-xxx-h4l4q  18m    ip-192-168-7-123.us-east-2.... us-east-2a 222
my-app-xxx-b8tcp  17m    ip-192-168-15-161.us-east-2... us-east-2a
my-app-xxx-8m6hc  16m    ip-192-168-72-211.us-east-2... us-east-2c
my-app-xxx-cgj25  15m    ip-192-168-34-100.us-east-2... us-east-2b 333
my-app-xxx-56h9c  14m    ip-192-168-37-144.us-east-2... us-east-2b
my-app-xxx-kpsjt  13m    ip-192-168-69-178.us-east-2... us-east-2c
my-app-xxx-l7b86  11m    ip-192-168-2-67.us-east-2.c... us-east-2a 444
my-app-xxx-q4krz  10m    ip-192-168-43-34.us-east-2.... us-east-2b
my-app-xxx-8hc8f  9m29s  ip-192-168-8-203.us-east-2.... us-east-2a
my-app-xxx-zqfbv  8m17s  ip-192-168-73-95.us-east-2.... us-east-2c 555
my-app-xxx-d5wnl  7m6s   ip-192-168-67-207.us-east-2... us-east-2c
my-app-xxx-nq2vp  5m55s  ip-192-168-37-87.us-east-2.... us-east-2b
my-app-xxx-xscmb  4m44s  ip-192-168-5-126.us-east-2.... us-east-2a 666
my-app-xxx-kx64d  3m33s  ip-192-168-77-59.us-east-2.... us-east-2c
my-app-xxx-zw5xc  2m22s  ip-192-168-13-228.us-east-2... us-east-2a
my-app-xxx-fv6lx  71s    ip-192-168-39-223.us-east-2... us-east-2b 777

Test 4

topology.kubernetes.io/zone : DoNotSchedule
kubernetes.io/hostname : ScheduleAnyway

The pods are spread across 10 different nodes. Same result as the Test 1.


Pod               Age    Node                           Zone        abc(zones
my-app-xxx-7ttx7  24m    ip-192-168-9-27.us-east-2....  us-east-2a
my-app-xxx-nskkc  23m    ip-192-168-64-38.us-east-2...  us-east-2c
my-app-xxx-f8xjn  22m    ip-192-168-42-228.us-east-2..  us-east-2b  111
my-app-xxx-nzdws  21m    ip-192-168-72-206.us-east-2..  us-east-2c
my-app-xxx-bsz7z  20m    ip-192-168-42-228.us-east-2..  us-east-2b
my-app-xxx-g8t5c  18m    ip-192-168-9-27.us-east-2.c..  us-east-2a  222
my-app-xxx-c2rmc  17m    ip-192-168-42-228.us-east-2..  us-east-2b
my-app-xxx-pc8jx  16m    ip-192-168-68-140.us-east-2..  us-east-2c
my-app-xxx-whg27  15m    ip-192-168-9-115.us-east-2...  us-east-2a  333
my-app-xxx-t5jzx  14m    ip-192-168-68-140.us-east-2..  us-east-2c
my-app-xxx-9lrtd  13m    ip-192-168-9-115.us-east-2...  us-east-2a
my-app-xxx-p7m94  11m    ip-192-168-42-39.us-east-2...  us-east-2b  444
my-app-xxx-hk8fg  10m    ip-192-168-42-39.us-east-2...  us-east-2b
my-app-xxx-4cmct  9m29s  ip-192-168-9-115.us-east-2...  us-east-2a
my-app-xxx-cw27c  8m17s  ip-192-168-68-140.us-east-2..  us-east-2c  555
my-app-xxx-8cp7m  7m6s   ip-192-168-42-39.us-east-2...  us-east-2b
my-app-xxx-skdfb  5m55s  ip-192-168-64-28.us-east-2...  us-east-2c
my-app-xxx-59vmb  4m44s  ip-192-168-14-63.us-east-2...  us-east-2a  666
my-app-xxx-r65jg  3m33s  ip-192-168-64-28.us-east-2...  us-east-2c
my-app-xxx-fjgjm  2m22s  ip-192-168-14-63.us-east-2...  us-east-2a
my-app-xxx-zrb6s  71s    ip-192-168-36-47.us-east-2...  us-east-2b  777

But….

Topology Constraints does not guarantee maxSkew will be respected when whenUnsatisfiable: DoNotSchedule.
It happened during the test few times when the delay between each increase of the number of replicas is a bit too low, such as 30s.
When this situation happens, the scheduler manages to mitigate the skew on the next steps, to finally make the skew back to 1.

The reason is because the scheduler defines the placement of a pod only on a pod creation. It won’t evaluate the current state and won’t deschedule a pod to preserve maxSkew.

The way to run strictly the pods of a service in separate nodes is to use podAffinity/podAntiAffinity instead.

Here is an example of the skew temporarily not respected with topology.kubernetes.io/zone : DoNotSchedule and delay of 30s between each increase.

Pod               Age    Node                          Zone        abc(zones
my-app-xxx-sh7g6  10m    ip-192-168-74-106.us-east-2..  us-east-2c  
my-app-xxx-z6wzg  10m    ip-192-168-0-245.us-east-2...  us-east-2a
my-app-xxx-nbdrj  9m51s  ip-192-168-34-118.us-east-2..  us-east-2b  111
my-app-xxx-xff7w  9m20s  ip-192-168-34-36.us-east-2...  us-east-2b
my-app-xxx-c9sk7  8m49s  ip-192-168-79-68.us-east-2...  us-east-2c
my-app-xxx-pz54k  8m17s  ip-192-168-10-53.us-east-2...  us-east-2a  222
my-app-xxx-sqzxl  7m46s  ip-192-168-14-172.us-east-2..  us-east-2a
my-app-xxx-lzdjq  7m15s  ip-192-168-35-36.us-east-2...  us-east-2b
my-app-xxx-d6kds  6m44s  ip-192-168-35-36.us-east-2...  us-east-2b  342 (skew 2)
my-app-xxx-qj9tv  6m13s  ip-192-168-65-153.us-east-2..  us-east-2c  343
my-app-xxx-qgnk4  5m42s  ip-192-168-65-153.us-east-2..  us-east-2c  344
my-app-xxx-x5724  5m11s  ip-192-168-9-121.us-east-2...  us-east-2a  444
my-app-xxx-8bj6q  4m40s  ip-192-168-65-153.us-east-2..  us-east-2c
my-app-xxx-bxjdl  4m9s   ip-192-168-9-121.us-east-2...  us-east-2a
my-app-xxx-ptbf8  3m38s  ip-192-168-35-36.us-east-2...  us-east-2b  555
my-app-xxx-rjgkl  3m6s   ip-192-168-9-121.us-east-2...  us-east-2a
my-app-xxx-zqgm7  2m35s  ip-192-168-69-34.us-east-2...  us-east-2c
my-app-xxx-76zxb  2m4s   ip-192-168-45-149.us-east-2..  us-east-2b  666
my-app-xxx-4nrq2  93s    ip-192-168-69-34.us-east-2...  us-east-2c
my-app-xxx-fxms8  62s    ip-192-168-45-149.us-east-2..  us-east-2b
my-app-xxx-trgj7  31s    ip-192-168-5-47.us-east-2.c..  us-east-2a  777

Or another example of the skew temporarily not respected with topology.kubernetes.io/zone : DoNotSchedule, and instances are spot, the delay of 70s between each increase.

Pod               Age    Node                            Zone       abc(zones
my-app-xxx-gbswx  27m    ip-192-168-3-107.us-east-2...  us-east-2a
my-app-xxx-xxgsw  23m    ip-192-168-44-31.us-east-2...  us-east-2b
my-app-xxx-nfpbh  22m    ip-192-168-66-238.us-east-2..  us-east-2c  111
my-app-xxx-t4dfn  21m    ip-192-168-66-238.us-east-2..  us-east-2c
my-app-xxx-4vj65  20m    ip-192-168-44-31.us-east-2...  us-east-2b
my-app-xxx-972l9  18m    ip-192-168-3-15.us-east-2.c..  us-east-2a  222
my-app-xxx-wfh8z  16m    ip-192-168-36-14.us-east-2...  us-east-2b
my-app-xxx-w92vb  15m    ip-192-168-2-166.us-east-2...  us-east-2a
my-app-xxx-dd2rz  14m    ip-192-168-2-166.us-east-2...  us-east-2a  432 (skew 2)
my-app-xxx-2gk8s  13m    ip-192-168-78-117.us-east-2..  us-east-2c  433
my-app-xxx-w7v74  11m    ip-192-168-46-160.us-east-2..  us-east-2b  443
my-app-xxx-s6n8h  10m    ip-192-168-78-117.us-east-2..  us-east-2c  444
my-app-xxx-s8xjw  10m    ip-192-168-46-160.us-east-2..  us-east-2b
my-app-xxx-jqcrb  9m29s  ip-192-168-2-166.us-east-2...  us-east-2a
my-app-xxx-twfpq  8m18s  ip-192-168-76-130.us-east-2..  us-east-2c  555
my-app-xxx-fd4gv  7m6s   ip-192-168-76-130.us-east-2..  us-east-2c
my-app-xxx-mwmpz  5m55s  ip-192-168-9-107.us-east-2...  us-east-2a
my-app-xxx-vvmhm  4m44s  ip-192-168-46-79.us-east-2...  us-east-2b  666
my-app-xxx-5pjzc  3m33s  ip-192-168-9-107.us-east-2...  us-east-2a
my-app-xxx-lw7fm  2m22s  ip-192-168-46-79.us-east-2...  us-east-2b
my-app-xxx-79d7c  71s    ip-192-168-78-53.us-east-2...  us-east-2c  777

Sources:

Pod Topology Spread Constraints | Kubernetes
Scheduling | Karpenter
The Ultimate Guide to Kubernetes Pod Topology Spread Constraints | mby.io