ECS Cluster Auto Scaling

Introduction

Previously, in order to autoscale ECS and ASG, we had to set up a scaling policy separately in ECS and ASG.
It is clearly not trivial to set them up for the autoscaling process to behave as expected.
These policies behave totally independently and result in unpredictable behaviours (instances created too early or too late, etc…)

Amazon ECS cluster auto scaling (CAS) enables you to have more control over how you scale the Amazon EC2 instances within a cluster.

Once enabled by creating a capacity provider, Amazon ECS manages the capacity of the Auto Scaling group selected.
Basically, you will have just to focus on running your tasks and ECS will ensure the ASG scales as needed.
Compared to a the previous configuration explained above, CAS will link ECS and ASG.

CAS Representation

In the diagram below, on the left, there is the ECS Service with a Service Autoscaling Policy set in the way you want (Step Scaling, Target Tracking…).
The behaviour is the same at this point. The ECS Service send some metrics to CloudWatch (1) which can trigger an alarm to add more tasks in the ECS Service.(2)

With the capacity provider set up and targeting the ASG (3), a new metric AWS/ECS/ManagedScaling is created by AWS.
This metric is a calculation explained a bit later and can trigger an alarm to change the capacity of ASG according to it.(4)

Note that 2 alarms based on the CapacityProviderReservation and 1 ASG Autoscaling policy will be automatically generated by AWS and should not be edited.

How is this metric calculated ?

Once configured, a metric is available AWS/ECS/ManagedScaling called CapacityProviderReservation.

This CapacityProviderReservation is a metric which controls the scaling of the ASG and represents the ratio of how big the ASG needs to be relative to how big it actually is.
In other words, it defines the percentage of instances running some tasks. The rest are spare instances.
For example, a target value of 10 means that the scaling policy will adjust N (within the limits available) so that about 90% of your ASG’s instances will not be running any tasks, regardless of how many tasks you run.

To understand how this is defined, let’s call:

  • N, the current number of EC2 instances
  • M, the needed number of EC2 instances to fit the current number of tasks + provisioning tasks

CapacityProviderReservation = M / N * 100

Let’s see CapacityProviderReservation in action

A bit of context…
In these tests, ASG runs instances which have 2vCPU
Each of these instances can run 2 tasks (1 task per vCPU).
The ECS Service task placement is set to binpack (CPU)

The EC2 instances will be represented in grey squares and ECS tasks in orange squares.

The load test will be executed in 4 stages using k6:

  • 10min with 80 VUS
  • 10min with 160 VUS
  • 10min with 280 VUS
  • 10min with 500 VUS

During the 1st load test (CapacityProviderReservation 90%)

Service Autoscaling Policy: Target Tracking set to 50% for CPUUtilisation
CapacityProviderReservation: 90%

(50% is used only for the test purpose)

(1) Initial state

2 instances running but only 1 contains some tasks.
M = 1 | N = 2 | C = 1 / 2 * 100 = 50%

(2a) 1st stage: Alarm is triggered and Service Autoscaling needs to add more tasks

Once the tasks added to the instance, 2 instances will be running and 2 will contain some tasks.
M = 2 | N = 2 | C = 2 / 2 * 100 = 100%

At this point, the alarm for CapacityProviderReservation will switch to ALARM state.
CAS will calculate how many instances to add to fit all these tasks.

(2b) New instance created

Once the instance created, 3 instances will be running and 2 will contain some tasks.

M = 2 | N = 3 | C = 2 / 3 * 100 = 66.66%

(3a) 4th stage: Alarm keeps staying in ALARM state and Service Autoscaling needs to add more tasks

Before the provisioning tasks were added to the instances, 8 instances will be running and 6 will contain some tasks.
M =
6 | N = 8 | C = 6 / 8 * 100 = 75%

Once the tasks added to the instances, 8 instances will be running and 8 will contain some tasks.
M =
8 | N = 8 | C = 8 / 8 * 100 = 100%

At this point, the alarm for CapacityProviderReservation will switch to ALARM state.
CAS will calculate how many instances to add to fit all these tasks.

(3b) New instance created

Once the instance created, 10 instances will be running and 9 will contain some tasks.

M = 9 | N = 10 | C = 9 / 10 * 100 = 90%

Conclusion:

During the ramp up and with the CapacityProviderReservation set to 90%, the ASG contains all the time 1 instance as headroom.
Obviously there is a cost, but this is helpful to handle some CPUUtilisation spikes as some provisioning tasks can be running on this instance straight away.

After the load test 1

During the scale down process, you can notice some tasks are terminated, then recreated straight after.
The reason is that there is no guarantee that the instance running no tasks will be the one selected for termination.

For this reason, there is the option of having ECS dynamically manage instance termination protection on your behalf.
If enabled for a capacity provider, ECS will protect any instances from scale-in if it is running at least one non-daemon task.
Note that this does NOT prevent a Spot Instance from being reclaimed, or the instance being terminated manually.

During the 2nd load test (CapacityProviderReservation 50%)

Service Autoscaling Policy: Target Tracking set to 50% for CPUUtilisation
CapacityProviderReservation: 50%

(1) Initial state

2 instances running but only 1 contains some tasks.
M = 1 | N = 2 | C = 1 / 2 * 100 = 50%

(2a) 2nd stage: Alarm is triggered and Service Autoscaling needs to add more tasks

Once the tasks added to the instance, 2 instances will be running and 2 will contain some tasks.
M = 2 | N = 2 | C = 2 / 2 * 100 = 100%

At this point, the alarm for CapacityProviderReservation will switch to ALARM state.
CAS will calculate how many instances to add to fit all these tasks.

(2b) New instances created

Once the instance created, 4instances will be running and 2 will contain some tasks.

M = 2 | N = 4| C = 2 / 4* 100 = 50%

(3a) 4th stage: Alarm keeps staying in ALARM state and Service Autoscaling needs to add more tasks

Before the provisioning tasks were added to the instances, 12 instances will be running and 5 will contain some tasks.
M =
5 | N = 12 | C = 5 / 12 * 100 = 41.6%

Once the tasks added to the instances, 12 instances will be running and 7 will contain some tasks.
M =
7 | N = 12 | C = 7 / 12 * 100 = 58.33%

At this point, the alarm for CapacityProviderReservation will switch to ALARM state.
CAS will calculate how many instances to add to fit all these tasks.

(3b) New instances created

Once the instances created, 18 instances will be running and 8 will contain some tasks.

M = 8 | N = 18 | C = 8 / 18 * 100 = 44.44%

Conclusion:

During the ramp up and with the CapacityProviderReservation set to 50%, the ASG contains all the time a spare capacity at 50%.
Here, this is definitely not cost effective but it can handle higher spikes more easily.

And now the behaviour with other options…

These options have an impact on CAS behaviour and might not desired.

CAS option: Protected from scale-in option

This option is available only if the ASG also has the option “instance protection from scale-in” enabled

As mentioned above at the end of load test, some tasks were terminated, then recreated.
During an ASG scale down, there is no guarantee that an instance with no tasks will be selected for termination.
Then, first, the tasks are terminated, followed by the selected instance.
But AWS has to satisfy the desired number of tasks specified at the ECS Service level. Then AWS will spin up some tasks up.

To prevent this behaviour, set the option “Protected from scale-in” to enabled and that’s what you should have:

ECS Service Option: Task Placement set to spread

(1) State at the end of the load test

2 instances running but only 1 contains some tasks.
M = 1 0 | N = 20 | C = 10 / 20 * 100 = 50%

And finally…

AWS recommends that you create a new empty Auto Scaling group to use with a capacity provider rather than using an existing one.
If you use an existing Auto Scaling group, any Amazon EC2 instances associated with the group that were already running and registered to an Amazon ECS cluster prior to the Auto Scaling group being used to create a capacity provider may not be properly registered with the capacity provider.
This may cause issues when using the capacity provider in a capacity provider strategy.
The DescribeContainerInstances API can confirm whether a container instance is associated with a capacity provider or not.

It’s a very important information from AWS as it implies to create an other ASG and brings a bit of complexity to avoid any downtime.
However, after few tests on existing ASGs, we haven’t noticed any issue or any instances incorrectly registered with the capacity provider.
To make sure all instances are registered, we have run the command:

aws ecs describe-container-instances --region {your_region} --cluster {your_cluster} --container-instances {your_container_instances}

You can see in the output, there is a new field capacityProviderName which confirms the instances are correctly registered:

Issues during CAS setup:

When using managed scaling, the ASG must not have any scaling policies attached to it other than the ones Amazon ECS creates, otherwise the Amazon ECS created scaling plans will receive an ActiveWithProblems error.

Make sure you use a recent ECS Agent version. You can find the version in your ECS Cluster, then ECS Instances.

I hope you have a better understanding on how CAS works, how the related metric is calculated and how to set it up to fit your needs.

Lead Backend / SRE Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store