-
Notifications
You must be signed in to change notification settings - Fork 8
Description
We have an autoscaling policy to reduce the instance count when the site is not heavily loaded, but we don't have one to increase capacity when it is.
This is causing downtime, since we always have worker churn within the app tasks, and when the load gets higher the chances that all 4 workers will be down at the same time increases. When that happens the health check fails and ECS replaces the task, but since we only have one EC2 instance, it can't start a new task before stopping the old one. So we end up with the service down entirely while it makes the switch (also it seems like we might be waiting on some sort of timeout or cooldown, because that usually takes a little more than an hour, which is longer than it should).
Increasing the instance count when the load is high, and keeping the desired task count in ECS permanently high, should make it so that a new instance will come up and, hopefully, a new task will be running on it by the time the existing task gets killed due to health check failure. Or possibly it would reduce the chances of health check failure in the first place by taking some of the load. In any case, it seems worth doing.