Skip to content

Better avoid and handle AWS API throttling #459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nocnokneo
Copy link

@nocnokneo nocnokneo commented Jun 11, 2025

The current ECS implementation has a couple issues regarding AWS API throttling that are especially problematic when starting a large cluster (over ~100 workers)

  • API calls are not used efficiently. For example each Worker makes its own call to ecs.list_account_settings to determine if taskLongArnFormat is enabled rather than doing this just once at the ECSCluster level. This PR optimized some of the easy cases like this. The most notable one that's NOT addressed is ecs.describe_tasks that each worker polls as it's waiting to come up.
  • Throttle backoff handling is not implemented for all API calls that need it and none of them implement jittered backoff which leads to many workers backing off at exactly the same rate and all making their retry requests at the same time. The solution I have implemented here is to leverage the backoff support built into boto3 that uses jittered exponential backoff.

References:

Comment on lines -175 to -189
while True:
try:
[self.task] = (
await ecs.describe_tasks(
cluster=self.cluster_arn, tasks=[self.task_arn]
)
)["tasks"]
except ClientError as e:
if e.response["Error"]["Code"] == "ThrottlingException":
wait_duration = min(wait_duration * 2, 20)
else:
raise
else:
break
await asyncio.sleep(wait_duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in understanding that we are removing our retry logic here and leveraging the built-in retries in aiobotocore?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Sorry, I stuck this draft PR up as a placeholder and hadn't yet added a detailed description to describe this. Done now.

@nocnokneo nocnokneo marked this pull request as ready for review June 12, 2025 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants