Skip to content

Proposal: New implementation of ECSCluster #330

@kinghuang

Description

@kinghuang

I use ECSCluster heavily, but I find it has lots of small implementation issues that make it hard to reliably run large Dask clusters. Here's a laundry list of some of the problems I've run into.

  • Task names are always derived from the cluster name. In a shared ECS cluster, task definitions overlap between multiple ECSCluster instances, making it impossible to tell what tasks belong to which Dask cluster.
  • API rate limits are not handled properly. Combined with the log parsing for addresses (relates to Use ECS API to set Worker/Scheduler address instead of parsing logs #313, RuntimeWarning: get_log_events rate limit exceeded #121), large clusters are hard to reliably instantiate because the worker IP addresses can't be found.
  • ECSCluster directly instantiates tasks without using a service. There's no good way to do placement strategies like binpack.
  • Exited tasks are not handled and rescheduled. When workers run on spot instances, the Dask cluster can gradually lose workers as spot instances come and go.
  • There's no way to configure capacity providers for tasks.
  • There's no way to configure different subnets, environment variables, etc. for schedulers vs workers.
  • There's no way to configure driver to something other than awslogs.
  • Scaling a cluster while a previous scale is still in progress sometimes fails.
  • Too many IAM permissions are required, even when using pre-existing ECS clusters and resources.
  • Deprovisioning of tasks for both workers and schedulers is not clean (relates to ECSCluster does not de-provision tasks after failing to connect to scheduler #262).
  • Closing the client and cluster objects results in dangling hooks.

Rather than trying to morph the existing ECSCluster class, would this project be open to a completely new implementation (ECSCluster2?). I anticipate API changes are required (i.e., the arguments to ECSCluster). I'm willing to tackle this myself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions