Skip to content

Add a proposal for integrating volume limits into cluster autoscaler #5031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Jan 9, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jan 9, 2025
@sftim
Copy link

sftim commented Jan 12, 2025

This looks like a draft to me @gnufied

/retitle [WIP] Add a proposal for integrating volume limits into cluster autoscaler

Feel free to amend the PR title if that's not correct.

@k8s-ci-robot k8s-ci-robot changed the title Add a proposal for integrating volume limits into cluster autoscaler [WIP] Add a proposal for integrating volume limits into cluster autoscaler Jan 12, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2025
@gnufied gnufied force-pushed the add-csinode-limit-autoscaler branch from e387396 to 0c63a15 Compare June 10, 2025 20:24
@gnufied gnufied changed the title [WIP] Add a proposal for integrating volume limits into cluster autoscaler Add a proposal for integrating volume limits into cluster autoscaler Jun 10, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2025
@gnufied
Copy link
Member Author

gnufied commented Jun 10, 2025

/remove-lifecycle-rotten

@xing-yang xing-yang removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 10, 2025
Copy link
Member

@drewhagen drewhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @gnufied @dobsonj 👋, 1.34 Enhancements team here.

A friendly heads up that these fields should be ready to be updated to satisfy the requirement for the upcoming Enhancements Freeze. Please see more details in my comment on the KEP issue.

Thanks!

@gnufied
Copy link
Member Author

gnufied commented Jun 11, 2025

/assign @kannon92 @jsafrane


- Since a node does not come up with a CSI driver typically, usually too many pods get scheduled on a node, which may not be supportable by the node in the first place. This leads to bunch of pods, just stuck.

Once cluster-autoscaler is aware of CSI volume attach limits, we can fix kubernete's builtin scheduler to not schedule pods to nodes that don't have CSI driver installed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This motivation is confusing to me.

You want cluster-autoscaler to be aware of volumes that can be attached to a node.

Why are you mentioning CSI driver on a node not being present? That seems to be unrelated to the number of volumes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you are calling out two different problems and targeting them together.

Consider including folks who also work outside the SIG or subproject.
-->

## Design Details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this design detail, I am thinking of #5328.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it does have some overlap but that KEP is IMO different and I don't see how it integrates with autoscaler yet (may be they need to integrate those node capabilities with autoscaler at some point).


#### Alpha

- All of the planned code changes for alpha will be done in cluster-autoscaler and not in k/k repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not in k/k repository"?

So this first alpha stage is only cluster-autoscaler and your k/k PR will not be in scope for alpha?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the change in main k/k repository is pretty small and can only be enabled when autoscaler changes have already gone GA. In fact, to ensure some backward compatibility concerns, we have to be careful not to enable scheduler changes sooner.

I have covered this in Notes and Caveats section.

CRI or CNI may require updating that component before the kubelet.
-->

## Production Readiness Review Questionnaire
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new requirements on KEPs (promoting to beta should require feature to be functionally complete), I ask that you fill up the PRR as much as you can.


We also propose that, if given node is not reporting any installed CSI drivers, we do not schedule pods that need CSI volumes to that node.

The proposed change is small and a draft PR is available here - https://github.com/kubernetes/kubernetes/pull/130702
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not reference the feature gate you mention here?

Will this work be gated by the same feature gate or are these two separate features?

Copy link
Member Author

@gnufied gnufied Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is just POC. First we want to fix this in cluster-autoscaler before we can target kube-scheduler, crux of the code will not change. It will be pretty small change in main k/k repository.

Even if applying deprecation policies, they may still surprise some users.
-->

### Monitoring Requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to answer these.

### Scalability

<!--
For alpha, this section is encouraged: reviewers should consider these questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fill this section out.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to fill out the PRR as much as you can.

@kannon92
Copy link
Contributor

kannon92 commented Jun 12, 2025

/hold

Since this is a storage PR, I think a hold is sufficient to get ACK/Approval from autoscaling and scheduling.

Please unhold once you have a review from autoscaling and scheduling.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2025
@gnufied gnufied force-pushed the add-csinode-limit-autoscaler branch from 203a91c to e71d1d6 Compare June 12, 2025 20:22
@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Jun 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please assign deads2k, gjtempleton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@deads2k deads2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An order dependent design makes disablement and re-enablement hard. What can we do to avoid an order dependent design?

Comment on lines 326 to 327
For the case of newer scheduler running with older autoscaler, we propose that `VolumeLimitScaling` feature
should be disabled in the scheduler.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this coordination sounds ripe for human error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed an update that clarifies this scheduler and autoscaler coupling. PTAL.


###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. This will simply cause old behaviour to be restored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the feature is disabled, what removes the existing labels? Do they remain? If they remain, what order should the various processes be updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which label you are thinking of? The GPULabel that this proposes is one of solving it, but those label can be removed after node is considered "ready".

But that is not the only way of solving this problem. Once #5233 is implemented (even if alpha), we could start to have one true mechanism for node readiness.


###### What happens if we reenable the feature if it was previously rolled back?

It should work fine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar question about rollout order. The statement above about new schedulers having trouble suggests this needs consideration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed an update that clarifies this scheduler and autoscaler coupling. PTAL.


It should work fine.

###### Are there any tests for feature enablement/disablement?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have an order dependent feature, the unit testing needs to be thorough. It is far better to have a feature that "just works" regardless of the order.

Copy link

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing from SIG Autoscaling. We need to figure out the finer details about changes in CA, but the overall approach seems correct and implementable. Not sure if the plan is to still get this in the 1.34 cycle, I'm in favor if so.


### Notes/Constraints/Caveats (Optional)

Scheduler changes can only be merged after cluster-autoscaler changes has been GAed and there are no concerns about scheduler changes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a strict requirement, or could they both be done in the same release behind the same feature gate? Wouldn't using a single feature gate solve problems related to coordinating CA/scheduler versions in the cluster and upgrades/downgrades?


### Scaling a node-group that already has one or more nodes.

1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points about reusing the GPULabel pattern:

  1. It's a pretty limited mechanism that we had trouble extending for DRA. We can reuse it as a stop-gap if necessary, but IMO integrating CSI drivers with the Node Readiness Gates proposal should be the long-term solution.
  2. If the scale-from-0-nodes logic is reliably solved, it can be used for correctly determining Node readiness. A Node can be compared to the scale-from-0 template of its NodeGroup, and considered unready until all CSI drivers present on the template are present on the Node. This is what we did for DRA: Handle node readiness for DRA after a scale-up autoscaler#8109.

IMO relying on the scale-from-0 logic until the long-term solution would be better than adding separate labels just for readiness.

}
```

3. We propose that, when saving `ClusterState` , we capture and add `csiDrivers` information to all existing nodes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific here? Does this mean listing CSINodes in StaticAutoscaler.RunOnce() and extending ClusterSnapshot.SetClusterState() to take a list of CSINodes?


3. We propose that, when saving `ClusterState` , we capture and add `csiDrivers` information to all existing nodes.

4. We propose that, when getting nodeInfosForGroups , the return nodeInfo map also contains csidriver information, which can be used later on for scheduling decisions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can you be more specific here? How does TemplateNodeInfoProvider get the CSI driver information? How do we provide it with CSINodes for existing Nodes? How do we "sanitize" a CSINode when creating a template? I assume we just copy the allocatable, but I'd include it here for completeness.


Scaling from zero should work similar to scaling from 1, but the main problem is - we do not have NodeInfo which can tell us what would be the CSI attach limit on the node which is being spun up in a NodeGroup.

We propose that we introduce similar annotation as CPU, Memory resources in cluster-api to process attach limits available on a node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of an existing CA pattern that we'd ideally not reuse here. Not all cloud providers integrated with CA support scale-from-0 tags like cluster-api. And the ones that do do it slightly differently because of e.g. different validation rules.

I strongly believe that we should move away from this pattern ASAP, and start configuring these things on the k8s API layer. We have kubernetes/autoscaler#7799 for that that I plan to start working relatively soon.

We have to introduce similar mechanism in various cloudproviders which return Template objects to incorporate volume limits. This will allow us to handle the case of scaling from zero.


## Kubernetes Scheduler change
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe the problematic CA/scheduler interactions wrt volume limits (now and after this proposal) somewhere?

I'm not sure if I'm getting the problem right myself, is this right?

Now:

  • Scheduler CSI plugin assumes that "no information about a CSI driver published in a CSINode" means "no limits for volumes from that driver".
  • For existing Nodes with CSI driver information already published, CA correctly takes the volume limits into account when running scheduler filters in simulations (e.g. when packing pending Pods on existing Nodes in the cluster at the beginning of the loop).
  • For fake "upcoming" Nodes created in-memory by CA during scale-up simulations the corresponding "upcoming" CSINode is not created/taken into account. So the volume limits are not taken into account when running scheduler filters, which makes CA pack more Pods per Node than actually fit, which makes it undershoot scale-ups.
  • For existing Nodes with CSI driver information already published, scheduler correctly takes the volume limits into account when scheduling.
  • For new Nodes with not all CSI driver information published yet, scheduler can let Pods in that can't actually run on the Node.

After:

  • Scheduler CSI plugin assumes that "no information about a CSI driver published in a CSINode" means "0 limit for volumes from that driver".
  • No change for existing Nodes with CSI driver information already published - CA and scheduler still behave correctly.
  • Scheduler waits until all relevant CSI driver info is published before scheduling a Pod, removing the race condition for new Nodes.
  • Cluster Autoscaler correctly simulates "upcoming" CSINodes for "upcoming" Nodes and makes correct scale-up decisions.

scaling from 1, because in former case - no CSI volume limit information will be available
If no node exists in a NodeGroup.

5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exact changes to hinting_simulator would be needed? Will the existing caching break because of this proposal?


5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes.

Making aforementioned changes should allow us to handle scaling of nodes from 1.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing at least 1 more step: adapting simulator/node_info_utils.go to propagate the CSI information when "sanitizing" NodeInfos.

@deads2k
Copy link
Contributor

deads2k commented Jun 20, 2025

I'll take one more pass this afternoon for PRR, but right now the PRR answers haven't clarified robustness to human error in upgrade/downgrade and enable/disable/enable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants