-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add a proposal for integrating volume limits into cluster autoscaler #5031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
gnufied
commented
Jan 9, 2025
- One-line PR description:
- Integrate CSI Volume attach limits with cluster autoscaler #5030
- Other comments:
This looks like a draft to me @gnufied /retitle [WIP] Add a proposal for integrating volume limits into cluster autoscaler Feel free to amend the PR title if that's not correct. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
e387396
to
0c63a15
Compare
/remove-lifecycle-rotten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Drew Hagen <[email protected]>
Co-authored-by: Drew Hagen <[email protected]>
|
||
- Since a node does not come up with a CSI driver typically, usually too many pods get scheduled on a node, which may not be supportable by the node in the first place. This leads to bunch of pods, just stuck. | ||
|
||
Once cluster-autoscaler is aware of CSI volume attach limits, we can fix kubernete's builtin scheduler to not schedule pods to nodes that don't have CSI driver installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This motivation is confusing to me.
You want cluster-autoscaler to be aware of volumes that can be attached to a node.
Why are you mentioning CSI driver on a node not being present? That seems to be unrelated to the number of volumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, you are calling out two different problems and targeting them together.
Consider including folks who also work outside the SIG or subproject. | ||
--> | ||
|
||
## Design Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this design detail, I am thinking of #5328.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it does have some overlap but that KEP is IMO different and I don't see how it integrates with autoscaler yet (may be they need to integrate those node capabilities with autoscaler at some point).
|
||
#### Alpha | ||
|
||
- All of the planned code changes for alpha will be done in cluster-autoscaler and not in k/k repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not in k/k repository"?
So this first alpha stage is only cluster-autoscaler and your k/k PR will not be in scope for alpha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the change in main k/k
repository is pretty small and can only be enabled when autoscaler changes have already gone GA. In fact, to ensure some backward compatibility concerns, we have to be careful not to enable scheduler changes sooner.
I have covered this in Notes and Caveats section.
CRI or CNI may require updating that component before the kubelet. | ||
--> | ||
|
||
## Production Readiness Review Questionnaire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the new requirements on KEPs (promoting to beta should require feature to be functionally complete), I ask that you fill up the PRR as much as you can.
|
||
We also propose that, if given node is not reporting any installed CSI drivers, we do not schedule pods that need CSI volumes to that node. | ||
|
||
The proposed change is small and a draft PR is available here - https://github.com/kubernetes/kubernetes/pull/130702 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does not reference the feature gate you mention here?
Will this work be gated by the same feature gate or are these two separate features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is just POC. First we want to fix this in cluster-autoscaler before we can target kube-scheduler
, crux of the code will not change. It will be pretty small change in main k/k repository.
Even if applying deprecation policies, they may still surprise some users. | ||
--> | ||
|
||
### Monitoring Requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to answer these.
### Scalability | ||
|
||
<!-- | ||
For alpha, this section is encouraged: reviewers should consider these questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fill this section out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to fill out the PRR as much as you can.
Co-authored-by: Kevin Hannon <[email protected]>
/hold Since this is a storage PR, I think a hold is sufficient to get ACK/Approval from autoscaling and scheduling. Please unhold once you have a review from autoscaling and scheduling. |
203a91c
to
e71d1d6
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gnufied The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An order dependent design makes disablement and re-enablement hard. What can we do to avoid an order dependent design?
For the case of newer scheduler running with older autoscaler, we propose that `VolumeLimitScaling` feature | ||
should be disabled in the scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this coordination sounds ripe for human error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed an update that clarifies this scheduler and autoscaler coupling. PTAL.
|
||
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
Yes. This will simply cause old behaviour to be restored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the feature is disabled, what removes the existing labels? Do they remain? If they remain, what order should the various processes be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which label you are thinking of? The GPULabel
that this proposes is one of solving it, but those label can be removed after node is considered "ready".
But that is not the only way of solving this problem. Once #5233 is implemented (even if alpha), we could start to have one true mechanism for node readiness.
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
It should work fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar question about rollout order. The statement above about new schedulers having trouble suggests this needs consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed an update that clarifies this scheduler and autoscaler coupling. PTAL.
|
||
It should work fine. | ||
|
||
###### Are there any tests for feature enablement/disablement? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have an order dependent feature, the unit testing needs to be thorough. It is far better to have a feature that "just works" regardless of the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing from SIG Autoscaling. We need to figure out the finer details about changes in CA, but the overall approach seems correct and implementable. Not sure if the plan is to still get this in the 1.34 cycle, I'm in favor if so.
|
||
### Notes/Constraints/Caveats (Optional) | ||
|
||
Scheduler changes can only be merged after cluster-autoscaler changes has been GAed and there are no concerns about scheduler changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a strict requirement, or could they both be done in the same release behind the same feature gate? Wouldn't using a single feature gate solve problems related to coordinating CA/scheduler versions in the cluster and upgrades/downgrades?
|
||
### Scaling a node-group that already has one or more nodes. | ||
|
||
1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two points about reusing the GPULabel pattern:
- It's a pretty limited mechanism that we had trouble extending for DRA. We can reuse it as a stop-gap if necessary, but IMO integrating CSI drivers with the Node Readiness Gates proposal should be the long-term solution.
- If the scale-from-0-nodes logic is reliably solved, it can be used for correctly determining Node readiness. A Node can be compared to the scale-from-0 template of its NodeGroup, and considered unready until all CSI drivers present on the template are present on the Node. This is what we did for DRA: Handle node readiness for DRA after a scale-up autoscaler#8109.
IMO relying on the scale-from-0 logic until the long-term solution would be better than adding separate labels just for readiness.
} | ||
``` | ||
|
||
3. We propose that, when saving `ClusterState` , we capture and add `csiDrivers` information to all existing nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more specific here? Does this mean listing CSINodes in StaticAutoscaler.RunOnce()
and extending ClusterSnapshot.SetClusterState()
to take a list of CSINodes?
|
||
3. We propose that, when saving `ClusterState` , we capture and add `csiDrivers` information to all existing nodes. | ||
|
||
4. We propose that, when getting nodeInfosForGroups , the return nodeInfo map also contains csidriver information, which can be used later on for scheduling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, can you be more specific here? How does TemplateNodeInfoProvider
get the CSI driver information? How do we provide it with CSINodes for existing Nodes? How do we "sanitize" a CSINode when creating a template? I assume we just copy the allocatable, but I'd include it here for completeness.
|
||
Scaling from zero should work similar to scaling from 1, but the main problem is - we do not have NodeInfo which can tell us what would be the CSI attach limit on the node which is being spun up in a NodeGroup. | ||
|
||
We propose that we introduce similar annotation as CPU, Memory resources in cluster-api to process attach limits available on a node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example of an existing CA pattern that we'd ideally not reuse here. Not all cloud providers integrated with CA support scale-from-0 tags like cluster-api. And the ones that do do it slightly differently because of e.g. different validation rules.
I strongly believe that we should move away from this pattern ASAP, and start configuring these things on the k8s API layer. We have kubernetes/autoscaler#7799 for that that I plan to start working relatively soon.
We have to introduce similar mechanism in various cloudproviders which return Template objects to incorporate volume limits. This will allow us to handle the case of scaling from zero. | ||
|
||
|
||
## Kubernetes Scheduler change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe the problematic CA/scheduler interactions wrt volume limits (now and after this proposal) somewhere?
I'm not sure if I'm getting the problem right myself, is this right?
Now:
- Scheduler CSI plugin assumes that "no information about a CSI driver published in a CSINode" means "no limits for volumes from that driver".
- For existing Nodes with CSI driver information already published, CA correctly takes the volume limits into account when running scheduler filters in simulations (e.g. when packing pending Pods on existing Nodes in the cluster at the beginning of the loop).
- For fake "upcoming" Nodes created in-memory by CA during scale-up simulations the corresponding "upcoming" CSINode is not created/taken into account. So the volume limits are not taken into account when running scheduler filters, which makes CA pack more Pods per Node than actually fit, which makes it undershoot scale-ups.
- For existing Nodes with CSI driver information already published, scheduler correctly takes the volume limits into account when scheduling.
- For new Nodes with not all CSI driver information published yet, scheduler can let Pods in that can't actually run on the Node.
After:
- Scheduler CSI plugin assumes that "no information about a CSI driver published in a CSINode" means "0 limit for volumes from that driver".
- No change for existing Nodes with CSI driver information already published - CA and scheduler still behave correctly.
- Scheduler waits until all relevant CSI driver info is published before scheduling a Pod, removing the race condition for new Nodes.
- Cluster Autoscaler correctly simulates "upcoming" CSINodes for "upcoming" Nodes and makes correct scale-up decisions.
scaling from 1, because in former case - no CSI volume limit information will be available | ||
If no node exists in a NodeGroup. | ||
|
||
5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exact changes to hinting_simulator would be needed? Will the existing caching break because of this proposal?
|
||
5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes. | ||
|
||
Making aforementioned changes should allow us to handle scaling of nodes from 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're missing at least 1 more step: adapting simulator/node_info_utils.go
to propagate the CSI information when "sanitizing" NodeInfos.
I'll take one more pass this afternoon for PRR, but right now the PRR answers haven't clarified robustness to human error in upgrade/downgrade and enable/disable/enable. |