Skip to content

Commit 7f93d68

Browse files
authored
Merge pull request kubernetes#5368 from tallclair/ippr-mem-limits
KEP-1287: Allow memory limit decreases
2 parents 9d1f63c + c7b9304 commit 7f93d68

File tree

1 file changed

+14
-8
lines changed
  • keps/sig-node/1287-in-place-update-pod-resources

1 file changed

+14
-8
lines changed

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -397,9 +397,6 @@ WindowsPodSandboxConfig.
397397
configurations will see values that may not represent actual configurations. As a
398398
mitigation, this change needs to be documented and highlighted in the
399399
release notes, and in top-level Kubernetes documents.
400-
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages
401-
could be in use, and approaches such as setting limit near current usage may
402-
be required. This issue needs further investigation.
403400
1. Scheduler race condition: If a resize happens concurrently with the scheduler evaluating the node
404401
where the pod is resized, it can result in a node being over-scheduled, which will cause the pod
405402
to be rejected with an `OutOfCPU` or `OutOfMemory` error. Solving this race condition is out of
@@ -847,11 +844,17 @@ Setting the memory limit below current memory usage can cause problems. If the k
847844
sufficient memory, the outcome depends on the cgroups version. With cgroups v1 the change will
848845
simply be rejected by the kernel, whereas with cgroups v2 it will trigger an oom-kill.
849846

850-
In the initial beta release of in-place resize, we will **disallow** `PreferNoRestart` memory limit
851-
decreases, enforced through API validation. The intent is for this restriction to be relaxed in the
852-
future, but the design of how limit decreases will be approached is still undecided.
847+
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
848+
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
849+
guarantees. Before decreasing container memory limits, the Kubelet will read the container memory
850+
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be
851+
skipped for that container. The pod condition `PodResizeInProgress` will remain, with an `Error`
852+
reason, and a message reporting the current usage & desired limit. This is considered best-effort
853+
since it is still subject to a time-of-check-time-of-use (TOCTOU) race condition where the usage exceeds the limit after the
854+
check is performed. A similar check will also be performed at the pod level before lowering the pod
855+
cgroup memory limit.
853856

854-
Memory limit decreases with `RestartRequired` are still allowed.
857+
_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage.
855858

856859
### Swap
857860

@@ -996,7 +999,8 @@ This will be reconsidered post-beta as a future enhancement.
996999

9971000
### Future Enhancements
9981001

999-
1. Allow memory limits to be decreased, and handle the case where limits are set below usage.
1002+
1. Improve memory limit decrease oom-kill prevention by leveraging other kernel mechanisms or using
1003+
gradual decreaese.
10001004
1. Kubelet (or Scheduler) evicts lower priority Pods from Node to make room for
10011005
resize. Pre-emption by Kubelet may be simpler and offer lower latencies.
10021006
1. Allow ResizePolicy to be set on Pod level, acting as default if (some of)
@@ -1650,6 +1654,8 @@ _This section must be completed when targeting beta graduation to a release._
16501654
and update CRI `UpdateContainerResources` contract
16511655
- Add back `AllocatedResources` field to resolve a scheduler corner case
16521656
- Introduce Actuated resources for actuation
1657+
- 2025-06-03 - v1.34 post-beta updates
1658+
- Allow no-restart memory limit decreases
16531659

16541660
## Drawbacks
16551661

0 commit comments

Comments
 (0)