Skip to content

Introduce WG Checkpoint Restore #8508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adrianreber
Copy link
Member

As described in sig-wg-lifecycle.md this PR is the next step after sending an email to [email protected] about the creation of the Working Group Checkpoint Restore.

CC: @rst0git, @viktoriaas, @xhejtman

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 3, 2025
@k8s-ci-robot k8s-ci-robot requested review from deads2k and macsko July 3, 2025 13:33
@k8s-ci-robot
Copy link
Contributor

Welcome @adrianreber!

It looks like this is your first PR to kubernetes/community 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/community has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adrianreber
Once this PR has been reviewed and has the lgtm label, please assign saschagrunert for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Jul 3, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 3, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Jul 3, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Jul 3, 2025
@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 10, 2025
@kannon92
Copy link
Contributor

Looking at #8519,

I see that we are missing a charter.

@adrianreber
Copy link
Member Author

Looking at #8519,

I see that we are missing a charter.

In https://github.com/kubernetes/community/blob/master/sig-wg-lifecycle.md#GitHub is says to add a charter once this initial PR has been merged. That's why is skipped it.

the integration of Checkpoint/Restore functionality into Kubernetes.

charter_link: charter.md
stakeholder_sigs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig auth may have a big say in security of this whole restoration pipeline

Copy link
Member

@rst0git rst0git Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! Security is definitely an important topic that we need to discuss with sig-auth, both for the checkpoint API and the restoration pipeline. The following paper and master thesis describe our recent work on this topic:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added sig auth to the list of stakeholder sigs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this showed up in the sig-auth meeting, we may have missed the discussion around this WG

if this WG is contemplating taking state from a running pod / saving it / letting it be consumed on another node or from another pod or another namespace, then sig-auth is definitely interested in making sure the permissions model around that exists and is ~consistent with similar things Kubernetes does elsewhere (like PVC / snapshots)

We're happy to consult on that, I'm not sure our awareness / involvement rises to the level of sponsoring the WG :)

cc @kubernetes/sig-auth-leads

This working group aims to provide a central location for the community to discuss
the integration of Checkpoint/Restore functionality into Kubernetes.

charter_link: charter.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is charter included into this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it is, I didn't add it initially as the lifecycle document mentions that it is added later, but looking at the WG PRs it seems to be common to have a charter in the initial PR.

@k8s-ci-robot k8s-ci-robot added the committee/steering Denotes an issue or PR intended to be handled by the steering committee. label Jul 20, 2025
@enj enj added this to SIG Auth Jul 21, 2025
@enj enj moved this to Needs Triage in SIG Auth Jul 21, 2025
@adrianreber adrianreber marked this pull request as ready for review July 22, 2025 16:02
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2025
@adrianreber
Copy link
Member Author

/test pull-community-verify

@adrianreber
Copy link
Member Author

/verify-owners

the integration of Checkpoint/Restore functionality into Kubernetes.

charter_link: charter.md
stakeholder_sigs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valuable initiative. The charter mentions that the scope includes checkpointing and restoring 'workloads' and providing 'guidance for developers on checkpoint-friendly app design.' Given this focus, it's essential for SIG Apps to be involved as a key stakeholder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo This is a good idea, thank you so much for suggesting it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks @janetkuo. I added SIG Apps to the proposal.

- Investigate and propose Kubernetes APIs for checkpoint/restore operations.
- Work with SIGs for the best integration of checkpoint/restore functionality
and APIs.
- Provide guidance for developers on checkpoint-friendly app design and
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may be API needed to communicate between the app and API server that the checkopoint is requested AND/OR that the app is ready for checkpoint. Something that is beyond just guidance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is actually something we discussed how to do in containers for years now (outside of Kubernetes). But we never found the right way how to do this. We were looking at kernel interfaces or systemd interfaces because for many applications it could be helpful to free temporary memory to reduce checkpoint size or even drop confidential information. Also after restore it would be good to tell the application that maybe certain cryptographic values need to be reset or regenerated. I will try to include something mentioning this. Thanks.

Co-authored-by: Sergey Kanzhelev <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jul 28, 2025
@k8s-ci-robot
Copy link
Contributor

The following users are mentioned in OWNERS file(s) but are untrusted for the following reasons. One way to make the user trusted is to add them as members of the kubernetes org. You can then trigger verification by writing /verify-owners in a comment.

  • viktoriaas
    • User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.
  • rst0git
    • User is not a member of the org. Satisfy at least one of these conditions to make the user trusted.

@aramase
Copy link
Member

aramase commented Aug 4, 2025

/assign ritazh

(assigned as part of SIG Auth triage; to review the SIG Auth updates)

@aramase aramase moved this from Needs Triage to In Review in SIG Auth Aug 4, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 5, 2025
Stakeholders in this working group span multiple SIGs that own parts of the
code in core kubernetes components and addons.

- SIG CLI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has there been outreach to these SIGs yet? I see some SIG node participants/leaders but not CLI yet for example

Copy link
Member

@rst0git rst0git Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing from SIG CLI to SIG API machinery after SIG CLI mentioned that new commands are first introduced via a plugin.

kubernetes/enhancements#5091

I have a simple prototype of a plugin for the kubectl checkpoint command that we used in our demos and can discuss further in the working group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we should remove the SIG if they are not a stakeholder, or else the WG organizers should reach out to the SIGs.

@BenTheElder
Copy link
Member

BenTheElder commented Aug 20, 2025

@kubernetes/sig-node-leads are you all +1, officially?

@haircommander
Copy link
Contributor

+1 from me

meetings: []
contact:
slack: wg-checkpoint-restore
mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the steering liason field, we discussed in the steering meeting today, you can add me..

Looking forward to being more directly helpful with @rst0git @viktoriaas and crew :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks

Suggested change
mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore
mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore
liaison:
github: BenTheElder
name: Benjamin Elder


- maintain a solid communication line between the Kubernetes groups and the
wider CNCF community
- submit a proposal to the KubeCon/CloudNativeCon maintainers track
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts if this incentivize the right behavior and will encourage people to build WG to get a slot in the kubecon

Comment on lines +80 to +84
As a first mandate, the WG will define a roadmap and tasks in the first quarter
of operation.

After that the WG will distribute the different tasks to different community
members to define possible APIs and how it can be integrated in Kubernetes.
Copy link
Member

@aojea aojea Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need some clarification on this paragraph, for readers unfamiliar with Kubernetes governance, the current wording ("WG will define a roadmap," "distribute tasks," "define possible APIs") could give the impression that the WG is an authoritative body with the power to approve APIs and dictate roadmaps for the project.

As we know, a WG's primary role is to serve as a forum for cross-SIG collaboration, and SIGs are the ultimate authority for code and API changes.

To avoid this potential confusion, could we rephrase this section to better reflect the WG's role as a facilitator and a place for proposing ideas to the SIGs?

what about something like:

Suggested change
As a first mandate, the WG will define a roadmap and tasks in the first quarter
of operation.
After that the WG will distribute the different tasks to different community
members to define possible APIs and how it can be integrated in Kubernetes.
As a first mandate, the WG will propose a draft roadmap and identify key tasks in the first quarter of operation.
After that, the WG will facilitate collaboration among community members to explore possible APIs and draft proposals for their integration into Kubernetes, which will then be presented to the relevant SIGs.

## Contact
- Slack: [#wg-checkpoint-restore](https://kubernetes.slack.com/messages/wg-checkpoint-restore)
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding @BenTheElder as liaison to README too.

Suggested change
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)
- Steering Committee Liaison: Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. committee/steering Denotes an issue or PR intended to be handled by the steering committee. do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: In Review
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.