Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions OWNERS_ALIASES
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,11 @@ aliases:
- mwielgus
- soltysh
- swatisehgal
wg-checkpoint-restore-leads:
- adrianreber
- haircommander
- rst0git
- viktoriaas
wg-data-protection-leads:
- xing-yang
- yuxiangqian
Expand Down
1 change: 1 addition & 0 deletions sig-api-machinery/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-api-machinery:
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Structured Logging](/wg-structured-logging)


Expand Down
1 change: 1 addition & 0 deletions sig-apps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-apps:
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Data Protection](/wg-data-protection)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
6 changes: 6 additions & 0 deletions sig-auth/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,12 @@ subprojects, and resolve cross-subproject technical issues and decisions.
- [@kubernetes/sig-auth-test-failures](https://github.com/orgs/kubernetes/teams/sig-auth-test-failures) - Test Failures and Triage
- Steering Committee Liaison: Patrick Ohly (**[@pohly](https://github.com/pohly)**)

## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-auth:
* [WG Checkpoint Restore](/wg-checkpoint-restore)


## Subprojects

The following [subprojects][subproject-definition] are owned by sig-auth:
Expand Down
1 change: 1 addition & 0 deletions sig-cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-cli:
* [WG Checkpoint Restore](/wg-checkpoint-restore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my SIG-CLI hat, I'm raising similar comment as the other one, this topic wasn't brought to SIG-CLI attention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kubernetes/sig-cli-leads

* [WG Node Lifecycle](/wg-node-lifecycle)


Expand Down
1 change: 1 addition & 0 deletions sig-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md)
| Name | Label | Stakeholder SIGs |Organizers | Contact | Meetings |
|------|-------|------------------|-----------|---------|----------|
|[Batch](wg-batch/README.md)|[batch](https://github.com/kubernetes/kubernetes/labels/wg%2Fbatch)|* Apps<br>* Autoscaling<br>* Node<br>* Scheduling<br>|* [Kevin Hannon](https://github.com/kannon92), Red Hat<br>* [Marcin Wielgus](https://github.com/mwielgus), Google<br>* [Maciej Szulik](https://github.com/soltysh), Defense Unicorns<br>* [Swati Sehgal](https://github.com/swatisehgal), Red Hat<br>|* [Slack](https://kubernetes.slack.com/messages/wg-batch)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-batch)|* Regular Meeting ([calendar](https://calendar.google.com/calendar/embed?src=8ulop9k0jfpuo0t7kp8d9ubtj4%40group.calendar.google.com)): [Thursdays (starting February 15th 2024)s at 3PM CET (Central European Time) (monthly)](https://zoom.us/j/98329676612?pwd=c0N2bVV1aTh2VzltckdXSitaZXBKQT09)<br>
|[Checkpoint Restore](wg-checkpoint-restore/README.md)|[checkpoint-restore](https://github.com/kubernetes/kubernetes/labels/wg%2Fcheckpoint-restore)|* API Machinery<br>* Apps<br>* Auth<br>* CLI<br>* Node<br>* Scheduling<br>|* [Adrian Reber](https://github.com/adrianreber), Red Hat<br>* [Peter Hunt](https://github.com/haircommander), Red Hat<br>* [Radostin Stoyanov](https://github.com/rst0git), University of Oxford<br>* [Viktória Spišaková](https://github.com/viktoriaas), Masaryk University<br>|* [Slack](https://kubernetes.slack.com/messages/wg-checkpoint-restore)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore)|
|[Data Protection](wg-data-protection/README.md)|[data-protection](https://github.com/kubernetes/kubernetes/labels/wg%2Fdata-protection)|* Apps<br>* Storage<br>|* [Xing Yang](https://github.com/xing-yang), VMware<br>* [Xiangqian Yu](https://github.com/yuxiangqian), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-data-protection)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-data-protection)|* Regular WG Meeting: [Wednesdays at 9:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/j/6933410772)<br>
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture<br>* Autoscaling<br>* Network<br>* Node<br>* Scheduling<br>|* [John Belamaric](https://github.com/johnbelamaric), Google<br>* [Kevin Klues](https://github.com/klueska), NVIDIA<br>* [Patrick Ohly](https://github.com/pohly), Intel<br>|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting (Asia/Europe): [Wednesdays at 9:00 CET (Central European Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)<br>* Regular WG Meeting (Europe/America): [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09)<br>
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle<br>* etcd<br>|* [Benjamin Wang](https://github.com/ahrtr), VMware<br>* [Ciprian Hacman](https://github.com/hakman), Microsoft<br>* [Josh Berkus](https://github.com/jberkus), Red Hat<br>* [James Blair](https://github.com/jmhbnz), Red Hat<br>* [Justin Santa Barbara](https://github.com/justinsb), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)<br>
Expand Down
1 change: 1 addition & 0 deletions sig-node/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-node:
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
1 change: 1 addition & 0 deletions sig-scheduling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-scheduling:
* [WG Batch](/wg-batch)
* [WG Checkpoint Restore](/wg-checkpoint-restore)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)
Expand Down
37 changes: 37 additions & 0 deletions sigs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3583,6 +3583,43 @@ workinggroups:
liaison:
github: aojea
name: Antonio Ojea
- dir: wg-checkpoint-restore
name: Checkpoint Restore
mission_statement: >
This working group aims to provide a central location for the community to discuss
the integration of Checkpoint/Restore functionality into Kubernetes.
charter_link: charter.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is charter included into this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it is, I didn't add it initially as the lifecycle document mentions that it is added later, but looking at the WG PRs it seems to be common to have a charter in the initial PR.

stakeholder_sigs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig auth may have a big say in security of this whole restoration pipeline

Copy link
Member

@rst0git rst0git Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! Security is definitely an important topic that we need to discuss with sig-auth, both for the checkpoint API and the restoration pipeline. The following paper and master thesis describe our recent work on this topic:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added sig auth to the list of stakeholder sigs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this showed up in the sig-auth meeting, we may have missed the discussion around this WG

if this WG is contemplating taking state from a running pod / saving it / letting it be consumed on another node or from another pod or another namespace, then sig-auth is definitely interested in making sure the permissions model around that exists and is ~consistent with similar things Kubernetes does elsewhere (like PVC / snapshots)

We're happy to consult on that, I'm not sure our awareness / involvement rises to the level of sponsoring the WG :)

cc @kubernetes/sig-auth-leads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valuable initiative. The charter mentions that the scope includes checkpointing and restoring 'workloads' and providing 'guidance for developers on checkpoint-friendly app design.' Given this focus, it's essential for SIG Apps to be involved as a key stakeholder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo This is a good idea, thank you so much for suggesting it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks @janetkuo. I added SIG Apps to the proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Janet here, but please make sure to show up and present the scope of this proposal to one of the future SIG-Apps calls.

- API Machinery
- Apps
- Auth
- CLI
- Node
- Scheduling
label: checkpoint-restore
leadership:
chairs:
- github: adrianreber
name: Adrian Reber
company: Red Hat
email: [email protected]
- github: haircommander
name: Peter Hunt
company: Red Hat
email: [email protected]
- github: rst0git
name: Radostin Stoyanov
company: University of Oxford
email: [email protected]
- github: viktoriaas
name: Viktória Spišaková
company: Masaryk University
email: [email protected]
meetings: []
contact:
slack: wg-checkpoint-restore
mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the steering liason field, we discussed in the steering meeting today, you can add me..

Looking forward to being more directly helpful with @rst0git @viktoriaas and crew :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks

Suggested change
mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore
mailing_list: https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore
liaison:
github: BenTheElder
name: Benjamin Elder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to push the addition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh Thank you for your comment! Adrian is on vacation right now, but he will merge it when he returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mailing lists for all WGs/SIGs are part of our managed google groups, with a kubernetes.io domain. So this should be https://groups.google.com/a/kubernetes.io/g/wg-checkpoint-restore once you get the group created we'll be able to provision this for you.

- dir: wg-data-protection
name: Data Protection
mission_statement: >
Expand Down
38 changes: 38 additions & 0 deletions wg-checkpoint-restore/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<!---
This is an autogenerated file!
Please do not edit this file directly, but instead make changes to the
sigs.yaml file in the project root.
To understand how this file is generated, see https://git.k8s.io/community/generator/README.md
--->
# Checkpoint Restore Working Group

This working group aims to provide a central location for the community to discuss the integration of Checkpoint/Restore functionality into Kubernetes.

The [charter](charter.md) defines the scope and governance of the Checkpoint Restore Working Group.

## Stakeholder SIGs
* [SIG API Machinery](/sig-api-machinery)
* [SIG Apps](/sig-apps)
* [SIG Auth](/sig-auth)
* [SIG CLI](/sig-cli)
* [SIG Node](/sig-node)
* [SIG Scheduling](/sig-scheduling)



## Organizers

* Adrian Reber (**[@adrianreber](https://github.com/adrianreber)**), Red Hat
* Peter Hunt (**[@haircommander](https://github.com/haircommander)**), Red Hat
* Radostin Stoyanov (**[@rst0git](https://github.com/rst0git)**), University of Oxford
* Viktória Spišaková (**[@viktoriaas](https://github.com/viktoriaas)**), Masaryk University

## Contact
- Slack: [#wg-checkpoint-restore](https://kubernetes.slack.com/messages/wg-checkpoint-restore)
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-wg-checkpoint-restore)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding @BenTheElder as liaison to README too.

Suggested change
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fcheckpoint-restore)
- Steering Committee Liaison: Benjamin Elder (**[@BenTheElder](https://github.com/BenTheElder)**)

<!-- BEGIN CUSTOM CONTENT -->

<!-- END CUSTOM CONTENT -->
91 changes: 91 additions & 0 deletions wg-checkpoint-restore/charter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@

# WG Checkpoint Restore Charter

This charter adheres to the conventions described in the [Kubernetes Charter README] and uses
the Roles and Organization Management outlined in [sig-governance].

## Scope

The Checkpoint/Restore Working Group aims to solve the problem of transparently
checkpointing and restoring workloads in Kubernetes, a functionality discussed
for over five years. The group will deliver the design and implementation of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to those discussions?

Checkpoint/Restore functionality in Kubernetes, serving as a central hub for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it has to be a central part of Kubernetes, where multiple external solutions already exists?

community information and discussion. This initiative addresses a wide range of
problems, including fault tolerance, improved resource utilization, and
accelerated application startup times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first thing that I'd like to point out is that there are 2 main use cases:

  1. the whole control-plane snapshot
  2. workload

Which one this group is planning to cover? As I'm reading this document I'm seeing both used interchangeably which is very confusing. That's why I'd start with clearly drawing the line between the two and properly documenting which one of these two (or both) are you planning to tackle.


### In scope

- Identify core Kubernetes checkpoint/restore use cases (e.g., live migration,
fault tolerance, debugging, snapshotting) and gather stakeholder requirements.
- Investigate and propose Kubernetes APIs for checkpoint/restore operations.
- Work with SIGs for the best integration of checkpoint/restore functionality
and APIs.
- Provide guidance for developers on checkpoint-friendly app design and
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may be API needed to communicate between the app and API server that the checkopoint is requested AND/OR that the app is ready for checkpoint. Something that is beyond just guidance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is actually something we discussed how to do in containers for years now (outside of Kubernetes). But we never found the right way how to do this. We were looking at kernel interfaces or systemd interfaces because for many applications it could be helpful to free temporary memory to reduce checkpoint size or even drop confidential information. Also after restore it would be good to tell the application that maybe certain cryptographic values need to be reset or regenerated. I will try to include something mentioning this. Thanks.

recommendations for operators on feature management.
- Work closely with relevant upstream projects (CRI-O, containerd, CRIU, gVisor)
for alignment and integration.
- Revisit the existing implementations to find and remedy possible inefficiencies.
One example is the existing checkpoint archive format which has already been
identified as being a major source of slowdown.

### Out of scope

- Not focused on general OS-level checkpointing outside Kubernetes
pods/containers.
- Will not dictate internal application checkpointing logic; focuses on
Kubernetes platform orchestration of *container/pod state.

## Stakeholders

Stakeholders in this working group span multiple SIGs that own parts of the
code in core kubernetes components and addons.

- SIG CLI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has there been outreach to these SIGs yet? I see some SIG node participants/leaders but not CLI yet for example

Copy link
Member

@rst0git rst0git Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing from SIG CLI to SIG API machinery after SIG CLI mentioned that new commands are first introduced via a plugin.

kubernetes/enhancements#5091

I have a simple prototype of a plugin for the kubectl checkpoint command that we used in our demos and can discuss further in the working group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we should remove the SIG if they are not a stakeholder, or else the WG organizers should reach out to the SIGs.

- SIG API Machinery
- SIG Node
- SIG Scheduling
- SIG Auth
- SIG Apps

## Deliverables

The list of deliverables include the following high level features:

- In the early stage, we mainly want to offer a well-defined location for the
community to find information, ask questions, and discuss the next steps of
enabling checkpoint and restore in Kubernetes.

Later:

- Ability to checkpoint and restore a container using kubectl
- Ability to checkpoint and restore a pod using kubectl
- Integration of container/pod checkpointing in scheduling decisions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pod checkpointing would have anything to do with scheduling?


## Roles and Organization Management

This WG adheres to the Roles and Organization Management outlined in [wg-governance]
and opts-in to updates and modifications to [wg-governance].

[wg-governance]: /committee-steering/governance/wg-governance.md

Additionally, the WG commits to:

- maintain a solid communication line between the Kubernetes groups and the
wider CNCF community
- submit a proposal to the KubeCon/CloudNativeCon maintainers track
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts if this incentivize the right behavior and will encourage people to build WG to get a slot in the kubecon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Antonio, this particular line should be removed, it's sufficient what the previous point shows.


## Timelines and Disbanding

As a first mandate, the WG will define a roadmap and tasks in the first quarter
of operation.

After that the WG will distribute the different tasks to different community
members to define possible APIs and how it can be integrated in Kubernetes.
Comment on lines +80 to +84
Copy link
Member

@aojea aojea Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need some clarification on this paragraph, for readers unfamiliar with Kubernetes governance, the current wording ("WG will define a roadmap," "distribute tasks," "define possible APIs") could give the impression that the WG is an authoritative body with the power to approve APIs and dictate roadmaps for the project.

As we know, a WG's primary role is to serve as a forum for cross-SIG collaboration, and SIGs are the ultimate authority for code and API changes.

To avoid this potential confusion, could we rephrase this section to better reflect the WG's role as a facilitator and a place for proposing ideas to the SIGs?

what about something like:

Suggested change
As a first mandate, the WG will define a roadmap and tasks in the first quarter
of operation.
After that the WG will distribute the different tasks to different community
members to define possible APIs and how it can be integrated in Kubernetes.
As a first mandate, the WG will propose a draft roadmap and identify key tasks in the first quarter of operation.
After that, the WG will facilitate collaboration among community members to explore possible APIs and draft proposals for their integration into Kubernetes, which will then be presented to the relevant SIGs.


Achieving the aforementioned deliverables, also mentioned in the `In Scope`
section, will allow us to decide when to disband this WG. There is no
expectations that the Working Group will be converted into a SIG long term.

[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md