Skip to content

Optimize image loading for Podman machines #26660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Jul 18, 2025

Add support for loading images directly from machine paths to avoid unnecessary file transfers when the image archive is already accessible on the running machine through mounted directories.

Changes include:

  • New /libpod/local/images/load API endpoint for direct machine loading
  • Machine detection and path mapping functionality
  • Fallback in tunnel mode to try optimized loading first
  • Improved error handling for non-JSON API responses

This optimization significantly speeds up image loading operations when working with remote Podman machines by eliminating redundant file transfers for already-accessible image archives.

Fixes: https://issues.redhat.com/browse/RUN-3249
Fixes: #26321

Does this PR introduce a user-facing change?

New API endpoint for MacOS and Windows that significantly speeds up remote image loading by using direct-path detection to avoid unnecessary file transfers, complete with a seamless fallback mechanism.

@openshift-ci openshift-ci bot added release-note do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jul 18, 2025
Copy link
Contributor

openshift-ci bot commented Jul 18, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign tomsweeneyredhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions bot added machine kind/api-change Change to remote API; merits scrutiny labels Jul 18, 2025
@@ -31,6 +32,7 @@ import (
"github.com/containers/podman/v5/pkg/domain/infra/abi"
domainUtils "github.com/containers/podman/v5/pkg/domain/utils"
"github.com/containers/podman/v5/pkg/errorhandling"
"github.com/containers/podman/v5/pkg/machine"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not import anything machine from here I would say that makes the import structure really weird that the server now all of the sudden imports machine code

Comment on lines 382 to 390
localMap := machine.LocalAPIMap{}
if err := json.NewDecoder(r.Body).Decode(&localMap); err != nil {
utils.Error(w, http.StatusBadRequest, fmt.Errorf("failed to decode request body: %w", err))
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is intuitive, you parse struct where one field is just not used at all. My suggestion add a signle query param path where the client must specify the path how it is on the VM.

}

cleanPath := filepath.Clean(localMap.RemotePath)
switch _, err := os.Stat(cleanPath); {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fileutils.Exists

cleanPath := filepath.Clean(localMap.RemotePath)
switch _, err := os.Stat(cleanPath); {
case err == nil:
case os.IsNotExist(err):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mention in that function new code should use errors.Is(err, fs.ErrNotExist)

but is there a reason to check for this file here at all? I would assume imageEngine.Load() call would alos fail with ErrNotExist so we could just handle it there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imageEngine.Load() attempts to open the file several times to determine the correct image format in the tar file. I would prefer to avoid repeatedly attempting to open the tar file.

Comment on lines +50 to +56
if h.Response.Header.Get("Content-Type") == "application/json" {
return handleError(data, &errorhandling.ErrorModel{})
}
return &errorhandling.ErrorModel{
Message: string(data),
ResponseCode: h.Response.StatusCode,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! But please split this out into a separate commit with its own explanation as it is mostly unrelated to the machine change


imageTypes "github.com/containers/image/v5/types"
handlersTypes "github.com/containers/podman/v5/pkg/api/handlers/types"
"github.com/containers/podman/v5/pkg/auth"
"github.com/containers/podman/v5/pkg/bindings"
"github.com/containers/podman/v5/pkg/domain/entities/reports"
"github.com/containers/podman/v5/pkg/domain/entities/types"
"github.com/containers/podman/v5/pkg/machine"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well I don't think we should import any machine code from bindings. The bindings itself are supported by external users so we should not cause machine specifc imports here.

@@ -221,6 +223,22 @@ func (ir *ImageEngine) Inspect(ctx context.Context, namesOrIDs []string, opts en
}

func (ir *ImageEngine) Load(ctx context.Context, opts entities.ImageLoadOptions) (*entities.ImageLoadReport, error) {
if localMap, ok := shim.CheckPathOnRunningMachine(ir.ClientCtx, opts.Input); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK machine only compiles on amd64/arm64 and not the other arches so this breaks the build, you will need some stub that always returns false.

Also we should not make calls into machine code at all if the connection is not a machine. We actually have the machine info already so the rereading of connection you do int he machine code is redundant.

If you look into NewImageEngine() there is facts.MachineMode you could use. i.e. you could store it in the ImageEngine struct and then just access it here.

@Honny1 Honny1 force-pushed the speed-up-load branch 2 times, most recently from 5b32a32 to 49668ff Compare August 4, 2025 15:28
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

1 similar comment
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

@Honny1
Copy link
Member Author

Honny1 commented Aug 5, 2025

/packit retest-failed

@Honny1 Honny1 marked this pull request as ready for review August 5, 2025 12:03
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 5, 2025
@Honny1
Copy link
Member Author

Honny1 commented Aug 5, 2025

@Luap99 PTAL, I have addressed your review.

I don't think the failed machine-wsl test is related to these changes. I think something went wrong with WSL.

@Honny1 Honny1 requested a review from Luap99 August 5, 2025 12:50
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to work on windows with WSL since that doesn't do mounts the normal way, all WSL machines have /mnt/c, etc... mounted by default.

There is specgen.ConvertWinMountPath() which must be used to translate the windows paths AFAIK.

@@ -139,6 +140,25 @@ func Load(ctx context.Context, r io.Reader) (*types.ImageLoadReport, error) {
return &report, response.Process(&report)
}

func LoadLocal(ctx context.Context, m *local_utils.LocalAPIMap) (*types.ImageLoadReport, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_utils is an internal package which means it will be impossible for any external bindings user to access LocalAPIMap type AFAICT. That is likely not wanted? I would say just accept the path as string argument here.

Comment on lines 231 to 234
errModel, ok := err.(*errorhandling.ErrorModel)
if !ok {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use errors.As() instead. Code might wrap the struct in which case your error type check here fails.

// InspectRunningMachine finds and returns information about the currently running machine.
// It searches across all VM providers to find a machine in the running state.
// Returns nil if no running machine is found.
func InspectRunningMachine(vmstubbers []vmconfigs.VMProvider, _ machine.ListOptions) (*machine.InternalInspectInfo, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right or future proof, wsl machines support multiple running machines at the same time and we might want to have that for other providers as well.

I think the checking the ssh connection info like done in cmd/podman/compose_machine.go is likely the better thing to match the machine vs connection. the bindings client.URI should show the URL

@Honny1 Honny1 force-pushed the speed-up-load branch 11 times, most recently from 9738a3c to 28e0a9c Compare August 8, 2025 13:07
@Honny1 Honny1 force-pushed the speed-up-load branch 8 times, most recently from 92b20e8 to c104e6f Compare August 11, 2025 15:34
@Honny1
Copy link
Member Author

Honny1 commented Aug 11, 2025

Benchmark Results

The following benchmarks measure the execution time of the podman load command.

  • Image: docker.io/tensorflow/tensorflow:latest-gpu
  • Command: podman load -i tensorflow.tar

Note: All measurements were taken on development machines with background processes running. Performance may be influenced.


MacOS

  • Chip: Apple M3 Pro
  • RAM: 36 GB
  • SSD: 1 TB Apple SSD

AppleHV (Apple Hypervisor)

  • Before: 74.96 s
  • After: 25.17 s (66% improvement)

Libkrun

  • Before: 84.07 s
  • After: 37.58 s (55% improvement)

Windows

  • CPU: Intel Core i7-8665U
  • RAM: 32 GB
  • SSD: 500 GB Samsung 970 Evo Plus

Hyper-V

  • Before: 700.74 s
  • After: 154.58 s (78% improvement)

WSL (16 GB RAM Limit)

  • Before: 109.63 s
  • After: 140.33 s

WSL (2 GB RAM Limit)

  • Before: 147.77 s
  • After: 142.51 s

Analysis of WSL Results

The Podman machine's performance under WSL is dictated by the global RAM limit of the WSL environment, rather than the specific memory setting configured for the Podman machine. This behavior may be a bug and requires further investigation.

With a 16 GB limit, podman load can copy the large tarball into a RAM-based temporary file system (tmpfs). This explains both the high memory usage and the exceptionally fast "Before" time. The speed advantage is due to two factors: leveraging high-speed RAM over the slower SSD and bypassing the performance overhead of the Windows-to-Linux filesystem translation layer.

When RAM is restricted to 2 GB, the system cannot effectively use tmpfs in RAM for an operation of this size and falls back to SSD. As a result, performance becomes bound by SSD speed, leading to more comparable "Before" and "After" times. The small difference in these results is within the expected margin of error for benchmarks conducted in a non-dedicated development environment.

@Honny1
Copy link
Member Author

Honny1 commented Aug 11, 2025

@Luap99 I've addressed your comments and fixed the issue with Windows path translation. This is ready for another review.

@Honny1
Copy link
Member Author

Honny1 commented Aug 11, 2025

/packit retest-failed

@Honny1 Honny1 requested a review from Luap99 August 11, 2025 17:50
@Luap99
Copy link
Member

Luap99 commented Aug 12, 2025

bypassing the performance overhead of the Windows-to-Linux filesystem translation layer.

I think that is the main point, what happens when we copy we read on the windows side send via network, extract to the wsl fs (ext4 I think?) which then has normal fs speed. Now with this change we use the 9p proxy, and I guess the windows 9p mount that it uses has just terrible performance if we can outperform it with the old way. That seems quite concerning for WSL but outside our control.
https://stackoverflow.com/questions/68972448/why-is-wsl-extremely-slow-when-compared-with-native-windows-npm-yarn-processing

Comment on lines 28 to 63
dirs, err := env.GetMachineDirs(machineProvider.VMType())
if err != nil {
return nil, err
}

machineList, err := vmconfigs.LoadMachinesInDir(dirs)
if err != nil {
return nil, fmt.Errorf("listing machines: %w", err)
}

// Now we know that the connection points to a machine and we
// can find the machine by looking for the one with the
// matching port.
connectionPort, err := strconv.Atoi(parsedConnection.Port())
if err != nil {
return nil, fmt.Errorf("parsing connection port: %w", err)
}
for _, mc := range machineList {
if connectionPort != mc.SSH.Port {
continue
}

state, err := machineProvider.State(mc, false)
if err != nil {
return nil, err
}

if state != define.Running {
return nil, fmt.Errorf("machine %s is not running but in state %s", mc.Name, state)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to duplicate code from the compose code, I suggest you split this into a function here and then make the compose code use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I duplicated the code.

Check Content-Type header before unmarshaling errors to avoid
unnecessary JSON parsing overhead for plain text responses.

Signed-off-by: Jan Rodák <[email protected]>
Add support for loading images directly from machine paths to avoid
unnecessary file transfers when the image archive is already accessible
on the running machine through mounted directories.

Changes include:
- New /libpod/local/images/load API endpoint for direct machine loading
- Machine detection and path mapping functionality
- Fallback in tunnel mode to try optimized loading first

This optimization significantly speeds up image loading operations
when working with remote Podman machines by eliminating redundant
file transfers for already-accessible image archives.

Fixes: https://issues.redhat.com/browse/RUN-3249
Fixes: containers#26321

Signed-off-by: Jan Rodák <[email protected]>
@Honny1 Honny1 requested a review from Luap99 August 12, 2025 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Change to remote API; merits scrutiny machine release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Podman-Remote needs api to do local artifact creation or builds.
2 participants