Fall back from chunked pulls when the chunked metadata is too large #2230

mtrmac · 2025-01-24T05:12:44Z

This is a short-term fix for containers/podman#24885 : if the chunked metadata is too large, trigger fallback to the traditional non-chunked pull path instead of a hard failure. See the linked issue for previous discussion.

RFC: While working on this, the APIs of various functions started looking increasingly unwieldy, so I started with a refactoring of the way chunkedDiffer is initialized, and the way read…Manifest return data. I think it’s a slight improvement, but it’s not at all an obvious improvement. I’d be happy to drop the initial refactor commits and to turn this into only the fallback logic.

This is a short-term mitigation of the existing limits. It seems that there are various other (close and not that close) code paths that can similarly allocate a large amount of memory, with, at least… inconsistent enforcement of limits.

Marked draft because I didn’t test the code yet.

@giuseppe PTAL

See individual commit messages for details.

cgwalters

Only a superficial skim

cgwalters · 2025-01-24T20:07:56Z

pkg/chunked/storage_linux.go

-func getProperDiffer(store storage.Store, blobDigest digest.Digest, blobSize int64, annotations map[string]string, iss ImageSourceSeekable, pullOptions pullOptions) (graphdriver.Differ, bool, error) {
+// newUninitializedDiffer creates the very initial state of a chunkedDiffer.
+// Next, .initialize… and .finishSetup must be called.
+func newUninitializedDiffer(pullOptions pullOptions, stream ImageSourceSeekable, blobDigest digest.Digest, blobSize int64) *chunkedDiffer {


For extra safety we could add a fully_initialized: bool to the struct and some asserts in different places around that?

I’m not really sure where to go here.

I was primarily trying to document the various field lifetimes, and reduce some duplication — but I’m not convinced the structure as is should stay; this certainly could be made more enterprisey, by adding state enums and encapsulation and …, but I have a vague feeling that there is should be a cleaner more elegant way to express this, if only I could find it, so I’d rather document / record things that might help find that ideal structure, and not set things in stone to make it harder to move towards that ideal structure.

Some of this is partially a consequence of the public API structure not being a great fit — now that we have convert_images, can convert images only in the .ApplyDiff phase, when the graph driver tells us the destination directory, but structurally we’d prefer to do the conversion very early and then have the .ApplyDiff phase only work with a parsed zstd manifest. This accounts for most of the complex lifetimes of the “Chunked format”/“Chunked metadata” fields. But I don’t think we can change that easily.

There are many other things to consider:

Split this into two objects — differContext with the source/options/blob ID, + differState with more? That would replace the newUninitialized/.initialize part

Storing topMetadata in chunkedDiffer as a whole — saves some lines, but makes the estargz/zstd:chunked differences less salient

… I think the bottom line is that I just don’t know what a good solution looks like yet, and now is not the best time to go figure that out. So maybe I should keep just a modified variant of the first commit, documenting the field groups, and leave the restructuring for some other time.

WDYT?

I only drop in unpredictably and mostly superficially here (don't have bandwidth for more) in the overall hope that I maintain the background understanding to eventually execute on bootc-dev/bootc#20

So with that in mind

… I think the bottom line is that I just don’t know what a good solution looks like yet, and now is not the best time to go figure that out. So maybe I should keep just a modified variant of the first commit, documenting the field groups, and leave the restructuring for some other time.

That makes sense to me!

cgwalters · 2025-01-24T22:02:45Z

pkg/chunked/compression_linux.go

 	}

 	footer := make([]byte, footerSize)
 	streamsOrErrors, err := getBlobAt(blobStream, ImageSourceChunk{Offset: uint64(blobSize - footerSize), Length: uint64(footerSize)})
 	if err != nil {
-		return topMetadata{}, err
+		var badRequestErr ErrBadRequest // If the source does not support partial reads, we might want to fall back.
+		return topMetadata{}, errors.As(err, &badRequestErr), err


It seems like it might be slightly cleaner to create a new error type e.g. ErrShouldFallback that wraps these and the caller tests for that error type?

Oh, great point. The ultimate goal is to return ErrFallbackToOrdinaryLayerDownload … and we can just do that directly without the canFallback bool. Originally I was thinking that this error type is a part of the API of GetDiffer and called subroutines should not need to care, but returning the extra return value all over the place is really not any better. I’ll rework this.

cgwalters · 2025-01-24T22:36:56Z

if the chunked metadata is too large, trigger fallback to the traditional non-chunked pull path

Of course it's often these huge layers where partial pulls are most valuable...

On that topic I know I've said this elsewhere but I still think us "productizing" a "rechunker/canonicalizer" for containers would be a huge win across the board (again it also helps pushes and registry storage, not just pulls) but it also means less work for each individual zstd:chunked pull, keeping metadata size down.

mtrmac · 2025-01-27T20:28:14Z

if the chunked metadata is too large, trigger fallback to the traditional non-chunked pull path

Of course it's often these huge layers where partial pulls are most valuable...

… and the metadata scales with the layer size, not number of files, so it will eventually go over the limit.

That’s true, but, also, we need pulls not to outright fail now. Handling images with larger-than-memory metadata is more work than this.

On that topic I know I've said this elsewhere but I still think us "productizing" a "rechunker/canonicalizer"

I have no idea what this means. What are the inputs and outputs? Is this working within a single layer, or somehow moving files across layers?

cgwalters · 2025-01-27T21:39:16Z

That’s true, but, also, we need pulls not to outright fail now. Handling images with larger-than-memory metadata is more work than this.

Yep, agreed.

I have no idea what this means. What are the inputs and outputs? Is this working within a single layer, or somehow moving files across layers?

See all the things in ostreedev/ostree-rs-ext#69 (especially my favorite https://grahamc.com/blog/nix-and-layered-docker-images/ ) - it can mean multiple things but basically a combination of adding determinism (file timestamps ruin all caching across the board) and yes splitting files into distinct layers - the code in rpm-ostree does this starting from an input filesystem tree, using the rpm database.

mtrmac · 2025-01-28T20:30:26Z

Is this working within a single layer, or somehow moving files across layers?

See all the things in ostreedev/ostree-rs-ext#69 (especially my favorite https://grahamc.com/blog/nix-and-layered-docker-images/ ) - it can mean multiple things but basically a combination of adding determinism (file timestamps ruin all caching across the board) and yes splitting files into distinct layers - the code in rpm-ostree does this starting from an input filesystem tree, using the rpm database.

My general intuition is that the layer concept of Docker is, to a fair extent, an implementation detail that should not be treated as a constraint… and, with zstd:chunked, it’s not that relevant anymore, already, because zstd:chunked always uses reflinks across layers (and copies chunks of modified files, rsync-style).

I think that users care that “pulling multiple small applications on top of a single large shared base image is cheap”. That’s a specific property that directly affects how images are built, with explainable performance properties.

It seems to me that re-packing, both within image boundaries and across, doesn’t have that kind of explainable performance. “It will be better, on average”, sure. But maybe there are other ways to achieve that. Hypothetically, what if some future zstd:chunked-v2 metadata also contained deterministic hashes of full directory trees, so that we could directly determine “all of /usr/share/python is the same as in that other image”? What I find attractive about this is that users would never hear about a “re-packer” as anything that needs tuning.

Obviously this also interacts with composefs , IIRC it’s still not 100% settled whether we want a single top-level composefs manifest or one composefs per layer + overlay to stack them.

mtrmac · 2025-01-28T20:33:20Z

PR updated:

Dropped most of the chunkedDiffer restructuring
The code now creates ErrFallbackToOrdinaryLayerDownload (and private errFallbackCanConvert) directly
If we refuse to pull a layer partially because its metadata is too large, we also refuse to use the convert_images path (reading the full file and creating a new partial representation — it should also have too large metadata, and fail during the conversion). In that case, per existing convert_images semantics (use case: “always use composefs“), the pull continues to fail instead of falling back to non-partial pull.

Manually tested against containers/podman#24885 (comment) , both with and without convert_images.

Ready for review, PTAL.

mtrmac · 2025-01-28T20:37:44Z

Also filed #2232 to track adding support for partial pulls of very large images .

cgwalters · 2025-01-29T18:09:25Z

My general intuition is that the layer concept of Docker is, to a fair extent, an implementation detail that should not be treated as a constraint… and, with zstd:chunked, it’s not that relevant anymore, already, because zstd:chunked always uses reflinks across layers (and copies chunks of modified files, rsync-style).

I think you're wrong. In a nutshell, generating reproducible container images benefits not just pulling but also pushing images and registry storage (especially multiplied by many build variants of the same image).

Today linux-firmware alone is 400MB - splitting that out into its own layer that is always bit-for-bit identical (given same input and gzip compressor) is a huge win. Trust me.

More in e.g. containers/image#902 (comment)

giuseppe

LGTM, since we know there are already such big images out there, what do you think about increasing the max size too as part of this PR?

mtrmac · 2025-01-30T17:02:06Z

since we know there are already such big images out there, what do you think about increasing the max size too as part of this PR?

I can’t think of any strong reason to prefer one size over any other. I’m open to changing this, but then what should we choose?

That image has a <92 MB uncompressed manifest vs. a 50 MB limit — and a tar-split, for which we currently don’t set any limit, of 310 MB .

Just intuitively with no numbers, I think 50 vs. 100 MB, say, is about equally reasonable and I wouldn’t argue at all about that difference … but, also, neither is obviously small enough. And we would probably want to hard-code a limit larger than the images we see in the wild.

So, 200? 256? MB? That, or the >300 (400? 500? MB) for tar-split is getting rather further into the “unreasonable for small systems” territory.

If it helps move things along, I’d be happy to bump that to 100 MB.

Try to split fields by purpose, and document some of the context. For consistency, use the same order in the struct literals. Only reorders existing fields, should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>

Instead, have the callees produce ErrFallbackToOrdinaryLayerDownload directly. Signed-off-by: Miloslav Trmač <[email protected]>

That's a logically better place, it pairs the getBlobAt calls with the ErrBadRequest types specific to those call sites. We will, also, add more fallback reasons. Signed-off-by: Miloslav Trmač <[email protected]>

... but not if the fallback would be convert_images, again creating too large metadata. Signed-off-by: Miloslav Trmač <[email protected]>

cgwalters · 2025-01-30T18:43:00Z

That, or the >300 (400? 500? MB) for tar-split is getting rather further into the “unreasonable for small systems” territory.

I thought we had a tracker for this but basically just to reiterate it's really quite silly and circular to ship tar-split in zstd:chunked, since we control the format we should define a "canonical tar" that we can reproduce given the metadata we have.

TomSweeneyRedHat · 2025-01-30T19:20:03Z

@cgwalters @giuseppe @mtrmac is this good to merge now, or are further changes needed? I'm asking as we'd like to get this into c/storage 1.57 and Podman v5.4. I'm fine with adding additional changes over the next few weeks if needed for RHEL 9.6/10.0 and Podman v5.4.*.

mtrmac · 2025-01-30T19:50:59Z

@TomSweeneyRedHat I think the code can go in as is, and we are talking just about tuning parameters (which is not time-sensitive the way the core of this PR is); but I obviously shouldn’t be approving my own PRs.

giuseppe

/lgtm

openshift-ci · 2025-01-30T19:56:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe,mtrmac]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

giuseppe · 2025-01-30T19:57:49Z

If it helps move things along, I’d be happy to bump that to 100 MB.

I think we can merge this as it is now, and we can bump the limit later. Since we already saw an image that is 92MB, should we just play safer here and use at least 150MB?

mtrmac · 2025-01-30T19:59:15Z

since we control the format we should define a "canonical tar" that we can reproduce given the metadata we have.

That would be necessary for some purposes, but let’s not plan doing that in zstd:chunked. It turns out it is impossible to provide all of:

compatibility with chunked-unaware tar consumers
full performance for chunked-aware consumers
a guarantee that both consumers see the same contents.

So if we were to make any new format (with a user-visible new option), let’s bite the bullet and break the tar consumers.

giuseppe · 2025-01-30T20:00:20Z

I thought we had a tracker for this but basically just to reiterate it's really quite silly and circular to ship tar-split in zstd:chunked, since we control the format we should define a "canonical tar" that we can reproduce given the metadata we have.

yes I agree we need to look into that and avoid repeating the same information in 3 different places.

mtrmac · 2025-01-30T20:12:49Z

Since we already saw an image that is 92MB, should we just play safer here and use at least 150MB?

*shrug* why not. For the record: #2240 .

openshift-ci bot added do-not-merge/work-in-progress approved labels Jan 24, 2025

mtrmac mentioned this pull request Jan 24, 2025

pulling image with zstd:chunked compression is returning manifest too big error containers/podman#24885

Open

cgwalters reviewed Jan 24, 2025

View reviewed changes

mtrmac force-pushed the chunked-too-large branch from 0e924af to 1ebfbe8 Compare January 28, 2025 18:41

mtrmac marked this pull request as ready for review January 28, 2025 20:33

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 28, 2025

mtrmac mentioned this pull request Jan 28, 2025

Support arbitrarily-large zstd:chunked images #2232

Open

giuseppe reviewed Jan 29, 2025

View reviewed changes

mtrmac added 4 commits January 30, 2025 18:17

Reorganize chunkedDiffer

9c42f91

Try to split fields by purpose, and document some of the context. For consistency, use the same order in the struct literals. Only reorders existing fields, should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>

Remove the canFallback return value

0dbf0fe

Instead, have the callees produce ErrFallbackToOrdinaryLayerDownload directly. Signed-off-by: Miloslav Trmač <[email protected]>

Move the canFallback logic from make...Differ into read...Manifest

8e3996b

That's a logically better place, it pairs the getBlobAt calls with the ErrBadRequest types specific to those call sites. We will, also, add more fallback reasons. Signed-off-by: Miloslav Trmač <[email protected]>

Allow falling back from partial pulls if the metadata is too large

009737b

... but not if the fallback would be convert_images, again creating too large metadata. Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac force-pushed the chunked-too-large branch from 1ebfbe8 to 009737b Compare January 30, 2025 17:18

This was referenced Jan 30, 2025

Bump c/storage to v1.57.1, c/image to v5.34.0, then to v5.35.0-dev containers/image#2700

Merged

[1.57] Fall back from chunked pulls when the chunked metadata is too large and tag as 1.57.1 #2239

Closed

giuseppe approved these changes Jan 30, 2025

View reviewed changes

openshift-ci bot assigned giuseppe Jan 30, 2025

openshift-ci bot added the lgtm label Jan 30, 2025

openshift-merge-bot bot merged commit f3d8abc into containers:main Jan 30, 2025
20 checks passed

mtrmac mentioned this pull request Jan 30, 2025

Bump maxTocSize to 150 MB #2240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fall back from chunked pulls when the chunked metadata is too large #2230

Fall back from chunked pulls when the chunked metadata is too large #2230

mtrmac commented Jan 24, 2025 •

edited

Loading

cgwalters left a comment

cgwalters Jan 24, 2025

mtrmac Jan 27, 2025

cgwalters Jan 27, 2025

cgwalters Jan 24, 2025

mtrmac Jan 27, 2025

cgwalters commented Jan 24, 2025

mtrmac commented Jan 27, 2025

cgwalters commented Jan 27, 2025

mtrmac commented Jan 28, 2025

mtrmac commented Jan 28, 2025

mtrmac commented Jan 28, 2025

cgwalters commented Jan 29, 2025

giuseppe left a comment

mtrmac commented Jan 30, 2025

cgwalters commented Jan 30, 2025

TomSweeneyRedHat commented Jan 30, 2025

mtrmac commented Jan 30, 2025

giuseppe left a comment

openshift-ci bot commented Jan 30, 2025

giuseppe commented Jan 30, 2025

mtrmac commented Jan 30, 2025

giuseppe commented Jan 30, 2025

mtrmac commented Jan 30, 2025

Fall back from chunked pulls when the chunked metadata is too large #2230

Fall back from chunked pulls when the chunked metadata is too large #2230

Conversation

mtrmac commented Jan 24, 2025 • edited Loading

cgwalters left a comment

Choose a reason for hiding this comment

cgwalters Jan 24, 2025

Choose a reason for hiding this comment

mtrmac Jan 27, 2025

Choose a reason for hiding this comment

cgwalters Jan 27, 2025

Choose a reason for hiding this comment

cgwalters Jan 24, 2025

Choose a reason for hiding this comment

mtrmac Jan 27, 2025

Choose a reason for hiding this comment

cgwalters commented Jan 24, 2025

mtrmac commented Jan 27, 2025

cgwalters commented Jan 27, 2025

mtrmac commented Jan 28, 2025

mtrmac commented Jan 28, 2025

mtrmac commented Jan 28, 2025

cgwalters commented Jan 29, 2025

giuseppe left a comment

Choose a reason for hiding this comment

mtrmac commented Jan 30, 2025

cgwalters commented Jan 30, 2025

TomSweeneyRedHat commented Jan 30, 2025

mtrmac commented Jan 30, 2025

giuseppe left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 30, 2025

giuseppe commented Jan 30, 2025

mtrmac commented Jan 30, 2025

giuseppe commented Jan 30, 2025

mtrmac commented Jan 30, 2025

mtrmac commented Jan 24, 2025 •

edited

Loading