Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall back from chunked pulls when the chunked metadata is too large #2230

Merged
merged 4 commits into from
Jan 30, 2025

Conversation

mtrmac
Copy link
Collaborator

@mtrmac mtrmac commented Jan 24, 2025

This is a short-term fix for containers/podman#24885 : if the chunked metadata is too large, trigger fallback to the traditional non-chunked pull path instead of a hard failure. See the linked issue for previous discussion.


RFC: While working on this, the APIs of various functions started looking increasingly unwieldy, so I started with a refactoring of the way chunkedDiffer is initialized, and the way read…Manifest return data. I think it’s a slight improvement, but it’s not at all an obvious improvement. I’d be happy to drop the initial refactor commits and to turn this into only the fallback logic.


This is a short-term mitigation of the existing limits. It seems that there are various other (close and not that close) code paths that can similarly allocate a large amount of memory, with, at least… inconsistent enforcement of limits.


Marked draft because I didn’t test the code yet.

@giuseppe PTAL

See individual commit messages for details.

Copy link
Contributor

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a superficial skim

func getProperDiffer(store storage.Store, blobDigest digest.Digest, blobSize int64, annotations map[string]string, iss ImageSourceSeekable, pullOptions pullOptions) (graphdriver.Differ, bool, error) {
// newUninitializedDiffer creates the very initial state of a chunkedDiffer.
// Next, .initialize… and .finishSetup must be called.
func newUninitializedDiffer(pullOptions pullOptions, stream ImageSourceSeekable, blobDigest digest.Digest, blobSize int64) *chunkedDiffer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For extra safety we could add a fully_initialized: bool to the struct and some asserts in different places around that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not really sure where to go here.

I was primarily trying to document the various field lifetimes, and reduce some duplication — but I’m not convinced the structure as is should stay; this certainly could be made more enterprisey, by adding state enums and encapsulation and …, but I have a vague feeling that there is should be a cleaner more elegant way to express this, if only I could find it, so I’d rather document / record things that might help find that ideal structure, and not set things in stone to make it harder to move towards that ideal structure.

Some of this is partially a consequence of the public API structure not being a great fit — now that we have convert_images, can convert images only in the .ApplyDiff phase, when the graph driver tells us the destination directory, but structurally we’d prefer to do the conversion very early and then have the .ApplyDiff phase only work with a parsed zstd manifest. This accounts for most of the complex lifetimes of the “Chunked format”/“Chunked metadata” fields. But I don’t think we can change that easily.

There are many other things to consider:

  • Split this into two objects — differContext with the source/options/blob ID, + differState with more? That would replace the newUninitialized/.initialize part
  • Storing topMetadata in chunkedDiffer as a whole — saves some lines, but makes the estargz/zstd:chunked differences less salient

… I think the bottom line is that I just don’t know what a good solution looks like yet, and now is not the best time to go figure that out. So maybe I should keep just a modified variant of the first commit, documenting the field groups, and leave the restructuring for some other time.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only drop in unpredictably and mostly superficially here (don't have bandwidth for more) in the overall hope that I maintain the background understanding to eventually execute on bootc-dev/bootc#20

So with that in mind

… I think the bottom line is that I just don’t know what a good solution looks like yet, and now is not the best time to go figure that out. So maybe I should keep just a modified variant of the first commit, documenting the field groups, and leave the restructuring for some other time.

That makes sense to me!

}

footer := make([]byte, footerSize)
streamsOrErrors, err := getBlobAt(blobStream, ImageSourceChunk{Offset: uint64(blobSize - footerSize), Length: uint64(footerSize)})
if err != nil {
return topMetadata{}, err
var badRequestErr ErrBadRequest // If the source does not support partial reads, we might want to fall back.
return topMetadata{}, errors.As(err, &badRequestErr), err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it might be slightly cleaner to create a new error type e.g. ErrShouldFallback that wraps these and the caller tests for that error type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great point. The ultimate goal is to return ErrFallbackToOrdinaryLayerDownload … and we can just do that directly without the canFallback bool. Originally I was thinking that this error type is a part of the API of GetDiffer and called subroutines should not need to care, but returning the extra return value all over the place is really not any better. I’ll rework this.

@cgwalters
Copy link
Contributor

if the chunked metadata is too large, trigger fallback to the traditional non-chunked pull path

Of course it's often these huge layers where partial pulls are most valuable...

On that topic I know I've said this elsewhere but I still think us "productizing" a "rechunker/canonicalizer" for containers would be a huge win across the board (again it also helps pushes and registry storage, not just pulls) but it also means less work for each individual zstd:chunked pull, keeping metadata size down.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 27, 2025

if the chunked metadata is too large, trigger fallback to the traditional non-chunked pull path

Of course it's often these huge layers where partial pulls are most valuable...

… and the metadata scales with the layer size, not number of files, so it will eventually go over the limit.

That’s true, but, also, we need pulls not to outright fail now. Handling images with larger-than-memory metadata is more work than this.

On that topic I know I've said this elsewhere but I still think us "productizing" a "rechunker/canonicalizer"

I have no idea what this means. What are the inputs and outputs? Is this working within a single layer, or somehow moving files across layers?

@cgwalters
Copy link
Contributor

That’s true, but, also, we need pulls not to outright fail now. Handling images with larger-than-memory metadata is more work than this.

Yep, agreed.

I have no idea what this means. What are the inputs and outputs? Is this working within a single layer, or somehow moving files across layers?

See all the things in ostreedev/ostree-rs-ext#69 (especially my favorite https://grahamc.com/blog/nix-and-layered-docker-images/ ) - it can mean multiple things but basically a combination of adding determinism (file timestamps ruin all caching across the board) and yes splitting files into distinct layers - the code in rpm-ostree does this starting from an input filesystem tree, using the rpm database.

@mtrmac mtrmac force-pushed the chunked-too-large branch from 0e924af to 1ebfbe8 Compare January 28, 2025 18:41
@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 28, 2025

Is this working within a single layer, or somehow moving files across layers?

See all the things in ostreedev/ostree-rs-ext#69 (especially my favorite https://grahamc.com/blog/nix-and-layered-docker-images/ ) - it can mean multiple things but basically a combination of adding determinism (file timestamps ruin all caching across the board) and yes splitting files into distinct layers - the code in rpm-ostree does this starting from an input filesystem tree, using the rpm database.

My general intuition is that the layer concept of Docker is, to a fair extent, an implementation detail that should not be treated as a constraint… and, with zstd:chunked, it’s not that relevant anymore, already, because zstd:chunked always uses reflinks across layers (and copies chunks of modified files, rsync-style).

I think that users care that “pulling multiple small applications on top of a single large shared base image is cheap”. That’s a specific property that directly affects how images are built, with explainable performance properties.

It seems to me that re-packing, both within image boundaries and across, doesn’t have that kind of explainable performance. “It will be better, on average”, sure. But maybe there are other ways to achieve that. Hypothetically, what if some future zstd:chunked-v2 metadata also contained deterministic hashes of full directory trees, so that we could directly determine “all of /usr/share/python is the same as in that other image”? What I find attractive about this is that users would never hear about a “re-packer” as anything that needs tuning.

Obviously this also interacts with composefs , IIRC it’s still not 100% settled whether we want a single top-level composefs manifest or one composefs per layer + overlay to stack them.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 28, 2025

PR updated:

  • Dropped most of the chunkedDiffer restructuring
  • The code now creates ErrFallbackToOrdinaryLayerDownload (and private errFallbackCanConvert) directly
  • If we refuse to pull a layer partially because its metadata is too large, we also refuse to use the convert_images path (reading the full file and creating a new partial representation — it should also have too large metadata, and fail during the conversion). In that case, per existing convert_images semantics (use case: “always use composefs“), the pull continues to fail instead of falling back to non-partial pull.

Manually tested against containers/podman#24885 (comment) , both with and without convert_images.

Ready for review, PTAL.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 28, 2025

Also filed #2232 to track adding support for partial pulls of very large images .

@cgwalters
Copy link
Contributor

My general intuition is that the layer concept of Docker is, to a fair extent, an implementation detail that should not be treated as a constraint… and, with zstd:chunked, it’s not that relevant anymore, already, because zstd:chunked always uses reflinks across layers (and copies chunks of modified files, rsync-style).

I think you're wrong. In a nutshell, generating reproducible container images benefits not just pulling but also pushing images and registry storage (especially multiplied by many build variants of the same image).

Today linux-firmware alone is 400MB - splitting that out into its own layer that is always bit-for-bit identical (given same input and gzip compressor) is a huge win. Trust me.

More in e.g. containers/image#902 (comment)

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, since we know there are already such big images out there, what do you think about increasing the max size too as part of this PR?

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 30, 2025

since we know there are already such big images out there, what do you think about increasing the max size too as part of this PR?

I can’t think of any strong reason to prefer one size over any other. I’m open to changing this, but then what should we choose?


That image has a <92 MB uncompressed manifest vs. a 50 MB limit — and a tar-split, for which we currently don’t set any limit, of 310 MB .

Just intuitively with no numbers, I think 50 vs. 100 MB, say, is about equally reasonable and I wouldn’t argue at all about that difference … but, also, neither is obviously small enough. And we would probably want to hard-code a limit larger than the images we see in the wild.

So, 200? 256? MB? That, or the >300 (400? 500? MB) for tar-split is getting rather further into the “unreasonable for small systems” territory.


If it helps move things along, I’d be happy to bump that to 100 MB.

Try to split fields by purpose, and document some of the context.

For consistency, use the same order in the struct literals.

Only reorders existing fields, should not change behavior.

Signed-off-by: Miloslav Trmač <[email protected]>
Instead, have the callees produce ErrFallbackToOrdinaryLayerDownload
directly.

Signed-off-by: Miloslav Trmač <[email protected]>
That's a logically better place, it pairs the getBlobAt
calls with the ErrBadRequest types specific to those call sites.

We will, also, add more fallback reasons.

Signed-off-by: Miloslav Trmač <[email protected]>
... but not if the fallback would be convert_images, again
creating too large metadata.

Signed-off-by: Miloslav Trmač <[email protected]>
@cgwalters
Copy link
Contributor

That, or the >300 (400? 500? MB) for tar-split is getting rather further into the “unreasonable for small systems” territory.

I thought we had a tracker for this but basically just to reiterate it's really quite silly and circular to ship tar-split in zstd:chunked, since we control the format we should define a "canonical tar" that we can reproduce given the metadata we have.

@TomSweeneyRedHat
Copy link
Member

@cgwalters @giuseppe @mtrmac is this good to merge now, or are further changes needed? I'm asking as we'd like to get this into c/storage 1.57 and Podman v5.4. I'm fine with adding additional changes over the next few weeks if needed for RHEL 9.6/10.0 and Podman v5.4.*.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 30, 2025

@TomSweeneyRedHat I think the code can go in as is, and we are talking just about tuning parameters (which is not time-sensitive the way the core of this PR is); but I obviously shouldn’t be approving my own PRs.

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Jan 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@giuseppe
Copy link
Member

If it helps move things along, I’d be happy to bump that to 100 MB.

I think we can merge this as it is now, and we can bump the limit later. Since we already saw an image that is 92MB, should we just play safer here and use at least 150MB?

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 30, 2025

since we control the format we should define a "canonical tar" that we can reproduce given the metadata we have.

That would be necessary for some purposes, but let’s not plan doing that in zstd:chunked. It turns out it is impossible to provide all of:

  • compatibility with chunked-unaware tar consumers
  • full performance for chunked-aware consumers
  • a guarantee that both consumers see the same contents.

So if we were to make any new format (with a user-visible new option), let’s bite the bullet and break the tar consumers.

@giuseppe
Copy link
Member

I thought we had a tracker for this but basically just to reiterate it's really quite silly and circular to ship tar-split in zstd:chunked, since we control the format we should define a "canonical tar" that we can reproduce given the metadata we have.

yes I agree we need to look into that and avoid repeating the same information in 3 different places.

@openshift-merge-bot openshift-merge-bot bot merged commit f3d8abc into containers:main Jan 30, 2025
20 checks passed
@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 30, 2025

Since we already saw an image that is 92MB, should we just play safer here and use at least 150MB?

*shrug* why not. For the record: #2240 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants