-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fall back from chunked pulls when the chunked metadata is too large #2230
Fall back from chunked pulls when the chunked metadata is too large #2230
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a superficial skim
pkg/chunked/storage_linux.go
Outdated
func getProperDiffer(store storage.Store, blobDigest digest.Digest, blobSize int64, annotations map[string]string, iss ImageSourceSeekable, pullOptions pullOptions) (graphdriver.Differ, bool, error) { | ||
// newUninitializedDiffer creates the very initial state of a chunkedDiffer. | ||
// Next, .initialize… and .finishSetup must be called. | ||
func newUninitializedDiffer(pullOptions pullOptions, stream ImageSourceSeekable, blobDigest digest.Digest, blobSize int64) *chunkedDiffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For extra safety we could add a fully_initialized: bool
to the struct and some asserts in different places around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not really sure where to go here.
I was primarily trying to document the various field lifetimes, and reduce some duplication — but I’m not convinced the structure as is should stay; this certainly could be made more enterprisey, by adding state enums and encapsulation and …, but I have a vague feeling that there is should be a cleaner more elegant way to express this, if only I could find it, so I’d rather document / record things that might help find that ideal structure, and not set things in stone to make it harder to move towards that ideal structure.
Some of this is partially a consequence of the public API structure not being a great fit — now that we have convert_images
, can convert images only in the .ApplyDiff
phase, when the graph driver tells us the destination directory, but structurally we’d prefer to do the conversion very early and then have the .ApplyDiff
phase only work with a parsed zstd manifest. This accounts for most of the complex lifetimes of the “Chunked format”/“Chunked metadata” fields. But I don’t think we can change that easily.
There are many other things to consider:
- Split this into two objects —
differContext
with the source/options/blob ID, +differState
with more? That would replace thenewUninitialized
/.initialize
part - Storing
topMetadata
inchunkedDiffer
as a whole — saves some lines, but makes the estargz/zstd:chunked differences less salient
… I think the bottom line is that I just don’t know what a good solution looks like yet, and now is not the best time to go figure that out. So maybe I should keep just a modified variant of the first commit, documenting the field groups, and leave the restructuring for some other time.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only drop in unpredictably and mostly superficially here (don't have bandwidth for more) in the overall hope that I maintain the background understanding to eventually execute on bootc-dev/bootc#20
So with that in mind
… I think the bottom line is that I just don’t know what a good solution looks like yet, and now is not the best time to go figure that out. So maybe I should keep just a modified variant of the first commit, documenting the field groups, and leave the restructuring for some other time.
That makes sense to me!
pkg/chunked/compression_linux.go
Outdated
} | ||
|
||
footer := make([]byte, footerSize) | ||
streamsOrErrors, err := getBlobAt(blobStream, ImageSourceChunk{Offset: uint64(blobSize - footerSize), Length: uint64(footerSize)}) | ||
if err != nil { | ||
return topMetadata{}, err | ||
var badRequestErr ErrBadRequest // If the source does not support partial reads, we might want to fall back. | ||
return topMetadata{}, errors.As(err, &badRequestErr), err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it might be slightly cleaner to create a new error type e.g. ErrShouldFallback
that wraps these and the caller tests for that error type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, great point. The ultimate goal is to return ErrFallbackToOrdinaryLayerDownload
… and we can just do that directly without the canFallback
bool. Originally I was thinking that this error type is a part of the API of GetDiffer
and called subroutines should not need to care, but returning the extra return value all over the place is really not any better. I’ll rework this.
Of course it's often these huge layers where partial pulls are most valuable... On that topic I know I've said this elsewhere but I still think us "productizing" a "rechunker/canonicalizer" for containers would be a huge win across the board (again it also helps pushes and registry storage, not just pulls) but it also means less work for each individual zstd:chunked pull, keeping metadata size down. |
… and the metadata scales with the layer size, not number of files, so it will eventually go over the limit. That’s true, but, also, we need pulls not to outright fail now. Handling images with larger-than-memory metadata is more work than this.
I have no idea what this means. What are the inputs and outputs? Is this working within a single layer, or somehow moving files across layers? |
Yep, agreed.
See all the things in ostreedev/ostree-rs-ext#69 (especially my favorite https://grahamc.com/blog/nix-and-layered-docker-images/ ) - it can mean multiple things but basically a combination of adding determinism (file timestamps ruin all caching across the board) and yes splitting files into distinct layers - the code in rpm-ostree does this starting from an input filesystem tree, using the rpm database. |
0e924af
to
1ebfbe8
Compare
My general intuition is that the layer concept of Docker is, to a fair extent, an implementation detail that should not be treated as a constraint… and, with zstd:chunked, it’s not that relevant anymore, already, because zstd:chunked always uses reflinks across layers (and copies chunks of modified files, rsync-style). I think that users care that “pulling multiple small applications on top of a single large shared base image is cheap”. That’s a specific property that directly affects how images are built, with explainable performance properties. It seems to me that re-packing, both within image boundaries and across, doesn’t have that kind of explainable performance. “It will be better, on average”, sure. But maybe there are other ways to achieve that. Hypothetically, what if some future zstd:chunked-v2 metadata also contained deterministic hashes of full directory trees, so that we could directly determine “all of Obviously this also interacts with composefs , IIRC it’s still not 100% settled whether we want a single top-level composefs manifest or one composefs per layer + overlay to stack them. |
PR updated:
Manually tested against containers/podman#24885 (comment) , both with and without Ready for review, PTAL. |
Also filed #2232 to track adding support for partial pulls of very large images . |
I think you're wrong. In a nutshell, generating reproducible container images benefits not just pulling but also pushing images and registry storage (especially multiplied by many build variants of the same image). Today linux-firmware alone is 400MB - splitting that out into its own layer that is always bit-for-bit identical (given same input and gzip compressor) is a huge win. Trust me. More in e.g. containers/image#902 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, since we know there are already such big images out there, what do you think about increasing the max size too as part of this PR?
I can’t think of any strong reason to prefer one size over any other. I’m open to changing this, but then what should we choose? That image has a <92 MB uncompressed manifest vs. a 50 MB limit — and a tar-split, for which we currently don’t set any limit, of 310 MB . Just intuitively with no numbers, I think 50 vs. 100 MB, say, is about equally reasonable and I wouldn’t argue at all about that difference … but, also, neither is obviously small enough. And we would probably want to hard-code a limit larger than the images we see in the wild. So, 200? 256? MB? That, or the >300 (400? 500? MB) for tar-split is getting rather further into the “unreasonable for small systems” territory. If it helps move things along, I’d be happy to bump that to 100 MB. |
Try to split fields by purpose, and document some of the context. For consistency, use the same order in the struct literals. Only reorders existing fields, should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>
Instead, have the callees produce ErrFallbackToOrdinaryLayerDownload directly. Signed-off-by: Miloslav Trmač <[email protected]>
That's a logically better place, it pairs the getBlobAt calls with the ErrBadRequest types specific to those call sites. We will, also, add more fallback reasons. Signed-off-by: Miloslav Trmač <[email protected]>
... but not if the fallback would be convert_images, again creating too large metadata. Signed-off-by: Miloslav Trmač <[email protected]>
1ebfbe8
to
009737b
Compare
I thought we had a tracker for this but basically just to reiterate it's really quite silly and circular to ship tar-split in zstd:chunked, since we control the format we should define a "canonical tar" that we can reproduce given the metadata we have. |
@cgwalters @giuseppe @mtrmac is this good to merge now, or are further changes needed? I'm asking as we'd like to get this into c/storage 1.57 and Podman v5.4. I'm fine with adding additional changes over the next few weeks if needed for RHEL 9.6/10.0 and Podman v5.4.*. |
@TomSweeneyRedHat I think the code can go in as is, and we are talking just about tuning parameters (which is not time-sensitive the way the core of this PR is); but I obviously shouldn’t be approving my own PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, mtrmac The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think we can merge this as it is now, and we can bump the limit later. Since we already saw an image that is 92MB, should we just play safer here and use at least 150MB? |
That would be necessary for some purposes, but let’s not plan doing that in zstd:chunked. It turns out it is impossible to provide all of:
So if we were to make any new format (with a user-visible new option), let’s bite the bullet and break the tar consumers. |
yes I agree we need to look into that and avoid repeating the same information in 3 different places. |
*shrug* why not. For the record: #2240 . |
This is a short-term fix for containers/podman#24885 : if the chunked metadata is too large, trigger fallback to the traditional non-chunked pull path instead of a hard failure. See the linked issue for previous discussion.
RFC: While working on this, the APIs of various functions started looking increasingly unwieldy, so I started with a refactoring of the way
chunkedDiffer
is initialized, and the wayread…Manifest
return data. I think it’s a slight improvement, but it’s not at all an obvious improvement. I’d be happy to drop the initial refactor commits and to turn this into only the fallback logic.This is a short-term mitigation of the existing limits. It seems that there are various other (close and not that close) code paths that can similarly allocate a large amount of memory, with, at least… inconsistent enforcement of limits.
Marked draft because I didn’t test the code yet.
@giuseppe PTAL
See individual commit messages for details.