Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: recover from partially pulled layer (network error, NOT out out of disk space) without restarting #14616

Open
praveenkumar opened this issue Jun 16, 2022 · 19 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@praveenkumar
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind feature

Description
When pulling an image large layer (>1GB) through registry and pull fails in between because of network issue or user cancel it then
this layer is repulled from scratch instead resuming from partially pulled layer.

It would be good to have the partial layer saved, and when the pull is attempted again for the download to resume from where it left off.

Steps to reproduce the issue:

  • Create image with very large layers and store in repository
  • Pull image over slow and possibly unreliable network.
  • Have pull fail when layer is close to complete. ( try manually exit)

Describe the results you received:

  • Try again pulling same image and you will see that layer which was close to complete again start pulling from scratch.

Describe the results you expected:

  • Should resume instead pulling from scratch.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Client:       Podman Engine
Version:      4.1.0
API Version:  4.1.0
Go Version:   go1.18
Built:        Fri May  6 21:45:54 2022
OS/Arch:      linux/amd64

This is kind of similar issue as #7497

@openshift-ci openshift-ci bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 16, 2022
@Luap99
Copy link
Member

Luap99 commented Jun 16, 2022

@mtrmac @vrothberg PTAL

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 16, 2022

This specific RFE is, I think, basically local to c/image/docker/dockerImageSource.GetBlob: on an EOF or some other failures (as long as we received some data), use a range request to continue where we left off. Mostly local and transparent to the rest, apart from having to decide on a heuristic for which failures to handle like that.

There might be some interaction with c/common/pkg/retry if the failures are timeouts. I’m not sure that we need to handle that, apart from maybe containers/common#654 (impose a total timeout on a single image copy).

To be explicit, this RFE is not “retry a layer pull individually, don’t abort the copy and retry on c/common/pkg/retry to do the copy all from scratch again”.

@akiross
Copy link

akiross commented Jun 30, 2022

Another situation in which current state of things caused annoyance: process interrupted by lack of disk space.

Had to download a 12GB image; after all the blobs were downloaded, the process was interrupted during unpacking due to missing space on device. Blobs were discarded, had to download them all again.

Unsure if this is related to what OP said: indeed it's part of the same "continue where you left off" user experience.

$ podman version
Client:       Podman Engine
Version:      4.1.1
API Version:  4.1.1
Go Version:   go1.17.10
Built:        Tue Jan  1 01:00:00 1980
OS/Arch:      linux/amd64

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 30, 2022

@akiross That’s not this specific RFE. (Also, it’s not something I’d expect now that we commit layers immediately, but please report it separately.)

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@alien999999999
Copy link

alien999999999 commented Dec 19, 2022

So, what happened for me is:

Trying to pull docker.io/some/image...
Getting image source signatures
Copying blob ea362f368469 done  
Copying blob a4431157df48 done  
Copying blob 014f7000a66e done  
Copying blob 85dcc8c0c752 done  
Copying blob b119e9491b35 done  
Copying blob bbcb5ff8fceb done  
Copying blob e83e99622cc7 [=====================>----------------] 1.7GiB / 3.0GiB
Copying blob 074ef7e60abf done  
Getting image source signatures
Copying blob bbcb5ff8fceb done  
Copying blob b119e9491b35 done  
Copying blob ea362f368469 done  
Copying blob 85dcc8c0c752 done  
Copying blob a4431157df48 done  
Copying blob 014f7000a66e done  
Copying blob e83e99622cc7 [=====================>----------------] 1.7GiB / 3.0GiB
Copying blob 074ef7e60abf done  
Getting image source signatures
Copying blob ea362f368469 done  
Copying blob b119e9491b35 done  
Copying blob a4431157df48 done  
Copying blob 014f7000a66e done  
Copying blob bbcb5ff8fceb done  
Copying blob 85dcc8c0c752 done  
Copying blob e83e99622cc7 [=====================>----------------] 1.7GiB / 3.0GiB
Copying blob 074ef7e60abf done  
Getting image source signatures
Copying blob ea362f368469 done  
Copying blob b119e9491b35 done  
Copying blob 014f7000a66e done  
Copying blob 85dcc8c0c752 done  
Copying blob bbcb5ff8fceb done  
Copying blob a4431157df48 done  
Copying blob 074ef7e60abf done  
Copying blob e83e99622cc7 [=====================>----------------] 1.7GiB / 3.0GiB
  write /tmp/storage328542328/8: no space left on device
Error: Error writing blob: error storing blob to file "/tmp/storage328542328/8": write /tmp/storage328542328/8: no space left on device

It looks like some timeout (can i increase it somewhere?) and it retries, but seems to start over...

@vrothberg
Copy link
Member

@alien999999999, your other comment indicated that you're using Podman v3.0. I suggest updating to a more recent version of Podman and to increase /tmp or move it to a bigger partition.

@alien999999999
Copy link

I'm going to look at that, but i think it's part of that distribution, not sure i can easily upgrade.

In any case upgrading /tmp is not gonna work, it just endlessly timeouts and retries... do you know if there is a timeout session i can just double? or download this blob manually so it gets picked up or something? i noticed an ostree option...

@alien999999999
Copy link

Yeah, it's on the distro, but upgrading this is difficult, what else would i have to upgrade?

@vrothberg
Copy link
Member

A timeout/retry won't help when there's not enough space on the device. All improvements for this issue are shipped with later versions of Podman.

@alien999999999
Copy link

the /tmp is 8GB, the file is 3GB, it's only after 4 times 1.7GB that it crosses the out of disk space...

@alien999999999
Copy link

according to https://stackoverflow.com/questions/16895294/how-to-set-timeout-for-http-get-requests-in-golang ; http.Client has a Timeout field? could you expose this to the pull_options ?

@alien999999999
Copy link

if i manually download this blob with curl commands; how can i put this into an ostree? is there any documentation on that?

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 23, 2022

@alien999999999 Like #14616 (comment) , running out of space is not what we are trying to track in this RFE. Please file a fresh issue and discuss there.

@mtrmac mtrmac changed the title Feature Request: recover from partially pulled layer without restarting Feature Request: recover from partially pulled layer (network error, NOT out out of disk space) without restarting Dec 23, 2022
@alien999999999
Copy link

So, yes, tha'ts exactly it, this is NOT about the "out of disk space"... that is NOT my problem. my problem is it restarts after a timeout at 1.7GB (before the 3GB blob is done), and does so in endless loop, the out of disk space just happens as a result of the contant retrying, (I could add Terabytes and eventually it would also fill out); I was hoping there was a tunable timeout, so i could double that and the 3GB would be transferred and I could continue

@rhatdan
Copy link
Member

rhatdan commented Jan 3, 2023

@mtrmac Reminder.

@ghost
Copy link

ghost commented Oct 31, 2024

any chance this will be supported soon?

@mtrmac
Copy link
Collaborator

mtrmac commented Oct 31, 2024

containers/image#1816 exists for quite some time now, and I think that’s the extent of what makes sense to do.


A strict reading of the RFE is that if podman pull is interrupted, e.g. using Ctrl+C, that the being-pulled layer is stored somewhere and later reused on a later retry. To do that, we would need to somehow ensure that if there is no “later retry”, the partial file (which can be many gigabytes in size) would need to be automatically deleted somehow, without bothering the user. I don’t know how that could happen, and I think it’s very unlikely we would ever implement that.

@ghost
Copy link

ghost commented Oct 31, 2024

@mtrmac Is it possible to add an option for the recoverable pull, where user take care about the the partial file
they can be deleted when using system prune

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

7 participants