-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP / Feature recompress (2) #770
Conversation
Avoids any stale locks or uncommited segments.
Why work in reverse, ie. starting at the latest segment, working towards the oldest segment? In my mind it is a goal in backup systems like Borg to minimize the "acquisition time", the time needed to store a snapshot of the data. Using heavy compression will usually increase this time. Entrance borg recompress: Use some fast compression while running borg create (LZ4 is pretty much ideal for any CPU+disk combination), to minimize acquisition time. Run borg recompress later at any time to squash space usage down. It will compress the chunks of latest archives first. A useful feature might be to track in-repo where it left off, so it can skip segments without even looking at them (segment IDs are ever increasing).
Also some clean up
d37580b
to
0cdb9e9
Compare
repository.commit() | ||
if exit_soon: | ||
break | ||
except Exception as e: # too broad! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this part when evaluating "Ctrl-C safety" the first time (and it caught the KeyboardInterrupt reliably when inside[1]), but I think now that signals are handled explicitly it's unnecessary.
[1] outside this part it didn't matter, well, it left behind stale locks and stuff.
bf92bbe
to
ffab0a1
Compare
now this is becoming useful...!
ffab0a1
to
e63241c
Compare
Current coverage is
|
@@ -75,7 +75,11 @@ def id_hash(self, data): | |||
def encrypt(self, data): | |||
pass | |||
|
|||
def decrypt(self, id, data): | |||
def decrypt(self, id, data, no_decompress=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about compress=True? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's certainly not less_legible :-)
… others , resolved "segment_id <= sp" vs "id < sp"
3f7e1da
to
9355069
Compare
So, now we got a "major" problem: csize in chunk id lists (if I remember correctly this was also an annoyance in multi-threading We could do that. But now our problems just increased manifold: items -- that is, the item-dict with all the metadata -- are chunked as well, and the list of chunk ids that represents the chunked items is the 'items' key of archive metadata. And this metadata block is just another chunk, with the chunk ID -- the archive's ID -- depending, as usual, on the chunk contents. Change the csize, changes the chunk IDs of the items and that changes all archive IDs. However, archive IDs must be final and never change (by our definition). Independent of how we approach recompression, either we
Dropping csize means that it's harder to find compressed size of items on disk: while we don't need to actually read the data, we'd need to take a peek at the segment holding the chunk, since the segment entry contains the length of the stored data. However, My opinion: recompression is too useful compared to the csize field to not have it. Workarounds to consider...
Maybe... 1.) for now (ideally we'd not have any trade-off but this one seems reasonable). |
remote repos not working and the csize issues let this look rather questionable. |
I agree! Remote repos can be supported, but the csize turns up again as a PITA by principle. So I'm thinking we should solve this problem on a root-cause-level (and the root cause is, that we put csize, which is unrelated to the identity of chunks[1], into the chunk lists, which are final/sealed). What are the clients (in an API sense) of csize?
info requires an up to date Cache. list does as well (it could be lazy about it, though, which might be worth another PR). So here's my line of thought: new archives get csize=0 (or -1 or some other dummy value) in all chunk lists. Repository gets a
If we decide to solve csize this way, it shouldn't be done in this PR imho, though. [1] chunk identity is our chunk ID:
csize also "punches through" our abstractions (borg.key). |
yes, csize is a troublemaker. but i guess we need to delay any change of repo api to until later, I'ld like to make a few releases with features and fixes that we can do without breaking repo / repo api compatibility. |
That's reasonable. Should we postpone / close this one then, or put a disclaimer on it for now basically saying
Because I see recompress as something really useful until there is (hopefully) someday a faster, multi-threaded borg create, to achieve... (citing commit message)
|
how about a more high-level and more generic approach for now, even if it is more expensive? as this is likely a rare / one-time / never operation for most users, efficiency is not as important as for the regularly invoked stuff. e.g. like:
transform_chunks would recompress the chunks as needed. as the chunk ids do not change, space needs could be similar to your approach, but it would work also for remote repos and would not require any api or storage change (AFAICS). this "recreate archive" approach could even do other stuff: like dropping files according to a new exclude list, applying dynamic compression (depending on filetype) - so it is more like create with data coming from an archive rather than from local disk. about dropping csize: this would be something useful for multithreading branch and for a major release (2.0, 3.0?). we should keep the thoughts / research results in a ticket about that. |
That sounds like an idea, would also fulfill feature requests for deleting selectively in old archives (via --exclude, and, I don't know if someone asked for that, but someone probably did). If we allow processing of single archives the csize will be wrong again for all other archives, though (-- in the general case, see below). And we can't really tell if that's the case: refcount > 1 of a chunk can also mean it's a reference from the same archive. Assuring that it is, is prohibitive in terms of either time or additional memory use. If this were used in a "progressive recompression" scheme, i.e. after every or every group of So that's better than approaches No. 1 and 2 in these regards. Space usage if used to recompress whole repositories should be ok, working chronologically. Keeping a list / dict of seen chunks around will cost (a lot of) memory (but could be disabled, similar to --space-save) but avoids lots of I/O + decryption + Compressor.detect, which would dominate otherwise. I'll look into it. |
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
This is teh announced successor to #756
I've crunched now a few hundred GB through a few versions of this (including the multi-threaded prototype) and so far no data loss (diff of SHA-512
borg list
s) and no complaints fromborg check
.Improvements over older prototypes now include "really clean exit" (i.e. no stale locks, everything nice and committed, it even prints stats if asked to), proper progress indication and reverse processing (see 5c2ea3a for rationale). And now we really don't write anything on --dry-runs.
Additional space usage bound: Depends on what the data is stored with and what you're compressing with. You can use recompress to decompress stuff (
-C none
). Generally speaking, it's about 10 * max_segment_size[1].[1] I'm going with 10 MSS currently, because I'm assuming that commit's are costly (and I think they are, since they do all the clean up of over-written (in the key-value store sense) segments).
Also, additional space use will obviously be much larger if you aren't compressing.
TODO
id_ not in repository
=>continue
check, to only recompress objects in the index.Feature list