WIP / Feature recompress (2) #770

enkore · 2016-03-18T17:18:56Z

This is teh announced successor to #756

I've crunched now a few hundred GB through a few versions of this (including the multi-threaded prototype) and so far no data loss (diff of SHA-512 borg lists) and no complaints from borg check.

Improvements over older prototypes now include "really clean exit" (i.e. no stale locks, everything nice and committed, it even prints stats if asked to), proper progress indication and reverse processing (see 5c2ea3a for rationale). And now we really don't write anything on --dry-runs.

Additional space usage bound: Depends on what the data is stored with and what you're compressing with. You can use recompress to decompress stuff (-C none). Generally speaking, it's about 10 * max_segment_size[1].

[1] I'm going with 10 MSS currently, because I'm assuming that commit's are costly (and I think they are, since they do all the clean up of over-written (in the key-value store sense) segments).

Also, additional space use will obviously be much larger if you aren't compressing.

TODO

Code clean up
Tests (some more manual and units)
- What I really want to do in terms of manual testing is write a small script that calls recompress-list-diff in a loop like before, but also kills it at random. (So that we get good real-life coverage of this scenario, something which I'd say wouldn't be practical in unit tests)
- What about TAG_DELETE?
  - I don't think it matters here. If it does, we could simply add a id_ not in repository => continue check, to only recompress objects in the index.
Docs (Examples, better epilogue-explanation)
Possibly change stats output
doesn't work for RemoteRepository, because it works below they key/value store abstraction. ATM I don't consider this a todo, because it would be quite hard to make it work for RRs.
- OTOH it could be implemented purely on top of the Repo interface. That would make space usage unpredictable and can/will touch a lot of segments after each other, though. So the performance hit (both time and space) could be quite drastic.

Feature list

recompress stuff
skip stuff
option to not skip stuff
option to skip stuff really fast

Avoids any stale locks or uncommited segments.

Why work in reverse, ie. starting at the latest segment, working towards the oldest segment? In my mind it is a goal in backup systems like Borg to minimize the "acquisition time", the time needed to store a snapshot of the data. Using heavy compression will usually increase this time. Entrance borg recompress: Use some fast compression while running borg create (LZ4 is pretty much ideal for any CPU+disk combination), to minimize acquisition time. Run borg recompress later at any time to squash space usage down. It will compress the chunks of latest archives first. A useful feature might be to track in-repo where it left off, so it can skip segments without even looking at them (segment IDs are ever increasing).

Also some clean up

enkore · 2016-03-18T17:49:57Z

borg/archiver.py

+                        repository.commit()
+                    if exit_soon:
+                        break
+                except Exception as e:  # too broad!


I added this part when evaluating "Ctrl-C safety" the first time (and it caught the KeyboardInterrupt reliably when inside[1]), but I think now that signals are handled explicitly it's unnecessary.

[1] outside this part it didn't matter, well, it left behind stale locks and stuff.

now this is becoming useful...!

codecov-io · 2016-03-18T19:46:14Z

Current coverage is `82.73%`

Merging #770 into master will decrease coverage by -0.56% as of 5a8f883

@@            master    #770   diff @@
======================================
  Files           14      14       
  Stmts         4317    4432   +115
  Branches       759     784    +25
  Methods          0       0       
======================================
+ Hit           3596    3667    +71
- Partial        214     230    +16
- Missed         507     535    +28

Review entire Coverage Diff as of 5a8f883

Uncovered Suggestions

Powered by Codecov. Updated on successful CI builds.

ThomasWaldmann · 2016-03-18T21:31:54Z

borg/key.py

@@ -75,7 +75,11 @@ def id_hash(self, data):
    def encrypt(self, data):
        pass

-    def decrypt(self, id, data):
+    def decrypt(self, id, data, no_decompress=False):


how about compress=True? ;)

that's certainly not less_legible :-)

… others , resolved "segment_id <= sp" vs "id < sp"

enkore · 2016-03-19T00:53:34Z

So, now we got a "major" problem: csize in chunk id lists (if I remember correctly this was also an annoyance in multi-threading create, since we need to delay item "finalisation" until it's known by compressing stuff). It's a nice book-keeping number, sure. But we never update csize with this approach, because we never touch these lists, nor archives nor the manifest.

We could do that. But now our problems just increased manifold: items -- that is, the item-dict with all the metadata -- are chunked as well, and the list of chunk ids that represents the chunked items is the 'items' key of archive metadata. And this metadata block is just another chunk, with the chunk ID -- the archive's ID -- depending, as usual, on the chunk contents.

Change the csize, changes the chunk IDs of the items and that changes all archive IDs. However, archive IDs must be final and never change (by our definition).

Independent of how we approach recompression, either we

ignore incorrect csize (just a book keeping number)
drop / ignore csize entirely
drop recompression entirely

Dropping csize means that it's harder to find compressed size of items on disk: while we don't need to actually read the data, we'd need to take a peek at the segment holding the chunk, since the segment entry contains the length of the stored data.

However, borg create-time book keeping of how well stuff compressed wouldn't be affected, we can still do that, and no changes are required for it to continue to work, at least for new chunks. Old chunks' csize are stored in the hashindex and would still be off, if they were recompressed.

My opinion: recompression is too useful compared to the csize field to not have it. Workarounds to consider...

Just say "if you use borg recompress, the compressed numbers of create/info are off. Thy must know yourself whether that matters." (you still get the compressed-deduplicated number from the output of recompress, when used with --segment-pointer, after each create / group of creates)
Put a chunkid => csize mapping in the repo (we can chunk this as usual and just put these chunk ids in the manifest). This would be an unacceptably large overhead in disk space / memory use (we'd need to extract it to disk or memory to get random access?) I guess: Even if this were only 32+32 byte for every chunk this would still work out to hundreds of megabytes for a couple million chunks. So probably not an option.
extend manifest with up-to-date data about total compressed repository size. (=> so we can always quickly now how large this whole thing is on-disk, neglecting file system overhead (which should be neglible for larger repos compared to the repo).
the most relevant (imho) csize-related number of borg info is how large "this deduplicated compressed archive" is. This only needs looking at the segments of unique chunks, which is still pretty reasonable I guess; borg info isn't fast anyway. (especially in conjunction with 3.) we'd have all the important data, no?)
we could also use the chunk index for it (e.g. look at segment entry length for each chunk when creating the idx instead of copying csize). borg info needs that one anyway. but it would mean lots of IO still when creating the index. I guess we'll do this best by adding a new API method to Repository to just query the size of a value, not reading it, to have this work for RemoteRepositories as well.

Maybe...

1.) for now (ideally we'd not have any trade-off but this one seems reasonable).
5.) mid/long-term combined with phasing-out csize entirely (i.e. not storing it in new chunk id lists and have the chunk cache handle it).

…nto base

ThomasWaldmann · 2016-03-19T14:00:41Z

remote repos not working and the csize issues let this look rather questionable.

enkore · 2016-03-19T14:32:18Z

I agree! Remote repos can be supported, but the csize turns up again as a PITA by principle. So I'm thinking we should solve this problem on a root-cause-level (and the root cause is, that we put csize, which is unrelated to the identity of chunks[1], into the chunk lists, which are final/sealed).

What are the clients (in an API sense) of csize?

borg create: uses it, but can easily be changed not to use it.
borg info: for the compressed stats
borg list: for {csize}

info requires an up to date Cache. list does as well (it could be lazy about it, though, which might be worth another PR).

So here's my line of thought: new archives get csize=0 (or -1 or some other dummy value) in all chunk lists. Repository gets a size_of(id_) / size_of_many(ids) API. When we're writing chunks the Cache is updated with the real csize.

no identity unrelated data in chunk lists
removes the delayer-problem for multithreading (because now the pipeline has no cycles, we can work "forwards-only", not "backwards-passing" of information)
can recompress etc. easily
borg info, list etc. continue to work like they did with no additional hit (they use the Cache already, in the case of borg info the Cache is queried about all chunks anyway already)
But: Cache creation from scratch is more costly

If we decide to solve csize this way, it shouldn't be done in this PR imho, though.

[1] chunk identity is our chunk ID:

id(somechunk) == id(otherchunk) <=> somechunk == otherchunk (except for hash collisions which we ignore because of the huge image space of id()
id(somechunk) == id(otherchunk) => size(somechunk) == size(otherchunk) (implies only - no iff)
But: id(somechunk) == id(otherchunk) does not imply anything about csize.
- so csize is unrelated to chunk identity, because there is no relation operator between them.

csize also "punches through" our abstractions (borg.key).

ThomasWaldmann · 2016-03-19T15:13:37Z

yes, csize is a troublemaker. but i guess we need to delay any change of repo api to until later, I'ld like to make a few releases with features and fixes that we can do without breaking repo / repo api compatibility.

enkore · 2016-03-19T15:18:28Z

That's reasonable. Should we postpone / close this one then, or put a disclaimer on it for now basically saying

"if you use borg recompress, the compressed numbers of create/info are off. Thy must know yourself whether that matters. We probably fix this in the feature, but no guarantees."

Because I see recompress as something really useful until there is (hopefully) someday a faster, multi-threaded borg create, to achieve... (citing commit message)

In my mind it is a goal in backup systems like Borg to minimize the
"acquisition time", the time needed to store a snapshot of the data.
Using heavy compression will usually increase this time.

Entrance borg recompress: Use some fast compression while running
borg create (LZ4 is pretty much ideal for any CPU+disk combination),
to minimize acquisition time. Run borg recompress later at any time
to squash space usage down.

It will compress the chunks of latest archives first.

ThomasWaldmann · 2016-03-19T18:18:43Z

how about a more high-level and more generic approach for now, even if it is more expensive?

as this is likely a rare / one-time / never operation for most users, efficiency is not as important as for the regularly invoked stuff.

e.g. like:

for archive in repo: # could also be partial: archive with a specific name, starting with a prefix, ...
    create temp_archive
    for item in archive:
        new_chunks = transform_chunks(item.chunks)
        temp_archive.write_item(item, new_chunks)
    kill archive
    rename temp_archive archive

transform_chunks would recompress the chunks as needed. as the chunk ids do not change, space needs could be similar to your approach, but it would work also for remote repos and would not require any api or storage change (AFAICS).

this "recreate archive" approach could even do other stuff: like dropping files according to a new exclude list, applying dynamic compression (depending on filetype) - so it is more like create with data coming from an archive rather than from local disk.

about dropping csize: this would be something useful for multithreading branch and for a major release (2.0, 3.0?). we should keep the thoughts / research results in a ticket about that.

enkore · 2016-03-19T18:41:18Z

That sounds like an idea, would also fulfill feature requests for deleting selectively in old archives (via --exclude, and, I don't know if someone asked for that, but someone probably did).

If we allow processing of single archives the csize will be wrong again for all other archives, though (-- in the general case, see below). And we can't really tell if that's the case: refcount > 1 of a chunk can also mean it's a reference from the same archive. Assuring that it is, is prohibitive in terms of either time or additional memory use.

If this were used in a "progressive recompression" scheme, i.e. after every or every group of borg create comes a borg rewrite (which is how I'd name it) to squish down space usage, like I described above, then this would indeed give us correct csize; old chunks would already be compressed and wouldn't be recompressed.

So that's better than approaches No. 1 and 2 in these regards. Space usage if used to recompress whole repositories should be ok, working chronologically. Keeping a list / dict of seen chunks around will cost (a lot of) memory (but could be disabled, similar to --space-save) but avoids lots of I/O + decryption + Compressor.detect, which would dominate otherwise.

I'll look into it.

Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770

Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770

enkore added 4 commits March 18, 2016 15:48

borg recompress

eda8156

borg recompress: Really clean exit on SIGTERM and SIGINT

4a2e04c

Avoids any stale locks or uncommited segments.

borg recompress: check commit status on last segment.

0cdb9e9

Also some clean up

enkore force-pushed the feature-recompress branch from d37580b to 0cdb9e9 Compare March 18, 2016 17:39

enkore reviewed Mar 18, 2016
View reviewed changes

borg recompress: minor refactoring

145764c

enkore force-pushed the feature-recompress branch from bf92bbe to ffab0a1 Compare March 18, 2016 19:05

borg recompress: implement --segment-pointer

e63241c

now this is becoming useful...!

enkore force-pushed the feature-recompress branch from ffab0a1 to e63241c Compare March 18, 2016 19:06

ThomasWaldmann reviewed Mar 18, 2016
View reviewed changes

Add Compressor.detect method, Key changes, --stats <=> --verbose like…

9355069

… others , resolved "segment_id <= sp" vs "id < sp"

enkore force-pushed the feature-recompress branch from 3f7e1da to 9355069 Compare March 18, 2016 22:36

enkore added 2 commits March 19, 2016 02:02

Archiver test suite: push create_test_files and create_regular_file i…

9e4381b

…nto base

Add RecompressArchiverTestCase

2f328d7

ThomasWaldmann mentioned this pull request Mar 19, 2016

get rid of csize #776

Closed

enkore closed this Mar 22, 2016

enkore mentioned this pull request Mar 29, 2016

Done, feature recreate #812

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP / Feature recompress (2) #770

WIP / Feature recompress (2) #770

enkore commented Mar 18, 2016

enkore Mar 18, 2016

codecov-io commented Mar 18, 2016

ThomasWaldmann Mar 18, 2016

enkore Mar 18, 2016

enkore commented Mar 19, 2016

ThomasWaldmann commented Mar 19, 2016

enkore commented Mar 19, 2016

ThomasWaldmann commented Mar 19, 2016

enkore commented Mar 19, 2016

ThomasWaldmann commented Mar 19, 2016

enkore commented Mar 19, 2016

WIP / Feature recompress (2) #770

WIP / Feature recompress (2) #770

Conversation

enkore commented Mar 18, 2016

enkore Mar 18, 2016

Choose a reason for hiding this comment

codecov-io commented Mar 18, 2016

Current coverage is 82.73%

Uncovered Suggestions

ThomasWaldmann Mar 18, 2016

Choose a reason for hiding this comment

enkore Mar 18, 2016

Choose a reason for hiding this comment

enkore commented Mar 19, 2016

ThomasWaldmann commented Mar 19, 2016

enkore commented Mar 19, 2016

ThomasWaldmann commented Mar 19, 2016

enkore commented Mar 19, 2016

ThomasWaldmann commented Mar 19, 2016

enkore commented Mar 19, 2016

Current coverage is `82.73%`