Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible compression - pattern matching on path/filename #810

Merged

Conversation

ThomasWaldmann
Copy link
Member

@ThomasWaldmann ThomasWaldmann commented Mar 28, 2016

current state:

  • --compression-from reads a file with a list compression-spec:matching-pattern that is used to set up CompressionDecider1
  • docs/misc/compression.conf example file (currently used to disable compression)
  • if no match, compresses like given by --compression
  • CompressionDecider2 operates on chunks - it does not do much yet (out of scope of this PR), just either gets the desired compression from Chunk meta or returns the default compression (set by --compression) if there is no meta entry.
  • debug log tells filenames and compression
  • clean_lines() can be used later to deduplicate similar code at other places

TODO:

  • support --compression-from for recreate, see TODO in the code

@@ -178,7 +178,7 @@ class Compressor:
compresses using a compressor with given name and parameters
decompresses everything we can handle (autodetect)
"""
def __init__(self, name='null', **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgotten in a rename 'null' -> 'none' (CNONE.name == 'none').

@enkore
Copy link
Contributor

enkore commented Mar 29, 2016

A different implementation idea would be to always run LZ4 over every chunk (it's very fast after all), and see how well LZ4 compresses it (just as an indicator of compressibility). If LZ4 doesn't compress a chunk at all or only by 2 % or so, then it's unlikely that zlib or LZMA could do anything about it either.

This would be similar to what compressing file systems do afaik.

The advantage would be that it can be implemented within the Compressor API we already have. E.g. --compression auto,zlib,9 => AutoCompressor uses LZ4 to decide whether no compression or zlib,9 compression should be used (on a per-chunk basis). This also doesn't need file extensions to make the decision (while some formats / file extensions are always compressed, others might or might not be compressed etc., also this list needs to be maintained).

(Also no extra additional options to en/disable it, doesn't impact all code that handles chunks etc.)

Re. metadata for sparse files: I think these need to go into the item, not chunk metadata (which isn't saved)... unless we want to make the chunker itself sparse-file-aware, which I think would be too complex for too little gain.

@Ape
Copy link
Contributor

Ape commented Mar 29, 2016

Using the file extension helps a bit, but it's really not that flexible since we cannot list all the possible extensions, and sometimes the files do not have a detectable extension at all.

Also, it might be possible that there are files with the listed file name extension, but which are still readily compressible. This may be a weird, rare case, but who know what kind of file naming schemes people are using out there.

enkore added a commit to enkore/borg that referenced this pull request Mar 29, 2016
Use with caution: permanent data loss by specifying incorrect patterns
is easily possible. Make a dry run to make sure you got everything right.

borg rewrite has many uses:
- Can selectively remove files/dirs from old archives, e.g. to free
  space or purging picturarum biggus dickus from history
- Recompress data
- Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate
  with Borg 1.x archives. (Or to experiment with chunker-params for
  specific use cases

It is interrupt- and resumable.

Chunks are not freed on-the-fly.
Rationale:
  Makes only sense when rechunkifying, but logic on which new chunks to
  free what input chunks is complicated and *very* delicate.


Current TODOs:
- Detect and skip (unless --force) already recompressed chunks
  -- delayed until current PRs on borg.key APIs are decided
     borgbackup#810 borgbackup#789
- Usage example

Future TODOs:
- Refactor tests using py.test fixtures
  -- would require porting ArchiverTestCase to py.test: many changes,
     this changeset is already borderline too large.
- Possibly add a --target option to not replace the source archive
  -- with the target possibly in another Repo
     (better than "cp" due to full integrity checking, and deduplication
      at the target)

Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked)
Also see borgbackup#757 and borgbackup#770
@enkore enkore mentioned this pull request Mar 29, 2016
enkore added a commit to enkore/borg that referenced this pull request Mar 29, 2016
Use with caution: permanent data loss by specifying incorrect patterns
is easily possible. Make a dry run to make sure you got everything right.

borg rewrite has many uses:
- Can selectively remove files/dirs from old archives, e.g. to free
  space or purging picturarum biggus dickus from history
- Recompress data
- Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate
  with Borg 1.x archives. (Or to experiment with chunker-params for
  specific use cases

It is interrupt- and resumable.

Chunks are not freed on-the-fly.
Rationale:
  Makes only sense when rechunkifying, but logic on which new chunks to
  free what input chunks is complicated and *very* delicate.


Current TODOs:
- Detect and skip (unless --force) already recompressed chunks
  -- delayed until current PRs on borg.key APIs are decided
     borgbackup#810 borgbackup#789
- Usage example

Future TODOs:
- Refactor tests using py.test fixtures
  -- would require porting ArchiverTestCase to py.test: many changes,
     this changeset is already borderline too large.
- Possibly add a --target option to not replace the source archive
  -- with the target possibly in another Repo
     (better than "cp" due to full integrity checking, and deduplication
      at the target)

Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked)
Also see borgbackup#757 and borgbackup#770
enkore added a commit to enkore/borg that referenced this pull request Mar 29, 2016
Use with caution: permanent data loss by specifying incorrect patterns
is easily possible. Make a dry run to make sure you got everything right.

borg rewrite has many uses:
- Can selectively remove files/dirs from old archives, e.g. to free
  space or purging picturarum biggus dickus from history
- Recompress data
- Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate
  with Borg 1.x archives. (Or to experiment with chunker-params for
  specific use cases

It is interrupt- and resumable.

Chunks are not freed on-the-fly.
Rationale:
  Makes only sense when rechunkifying, but logic on which new chunks to
  free what input chunks is complicated and *very* delicate.


Current TODOs:
- Detect and skip (unless --force) already recompressed chunks
  -- delayed until current PRs on borg.key APIs are decided
     borgbackup#810 borgbackup#789
- Usage example

Future TODOs:
- Refactor tests using py.test fixtures
  -- would require porting ArchiverTestCase to py.test: many changes,
     this changeset is already borderline too large.
- Possibly add a --target option to not replace the source archive
  -- with the target possibly in another Repo
     (better than "cp" due to full integrity checking, and deduplication
      at the target)

Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked)
Also see borgbackup#757 and borgbackup#770
enkore added a commit to enkore/borg that referenced this pull request Mar 29, 2016
Use with caution: permanent data loss by specifying incorrect patterns
is easily possible. Make a dry run to make sure you got everything right.

borg rewrite has many uses:
- Can selectively remove files/dirs from old archives, e.g. to free
  space or purging picturarum biggus dickus from history
- Recompress data
- Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate
  with Borg 1.x archives. (Or to experiment with chunker-params for
  specific use cases

It is interrupt- and resumable.

Chunks are not freed on-the-fly.
Rationale:
  Makes only sense when rechunkifying, but logic on which new chunks to
  free what input chunks is complicated and *very* delicate.


Current TODOs:
- Detect and skip (unless --force) already recompressed chunks
  -- delayed until current PRs on borg.key APIs are decided
     borgbackup#810 borgbackup#789
- Usage example

Future TODOs:
- Refactor tests using py.test fixtures
  -- would require porting ArchiverTestCase to py.test: many changes,
     this changeset is already borderline too large.
- Possibly add a --target option to not replace the source archive
  -- with the target possibly in another Repo
     (better than "cp" due to full integrity checking, and deduplication
      at the target)

Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked)
Also see borgbackup#757 and borgbackup#770
enkore added a commit to enkore/borg that referenced this pull request Mar 29, 2016
Use with caution: permanent data loss by specifying incorrect patterns
is easily possible. Make a dry run to make sure you got everything right.

borg rewrite has many uses:
- Can selectively remove files/dirs from old archives, e.g. to free
  space or purging picturarum biggus dickus from history
- Recompress data
- Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate
  with Borg 1.x archives. (Or to experiment with chunker-params for
  specific use cases

It is interrupt- and resumable.

Chunks are not freed on-the-fly.
Rationale:
  Makes only sense when rechunkifying, but logic on which new chunks to
  free what input chunks is complicated and *very* delicate.


Current TODOs:
- Detect and skip (unless --force) already recompressed chunks
  -- delayed until current PRs on borg.key APIs are decided
     borgbackup#810 borgbackup#789
- Usage example

Future TODOs:
- Refactor tests using py.test fixtures
  -- would require porting ArchiverTestCase to py.test: many changes,
     this changeset is already borderline too large.
- Possibly add a --target option to not replace the source archive
  -- with the target possibly in another Repo
     (better than "cp" due to full integrity checking, and deduplication
      at the target)

Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked)
Also see borgbackup#757 and borgbackup#770
@ThomasWaldmann
Copy link
Member Author

@Ape i think the biggest issue are files without extension, like some apps generate. the other cases seem rather uncommon.

@enkore good idea with the magic detection (although I am not sure how good lz4 is as a predictor, esp. since lzma considers much bigger scopes of input data), but it would still use quite some cycles as it would have to invoke lz4 (per file? or even per chunk?).

If you have a compatible-with-1.0-repos idea about how to implement precise reproduction of sparse files other than with "pass-meta" change + using a special type of "compression" for the holes, I'ld like to see it layouted in a separate ticket (or PR).

For sparse, the chunk.meta['sparse'] would be saved into the repo using the special compression type:

  • as usual for normal files data (not hole)
  • special compression type for file hole (which just saves the size of the hole and the knowledge that it was a hole) - this needs metadata flowing with the data from the reader to the compressor and also backwards from the decompressor to the writer.

@enkore
Copy link
Contributor

enkore commented Mar 30, 2016

LZ4/Speed: I am seeing 2 GB/s compression speed for fast mode on a single core. (lz4 -1b some screen shot (PNG, incompressible, 1.5 MB)). I think "fast mode" is what we use. [Edit: For a very well compressible file (compress.c from Cython, 357 kB, 21 % ratio) I see ~800 MB/s -- using -9 it manages 14 % ratio at a 20× slowdown]

API-wise interesting: If we give LZ4_compress_limitedOutput a smaller buffer than the input data it will abort early. E.g. we set maxOutputSize to, say, 0.8 * input size and it should abort early if it can't compress better than 80 %.

LZMA: From the tests I've made so far with the various compressions they all end up in a "similar" ball park. Depending on test data LZ4 gives me 25-50 % ratio, zlib somewhere like 25-35 % and LZMA basically the same as zlib, maybe 20-35 %.

More importantly: When zlib/LZMA give better results, LZ4 does as well, if zlib/LZMA is worse, LZ4 is worse as well. They seem to scale well relative to each other.

Which might be related to the chunk size, I think a compression like LZMA doesn't really work well with these small chunks (<2 MB) of data. So far there were few cases were LZMA,6 and LZMA,9 actually made a tangible difference. For me it's ~10 times slower than zlib,9 and often makes little difference, sometimes even slightly larger output.


btw. an interesting compression algorithm for chunk compression might be zstd, which has a dedicated "training mode" to be more efficient with small chunks. I guess that it is a bit like "solid compression", just without actually having to squish the data together.


Gist of how I think of "perfect hole repro": SEEK_DATA/SEEK_HOLE to map data/holes structure to [(False, length), (True, length) etc](or even sign-code it) put that in item[b'sparse']. When writing, read item[b'sparse'], skip (False, length), creating holes. Not the most efficient way (holes will compress very well), but backwards compatible to 1.0 (when extracting)

enkore added a commit to enkore/borg that referenced this pull request Apr 1, 2016
Use with caution: permanent data loss by specifying incorrect patterns
is easily possible. Make a dry run to make sure you got everything right.

borg rewrite has many uses:
- Can selectively remove files/dirs from old archives, e.g. to free
  space or purging picturarum biggus dickus from history
- Recompress data
- Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate
  with Borg 1.x archives. (Or to experiment with chunker-params for
  specific use cases

It is interrupt- and resumable.

Chunks are not freed on-the-fly.
Rationale:
  Makes only sense when rechunkifying, but logic on which new chunks to
  free what input chunks is complicated and *very* delicate.


Current TODOs:
- Detect and skip (unless --force) already recompressed chunks
  -- delayed until current PRs on borg.key APIs are decided
     borgbackup#810 borgbackup#789
- Usage example

Future TODOs:
- Refactor tests using py.test fixtures
  -- would require porting ArchiverTestCase to py.test: many changes,
     this changeset is already borderline too large.
- Possibly add a --target option to not replace the source archive
  -- with the target possibly in another Repo
     (better than "cp" due to full integrity checking, and deduplication
      at the target)

Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked)
Also see borgbackup#757 and borgbackup#770
@verygreen
Copy link
Contributor

I think another good idea is to check compressor output and if the result is bigger (or equal) than the original data - discard compressed output and store data as plain-text. Mark it not just as plaintext, but invent a new compressor id "uncompressed-uncompressible" - this is needed for later when people would start recompressing their repositories with more powerful algorithms, to save cpu time on data that just plain does not compress.

@ThomasWaldmann ThomasWaldmann changed the title Flexible compression Flexible compression (WIP) Apr 18, 2016
@ZoomRmc
Copy link

ZoomRmc commented Apr 25, 2016

Regarding files that do not need compression, Duplicati2 has a txt file which is read from application directory with quite a comprehensive list (but not perfect: lacks .opus, for example, .tiff files can be uncompressed ) of file extensions: default_compressed_extensions.txt

@ZoomRmc
Copy link

ZoomRmc commented Apr 25, 2016

May be some entropy measurement can be used to predict how well the file/chunk can be compressed? I just naively ran some Rust implementation of Shannon Entropy measurement on a test set and got the following results:

# shannon -f test.log

104857600 bytes
shannon entropy:  5.191260284047
Done in: 89.617991 msec
--------------------
# shannon -f testset.wav

91110528 bytes
shannon entropy:  7.614343072426
Done in: 73.439318 msec
--------------------
# shannon -f testset.flac

53651888 bytes
shannon entropy:  7.990729784876
Done in: 42.863125 msec
--------------------
# shannon -f testset.xz

76732004 bytes
shannon entropy:  7.999779254371
Done in: 61.242032 msec

# shannon -f e:\test.jpg

45127926 bytes
shannon entropy:  7.941891449844
Done in: 36.107949  msec
--------------------
# shannon -f nevergonnagiveyouup.mp4

60735907 bytes
shannon entropy:  7.997137165205
Done in: 47.598294  msec
--------------------
# shannon -f nixos-graphical-16.03.581.e409886-x86_64-linux.iso

1052770304 bytes
shannon entropy:  7.993876963475
Done in: 7.579552695 sec
--------------------

Entropy is in 1 to 8 range, time measured including reading file from HDD. This is fast, as far as I can tell, but I don't know if it's a viable metric. The same .iso from set compressed to .xz with 0.938 rate.

@enkore
Copy link
Contributor

enkore commented Apr 25, 2016

For a test like that you really want to vmlock the test files to make sure no cache-trashing program (e.g. Firefox, Thunderbid, IntelliJ) kicks it out. 140 MB/s is probably the disk IO limit, if it's already near the CPU limit there then it's way to inefficient.

For reference, on my half-decade old AMD processor LZ4 processes about 2 GB/s of incompressible data.

@ThomasWaldmann ThomasWaldmann force-pushed the flexible-compression branch 2 times, most recently from 3264ab3 to 61edea7 Compare April 25, 2016 22:54
@codecov-io
Copy link

codecov-io commented Apr 25, 2016

Current coverage is 83.93%

Merging #810 into master will increase coverage by +<.01%

@@             master       #810   diff @@
==========================================
  Files            15         15          
  Lines          4881       4930    +49   
  Methods           0          0          
  Messages          0          0          
  Branches        878        888    +10   
==========================================
+ Hits           4093       4138    +45   
- Misses          559        562     +3   
- Partials        229        230     +1   

Sunburst

Powered by Codecov. Last updated by 0ffbd99

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Apr 26, 2016

Idea:

Implement this as --compression-config=FILENAME (or --compression-from-file=...?) as an alternative to --compression option.

In the given file, it would have a list like <compression-args>:<path/filename-pattern> (like currently hardcoded).

The list would either end with a pattern like lz4:* or it would use 'none' as internal fallback (like when not giving --compression, then it also uses 'none' by default).

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Apr 26, 2016

In a future PR, compression modes (currently: none, lz4, gzip,N, lzma,N) could be amended by a mode auto,X and that would mean that a predictor (lz4?) is used to automatically decide about whether to compress a chunk or not. If it decides yes, compression X would be applied.

(or even auto-X for easier parsing)

This future change could be easily integrated with this one, as auto,X would be just adding another choice to compression modes.

@Ape
Copy link
Contributor

Ape commented Apr 26, 2016

Is there any benefit for using the manually specified file path pattern whitelist / blacklist if auto,X works?

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Apr 26, 2016

Yes, not wasting CPU cycles, if you already can say from the path / filename.

Also, you could give different compression methods depending on the path / filename not just "X or none":

lz4:*.vmdk
lzma:*.txt
none:*.jpg

You can also combine both methods. Like for some file extensions that can be already compressed or not, they could get an entry in the file like:

auto,zlib:*.tiff

@enkore
Copy link
Contributor

enkore commented Apr 26, 2016

There are two independent parts here. See #810 (comment) and the follow-up.

@ThomasWaldmann ThomasWaldmann changed the title Flexible compression (WIP) Flexible compression - by pattern matching on path/filename (WIP) Apr 26, 2016
@ThomasWaldmann ThomasWaldmann changed the title Flexible compression - by pattern matching on path/filename (WIP) Flexible compression - pattern matching on path/filename Apr 26, 2016
@ZoomRmc
Copy link

ZoomRmc commented Apr 27, 2016

@ThomasWaldmann, that's a great idea. We possibly could even have unique settings for different kinds of data. It would be great if that compression config file was sorted by compression method and was easily editable. Something like this, maybe (sorry for lua-inspired syntax):

lz4 = {*.vmdk, *.tar, /home/user/vm/*...}
lzma = {*.txt, *.log, /var/log/*, ...}
auto = {
    lzma={*.tif, *.raw, *.dng},
    zlib={*.iso, *.dd},
    ....
  }
none = { *.jpg, *.mp3, *.zip ...}

Do I understand correctly, that compression is performed per chunk, and each chunk has a metadata with a mark of a file-type it came from?

Also, there's a paper from IBM researchers specifically on compressibility prediction: PDF, Slides. They got pretty good and reliable results. There's a part about real-time analyzing chunks of data, we can utilize some of the ideas.

Relevant SO question

@ThomasWaldmann
Copy link
Member Author

@ZoomRmc yes, different compression for different file types is the goal. But I'ld rather re-use existing pattern definition formats we already use elsewhere.

Compression is performed per chunk (and at that time, the filename where it came from is known). But we do not store the file name or type with the chunk. But we store per-file metadata that has a list of content chunk IDs.

Thanks for the links!

@ThomasWaldmann ThomasWaldmann added this to the 1.1 - near future goals milestone Apr 27, 2016
@ThomasWaldmann ThomasWaldmann merged commit 2bb9bc4 into borgbackup:master Apr 27, 2016
@ThomasWaldmann ThomasWaldmann deleted the flexible-compression branch April 27, 2016 21:41
@dragetd
Copy link
Contributor

dragetd commented May 1, 2016

Excuse the off-comment, but I'd like to share my humble opinion: I'd prefer a heuristic (see the BTRFS-suggestion on the other chunking issue or a simple idea) or some build-in defaults to more options that blow up documentation and usage-complexity.

@ThomasWaldmann
Copy link
Member Author

@dragetd see #1006.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants