Flexible compression - pattern matching on path/filename #810

ThomasWaldmann · 2016-03-28T23:13:59Z

current state:

--compression-from reads a file with a list compression-spec:matching-pattern that is used to set up CompressionDecider1
docs/misc/compression.conf example file (currently used to disable compression)
if no match, compresses like given by --compression
CompressionDecider2 operates on chunks - it does not do much yet (out of scope of this PR), just either gets the desired compression from Chunk meta or returns the default compression (set by --compression) if there is no meta entry.
debug log tells filenames and compression
clean_lines() can be used later to deduplicate similar code at other places

TODO:

support --compression-from for recreate, see TODO in the code

ThomasWaldmann · 2016-03-28T23:22:57Z

borg/compress.pyx

@@ -178,7 +178,7 @@ class Compressor:
    compresses using a compressor with given name and parameters
    decompresses everything we can handle (autodetect)
    """
-    def __init__(self, name='null', **kwargs):


forgotten in a rename 'null' -> 'none' (CNONE.name == 'none').

enkore · 2016-03-29T08:23:38Z

A different implementation idea would be to always run LZ4 over every chunk (it's very fast after all), and see how well LZ4 compresses it (just as an indicator of compressibility). If LZ4 doesn't compress a chunk at all or only by 2 % or so, then it's unlikely that zlib or LZMA could do anything about it either.

This would be similar to what compressing file systems do afaik.

The advantage would be that it can be implemented within the Compressor API we already have. E.g. --compression auto,zlib,9 => AutoCompressor uses LZ4 to decide whether no compression or zlib,9 compression should be used (on a per-chunk basis). This also doesn't need file extensions to make the decision (while some formats / file extensions are always compressed, others might or might not be compressed etc., also this list needs to be maintained).

(Also no extra additional options to en/disable it, doesn't impact all code that handles chunks etc.)

Re. metadata for sparse files: I think these need to go into the item, not chunk metadata (which isn't saved)... unless we want to make the chunker itself sparse-file-aware, which I think would be too complex for too little gain.

Ape · 2016-03-29T10:51:10Z

Using the file extension helps a bit, but it's really not that flexible since we cannot list all the possible extensions, and sometimes the files do not have a detectable extension at all.

Also, it might be possible that there are files with the listed file name extension, but which are still readily compressible. This may be a weird, rare case, but who know what kind of file naming schemes people are using out there.

Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770

ThomasWaldmann · 2016-03-30T16:09:05Z

@Ape i think the biggest issue are files without extension, like some apps generate. the other cases seem rather uncommon.

@enkore good idea with the magic detection (although I am not sure how good lz4 is as a predictor, esp. since lzma considers much bigger scopes of input data), but it would still use quite some cycles as it would have to invoke lz4 (per file? or even per chunk?).

If you have a compatible-with-1.0-repos idea about how to implement precise reproduction of sparse files other than with "pass-meta" change + using a special type of "compression" for the holes, I'ld like to see it layouted in a separate ticket (or PR).

For sparse, the chunk.meta['sparse'] would be saved into the repo using the special compression type:

as usual for normal files data (not hole)
special compression type for file hole (which just saves the size of the hole and the knowledge that it was a hole) - this needs metadata flowing with the data from the reader to the compressor and also backwards from the decompressor to the writer.

enkore · 2016-03-30T21:27:10Z

LZ4/Speed: I am seeing 2 GB/s compression speed for fast mode on a single core. (lz4 -1b some screen shot (PNG, incompressible, 1.5 MB)). I think "fast mode" is what we use. [Edit: For a very well compressible file (compress.c from Cython, 357 kB, 21 % ratio) I see ~800 MB/s -- using -9 it manages 14 % ratio at a 20× slowdown]

API-wise interesting: If we give LZ4_compress_limitedOutput a smaller buffer than the input data it will abort early. E.g. we set maxOutputSize to, say, 0.8 * input size and it should abort early if it can't compress better than 80 %.

LZMA: From the tests I've made so far with the various compressions they all end up in a "similar" ball park. Depending on test data LZ4 gives me 25-50 % ratio, zlib somewhere like 25-35 % and LZMA basically the same as zlib, maybe 20-35 %.

More importantly: When zlib/LZMA give better results, LZ4 does as well, if zlib/LZMA is worse, LZ4 is worse as well. They seem to scale well relative to each other.

Which might be related to the chunk size, I think a compression like LZMA doesn't really work well with these small chunks (<2 MB) of data. So far there were few cases were LZMA,6 and LZMA,9 actually made a tangible difference. For me it's ~10 times slower than zlib,9 and often makes little difference, sometimes even slightly larger output.

btw. an interesting compression algorithm for chunk compression might be zstd, which has a dedicated "training mode" to be more efficient with small chunks. I guess that it is a bit like "solid compression", just without actually having to squish the data together.

Gist of how I think of "perfect hole repro": SEEK_DATA/SEEK_HOLE to map data/holes structure to [(False, length), (True, length) etc](or even sign-code it) put that in item[b'sparse']. When writing, read item[b'sparse'], skip (False, length), creating holes. Not the most efficient way (holes will compress very well), but backwards compatible to 1.0 (when extracting)

Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770

verygreen · 2016-04-17T23:20:05Z

I think another good idea is to check compressor output and if the result is bigger (or equal) than the original data - discard compressed output and store data as plain-text. Mark it not just as plaintext, but invent a new compressor id "uncompressed-uncompressible" - this is needed for later when people would start recompressing their repositories with more powerful algorithms, to save cpu time on data that just plain does not compress.

ZoomRmc · 2016-04-25T13:28:14Z

Regarding files that do not need compression, Duplicati2 has a txt file which is read from application directory with quite a comprehensive list (but not perfect: lacks .opus, for example, .tiff files can be uncompressed ) of file extensions: default_compressed_extensions.txt

ZoomRmc · 2016-04-25T15:32:41Z

May be some entropy measurement can be used to predict how well the file/chunk can be compressed? I just naively ran some Rust implementation of Shannon Entropy measurement on a test set and got the following results:

# shannon -f test.log

104857600 bytes
shannon entropy:  5.191260284047
Done in: 89.617991 msec
--------------------
# shannon -f testset.wav

91110528 bytes
shannon entropy:  7.614343072426
Done in: 73.439318 msec
--------------------
# shannon -f testset.flac

53651888 bytes
shannon entropy:  7.990729784876
Done in: 42.863125 msec
--------------------
# shannon -f testset.xz

76732004 bytes
shannon entropy:  7.999779254371
Done in: 61.242032 msec

# shannon -f e:\test.jpg

45127926 bytes
shannon entropy:  7.941891449844
Done in: 36.107949  msec
--------------------
# shannon -f nevergonnagiveyouup.mp4

60735907 bytes
shannon entropy:  7.997137165205
Done in: 47.598294  msec
--------------------
# shannon -f nixos-graphical-16.03.581.e409886-x86_64-linux.iso

1052770304 bytes
shannon entropy:  7.993876963475
Done in: 7.579552695 sec
--------------------

Entropy is in 1 to 8 range, time measured including reading file from HDD. This is fast, as far as I can tell, but I don't know if it's a viable metric. The same .iso from set compressed to .xz with 0.938 rate.

enkore · 2016-04-25T15:41:28Z

For a test like that you really want to vmlock the test files to make sure no cache-trashing program (e.g. Firefox, Thunderbid, IntelliJ) kicks it out. 140 MB/s is probably the disk IO limit, if it's already near the CPU limit there then it's way to inefficient.

For reference, on my half-decade old AMD processor LZ4 processes about 2 GB/s of incompressible data.

codecov-io · 2016-04-25T23:38:19Z

Current coverage is 83.93%

Merging #810 into master will increase coverage by +<.01%

@@             master       #810   diff @@
==========================================
  Files            15         15          
  Lines          4881       4930    +49   
  Methods           0          0          
  Messages          0          0          
  Branches        878        888    +10   
==========================================
+ Hits           4093       4138    +45   
- Misses          559        562     +3   
- Partials        229        230     +1

Powered by Codecov. Last updated by 0ffbd99

ThomasWaldmann · 2016-04-26T18:29:18Z

Idea:

Implement this as --compression-config=FILENAME (or --compression-from-file=...?) as an alternative to --compression option.

In the given file, it would have a list like <compression-args>:<path/filename-pattern> (like currently hardcoded).

The list would either end with a pattern like lz4:* or it would use 'none' as internal fallback (like when not giving --compression, then it also uses 'none' by default).

ThomasWaldmann · 2016-04-26T18:37:45Z

In a future PR, compression modes (currently: none, lz4, gzip,N, lzma,N) could be amended by a mode auto,X and that would mean that a predictor (lz4?) is used to automatically decide about whether to compress a chunk or not. If it decides yes, compression X would be applied.

(or even auto-X for easier parsing)

This future change could be easily integrated with this one, as auto,X would be just adding another choice to compression modes.

Ape · 2016-04-26T18:39:45Z

Is there any benefit for using the manually specified file path pattern whitelist / blacklist if auto,X works?

ThomasWaldmann · 2016-04-26T18:40:38Z

Yes, not wasting CPU cycles, if you already can say from the path / filename.

Also, you could give different compression methods depending on the path / filename not just "X or none":

lz4:*.vmdk
lzma:*.txt
none:*.jpg

You can also combine both methods. Like for some file extensions that can be already compressed or not, they could get an entry in the file like:

auto,zlib:*.tiff

enkore · 2016-04-26T19:11:28Z

There are two independent parts here. See #810 (comment) and the follow-up.

ZoomRmc · 2016-04-27T00:07:08Z

@ThomasWaldmann, that's a great idea. We possibly could even have unique settings for different kinds of data. It would be great if that compression config file was sorted by compression method and was easily editable. Something like this, maybe (sorry for lua-inspired syntax):

lz4 = {*.vmdk, *.tar, /home/user/vm/*...}
lzma = {*.txt, *.log, /var/log/*, ...}
auto = {
    lzma={*.tif, *.raw, *.dng},
    zlib={*.iso, *.dd},
    ....
  }
none = { *.jpg, *.mp3, *.zip ...}

Do I understand correctly, that compression is performed per chunk, and each chunk has a metadata with a mark of a file-type it came from?

Also, there's a paper from IBM researchers specifically on compressibility prediction: PDF, Slides. They got pretty good and reliable results. There's a part about real-time analyzing chunks of data, we can utilize some of the ideas.

Relevant SO question

ThomasWaldmann · 2016-04-27T01:08:16Z

@ZoomRmc yes, different compression for different file types is the goal. But I'ld rather re-use existing pattern definition formats we already use elsewhere.

Compression is performed per chunk (and at that time, the filename where it came from is known). But we do not store the file name or type with the chunk. But we store per-file metadata that has a list of content chunk IDs.

Thanks for the links!

dragetd · 2016-05-01T18:50:12Z

Excuse the off-comment, but I'd like to share my humble opinion: I'd prefer a heuristic (see the BTRFS-suggestion on the other chunking issue or a simple idea) or some build-in defaults to more options that blow up documentation and usage-complexity.

ThomasWaldmann · 2016-05-02T18:03:14Z

@dragetd see #1006.

ThomasWaldmann reviewed Mar 28, 2016
View reviewed changes

enkore mentioned this pull request Mar 29, 2016

Done, feature recreate #812

Merged

ThomasWaldmann closed this Apr 18, 2016

ThomasWaldmann force-pushed the flexible-compression branch from 95e3bc3 to 15c20dc Compare April 18, 2016 23:00

ThomasWaldmann reopened this Apr 18, 2016

ThomasWaldmann changed the title ~~Flexible compression~~ Flexible compression (WIP) Apr 18, 2016

ThomasWaldmann force-pushed the flexible-compression branch 2 times, most recently from 3264ab3 to 61edea7 Compare April 25, 2016 22:54

ThomasWaldmann force-pushed the flexible-compression branch from 61edea7 to 3ee4086 Compare April 26, 2016 01:46

ThomasWaldmann changed the title ~~Flexible compression (WIP)~~ Flexible compression - by pattern matching on path/filename (WIP) Apr 26, 2016

ThomasWaldmann changed the title ~~Flexible compression - by pattern matching on path/filename (WIP)~~ Flexible compression - pattern matching on path/filename Apr 26, 2016

flexible compression

f20a78c

ThomasWaldmann force-pushed the flexible-compression branch from 3ee4086 to f20a78c Compare April 27, 2016 01:09

ThomasWaldmann added this to the 1.1 - near future goals milestone Apr 27, 2016

ThomasWaldmann merged commit 2bb9bc4 into borgbackup:master Apr 27, 2016

ThomasWaldmann deleted the flexible-compression branch April 27, 2016 21:41

ThomasWaldmann mentioned this pull request Apr 30, 2016

file type based chunking / compression heuristic #82

Closed

ThomasWaldmann mentioned this pull request May 2, 2016

file type based compression #1004

Closed

enkore mentioned this pull request May 29, 2016

recreate should detect+skip already compressed chunks #1100

Closed

enkore mentioned this pull request Jul 25, 2016

Compression excludes file types #1372

Closed

enkore mentioned this pull request Sep 7, 2016

create: better docs for --compression-from #1582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible compression - pattern matching on path/filename #810

Flexible compression - pattern matching on path/filename #810

ThomasWaldmann commented Mar 28, 2016 •

edited

Loading

ThomasWaldmann Mar 28, 2016

enkore commented Mar 29, 2016

Ape commented Mar 29, 2016

ThomasWaldmann commented Mar 30, 2016

enkore commented Mar 30, 2016

verygreen commented Apr 17, 2016

ZoomRmc commented Apr 25, 2016 •

edited

Loading

ZoomRmc commented Apr 25, 2016

enkore commented Apr 25, 2016 •

edited

Loading

codecov-io commented Apr 25, 2016 •

edited

Loading

ThomasWaldmann commented Apr 26, 2016 •

edited

Loading

ThomasWaldmann commented Apr 26, 2016 •

edited

Loading

Ape commented Apr 26, 2016

ThomasWaldmann commented Apr 26, 2016 •

edited

Loading

enkore commented Apr 26, 2016

ZoomRmc commented Apr 27, 2016 •

edited

Loading

ThomasWaldmann commented Apr 27, 2016

dragetd commented May 1, 2016

ThomasWaldmann commented May 2, 2016

Flexible compression - pattern matching on path/filename #810

Flexible compression - pattern matching on path/filename #810

Conversation

ThomasWaldmann commented Mar 28, 2016 • edited Loading

ThomasWaldmann Mar 28, 2016

Choose a reason for hiding this comment

enkore commented Mar 29, 2016

Ape commented Mar 29, 2016

ThomasWaldmann commented Mar 30, 2016

enkore commented Mar 30, 2016

verygreen commented Apr 17, 2016

ZoomRmc commented Apr 25, 2016 • edited Loading

ZoomRmc commented Apr 25, 2016

enkore commented Apr 25, 2016 • edited Loading

codecov-io commented Apr 25, 2016 • edited Loading

Current coverage is 83.93%

ThomasWaldmann commented Apr 26, 2016 • edited Loading

ThomasWaldmann commented Apr 26, 2016 • edited Loading

Ape commented Apr 26, 2016

ThomasWaldmann commented Apr 26, 2016 • edited Loading

enkore commented Apr 26, 2016

ZoomRmc commented Apr 27, 2016 • edited Loading

ThomasWaldmann commented Apr 27, 2016

dragetd commented May 1, 2016

ThomasWaldmann commented May 2, 2016

ThomasWaldmann commented Mar 28, 2016 •

edited

Loading

ZoomRmc commented Apr 25, 2016 •

edited

Loading

enkore commented Apr 25, 2016 •

edited

Loading

codecov-io commented Apr 25, 2016 •

edited

Loading

ThomasWaldmann commented Apr 26, 2016 •

edited

Loading

ThomasWaldmann commented Apr 26, 2016 •

edited

Loading

ThomasWaldmann commented Apr 26, 2016 •

edited

Loading

ZoomRmc commented Apr 27, 2016 •

edited

Loading