-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible compression - pattern matching on path/filename #810
Flexible compression - pattern matching on path/filename #810
Conversation
@@ -178,7 +178,7 @@ class Compressor: | |||
compresses using a compressor with given name and parameters | |||
decompresses everything we can handle (autodetect) | |||
""" | |||
def __init__(self, name='null', **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgotten in a rename 'null' -> 'none' (CNONE.name == 'none').
A different implementation idea would be to always run LZ4 over every chunk (it's very fast after all), and see how well LZ4 compresses it (just as an indicator of compressibility). If LZ4 doesn't compress a chunk at all or only by 2 % or so, then it's unlikely that zlib or LZMA could do anything about it either. This would be similar to what compressing file systems do afaik. The advantage would be that it can be implemented within the Compressor API we already have. E.g. (Also no extra additional options to en/disable it, doesn't impact all code that handles chunks etc.) Re. metadata for sparse files: I think these need to go into the item, not chunk metadata (which isn't saved)... unless we want to make the chunker itself sparse-file-aware, which I think would be too complex for too little gain. |
Using the file extension helps a bit, but it's really not that flexible since we cannot list all the possible extensions, and sometimes the files do not have a detectable extension at all. Also, it might be possible that there are files with the listed file name extension, but which are still readily compressible. This may be a weird, rare case, but who know what kind of file naming schemes people are using out there. |
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
@Ape i think the biggest issue are files without extension, like some apps generate. the other cases seem rather uncommon. @enkore good idea with the magic detection (although I am not sure how good lz4 is as a predictor, esp. since lzma considers much bigger scopes of input data), but it would still use quite some cycles as it would have to invoke lz4 (per file? or even per chunk?). If you have a compatible-with-1.0-repos idea about how to implement precise reproduction of sparse files other than with "pass-meta" change + using a special type of "compression" for the holes, I'ld like to see it layouted in a separate ticket (or PR). For sparse, the chunk.meta['sparse'] would be saved into the repo using the special compression type:
|
LZ4/Speed: I am seeing 2 GB/s compression speed for fast mode on a single core. ( API-wise interesting: If we give LZ4_compress_limitedOutput a smaller buffer than the input data it will abort early. E.g. we set maxOutputSize to, say, LZMA: From the tests I've made so far with the various compressions they all end up in a "similar" ball park. Depending on test data LZ4 gives me 25-50 % ratio, zlib somewhere like 25-35 % and LZMA basically the same as zlib, maybe 20-35 %. More importantly: When zlib/LZMA give better results, LZ4 does as well, if zlib/LZMA is worse, LZ4 is worse as well. They seem to scale well relative to each other. Which might be related to the chunk size, I think a compression like LZMA doesn't really work well with these small chunks (<2 MB) of data. So far there were few cases were LZMA,6 and LZMA,9 actually made a tangible difference. For me it's ~10 times slower than zlib,9 and often makes little difference, sometimes even slightly larger output. btw. an interesting compression algorithm for chunk compression might be zstd, which has a dedicated "training mode" to be more efficient with small chunks. I guess that it is a bit like "solid compression", just without actually having to squish the data together. Gist of how I think of "perfect hole repro": SEEK_DATA/SEEK_HOLE to map data/holes structure to [(False, length), (True, length) etc](or even sign-code it) put that in |
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
I think another good idea is to check compressor output and if the result is bigger (or equal) than the original data - discard compressed output and store data as plain-text. Mark it not just as plaintext, but invent a new compressor id "uncompressed-uncompressible" - this is needed for later when people would start recompressing their repositories with more powerful algorithms, to save cpu time on data that just plain does not compress. |
95e3bc3
to
15c20dc
Compare
Regarding files that do not need compression, Duplicati2 has a txt file which is read from application directory with quite a comprehensive list (but not perfect: lacks .opus, for example, .tiff files can be uncompressed ) of file extensions: default_compressed_extensions.txt |
May be some entropy measurement can be used to predict how well the file/chunk can be compressed? I just naively ran some Rust implementation of Shannon Entropy measurement on a test set and got the following results:
Entropy is in 1 to 8 range, time measured including reading file from HDD. This is fast, as far as I can tell, but I don't know if it's a viable metric. The same .iso from set compressed to .xz with 0.938 rate. |
For a test like that you really want to vmlock the test files to make sure no cache-trashing program (e.g. Firefox, Thunderbid, IntelliJ) kicks it out. 140 MB/s is probably the disk IO limit, if it's already near the CPU limit there then it's way to inefficient. For reference, on my half-decade old AMD processor LZ4 processes about 2 GB/s of incompressible data. |
3264ab3
to
61edea7
Compare
Current coverage is 83.93%@@ master #810 diff @@
==========================================
Files 15 15
Lines 4881 4930 +49
Methods 0 0
Messages 0 0
Branches 878 888 +10
==========================================
+ Hits 4093 4138 +45
- Misses 559 562 +3
- Partials 229 230 +1 |
61edea7
to
3ee4086
Compare
Idea: Implement this as In the given file, it would have a list like The list would either end with a pattern like |
In a future PR, compression modes (currently: (or even This future change could be easily integrated with this one, as |
Is there any benefit for using the manually specified file path pattern whitelist / blacklist if |
Yes, not wasting CPU cycles, if you already can say from the path / filename. Also, you could give different compression methods depending on the path / filename not just "X or none":
You can also combine both methods. Like for some file extensions that can be already compressed or not, they could get an entry in the file like:
|
There are two independent parts here. See #810 (comment) and the follow-up. |
@ThomasWaldmann, that's a great idea. We possibly could even have unique settings for different kinds of data. It would be great if that compression config file was sorted by compression method and was easily editable. Something like this, maybe (sorry for lua-inspired syntax): lz4 = {*.vmdk, *.tar, /home/user/vm/*...}
lzma = {*.txt, *.log, /var/log/*, ...}
auto = {
lzma={*.tif, *.raw, *.dng},
zlib={*.iso, *.dd},
....
}
none = { *.jpg, *.mp3, *.zip ...} Do I understand correctly, that compression is performed per chunk, and each chunk has a metadata with a mark of a file-type it came from? Also, there's a paper from IBM researchers specifically on compressibility prediction: PDF, Slides. They got pretty good and reliable results. There's a part about real-time analyzing chunks of data, we can utilize some of the ideas. |
@ZoomRmc yes, different compression for different file types is the goal. But I'ld rather re-use existing pattern definition formats we already use elsewhere. Compression is performed per chunk (and at that time, the filename where it came from is known). But we do not store the file name or type with the chunk. But we store per-file metadata that has a list of content chunk IDs. Thanks for the links! |
3ee4086
to
f20a78c
Compare
Excuse the off-comment, but I'd like to share my humble opinion: I'd prefer a heuristic (see the BTRFS-suggestion on the other chunking issue or a simple idea) or some build-in defaults to more options that blow up documentation and usage-complexity. |
current state:
TODO:
TODO
in the code