Deduplicated size of a given set of archives? #5741

rom1v · 2021-03-20T14:08:34Z

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes, in praticular https://borgbackup.readthedocs.io/en/stable/usage/info.html

Is this a BUG / ISSUE report or a QUESTION?

A question (or feature request).

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.1.15

Operating system (distribution) and version.

Linux (Debian sid).

Describe the problem you're observing.

Let's create two directories to backup, each containing a duplicated file (respectively 10M and 100M):

$ mkdir userA userB
$ dd if=/dev/urandom of=userA/fileA bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.257317 s, 40.8 MB/s
$ dd if=/dev/urandom of=userB/fileB bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.54954 s, 41.1 MB/s
$ cp userA/fileA userA/fileA2
$ cp userB/fileB userB/fileB2
$ tree userA userB
userA
├── fileA
└── fileA2
userB
├── fileB
└── fileB2

0 directories, 4 files

Then, create an archive for userA and another for userB:

$ export BORG_REPO=borg
$ borg init -e none
$ borg create -Cnone ::userA-monday userA
$ borg create -Cnone ::userB-monday userB

The deduplicated size is (as expected) half the original size, for each user:

$ borg info ::userA-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB             10.49 MB
All archives:              230.69 MB            230.69 MB            115.35 MB
$ borg info ::userB-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:              209.72 MB            209.72 MB            104.86 MB
All archives:              230.69 MB            230.69 MB            115.35 MB

Now, let's create a new archive for each user (without any changes):

$ borg create -Cnone ::userA-tuesday userA
$ borg create -Cnone ::userB-tuesday userB

Now, the deduplicated size is basically 0 for all archives (if we remove exactly one archive, it won't save any space):

$ borg info ::userA-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                433 B
All archives:              461.39 MB            461.39 MB            115.35 MB
$ borg info ::userA-tuesday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                435 B
All archives:              461.39 MB            461.39 MB            115.35 MB
$ borg info ::userB-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:              209.72 MB            209.72 MB                433 B
All archives:              461.39 MB            461.39 MB            115.35 MB
$ borg info ::userB-tuesday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:              209.72 MB            209.72 MB                435 B
All archives:              461.39 MB            461.39 MB            115.35 MB

But in practice, we may still be interested in knowning the deduplicated size of all archives of userA. Is it possible to get this information easily?

Currently, it seems we can retrieve the global deduplicated size ("All archives"), or the deduplicated size of a single archive, but not that of a set of archives (typically for a given prefix). As a consequence, as soon as a new archive is created with few changes, the deduplicated size is meaningless in practice.

The filtering options of borg info just list individual archives matching the filter, but not the deduplicated size of the set of resulting archives:

$ borg info --prefix userA- | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                433 B
All archives:              461.39 MB            461.39 MB            115.35 MB
--
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                435 B
All archives:              461.39 MB            461.39 MB            115.35 MB

The text was updated successfully, but these errors were encountered:

ThomasWaldmann · 2021-03-20T16:40:49Z

Yeah, for the archive's deduped size, borg always computes what this archive adds (size of all unique chunks [chunks ONLY used by this archive]) compared to the rest of the repo.

ThomasWaldmann · 2024-09-29T16:00:13Z

Related: #71

ThomasWaldmann · 2024-11-02T20:18:45Z

Related: #8514 (hashtable flags)

Guess we can implement this now:

allocate flags bit C (== defining set C members) for the considered set of chunks (union of all chunks referenced by considered archives).
allocate flags bit R (== defining set R members) for the rest set of chunks (union of all chunks referenced by the rest of the archives)
the unique size of the considered archives is then sum(size(x) for x in C - R), summing up all chunk sizes of chunks with C bit set and R bit not set.

memory needs is not an issue (basically zero additional).

but runtime is O(archives count * chunks referenced per archive). this runtime would be needed per combination of considered/rest archives

so guess this might be similar to a borg compact run considering the runtime.

is it worth implementing this?

ThomasWaldmann · 2024-11-02T20:52:57Z

Or, with more memory:

allocate one bit per archive, set the bit for the chunks referenced by that archive
with that, the unique size of any combination of considered archives / rest of archives can be determined by looking at this built-only-once bitvector.
the unique size is sum(size(x) for x in chunks if any(C(n) bit set) and not any(R(m) bit set))

memory needs: additional archives_count/8 * chunks_count bytes

runtime is O(archives count * chunks referenced per archive), but only once - after that (as long as chunkindex is in memory) any combination can be queried quickly.

The problem in general is the amount of queries needed: if my maths aren't wrong, one would need 2^archives_count queries against that index to check all combinations of selected archives vs. remaining archives.

If we would be only interested in consecutive archives, the queries amount would be reduced to O(archives_count^2). This could be further reduced by limiting the length of such a consecutive archives "run", which makes sense especially if the archives count is rather high (and the archives are from same backup source data set).

rom1v · 2024-11-29T14:11:19Z

allocate one bit per archive, set the bit for the chunks referenced by that archive

That makes sense.

The problem in general is the amount of queries needed: if my maths aren't wrong, one would need 2^archives_count queries against that index to check all combinations of selected archives vs. remaining archives.

But typically we don't want to check all combinations of selected archives, just measure the "marginal/deduplicated size" of a given subset of archives (or I didn't understand what you said).

ThomasWaldmann · 2024-11-29T16:58:24Z

@rom1v If you have fixed sets R and C, yes, you only need to query the index once.

Building the index needs to iterate over all archives though (and some people have many archives), so that does not scale very well (similar to the archives part of borg check).

If the sets are fixed and only that one query is intended, the 2 bit approach is better and the n bit approach would just consume more memory and would not give any advantage.

ThomasWaldmann changed the title ~~Deduplicated size of a given prefix?~~ Deduplicated size of a given set of archives? Mar 20, 2021

ThomasWaldmann added the space repo space usage, quotas, ... label Sep 30, 2024

ThomasWaldmann modified the milestones: 2.0.0b13, 2.0.0rc1 Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicated size of a given set of archives? #5741

Deduplicated size of a given set of archives? #5741

rom1v commented Mar 20, 2021 •

edited

Loading

ThomasWaldmann commented Mar 20, 2021

ThomasWaldmann commented Sep 29, 2024

ThomasWaldmann commented Nov 2, 2024 •

edited

Loading

ThomasWaldmann commented Nov 2, 2024 •

edited

Loading

rom1v commented Nov 29, 2024

ThomasWaldmann commented Nov 29, 2024

Deduplicated size of a given set of archives? #5741

Deduplicated size of a given set of archives? #5741

Comments

rom1v commented Mar 20, 2021 • edited Loading

Have you checked borgbackup docs, FAQ, and open Github issues?

Is this a BUG / ISSUE report or a QUESTION?

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

Operating system (distribution) and version.

Describe the problem you're observing.

ThomasWaldmann commented Mar 20, 2021

ThomasWaldmann commented Sep 29, 2024

ThomasWaldmann commented Nov 2, 2024 • edited Loading

ThomasWaldmann commented Nov 2, 2024 • edited Loading

rom1v commented Nov 29, 2024

ThomasWaldmann commented Nov 29, 2024

rom1v commented Mar 20, 2021 •

edited

Loading

ThomasWaldmann commented Nov 2, 2024 •

edited

Loading

ThomasWaldmann commented Nov 2, 2024 •

edited

Loading