Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicated size of a given set of archives? #5741

Open
rom1v opened this issue Mar 20, 2021 · 6 comments
Open

Deduplicated size of a given set of archives? #5741

rom1v opened this issue Mar 20, 2021 · 6 comments
Labels
space repo space usage, quotas, ...
Milestone

Comments

@rom1v
Copy link
Contributor

rom1v commented Mar 20, 2021

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes, in praticular https://borgbackup.readthedocs.io/en/stable/usage/info.html

Is this a BUG / ISSUE report or a QUESTION?

A question (or feature request).

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.1.15

Operating system (distribution) and version.

Linux (Debian sid).

Describe the problem you're observing.

Let's create two directories to backup, each containing a duplicated file (respectively 10M and 100M):

$ mkdir userA userB
$ dd if=/dev/urandom of=userA/fileA bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.257317 s, 40.8 MB/s
$ dd if=/dev/urandom of=userB/fileB bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.54954 s, 41.1 MB/s
$ cp userA/fileA userA/fileA2
$ cp userB/fileB userB/fileB2
$ tree userA userB
userA
├── fileA
└── fileA2
userB
├── fileB
└── fileB2

0 directories, 4 files

Then, create an archive for userA and another for userB:

$ export BORG_REPO=borg
$ borg init -e none
$ borg create -Cnone ::userA-monday userA
$ borg create -Cnone ::userB-monday userB

The deduplicated size is (as expected) half the original size, for each user:

$ borg info ::userA-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB             10.49 MB
All archives:              230.69 MB            230.69 MB            115.35 MB
$ borg info ::userB-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:              209.72 MB            209.72 MB            104.86 MB
All archives:              230.69 MB            230.69 MB            115.35 MB

Now, let's create a new archive for each user (without any changes):

$ borg create -Cnone ::userA-tuesday userA
$ borg create -Cnone ::userB-tuesday userB

Now, the deduplicated size is basically 0 for all archives (if we remove exactly one archive, it won't save any space):

$ borg info ::userA-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                433 B
All archives:              461.39 MB            461.39 MB            115.35 MB
$ borg info ::userA-tuesday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                435 B
All archives:              461.39 MB            461.39 MB            115.35 MB
$ borg info ::userB-monday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:              209.72 MB            209.72 MB                433 B
All archives:              461.39 MB            461.39 MB            115.35 MB
$ borg info ::userB-tuesday | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:              209.72 MB            209.72 MB                435 B
All archives:              461.39 MB            461.39 MB            115.35 MB

But in practice, we may still be interested in knowning the deduplicated size of all archives of userA. Is it possible to get this information easily?

Currently, it seems we can retrieve the global deduplicated size ("All archives"), or the deduplicated size of a single archive, but not that of a set of archives (typically for a given prefix). As a consequence, as soon as a new archive is created with few changes, the deduplicated size is meaningless in practice.

The filtering options of borg info just list individual archives matching the filter, but not the deduplicated size of the set of resulting archives:

$ borg info --prefix userA- | grep -A2 Dedup
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                433 B
All archives:              461.39 MB            461.39 MB            115.35 MB
--
                       Original size      Compressed size    Deduplicated size
This archive:               20.97 MB             20.97 MB                435 B
All archives:              461.39 MB            461.39 MB            115.35 MB
@ThomasWaldmann
Copy link
Member

Yeah, for the archive's deduped size, borg always computes what this archive adds (size of all unique chunks [chunks ONLY used by this archive]) compared to the rest of the repo.

@ThomasWaldmann ThomasWaldmann changed the title Deduplicated size of a given prefix? Deduplicated size of a given set of archives? Mar 20, 2021
@ThomasWaldmann
Copy link
Member

Related: #71

@ThomasWaldmann ThomasWaldmann added the space repo space usage, quotas, ... label Sep 30, 2024
@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0b13, 2.0.0rc1 Oct 3, 2024
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Nov 2, 2024

Related: #8514 (hashtable flags)

Guess we can implement this now:

  • allocate flags bit C (== defining set C members) for the considered set of chunks (union of all chunks referenced by considered archives).
  • allocate flags bit R (== defining set R members) for the rest set of chunks (union of all chunks referenced by the rest of the archives)
  • the unique size of the considered archives is then sum(size(x) for x in C - R), summing up all chunk sizes of chunks with C bit set and R bit not set.

memory needs is not an issue (basically zero additional).

but runtime is O(archives count * chunks referenced per archive). this runtime would be needed per combination of considered/rest archives

so guess this might be similar to a borg compact run considering the runtime.

is it worth implementing this?

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Nov 2, 2024

Or, with more memory:

  • allocate one bit per archive, set the bit for the chunks referenced by that archive
  • with that, the unique size of any combination of considered archives / rest of archives can be determined by looking at this built-only-once bitvector.
  • the unique size is sum(size(x) for x in chunks if any(C(n) bit set) and not any(R(m) bit set))

memory needs: additional archives_count/8 * chunks_count bytes

runtime is O(archives count * chunks referenced per archive), but only once - after that (as long as chunkindex is in memory) any combination can be queried quickly.

The problem in general is the amount of queries needed: if my maths aren't wrong, one would need 2^archives_count queries against that index to check all combinations of selected archives vs. remaining archives.

If we would be only interested in consecutive archives, the queries amount would be reduced to O(archives_count^2). This could be further reduced by limiting the length of such a consecutive archives "run", which makes sense especially if the archives count is rather high (and the archives are from same backup source data set).

@rom1v
Copy link
Contributor Author

rom1v commented Nov 29, 2024

allocate one bit per archive, set the bit for the chunks referenced by that archive

That makes sense.

The problem in general is the amount of queries needed: if my maths aren't wrong, one would need 2^archives_count queries against that index to check all combinations of selected archives vs. remaining archives.

But typically we don't want to check all combinations of selected archives, just measure the "marginal/deduplicated size" of a given subset of archives (or I didn't understand what you said).

@ThomasWaldmann
Copy link
Member

@rom1v If you have fixed sets R and C, yes, you only need to query the index once.

Building the index needs to iterate over all archives though (and some people have many archives), so that does not scale very well (similar to the archives part of borg check).

If the sets are fixed and only that one query is intended, the 2 bit approach is better and the n bit approach would just consume more memory and would not give any advantage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
space repo space usage, quotas, ...
Projects
None yet
Development

No branches or pull requests

2 participants