-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicated size of a given set of archives? #5741
Comments
Yeah, for the archive's deduped size, borg always computes what this archive adds (size of all unique chunks [chunks ONLY used by this archive]) compared to the rest of the repo. |
Related: #71 |
Related: #8514 (hashtable flags) Guess we can implement this now:
memory needs is not an issue (basically zero additional). but runtime is O(archives count * chunks referenced per archive). this runtime would be needed per combination of considered/rest archives so guess this might be similar to a is it worth implementing this? |
Or, with more memory:
memory needs: additional archives_count/8 * chunks_count bytes runtime is O(archives count * chunks referenced per archive), but only once - after that (as long as chunkindex is in memory) any combination can be queried quickly. The problem in general is the amount of queries needed: if my maths aren't wrong, one would need 2^archives_count queries against that index to check all combinations of selected archives vs. remaining archives. If we would be only interested in consecutive archives, the queries amount would be reduced to O(archives_count^2). This could be further reduced by limiting the length of such a consecutive archives "run", which makes sense especially if the archives count is rather high (and the archives are from same backup source data set). |
That makes sense.
But typically we don't want to check all combinations of selected archives, just measure the "marginal/deduplicated size" of a given subset of archives (or I didn't understand what you said). |
@rom1v If you have fixed sets R and C, yes, you only need to query the index once. Building the index needs to iterate over all archives though (and some people have many archives), so that does not scale very well (similar to the archives part of If the sets are fixed and only that one query is intended, the 2 bit approach is better and the n bit approach would just consume more memory and would not give any advantage. |
Have you checked borgbackup docs, FAQ, and open Github issues?
Yes, in praticular https://borgbackup.readthedocs.io/en/stable/usage/info.html
Is this a BUG / ISSUE report or a QUESTION?
A question (or feature request).
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
borg 1.1.15
Operating system (distribution) and version.
Linux (Debian sid).
Describe the problem you're observing.
Let's create two directories to backup, each containing a duplicated file (respectively 10M and 100M):
Then, create an archive for
userA
and another foruserB
:The deduplicated size is (as expected) half the original size, for each user:
Now, let's create a new archive for each user (without any changes):
Now, the deduplicated size is basically 0 for all archives (if we remove exactly one archive, it won't save any space):
But in practice, we may still be interested in knowning the deduplicated size of all archives of
userA
. Is it possible to get this information easily?Currently, it seems we can retrieve the global deduplicated size ("All archives"), or the deduplicated size of a single archive, but not that of a set of archives (typically for a given prefix). As a consequence, as soon as a new archive is created with few changes, the deduplicated size is meaningless in practice.
The filtering options of
borg info
just list individual archives matching the filter, but not the deduplicated size of the set of resulting archives:The text was updated successfully, but these errors were encountered: