Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Analyze" function to find (and remove) missed non-dedupable temp/cache hotspots #71

Closed
jumper444 opened this issue Jun 27, 2015 · 5 comments · Fixed by #8436
Closed
Assignees
Labels
Milestone

Comments

@jumper444
Copy link

My previous issue post was to ask if a 'delete' command modification is possible to remove individual files or directories from within one or more archives (or entire repository). The feature discussed below is a method of finding non-dedupable 'hotspots' in backups (which would typically be missed/hidden cache or temp files) then deleting them to reclaim space.

I suggest consideration of a command such as "analyze" working on a repository level (or multiple archives..the more the better). This command would look for two things:

  1. Files (of fixed name and directory location) which, over multiple backups, have an extremely high non-dedupable ratio of data vs their size.

  2. Directories (of fixed name and location) which, over multiple backups, have a very high ratio of non-dedupable data vs their size.

You can see that such a scan/analyze will immediately reveal accidentally missed swap files, temp files, and temp directories. An administrator can use this command to search for (and upon further analysis) find and delete this data.

In the first case (1) if the file name and location stay the same between archives and yet the file keeps changing so every backup it has a massively high amount of new data chunks then almost certainly you've found some sort of temp file whose deletion from the backup will reclaim a large amount of space. For example, on backups of windows machines this test case would find "pagefile.sys" as being a huge redflag (windows swap file). Obviously note it isn't in a 'cache' directory and doesn't have a .TMP extension...yet this file is not necessary to backup and it's exclusion (or deletion post-backup with 'delete' command) would allow massive size savings.

Case (2) is where you have temp files such that the names of the files keep changing randomly (so case (1) won't work) but the location doesn't change. This would find hotspots like "c:\window\temp"...again something that could be deleted and reclaimed from a backup database. (In this case the exclusion is clearly labeled 'temp' but this was just the first example I could think of. There are multiple instances on computers of temp directories using random file names which don't immediately become noticed by looking at their name.)

The analyze command specific parameters would need some testing to determine what to display and how to calc/display it. And any results would require further manual inspection before going off and deleting things obviously. But such a feature would do a good job of highlighting missed hotspots in large or complex backups.

Thoughts?

@jumper444 jumper444 changed the title "Analyze" function to find (and remove) missed non-dedupe hotspots "Analyze" function to find (and remove) missed non-dedupable temp/cache hotspots Jun 27, 2015
@ThomasWaldmann
Copy link
Member

Interesting idea, but quite some effort to implement. So the question is whether you can't simply find these files/directories by looking at a --verbose log output of the 2nd+ backup. borg tells U there for unchanged files, A for added files.

@RonnyPfannschmidt
Copy link
Contributor

i am interested in aiding this (as this is what i currently want to do)
in combination with recreate --exclude it can be used to clean up backups iteratively

@ThomasWaldmann ThomasWaldmann added this to the 2.x milestone Sep 29, 2024
@ThomasWaldmann
Copy link
Member

This idea depends on only analysing the archives that contain basically the same data set at different points in time.

For borg 1.x that would mean some pattern matching on the archive name (like -a), for borg 2 it could also use archive series (identical archive names).

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Sep 29, 2024

#8436 is a start.

@ThomasWaldmann
Copy link
Member

@jumper444 @RonnyPfannschmidt can you review the PR / give feedback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants