-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on "delete" for files/directories? #70
Comments
Assuming you prune your backups, you can just add the unwanted files to the exclude list and wait until the archive which contain them get pruned. Takes a while (until the last reference to these chunks is gone), but works. The feature you suggest is doable I guess, but quite some work (and also a bit dangerous to use, imagine someone enters a blank at a bad place and kills all his backup files), so the question is whether it should be implemented and with what priority. |
The ability to delete anything on a computer has simple known risks. They are what they are, but programs, filesystems, or databases or whatever don't remove the command just because it could go wrong or be misused. The common solution for most utlities I've seen is a 'no action' or 'list changes only' command line switch. For robocopy on windows it is "/L" meaning list all changes you will do with this exec, but don't actually do anything. You run with "/L", review and make sure your wildcards, or regex, or whatever are hitting what you expect. Then remove the switch and make it permanant. I presume when removing an archive Borg looks it up, get's a list of contained files, then issues a separate delete call per file. This new concept just creates the list separately from 'exclude file' type patterns which are already programmed instead of the archive record. I see value for the feature and I see similar commands on most other utilities that operate on data/file batches. Waiting on a prune (which I currently don't do) isn't really the same thing. Even still I would only prune things after years (based on various data requirements). The file/directory delete allows efficient cleanup of either 'nondedupable hotspots' found post backup or removal of whole segments of data which clearly are now not needed without disturbing the remainder of an archive which might be needed for years. |
What about a 'copy on write' approach from ZFS. We would never do a delete directly - archives would remain intact (at least in the first step). Borg would have the feature of allowing one of its OWN ARCHIVES as an input argument for the CREATE command. The excludefile and patterns are still allowed so you are just running an archive through the program, using excludefile patterns, and storing it back in the repository under another archive name! A loopback which theoretically would take almost no space or time to create. Then (upon inspection and whatever satisfaction is necessary) the user uses the existing delete command to delete the first archive. Thus you have a relatively safe (two step, atomic) way of removing specific files and directories from Borg. Example: Take backup "april" and make a new one "aprilCleaned" by running it through ExcludeFile Then, Delete the original backup when you are sure you don't want it. Leaving aprilCleaned instead. Essentially you have now removed a subset of files and directories: What we are missing is a RENAME command for an archive at this time to finish things cleanly. Such as: But the hack for now would be to do the above process again without So what we've come up with on this post is a way of deleting files from an archive without using a specific file/directory delete command (assuming that is desired) by, instead, allowing Borg to accept its own archives as input for a create command in a loopback. |
In my opinion this is the number one feature missing from Attic & Borg. Not having this feature means that I have to put a lot of planning and care into what gets backed up, as trying to remove unwanted backups later on basically isn't possible. I'd really like to have a few big backup archives, but having to account for removing certain data later on means I have to split all my backups into a bunch of separate archives, just in case. My paranoia on this issue often leads me to delaying setting up a backup script to begin with before I've had time to inspect all the source files for things to exclude. Meaning in some instances I'm not running backups on some systems at all just because of this. I don't prune my backups (based on time), I prefer to keep everything... except for stuff that didn't need keeping in the first place. But obviously this isn't really possible at the moment, so I just have to keep everything or basically start a new archive from scratch, thereby basically losing much of the advantage of deduplication to begin with. CrashPlan makes this pretty easy, as you can just deselect a folder from your selected sources, and it will remove all data for those files. Also aside from the "accidentally included stuff that should have been excluded to begin with", I'd also like to use Borg/Attic to backup things like all my photos BEFORE I'd had the chance to cull them. Same goes for pretty much any other data that was assumed needed to begin with, but later not. As it currently stands, there's quite a few use cases like this that mean for now I'm stuck with using CrashPlan or something like rsnapshot. And yes, we definitely need a dry-run command option to see what will occur before running it for real. Thanks! |
@hi2u thanks for the feedback, some comments: You can follow a slightly different, more safe strategy for your backups: Just start from a "everything" (full system backup). Only exclude the most obvious stuff that you can immediately decide (if anything). That might be a bit bigger than needed, but at least you immediately have a backup without any "planning" delay. Use -v so it creates a list of all files it backs up, so you can look through them to refine your exclude list. If it takes 2 weeks to optimize your excludes, you can still delete the backup archives of the first 2 weeks later, when you are sure you do not need them any more. The repo will not consume more space than as if you did not make these 2 weeks of backups at all. Keeping every backup is not advisable, it might get rather slow in case a cache resync is needed. The first time it does a cache resync, the time will linearly grow with the amount of backup archives (later, it will be faster). Also the space needs in .cache/borg are linear with the backup archive count. So think about the usual pruning approach of having a good coverage of the recent hours/days/weeks, but less coverage for the more distant past. About CrashPlan: that is a rather dangerous operation, do they ask for confirmation about that deselection? |
BTW, |
2 weeks wouldn't be a big deal, but I often want to delete stuff from over a year ago. But of course I want to keep the rest of the data from those old backups. A good example of this is that I take a lot of photos on a DSLR. Often I don't even look through the photos until a year or two later. But when I do, I'll delete all the bad shots, which is usually more than half of them. The storage considerations are even more so when it comes to audio and video files. Yeah CrashPlan gives you a clear warning after de-selecting a folder (and removing it from backup archives). Sure it could be dangerous, but not as much as deleting an entire repo/archive, which is currently the only option as I understand it? If this were to be implemented, would it make it any easier to make the a mounted archive read+write (including deleting files/folders)? I assume that would probably be even harder than a new Borg command? But it would be amazingly flexible and useful. |
If one would implement a global delete that cuts out a file from all archives, that would be more dangerous than completely deleting a single archive (because the file is then gone from all the backups). A rw filesystem isn't the right way and performance wouldn't be good. |
On Mon, Sep 28, 2015 at 04:43:23AM -0700, TW wrote:
Maybe delete-files (or some other appropriate name, like maybe "filter") Best case there'd be an option to loop over all archives in the If that seems like a desirable implementation I'm interested in working Ed Blackman |
I had written a separate issue (as referenced) immediately after this one and wanted to link it in here because it depends on a file or directory delete funtionality. Obviously one of the main features of Borg is dedupe. Please read issue 71 briefly if this sounds interesting. |
OK, let's assume global-delete is technically doable as a command. How would the commandline / user interaction look like? |
Coming from the windows side of things I'm not as qualified to answer the question. I would presume that in the linux/unix world of various utils of common use I would also presume that this would not be the first (or second) shell My point being that while I'm unable to name the solution, I would What can be copied instead of re-inventing the wheel? |
Is rsync's implementation a good one? Also, I find an interesting approach by robocopy on windows which helps to prevent mistakes: (I jumped on regex as possibly too much in my first post, but then basically suggested it in this second. Maybe that simply is the standard that should be used? I'm ok with it, but I again think the convention should be something common to most linux users and I'm not sure which of a few competing approaches that would be.) |
Command nameI guess the first question to answer is whether this should be: a) Additional functionality added to the existing "delete" command. I think B is the better option, as it gives some clear separation from Attic, and is just clearer in general, as they're fairly different operations. Wildcards/filtersPersonally, I think to start with at least, don't even worry about the filter/wildcard stuff. Just the ability to delete specific folders/files (without wildcards) would meet my needs, as I can simply mount or list a repo/archive, then write a small script (or just write an xargs command) to generate the commands to delete everything I want to get rid of. This means there's less need for you (Borg developers) to spend any time worrying about a user accidentally deleting too much stuff. Without wildcard functionality, it's up to the user to script multiple deletions themselves. You don't even need to worry about deletions occurring across multiple archives in a single command either. This can also be done with an external script. Of course some filter/wildcard functionality down the track could be nice too. But don't let all the extra work of that get in the way of just getting a basic single deletion command going. Filtering can be treated as a separate development task in the future, but I don't even think it's very important at all considering this isn't a command that people would be using every day. The most common and flexible way to filter files/folders is the "GNU find" command. This is what I'd probably be using on my mounted archive in my script to generate the "borg rm" commands. So if you did end up doing internal filtering in Borg down the track, perhaps "find" itself, or its syntax could be involved in some way (probably would require mounting). Combining "GNU find" in some way follows the UNIX philosophy of "do one thing and do it well". There's no need to re-invent the "how to filter a list of folders/files" wheel. Find already does a great job, but obviously requires mounting. Also as @jumper444 mentioned, 99% of the time users probably want to differentiate paths of folders vs files. Some integration with "GNU find" (even if by an external script) covers this, and much more filtering power. I don't know Python, so I can't really help with Borg itself. But if somebody could develop a non-wildcard/filtering "rm" command at least, I (or anyone else) could contribute external script that does the mounting, FINDing, unmounting, and RMing with Borg. Such a script would be pretty easy to write. Maybe a bit slow to execute considering it's working on a mounted archive, but as I mentioned before, this isn't an everyday operation, so performance doesn't matter so much. |
I think there is an 80/20 approach here which is safe and simple. No heavy engineering or complexity to solve something which may not even be needed (or needed to be done differently after experience). By this I mean no regex or wildcards or such. Just start small and adapt if needed. I see two commands:
I would presume there could be multiple of each delete type in a single shell command (if desired). Or reference to a text file containing multiple file/directory lines if there are many (Borg already has code to handle an enternal list file as I recall). Finally there is the problem of specifying which archives any particular delete applies towards. I see three options, the first two of which are obvious. The third may be extra or presumptuous: I mention option C because the nature of Borg as a backup over time and also due to things like "prune" which likewise logically operates on things related to time spans. For simplicity drop C and just have A or B to start. As for syntax, there is already a "borg delete". Logically it is already constructed so that it appears to be a generic delete command. It is already allowing: so it might be argued that if you keep this view/approach then all you are doing with a directory is going one "unit" lower down INTO the archive and deleting the next smaller unit...aka..a directory Logically it would then follow the smaller unit to a directory is simply a file. So this thought process keeps the "borg delete" and modifies the syntax to go finer and finer (small and smaller inside the repository; 4 levels (repository, archive, dir, file) instead of just 2 now(repository, archive)). HOWEVER, concern has been expressed that file/dir deletion is more dangerous and unique (and more complex perhaps). So perhaps it warrants a new command to firewall it apart. So maybe: Those are my thoughts for now what might be a minimum implementation and syntax. |
TODO: define goals for a bounty, start simple. |
Definitely the number one missing feature for me as well. Happened a lot to me to realize later that I've backed up by mistake huge data that I don't care about because I can easily retrieve it from somewhere else, or because it's temporary data, and I still want to keep old backups so pruning is not enough. However after reading this thread, the solution of allowing the input of "create" command to be one of its own archive (or an additional "copy" command) is the best compromise to me. It offers the feature (a lot more efficiently than doing it with a mount which would reread everything), is perfectly safe, probably the easiest to implement, and easily allows more elaborate behaviors through scripting. |
it would be like create, but source its input data from an existing archive. process includes / excludes in same way as create. write to a temp archivename and at the end, rename it to original name. later, this could maybe get extended to re-process the contents of files, like recompressing them with a different compression method than originally. |
I like that approach because of its extensibility. |
Instead of a separate |
@jdchristensen I see what you mean, but:
|
If deleting functionality were initially implemented as a "recreate" feature, would the newly recreated archive still have the old timestamp for listing and pruning purposes? Otherwise it's going to be a bit tricky keeping track of the timeline and doing pruning etc. Having to keep track of it all in the archive's text name might get a bit messy Generally when using a deletion feature, the user really wants to replace the old archive entirely with the fixed one, so I think ideally it should, on the surface at least (for pruning and listing), have the older original backup timestamp rather than the newer timestamp of when the deletion/recreate happened. @ThomasWaldmann , a while back you said...
Just to confirm my assumption... Ignoring performance for a moment... would a RW filesystem simply be much harder to build? Because I don't think performance matters at all for something so rarely used. RW access to delete stuff would be absolutely amazing, if possible, and it means the user can really do anything they want without Borg needing specific support for that type of change. But I assume this is simply too hard? (ignoring performance) |
@jdchristensen Something like a "borg clone" command or "borg pull" or "borg push" command might be useful? I'd expect it to work similarly to moving around git branches. (Getting a bit off-topic for this issue.) |
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770
What are the thoughts on allowing the delete of specific files or entire directories WITHIN a repository or archive?
Such as...delete all the .tmp files in aprilBackup:
"borg delete myRepository::aprilBackup -p ".tmp"
(-p = pattern or delete pattern file or such; don't know what exact notation would be)
Remove the tmp files in entire repository:
"borg delete myRepository -p "*.tmp"
etc...etc...etc....
Yes, I'm aware that you should ideally exclude putting files into the repository in the first place, but that is definitely not always possible or known. There are many instances where you might later realize significant files that have been backed up, are not wanted, and (most importantly) are HIGHLY non-dedupable. You want to pull those out with a pattern and shrink your repository without removing entire archives.
(I've run across this in practice by backing up multiple FirefoxPortable instances. Obviously you exclude the CACHE, but what wasn't known immediately is that there are numerous other files that are essentially cache or temp in nature while not so names. These are also hard to dedup and large in size. Now that I know what they are I'd love to wipe them out of the repository, but can't easily.)
This then brings me to a second feature which would be diagnostic in nature and actually allow the finding, programatically, of these 'dedupe hotspots' in multiple backups. I've never seen such a thing in practice and will post it as "issue" immediately following this one for separate discussion. (But the finding of such hotspots is useless if a user can't "delete" individual data in a repository or backup.)
The text was updated successfully, but these errors were encountered: