Thoughts on "delete" for files/directories? #70

jumper444 · 2015-06-27T17:57:03Z

What are the thoughts on allowing the delete of specific files or entire directories WITHIN a repository or archive?

Such as...delete all the .tmp files in aprilBackup:
"borg delete myRepository::aprilBackup -p ".tmp"
(-p = pattern or delete pattern file or such; don't know what exact notation would be)

Remove the tmp files in entire repository:
"borg delete myRepository -p "*.tmp"
etc...etc...etc....

Yes, I'm aware that you should ideally exclude putting files into the repository in the first place, but that is definitely not always possible or known. There are many instances where you might later realize significant files that have been backed up, are not wanted, and (most importantly) are HIGHLY non-dedupable. You want to pull those out with a pattern and shrink your repository without removing entire archives.

(I've run across this in practice by backing up multiple FirefoxPortable instances. Obviously you exclude the CACHE, but what wasn't known immediately is that there are numerous other files that are essentially cache or temp in nature while not so names. These are also hard to dedup and large in size. Now that I know what they are I'd love to wipe them out of the repository, but can't easily.)

This then brings me to a second feature which would be diagnostic in nature and actually allow the finding, programatically, of these 'dedupe hotspots' in multiple backups. I've never seen such a thing in practice and will post it as "issue" immediately following this one for separate discussion. (But the finding of such hotspots is useless if a user can't "delete" individual data in a repository or backup.)

ThomasWaldmann · 2015-06-27T18:54:07Z

Assuming you prune your backups, you can just add the unwanted files to the exclude list and wait until the archive which contain them get pruned. Takes a while (until the last reference to these chunks is gone), but works.

The feature you suggest is doable I guess, but quite some work (and also a bit dangerous to use, imagine someone enters a blank at a bad place and kills all his backup files), so the question is whether it should be implemented and with what priority.

jumper444 · 2015-06-29T14:53:33Z

The ability to delete anything on a computer has simple known risks. They are what they are, but programs, filesystems, or databases or whatever don't remove the command just because it could go wrong or be misused. The common solution for most utlities I've seen is a 'no action' or 'list changes only' command line switch. For robocopy on windows it is "/L" meaning list all changes you will do with this exec, but don't actually do anything. You run with "/L", review and make sure your wildcards, or regex, or whatever are hitting what you expect. Then remove the switch and make it permanant.

I presume when removing an archive Borg looks it up, get's a list of contained files, then issues a separate delete call per file. This new concept just creates the list separately from 'exclude file' type patterns which are already programmed instead of the archive record.

I see value for the feature and I see similar commands on most other utilities that operate on data/file batches. Waiting on a prune (which I currently don't do) isn't really the same thing. Even still I would only prune things after years (based on various data requirements).

The file/directory delete allows efficient cleanup of either 'nondedupable hotspots' found post backup or removal of whole segments of data which clearly are now not needed without disturbing the remainder of an archive which might be needed for years.

jumper444 · 2015-06-29T14:57:54Z

What about a 'copy on write' approach from ZFS. We would never do a delete directly - archives would remain intact (at least in the first step).

Borg would have the feature of allowing one of its OWN ARCHIVES as an input argument for the CREATE command. The excludefile and patterns are still allowed so you are just running an archive through the program, using excludefile patterns, and storing it back in the repository under another archive name! A loopback which theoretically would take almost no space or time to create. Then (upon inspection and whatever satisfaction is necessary) the user uses the existing delete command to delete the first archive. Thus you have a relatively safe (two step, atomic) way of removing specific files and directories from Borg.

Example: Take backup "april" and make a new one "aprilCleaned" by running it through ExcludeFile
"borg create -v --exclude-from EXCLUDEFILE ~/myBackup::aprilCleaned ~/myBackup::april
borg here is using ~/myBackup::april as input to create "aprilCleaned" with the exclude pattern.

Then, Delete the original backup when you are sure you don't want it. Leaving aprilCleaned instead. Essentially you have now removed a subset of files and directories:
"borg delete ~/myBackup::april"

What we are missing is a RENAME command for an archive at this time to finish things cleanly. Such as:
"borg rename ~/myBackup::aprilCleaned april"
Try to rename the cleaned archive back to "april". Not possible.

But the hack for now would be to do the above process again without
an exclude file, then delete. It's a mess, but possible:
"borg create -v ~/myBackup:april ~/myBackup::aprilCleaned"
"borg delete ~/myBackup::aprilCleaned"

So what we've come up with on this post is a way of deleting files from an archive without using a specific file/directory delete command (assuming that is desired) by, instead, allowing Borg to accept its own archives as input for a create command in a loopback.
Would this be of any value?

hi2u · 2015-09-28T04:03:16Z

In my opinion this is the number one feature missing from Attic & Borg.

Not having this feature means that I have to put a lot of planning and care into what gets backed up, as trying to remove unwanted backups later on basically isn't possible. I'd really like to have a few big backup archives, but having to account for removing certain data later on means I have to split all my backups into a bunch of separate archives, just in case.

My paranoia on this issue often leads me to delaying setting up a backup script to begin with before I've had time to inspect all the source files for things to exclude. Meaning in some instances I'm not running backups on some systems at all just because of this.

I don't prune my backups (based on time), I prefer to keep everything... except for stuff that didn't need keeping in the first place. But obviously this isn't really possible at the moment, so I just have to keep everything or basically start a new archive from scratch, thereby basically losing much of the advantage of deduplication to begin with.

CrashPlan makes this pretty easy, as you can just deselect a folder from your selected sources, and it will remove all data for those files.

Also aside from the "accidentally included stuff that should have been excluded to begin with", I'd also like to use Borg/Attic to backup things like all my photos BEFORE I'd had the chance to cull them. Same goes for pretty much any other data that was assumed needed to begin with, but later not. As it currently stands, there's quite a few use cases like this that mean for now I'm stuck with using CrashPlan or something like rsnapshot.

And yes, we definitely need a dry-run command option to see what will occur before running it for real.

Thanks!

ThomasWaldmann · 2015-09-28T10:07:07Z

@hi2u thanks for the feedback, some comments:

You can follow a slightly different, more safe strategy for your backups:

Just start from a "everything" (full system backup). Only exclude the most obvious stuff that you can immediately decide (if anything). That might be a bit bigger than needed, but at least you immediately have a backup without any "planning" delay. Use -v so it creates a list of all files it backs up, so you can look through them to refine your exclude list.

If it takes 2 weeks to optimize your excludes, you can still delete the backup archives of the first 2 weeks later, when you are sure you do not need them any more. The repo will not consume more space than as if you did not make these 2 weeks of backups at all.

Keeping every backup is not advisable, it might get rather slow in case a cache resync is needed. The first time it does a cache resync, the time will linearly grow with the amount of backup archives (later, it will be faster). Also the space needs in .cache/borg are linear with the backup archive count.

So think about the usual pruning approach of having a good coverage of the recent hours/days/weeks, but less coverage for the more distant past.

About CrashPlan: that is a rather dangerous operation, do they ask for confirmation about that deselection?

ThomasWaldmann · 2015-09-28T10:11:27Z

BTW, borg create --dry-run ... exists since a while.

hi2u · 2015-09-28T11:34:26Z

2 weeks wouldn't be a big deal, but I often want to delete stuff from over a year ago. But of course I want to keep the rest of the data from those old backups.

A good example of this is that I take a lot of photos on a DSLR. Often I don't even look through the photos until a year or two later. But when I do, I'll delete all the bad shots, which is usually more than half of them. The storage considerations are even more so when it comes to audio and video files.

Yeah CrashPlan gives you a clear warning after de-selecting a folder (and removing it from backup archives).

Sure it could be dangerous, but not as much as deleting an entire repo/archive, which is currently the only option as I understand it?

If this were to be implemented, would it make it any easier to make the a mounted archive read+write (including deleting files/folders)? I assume that would probably be even harder than a new Borg command? But it would be amazingly flexible and useful.

ThomasWaldmann · 2015-09-28T11:43:23Z

If one would implement a global delete that cuts out a file from all archives, that would be more dangerous than completely deleting a single archive (because the file is then gone from all the backups).

A rw filesystem isn't the right way and performance wouldn't be good.

edgewood · 2015-09-28T17:21:17Z

On Mon, Sep 28, 2015 at 04:43:23AM -0700, TW wrote:

If one would implement a global delete that cuts out a file from all archives, that would be more dangerous than completely deleting a single archive (because the file is then gone from all the backups).

A rw filesystem isn't the right way and performance wouldn't be good.

Maybe delete-files (or some other appropriate name, like maybe "filter")
just creates a new archive from the old, leaving out the named files.
Perhaps it optionally also deletes the old archive and renames the new
archive to the old name, keeping the old archive's timestamp. I'm not
super familiar with the internals (yet) outside of a few areas I've
worked in, but that seems safe, low-impact spacewise, and
straightforward.

Best case there'd be an option to loop over all archives in the
repository (with a "are you SURE?"), worst case users would need to
script that loop if they wanted it.

If that seems like a desirable implementation I'm interested in working
on it. (Note though that it would probably be a month+ project, as I
will have several competing demands on my time in the upcoming weeks.)

Ed Blackman

jumper444 · 2015-09-28T17:27:59Z

I had written a separate issue (as referenced) immediately after this one and wanted to link it in here because it depends on a file or directory delete funtionality.
#71

Obviously one of the main features of Borg is dedupe.
Essentially I am suggesting that it could be possible to programatically WITHOUT user interaction find dedupe hotspots in backup data by having a borg subroutine that looks for files or directories which have large numbers of non-duplicating changes between backups.
Such a routine would quickly show storage locations which are likely temp or cache data and (after manual examination outside the scope of Borg) could likely be excluded in future backups (and hopefully deleted in existing backups...which is the discussion in this issue.)

Please read issue 71 briefly if this sounds interesting.

ThomasWaldmann · 2015-09-28T19:39:23Z

OK, let's assume global-delete is technically doable as a command.

How would the commandline / user interaction look like?

jumper444 · 2015-10-01T22:08:09Z

Coming from the windows side of things I'm not as qualified to answer the question.
My experience is either dos simple (* and ?) or with full blown regex instead.
One is underkill, the other is over.

I would presume that in the linux/unix world of various utils of common use
there is an appropriate level of complexity and simplicity that does what is
needed and not more.

I would also presume that this would not be the first (or second) shell
utility or filter or pipe or whatever to come into existence eons ago
somewhere. Often we (humans) get things right
the third or fourth time something is done. For example (flame on)
Emacs is very much in the 'first' category, but it is a huge monstrosity
to some views and later editors tried different commands, approaches,
etc. (flame off; no insult intended).

My point being that while I'm unable to name the solution, I would
suggest choosing an approach ALREADY used by something
relatively well know that also works on files and directories in a
similar way.

What can be copied instead of re-inventing the wheel?
But let's not copy the most intricate or obscure.
Let's try to make it the linguistic equivalent of Python or Ruby.
Not C++ or PHP.

jumper444 · 2015-10-01T22:13:59Z

Is rsync's implementation a good one?
https://unix.stackexchange.com/questions/2161/rsync-filter-copying-one-pattern-only

Also, I find an interesting approach by robocopy on windows which helps to prevent mistakes:
There are exclude switches for BOTH files AND directories.
Often when trying to knock things out (or include them in) you are well aware that you ONLY want certain directories OR files. You rarely (in a sincle include/exclude) are shooting for both at the same time...when it happens it is usually an accident and causes errors.
Thus maybe when a filtering is created the switches are split into those acting on files and those acting on directories?

(I jumped on regex as possibly too much in my first post, but then basically suggested it in this second. Maybe that simply is the standard that should be used? I'm ok with it, but I again think the convention should be something common to most linux users and I'm not sure which of a few competing approaches that would be.)

hi2u · 2015-10-02T07:05:52Z

Command name

I guess the first question to answer is whether this should be:

a) Additional functionality added to the existing "delete" command.
b) A new separate command, maybe called something like "rm".

I think B is the better option, as it gives some clear separation from Attic, and is just clearer in general, as they're fairly different operations.

Wildcards/filters

Personally, I think to start with at least, don't even worry about the filter/wildcard stuff. Just the ability to delete specific folders/files (without wildcards) would meet my needs, as I can simply mount or list a repo/archive, then write a small script (or just write an xargs command) to generate the commands to delete everything I want to get rid of. This means there's less need for you (Borg developers) to spend any time worrying about a user accidentally deleting too much stuff. Without wildcard functionality, it's up to the user to script multiple deletions themselves.

You don't even need to worry about deletions occurring across multiple archives in a single command either. This can also be done with an external script.

Of course some filter/wildcard functionality down the track could be nice too. But don't let all the extra work of that get in the way of just getting a basic single deletion command going. Filtering can be treated as a separate development task in the future, but I don't even think it's very important at all considering this isn't a command that people would be using every day.

The most common and flexible way to filter files/folders is the "GNU find" command. This is what I'd probably be using on my mounted archive in my script to generate the "borg rm" commands. So if you did end up doing internal filtering in Borg down the track, perhaps "find" itself, or its syntax could be involved in some way (probably would require mounting).

Combining "GNU find" in some way follows the UNIX philosophy of "do one thing and do it well". There's no need to re-invent the "how to filter a list of folders/files" wheel. Find already does a great job, but obviously requires mounting.

Also as @jumper444 mentioned, 99% of the time users probably want to differentiate paths of folders vs files. Some integration with "GNU find" (even if by an external script) covers this, and much more filtering power.

I don't know Python, so I can't really help with Borg itself. But if somebody could develop a non-wildcard/filtering "rm" command at least, I (or anyone else) could contribute external script that does the mounting, FINDing, unmounting, and RMing with Borg. Such a script would be pretty easy to write. Maybe a bit slow to execute considering it's working on a mounted archive, but as I mentioned before, this isn't an everyday operation, so performance doesn't matter so much.

jumper444 · 2015-10-08T06:32:24Z

I think there is an 80/20 approach here which is safe and simple. No heavy engineering or complexity to solve something which may not even be needed (or needed to be done differently after experience). By this I mean no regex or wildcards or such. Just start small and adapt if needed.

I see two commands:

delete directory "x" (x being fully specified single dir without any wildcards or anything)
delete file "y" (same...a full specification of exactly one file)

I would presume there could be multiple of each delete type in a single shell command (if desired). Or reference to a text file containing multiple file/directory lines if there are many (Borg already has code to handle an enternal list file as I recall).

Finally there is the problem of specifying which archives any particular delete applies towards. I see three options, the first two of which are obvious. The third may be extra or presumptuous:
A) entire repository (all archives)
B) one archive in repository
C) a date range of archives

I mention option C because the nature of Borg as a backup over time and also due to things like "prune" which likewise logically operates on things related to time spans. For simplicity drop C and just have A or B to start.

As for syntax, there is already a "borg delete".
This command already serves duty to handle TWO different tasks:
->delete a repository or delete an archive.

Logically it is already constructed so that it appears to be a generic delete command. It is already allowing:
-> delete it ALL (repository; then...
-> delete one "unit" SMALLER than "all" (aka archive;...

so it might be argued that if you keep this view/approach then all you are doing with a directory is going one "unit" lower down INTO the archive and deleting the next smaller unit...aka..a directory
-> delete one "unit" smaller than archive (aka..directory)

Logically it would then follow the smaller unit to a directory is simply a file. So this thought process keeps the "borg delete" and modifies the syntax to go finer and finer (small and smaller inside the repository; 4 levels (repository, archive, dir, file) instead of just 2 now(repository, archive)).

HOWEVER, concern has been expressed that file/dir deletion is more dangerous and unique (and more complex perhaps). So perhaps it warrants a new command to firewall it apart. So maybe:
"borg deleteData ~/myBorg::mondayBackup -dir xxx -file yyy....????"
(I can also see use of a separate command...at least during beta...to make sure that the normal use of 'delete' by those not intending to delete files and directories doesn't happen. Or...maybe having two commands instead of one is bad practice?)

Those are my thoughts for now what might be a minimum implementation and syntax.
The thread will continue with more input and revision (maybe my own. These are early thoughts.)

ThomasWaldmann · 2015-10-09T11:00:17Z

TODO: define goals for a bounty, start simple.

cyril42e · 2015-11-11T18:17:12Z

Definitely the number one missing feature for me as well. Happened a lot to me to realize later that I've backed up by mistake huge data that I don't care about because I can easily retrieve it from somewhere else, or because it's temporary data, and I still want to keep old backups so pruning is not enough.

However after reading this thread, the solution of allowing the input of "create" command to be one of its own archive (or an additional "copy" command) is the best compromise to me. It offers the feature (a lot more efficiently than doing it with a mount which would reread everything), is perfectly safe, probably the easiest to implement, and easily allows more elaborate behaviors through scripting.

ThomasWaldmann · 2016-02-21T13:55:28Z

borg recreate <mostly same args / options as create> might be also an option.

it would be like create, but source its input data from an existing archive. process includes / excludes in same way as create. write to a temp archivename and at the end, rename it to original name.

later, this could maybe get extended to re-process the contents of files, like recompressing them with a different compression method than originally.

alp82 · 2016-02-21T13:58:00Z

I like that approach because of its extensibility.

jdchristensen · 2016-02-21T15:09:04Z

Instead of a separate recreate command, how about letting create use an archive as its source. This would allow the user to backup from an archive in a local repo to an archive in a remote repo, for example. I could imagine using this to copy a selection of old backups to a remote server that I decide to add into my backup routine. Or to backup from a local repo using an older repo format to a new local repo, generalizing the upgrade command. Etc.

ThomasWaldmann · 2016-02-21T15:59:35Z

@jdchristensen I see what you mean, but:

there might be syntax complications. repo::archive vs path/filename ambiguities.
going to another repo requires transfering file contents (and dealing with crypto + compression) in any case, so this is always expensive,

hi2u · 2016-02-22T04:01:07Z

If deleting functionality were initially implemented as a "recreate" feature, would the newly recreated archive still have the old timestamp for listing and pruning purposes? Otherwise it's going to be a bit tricky keeping track of the timeline and doing pruning etc. Having to keep track of it all in the archive's text name might get a bit messy

Generally when using a deletion feature, the user really wants to replace the old archive entirely with the fixed one, so I think ideally it should, on the surface at least (for pruning and listing), have the older original backup timestamp rather than the newer timestamp of when the deletion/recreate happened.

@ThomasWaldmann , a while back you said...

A rw filesystem isn't the right way and performance wouldn't be good.

Just to confirm my assumption... Ignoring performance for a moment... would a RW filesystem simply be much harder to build? Because I don't think performance matters at all for something so rarely used. RW access to delete stuff would be absolutely amazing, if possible, and it means the user can really do anything they want without Borg needing specific support for that type of change. But I assume this is simply too hard? (ignoring performance)

tgharold · 2016-03-15T01:26:04Z

@jdchristensen Something like a "borg clone" command or "borg pull" or "borg push" command might be useful? I'd expect it to work similarly to moving around git branches. (Getting a bit off-topic for this issue.)

Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg rewrite has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Current TODOs: - Detect and skip (unless --force) already recompressed chunks -- delayed until current PRs on borg.key APIs are decided borgbackup#810 borgbackup#789 - Usage example Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770

Use with caution: permanent data loss by specifying incorrect patterns is easily possible. Make a dry run to make sure you got everything right. borg recreate has many uses: - Can selectively remove files/dirs from old archives, e.g. to free space or purging picturarum biggus dickus from history - Recompress data - Rechunkify data, to have upgraded Attic / Borg 0.xx archives deduplicate with Borg 1.x archives. (Or to experiment with chunker-params for specific use cases It is interrupt- and resumable. Chunks are not freed on-the-fly. Rationale: Makes only sense when rechunkifying, but logic on which new chunks to free what input chunks is complicated and *very* delicate. Future TODOs: - Refactor tests using py.test fixtures -- would require porting ArchiverTestCase to py.test: many changes, this changeset is already borderline too large. - Possibly add a --target option to not replace the source archive -- with the target possibly in another Repo (better than "cp" due to full integrity checking, and deduplication at the target) - Detect and skip (unless --always-recompress) already recompressed chunks Fixes borgbackup#787 borgbackup#686 borgbackup#630 borgbackup#70 (and probably some I overlooked) Also see borgbackup#757 and borgbackup#770

enkore · 2016-04-10T15:24:11Z

"borg recreate" (#812) is now merged into master and allows this (among other things).

(Also see #686)

ThomasWaldmann added the enhancement label Oct 8, 2015

DrupaListo-com mentioned this issue Nov 29, 2015

3 ideas: borg razor ( = "borg diff" between archives + "borg delete-dir-from-repo" ) #449

Closed

ThomasWaldmann mentioned this issue Feb 21, 2016

Can I selectively delete within old archives? #686

Closed

enkore mentioned this issue Mar 29, 2016

Done, feature recreate #812

Merged

enkore closed this as completed Apr 10, 2016

pkol mentioned this issue Mar 30, 2021

Input/output error #5749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on "delete" for files/directories? #70

Thoughts on "delete" for files/directories? #70

jumper444 commented Jun 27, 2015

ThomasWaldmann commented Jun 27, 2015

jumper444 commented Jun 29, 2015

jumper444 commented Jun 29, 2015

hi2u commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

hi2u commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

edgewood commented Sep 28, 2015

jumper444 commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

jumper444 commented Oct 1, 2015

jumper444 commented Oct 1, 2015

hi2u commented Oct 2, 2015

jumper444 commented Oct 8, 2015

ThomasWaldmann commented Oct 9, 2015

cyril42e commented Nov 11, 2015

ThomasWaldmann commented Feb 21, 2016

alp82 commented Feb 21, 2016

jdchristensen commented Feb 21, 2016

ThomasWaldmann commented Feb 21, 2016

hi2u commented Feb 22, 2016

tgharold commented Mar 15, 2016

enkore commented Apr 10, 2016

Thoughts on "delete" for files/directories? #70

Thoughts on "delete" for files/directories? #70

Comments

jumper444 commented Jun 27, 2015

ThomasWaldmann commented Jun 27, 2015

jumper444 commented Jun 29, 2015

jumper444 commented Jun 29, 2015

hi2u commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

hi2u commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

edgewood commented Sep 28, 2015

jumper444 commented Sep 28, 2015

ThomasWaldmann commented Sep 28, 2015

jumper444 commented Oct 1, 2015

jumper444 commented Oct 1, 2015

hi2u commented Oct 2, 2015

Command name

Wildcards/filters

jumper444 commented Oct 8, 2015

ThomasWaldmann commented Oct 9, 2015

cyril42e commented Nov 11, 2015

ThomasWaldmann commented Feb 21, 2016

alp82 commented Feb 21, 2016

jdchristensen commented Feb 21, 2016

ThomasWaldmann commented Feb 21, 2016

hi2u commented Feb 22, 2016

tgharold commented Mar 15, 2016

enkore commented Apr 10, 2016