-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tracking Issue] Deduplicate blob files #6265
Comments
Maybe instead of |
One of the reasons why #5495 exists is that it preserves original file names so that they are displayed as expected in external programs, this allows to avoid file copying. Though this requires the Delta Chat blobdir is traversable by that program which isn't true for all supported platforms. |
i closed #5495 for now, we can cherry-pick or re-open as needed, but it does not make much sense to get that in beforehand and without considering this issue first. i also have the gut feeling that it is better to leave things in a flat structure. #5495 would also only prevent copying if at the same time, things are set read-only, which is not that easy as it sounds iirc. also, the copying is not that much of an issue as it affects exporting files only - not showing images or playing audio/videos inside delta chat. exporting is not done that often and only on direct user action, taking anyways a moment |
When receiving messages, blobs will be deduplicated with the new function `create_and_deduplicate_from_bytes()`. For sending files, this adds a new function `set_file_and_deduplicate()` instead of deduplicating by default. This is for #6265; read the issue description there for more details. TODO: - [x] Set files as read-only - [x] Don't do a write when the file is already identical - [x] The first 32 chars or so of the 64-character hash are enough. I calculated that if 10b people (i.e. all of humanity) use DC, and each of them has 200k distinct blob files (I have 4k in my day-to-day account), and we used 20 chars, then the expected value for the number of name collisions would be ~0.0002 (and the probability that there is a least one name collision is lower than that) [^1]. I added 12 more characters to be on the super safe side, but this wouldn't be necessary and I could also make it 20 instead of 32. - Not 100% sure whether that's necessary at all - it would mainly be necessary if we might hit a length limit on some file systems (the blobdir is usually sth like `accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs` (53 chars), plus 64 chars for the filename would be 117). - [x] "touch" the files to prevent them from being deleted - [x] TODOs in the code For later PRs: - Replace `BlobObject::create(…)` with `BlobObject::create_and_deduplicate(…)` in order to deduplicate everytime core creates a file - Modify JsonRPC to deduplicate blob files - Possibly rename BlobObject.name to BlobObject.file in order to prevent confusion (because `name` usually means "user-visible-name", not "name of the file on disk"). [^1]: Calculated with both https://printfn.github.io/fend/ and https://www.geogebra.org/calculator, both of which came to the same result ([1](https://github.com/user-attachments/assets/bbb62550-3781-48b5-88b1-ba0e29c28c0d), [2](https://github.com/user-attachments/assets/82171212-b797-4117-a39f-0e132eac7252)) --------- Co-authored-by: l <[email protected]>
I think this can be closed, it's implemented in core and there are open issues for the specific things that still need to be done. |
This makes it so that files will be deduplicated when using the JsonRPC API. @nicodh and @WofWca you know the Desktop code and how it is using the API, so, you can probably tell me whether this is a good way of changing the JsonRPC code - feel free to push changes directly to this PR here! This PR here changes the existing functions instead of creating new ones; we can alternatively create new ones if it allows for a smoother transition. This brings a few changes: - If you pass a file that is already in the blobdir, it will be renamed to `<hash>.<extension>` immediately (previously, the filename on the disk stayed the same) - If you pass a file that's not in the blobdir yet, it will be copied to the blobdir immediately (previously, it was copied to the blobdir later, when sending) - If you create a file and then pass it to `create_message()`, it's better to directly create it in the blobdir, since it doesn't need to be copied. - You must not write to the files after they were passed to core, because otherwise, the hash will be wrong. So, if Desktop recodes videos or so, then the video file mustn't just be overwritten. What you can do instead is write the recoded video to a file with a random name in the blobdir and then create a new message with the new attachment. If needed, we can also create a JsonRPC for `set_file_and_deduplicate()` that replaces the file on an existing message. In order to test whether everything still works, the desktop issue has a list of things to test: deltachat/deltachat-desktop#4498 Core issue: #6265 --------- Co-authored-by: l <[email protected]>
We would like to eventually deduplicate blob files.
This supersedes #5495 and #4309. We may be able to revert #5778 afterwards.
Motivation
Especially with Webxdc, there are a lot of duplicate files in the blobs directory, because when you get the same file sent twice then it will be saved twice.
Also, it would be nice to use random filenames because it may happen that the SQL database references a file that doesn't exist anymore, and if the user sends or receives a file with this filename then this new file will accidentally be shown in the place of the removed file.
Prerequisites
dc_msg_get_filename()
(C-FFI) orMessageObject.file_name
(JsonRPC) needs to be usedParam::Filename
is set to the actual original filenameset_file()
, andset_file()
doesn't have anoriginal_name
parameterset_file_and_deduplicate(&mut self, path: &str, original_name: &str, mime: Option<&str>)
that is similar toset_file()
but you can specify the original file nameset_file()
is doing). It should be made to only work on files that already are in the blobs directory, in order to avoid accidentally moving a file that was still needed. Also, it should be allowed to immediately move the file (as opposed toset_file()
, which will only rename the file when sending.Current plan
TL;DR: Save all files as
<hash>.<extension>
.When inserting a file into the blobdir:
blake3
andiroh-blake3
dependencies anyway and iroh devs really like it. It is supposed to be much faster than other cryptographic hashes: https://peergos.org/posts/blake3<hash>.<extension>
already exists; if yes: use the existing file (and to be safe, check that the content is still correct and overwrite it otherwise). Only if it doesn't exist yet, create it.Existing files will be kept as they are. Also, the existing
set_file()
function still won't deduplicate, only the newset_file_and_deduplicate()
and when receiving messages.Alternatives
guess_msgtype_from_suffix()
uses the actual filename on the disk to guess the mime type; this means that we need to be careful if we deduplicate files that have different extensions.Open questions
set_file_and_deduplicate()
rename the file immediately before returning, asynchronously in the background, or when sending out the message?The text was updated successfully, but these errors were encountered: