[Tracking Issue] Deduplicate blob files #6265

Hocuri · 2024-11-26T13:28:53Z

We would like to eventually deduplicate blob files.

This supersedes #5495 and #4309. We may be able to revert #5778 afterwards.

Motivation

Especially with Webxdc, there are a lot of duplicate files in the blobs directory, because when you get the same file sent twice then it will be saved twice.

Also, it would be nice to use random filenames because it may happen that the SQL database references a file that doesn't exist anymore, and if the user sends or receives a file with this filename then this new file will accidentally be shown in the place of the removed file.

Prerequisites

DC Desktop: Don't allow users to edit attachments Attachmentfiles on devices are rw - changes to attachements are not signaled to the user by DeltaChat deltachat/deltachat-desktop#4352
Make sure that the actual name of the file isn't used in any UIs, instead dc_msg_get_filename() (C-FFI) or MessageObject.file_name (JsonRPC) needs to be used
Optional: Make sure that Param::Filename is set to the actual original filename
- The problem here is that when sending a file, UIs rename the file before passing it to set_file(), and set_file() doesn't have an original_name parameter
- We could simply ignore the problem because with the current plan (see below) all files will be renamed to a hash anyway, so, in most cases the original name will be free in the blobs dir. Not a super nice solution, but it would work.
- Implemented in File deduplication #6332: Add a function set_file_and_deduplicate(&mut self, path: &str, original_name: &str, mime: Option<&str>) that is similar to set_file() but you can specify the original file name
  - @link2xt says that he thinks that we should do this, and also allow core to rename the file (as opposed to copying it, which set_file() is doing). It should be made to only work on files that already are in the blobs directory, in order to avoid accidentally moving a file that was still needed. Also, it should be allowed to immediately move the file (as opposed to set_file(), which will only rename the file when sending.
  - We need to make sure that things work fine with Preparing messages
    - Either rename the file when sending (i.e. after preparing)
    - Or completely remove Preparing messages. Even on main, when you try to forward Preparing messages, the forwarded message stays Preparing forever, which is another good reason to get rid of them. (Remove prepareMsg (and OUT_PREPARING message state) deltachat/deltachat-android#3468)

Current plan

TL;DR: Save all files as <hash>.<extension>.

When inserting a file into the blobdir:

Compute the hash of the file content
- Use Blake3 for hash because we have blake3 and iroh-blake3 dependencies anyway and iroh devs really like it. It is supposed to be much faster than other cryptographic hashes: https://peergos.org/posts/blake3
Check if the file <hash>.<extension> already exists; if yes: use the existing file (and to be safe, check that the content is still correct and overwrite it otherwise). Only if it doesn't exist yet, create it.
If possible: Set the file permissions to read-only.

Existing files will be kept as they are. Also, the existing set_file() function still won't deduplicate, only the new set_file_and_deduplicate() and when receiving messages.

Alternatives

Save the hash together with the filename in a SQL table in order to check if the file already exists.
- Note that right now, guess_msgtype_from_suffix() uses the actual filename on the disk to guess the mime type; this means that we need to be careful if we deduplicate files that have different extensions.
Save the exact byte size together with the filename in a SQL table in order to check whether the file may already exist. If so, compare the files byte-by-byte.
- This may be more performant, at least as long as there aren't a lot of files with the exact same size

Open questions

Is the performance of this solution good enough?
Should set_file_and_deduplicate() rename the file immediately before returning, asynchronously in the background, or when sending out the message?
- Right now, it's renaming it immediately ebfore returning

The text was updated successfully, but these errors were encountered:

Septias · 2024-11-26T14:32:47Z

Maybe instead of [Tracking Issue] use a tag instead?

iequidoo · 2024-11-29T01:30:36Z

One of the reasons why #5495 exists is that it preserves original file names so that they are displayed as expected in external programs, this allows to avoid file copying. Though this requires the Delta Chat blobdir is traversable by that program which isn't true for all supported platforms.

r10s · 2024-11-29T13:30:03Z

i closed #5495 for now, we can cherry-pick or re-open as needed, but it does not make much sense to get that in beforehand and without considering this issue first. i also have the gut feeling that it is better to leave things in a flat structure.

#5495 would also only prevent copying if at the same time, things are set read-only, which is not that easy as it sounds iirc.

also, the copying is not that much of an issue as it affects exporting files only - not showing images or playing audio/videos inside delta chat. exporting is not done that often and only on direct user action, taking anyways a moment

When receiving messages, blobs will be deduplicated with the new function `create_and_deduplicate_from_bytes()`. For sending files, this adds a new function `set_file_and_deduplicate()` instead of deduplicating by default. This is for #6265; read the issue description there for more details. TODO: - [x] Set files as read-only - [x] Don't do a write when the file is already identical - [x] The first 32 chars or so of the 64-character hash are enough. I calculated that if 10b people (i.e. all of humanity) use DC, and each of them has 200k distinct blob files (I have 4k in my day-to-day account), and we used 20 chars, then the expected value for the number of name collisions would be ~0.0002 (and the probability that there is a least one name collision is lower than that) [^1]. I added 12 more characters to be on the super safe side, but this wouldn't be necessary and I could also make it 20 instead of 32. - Not 100% sure whether that's necessary at all - it would mainly be necessary if we might hit a length limit on some file systems (the blobdir is usually sth like `accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs` (53 chars), plus 64 chars for the filename would be 117). - [x] "touch" the files to prevent them from being deleted - [x] TODOs in the code For later PRs: - Replace `BlobObject::create(…)` with `BlobObject::create_and_deduplicate(…)` in order to deduplicate everytime core creates a file - Modify JsonRPC to deduplicate blob files - Possibly rename BlobObject.name to BlobObject.file in order to prevent confusion (because `name` usually means "user-visible-name", not "name of the file on disk"). [^1]: Calculated with both https://printfn.github.io/fend/ and https://www.geogebra.org/calculator, both of which came to the same result ([1](https://github.com/user-attachments/assets/bbb62550-3781-48b5-88b1-ba0e29c28c0d), [2](https://github.com/user-attachments/assets/82171212-b797-4117-a39f-0e132eac7252)) --------- Co-authored-by: l <[email protected]>

Hocuri · 2025-01-24T09:50:29Z

I think this can be closed, it's implemented in core and there are open issues for the specific things that still need to be done.

@nicodh

This makes it so that files will be deduplicated when using the JsonRPC API. @nicodh and @WofWca you know the Desktop code and how it is using the API, so, you can probably tell me whether this is a good way of changing the JsonRPC code - feel free to push changes directly to this PR here! This PR here changes the existing functions instead of creating new ones; we can alternatively create new ones if it allows for a smoother transition. This brings a few changes: - If you pass a file that is already in the blobdir, it will be renamed to `<hash>.<extension>` immediately (previously, the filename on the disk stayed the same) - If you pass a file that's not in the blobdir yet, it will be copied to the blobdir immediately (previously, it was copied to the blobdir later, when sending) - If you create a file and then pass it to `create_message()`, it's better to directly create it in the blobdir, since it doesn't need to be copied. - You must not write to the files after they were passed to core, because otherwise, the hash will be wrong. So, if Desktop recodes videos or so, then the video file mustn't just be overwritten. What you can do instead is write the recoded video to a file with a random name in the blobdir and then create a new message with the new attachment. If needed, we can also create a JsonRPC for `set_file_and_deduplicate()` that replaces the file on an existing message. In order to test whether everything still works, the desktop issue has a list of things to test: deltachat/deltachat-desktop#4498 Core issue: #6265 --------- Co-authored-by: l <[email protected]>

iequidoo mentioned this issue Nov 29, 2024

Switch to dc_msg_save_file() for exporting attachments deltachat/deltachat-android#3091

Closed

iequidoo added the tracking issue label Nov 29, 2024

r10s mentioned this issue Nov 29, 2024

feat: Store blobs in subdirs with random names (#4309) #5495

Closed

r10s mentioned this issue Nov 29, 2024

Use random filenames or hashes for blobstorage #4309

Closed

link2xt mentioned this issue Dec 10, 2024

feat: cache HTTP GET requests #6327

Merged

Hocuri mentioned this issue Dec 11, 2024

File deduplication #6332

Merged

5 tasks

This was referenced Jan 16, 2025

Adapt to file deduplication deltachat/deltachat-desktop#4498

Closed

Adapt to file deduplication deltachat/deltachat-ios#2524

Closed

r10s mentioned this issue Jan 21, 2025

files without extensions make troubles #6461

Closed

Hocuri closed this as completed Jan 24, 2025

This was referenced Jan 24, 2025

feat: Deduplicate blob files in the JsonRPC API #6470

Merged

Miscellaneous blob file deduplication improvements/fixes #6471

Merged

Should we add a migration to deduplicate existing blob files? #6476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking Issue] Deduplicate blob files #6265

[Tracking Issue] Deduplicate blob files #6265

Hocuri commented Nov 26, 2024 •

edited

Loading

Septias commented Nov 26, 2024

iequidoo commented Nov 29, 2024

r10s commented Nov 29, 2024

Hocuri commented Jan 24, 2025

[Tracking Issue] Deduplicate blob files #6265

[Tracking Issue] Deduplicate blob files #6265

Comments

Hocuri commented Nov 26, 2024 • edited Loading

Motivation

Prerequisites

Current plan

Alternatives

Open questions

Septias commented Nov 26, 2024

iequidoo commented Nov 29, 2024

r10s commented Nov 29, 2024

Hocuri commented Jan 24, 2025

Hocuri commented Nov 26, 2024 •

edited

Loading