-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted indexes #149
Comments
Good you saved the indexes. Would be very interested in taking a look. |
Vote & author indexes are the same. Seq & timestamp are also ok after loading and checking. Post does have 2 differences at index 421 and 422. The rest is the same. I'll do some more debugging and post my tool to look at the files. |
Some more debugging info mostly for myself: value_content_root_.index (isRoot)
value_content_root__map.32prefixmap (hasRoot)
value_content_type_post.index
value_content_contact__map.32prefixmap
Fork, vote link (both prefix maps) ok. Pub (bitset) also ok. |
I wonder if this could happen due to the log losing data rather than the indexes. The kind of bug I'm thinking of would follow these steps:
|
@Barbarrosa Thanks for the suggestion, I wouldn't have thought what you thought, I didn't even know about I think the difficulty in fixing this issue is reproducibility. We have hard evidence for it, but we don't have a way of knowing for sure whether a candidate fix was effective. In my opinion we should look for that first, maybe by simulating abrupt closing of the ssb instance and associated filesystem operations. |
More discoveries as I'm debugging the evidence too (I can use log.bipf to confirm what's the true correct value): This shows towards the end of the log for good-vs-corrupted Click to openOffset, bit from good, bit from corrupted
However, it seems that the
Notice that The other thing I noticed was near the middle of the log, starting at seq Offset, bit from good, bit from corrupted
In log.bipf, I noticed that those 4 messages are all encrypted. |
Oh that is quite interesting. Let me check that. |
There is a diff in canDecrypt:
And even more diffs in encrypted:
Was trying a few thing, no dice sadly. |
@Barbarrosa async log should have been made so that nothing is streamed to indexes that has not been written to disc. I'd be really interested in a test case that shows that is not the case :) |
I'm reminded of this old bug report in atomic-file. I wonder if we should start adding content hashing to detect non-atomic moves. It just seems like someone must have a good solution for this problem on android. |
I never thought of the hash idea. Sounds interesting! |
@staltz @arj03 The loss via differently ordered writes seems a little difficult to reliably reproduce, but I think this NodeJS issue roughly gets at what I'm thinking. Side notes: |
That paper is very interesting, thanks for linking @Barbarrosa. Especially the https://github.com/WiscADSL/cuttlefs tool that can be used to test a solution for fs corruption. |
I wonder if and how flumedb accounted for FS failures? Because I haven't bumped into these kinds of corruptions when using flumedb for years in ssb. |
So it seems quite similar to what we do with in db2 (level + atomically). |
The lingering tmp is a bad sign. I can see that atomically should clean up the files on exit, but it probably doesn't work if the process is killed. It needs a garbage collection step, but the files hints that at least 4-5 times the process was shut down during a write. |
Yeah, I'm curious if we could bring back atomic-file and improve it with the hash idea and other things. It's also a library that we control and understand, even if it's a bit abandoned currently. |
Would be fine with that. I would like to keep the browser part because if I remember correctly atomic-file does some things in the browser that is slower than https://github.com/ssb-ngi-pointer/atomically-universal/blob/master/index.js. About hashing, here is a thread on speed. |
Was reading up on things in this thread. Android uses ext4 as the filesystem so rename should be atomic, BUTT without the fsync() the metadata may be written before the data, which if there was a crash in between, would cause the new renamed file to have partial/empty data. From this nice thread here on is-rename-without-fsync-safe. I think the combination of adding hashes & adding fsync after write to atomic-file should get us a long way here. |
Wow, good find! |
@arj03 I've been trying to reproduce on Termux using https://github.com/staltz/jitdb-files-fs but no luck so far |
I'm very thankful you worked on this one @arj03. I'll update and let it run for some weeks to see whether we haven't bumped into more corruptions. If I find something, can open another issue, but let's hope there isn't anything. |
This is a placeholder for more details that I'll post later, because I'm on mobile.
We found a bug in Manyverse where opening a thread for msg root X would actually open msg Y, completely unrelated msg. Also we had other spooky issues, like some threads not appearing on the public feed despite them being in the local log, or even when I publish a msg myself, sometimes it didn't show on the public feed.
I confirmed that this was caused by jitdb or ssb-db2 by deleting the indexes folder and letting it reindex. After that, the bugs disappeared.
Gladly, I kept a copy of the corrupted indexes folder. I also kept a copy of the newly built good indexes folder, and both of these should be matching the same log, because I put airplane mode on and had no new replication. This way, we can do a raw diff on the corrupted-vs-good indexes folders and discover what's going on.
If I'd have to guess, I would say there was a bug in atomically and the core indexes like offset and timestamp began accumulating errors as the file got corrupted.
The text was updated successfully, but these errors were encountered: