-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken seq.index, keys index, and duplicate records on log #291
Comments
If one deletes the level indexes and rebuild, they also end up at 1384618 entries. And there seems to be corruption in the jitdb indexes as well. I'll do some more testing. |
This got me thinking. What version of node is the ios running? I'm wondering if it might have something to do with the M1 chip. It seems like node 16 was the first official release to support the Apple silicon. |
Node.js Mobile (on both Android and iOS) is 12.19.0. iPhones don't have (I think) M1 chips, it's only on macBooks, iMac, and iPad Pro. |
Ah right, it is A15 in the latest iphone. |
Since you rebuilt the indexes, could you tell me what do you see for the record at seq 1381000 and 1381001 and 1381002? I'd like to know what those 3 rows should have been. And I'm curious what message has the key |
Idea: even if we don't discover the culprit, we can still build a system to discover inconsistencies and automatically heal them. Like, if the number of entries in leveldb doesn't match the core indexes in jitdb, then rebuild parts of the leveldb index. (Of course, not fixing the actual culprit will then make it harder for us to debug the culprit, and this issue could linger on forever). But just an idea. |
Those 3 are: %XFqGE3AfQT7tsoB8KnzTAas1xGgnncj0cxe7L/jhy1c=.sha256, %cBQmvYU179KsLKdoiemhkzaCq7GxW09yHtd2ClE68Vs=.sha256, %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256. One thing I noticed is that if I get the messages for my profile I get only 1 message using the corrupted indexes. While clearing the whole thing I get > 8000. This is where I discovered the multiple indexes are corrupt. One thing I find strange is that we now have crc checks, meaning that this should not be any corruption in read/write, rather how they are extracted. Unless they were written before we added these cheks, do you remember if we did a reindex after deployed those? What I also find really strange is that it seems that both level & the jitdb indexes are corrupt. |
Oh, which jitdb indexes you detected were corrupted? |
I tried running a query for my posts messages, I havn't gotten exactly to the bottom of it yet, but at least the author + post and one other index was corrupt. |
Here goes some detective work: The last jitdb without CRC was In Manyverse, we forced the rebuild of all leveldb indexes (not jitdb indexes) recently: But we forced the rebuild of all jitdb indexes (with Note, CRC didn't work on iOS, so I made this patch to jitdb and finally And then I updated the "rebuild all jitdb indexes" logic in Manyverse to also include iOS: And I also updated Both of the commits above were shipped into Manyverse version |
That is interesting because I checked the seq numbers of my posts messages in dave's log and I can see that they go way back (they start at seq 8017). This means that they should have been reindexed last month when you made that release. I tried dumping the raw author index and can see that it is vastly different from my reindex on even the first seq. And this is a prefix index, so once a seq has been reindexed, it should never change. The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm |
Just a clarification, I mentioned
Maybe it's possible that the file |
For what is worth (it looks it might), I updated my phone to the latest version last month. iOS automatically copied the data content of the apps during the migration, so I didn’t have to install and sync Manyverse again. I can confirm I noticed the issue after the migration to the new phone. |
Very interesting! Thanks! Just for more context, though, there is another user who is experiencing the same symptoms, so I think it's a bit more general issue, maybe they didn't change their phone too. |
Oh, I misread the version numbers. I thought 11 meant the month :)
That shouldn't happen and the folder should have been deleted before anyway. |
I want to try later, to rebuild all of these indexes (both jitdb and leveldb) from the log, and then do a diff on all indexes and see which one went wrong first. Because it seems that once one is wrong, it could infect the other indexes with wrong data. |
Can it be that the problem is that canDecrypt is not CRC-checked? |
Sorry for the deleted message, I spoke to early, was running on my own profile and needed to check if it also is a problem with another one. So I tested cblgh and I see the same problem. I get the correct number of results by removing author + type post + canDecrypt index. In other words they are all 3 corrupted. |
Still seems like we could do this, right? canDecrypt.index and encrypted.index are currently not CRC'd. |
Sure that would be a good idea. The corruption is really strange as it seems to happen to more or less everything as far as I can see. |
Can you detect whether a corruption is due to "bad data" coming from another corrupted index, or whether all of the corruptions are "original corruptions"? It would be easier for us if we detected that just canDecrypt.index and encrypted.index were corrupted and then all of the other indexes got infected with that. But it could surely be a general filesystem-caused corruption. |
Hmm, I can't do this without unless I have Dave's secret keys (in order to create canDecrypt.index). I guess I could ask Dave to run some scripts using his secret key. |
After further testing I'm not so sure that canDecrypt is actually corrupt. Let me do some more testing on the author + post that is for sure corrupt. |
Yeah, it's actually hard to know if canDecrypt is corrupted unless we get Dave to rebuild it on desktop. |
PS: Dave is gonna run this https://github.com/staltz/ssb-db2-issue-291 |
Oh my. It turns out we might have to be really careful with bipf when upgrading. I was testing with latest bipf + ssb-db2. Pinning db2 to 2.3.3 and bipf to 1.5.1 seems to produce the correct results for author + post. So at least that is good news. Maybe the problem is only level db then. It will be interesting to see you find running the script you linked. |
Seeing as I'm really slow today 😩 can you tell me when this fix: 557e882 was put into manyverse and the level folders reindexed? It seems like 2.3.2 and 2.3.3 is not tagged in ssb-db2 git? |
Updated to ssb-db2 2.3.2 in staltz/manyverse@ebe4bf3 and released as Manyverse 0.2110.22 |
Updated to ssb-db2 2.3.3 in staltz/manyverse@41bb9ea and released as Manyverse 0.2111.11 |
We made so many changes, I almost forgot half of them... ;) The reason I was so confused is that the author index change for bendy butt bumps the version to 2, but that is only taken into consideration when updating the index ;-) So nothing related to bipf. Stepping back I don't think this has anything to do with index corruption. It more smells like a bug similar to the level bug. Maybe related to pausing in push-stream, that is just a guess because I noticed there is also 1 message missing from the type index. Sequence 1382999. So the following query will give you the message:
while this will not, until you rebuild:
|
Jacobs keys level has the same kind of bug. It is off by 38 instead of 2.
I have also noticed that his author index is correct, but instead
I have a theory what might be have caused the problem with seq.index. It just doesn't explain the author corruption or the level bugs. What is a bit puzzling about the corrupt author prefix index is that, if you look at the prefixes after it is corrupted, they don't point to any known authors starts. |
Studying Jacob's
Correct ones (they differ only by offset):
And the META for Bad base:
Rebuilt base (differs only by entries):
38 is the diff. |
Hmmmmmmmmmmmm good find. Why does Jacob has the same message multiple times? Check this (I exposed the getMsgByOffset method):
=>
|
Wait, this is literally just on the log, no indexes involved? |
Exactly |
I wonder what came first, the corrupt indexes or double messages 🤔 |
Well, one thing we know for sure, that seq.index has a saving problem. We could push that fix and reset all indexes, and then wait for more months to pass by and see what happens. |
Agree that it would be good with that fix. I'm just still trying to understand what is going on here. I found some duplicate messages in Daves database as well (key, first offset, duplicate offset).
I checked all of Jacobs and there are only the 2 you already found. |
First of all I have been really trying to break AAOL by threating it as a black block and throwing all kinds of random input at it. I havn't seen it break yet. So my hunch is that it must be somewhere else. Good news is I think I have found something. Look at this line and imagine we have added a new plugin. In this case we will reindex, but most plugins will just skip. BUT they will update their |
🙊 🙈
I have a strong feeling/memory that we nitpicked that line of code many times and there was SOME reason why it was like that. Let me see if I can find anything from git history. |
Hmm. It still seems like a bug to me. I'll try running the tests with this changed. |
Yeah, feel free to tuborgify this. (Side note: when all this is fixed I'd like to update to the latest ssb-db2 instead of doing more backports. I hope I get to the bottom of the memleak quickly) |
Great! To sum up. There seems to be a myriad of bugs making debugging especially hard. But I'm glad that it seems that these bugs so far have been in jitdb and db2.
|
@arj03 Any thoughts what we should do about the broken log.bipf (with duplicate records)? I imagine if it remains broken it could also infect the indexes. Perhaps scan the log and delete duplicate records? |
Fyi i deleted my folder and key. But there might be other people with the
same issue
Jacob
…On Wed, Dec 22, 2021, 11:05 André Staltz ***@***.***> wrote:
@arj03 <https://github.com/arj03> Any thoughts what we should do about
the broken log.bipf (with duplicate records)? I imagine if it remains
broken it could also infect the indexes. Perhaps scan the log and delete
duplicate records?
—
Reply to this email directly, view it on GitHub
<#291 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK4M5ZPG7NUD4HBYMRC2LLUSGPINANCNFSM5J6PTPEA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Speaking about it, I just noticed on my phone's Manyverse that it also has this bug. Yay what a great christmas gift. |
For things like base & keys they should be fine with the extra messages. JitDB like posts can be a bit more problematic as you'll have too many records. You could delete them as you say, that should work. Just note that its the first time you'll be running with deleted records so there is always a risk there. |
I'll close this issue later, when I discover the best way of deleting those duplicate records from the log (I am trying to avoid a |
Here's the script I've been using the remove duplicates: const B_KEY = Buffer.from('key');
const existing = new Set<string>();
const deletables = new Set<number>();
// Find deletables
await new Promise<void>((resolve) => {
log.stream({gt: -1}).pipe({
paused: false,
write: function (record: {value: Buffer | null; offset: number}) {
const buffer = record.value;
if (!buffer) return;
const pKey = BIPF.seekKey(buffer, 0, B_KEY) as number;
const shortKey = BIPF.decode(buffer, pKey).slice(0, 15) as string;
if (existing.has(shortKey)) {
deletables.add(record.offset);
} else {
existing.add(shortKey);
}
},
end: () => {
resolve();
},
});
});
const del = util.promisify(log.del);
for (const offset of deletables) {
await del(offset);
}
existing.clear();
deletables.clear(); The shortKey idea is a prefix and it means that if you have 1 million records, then the It works, and it's not too slow. It takes about ~5s to scan the log on desktop (maybe it'll take about 20s on mobile), and Jacob's log.bipf had 2 duplicates, Dave's log.bipf had 6. I tested a couple of times and it seems solid. There was one crash I found and I'll submit a PR to ssb-db2, a simple thing. |
Nice speed!
You should be able to do slice(1, 15) as the first char is always the same. |
Yes, |
Now we can close :) |
This issue has the same symptoms as ssbc/jitdb#161, and seems to be happening only on iOS, I haven't gotten bug reports for Android yet.
I got the
db2
folder from Dave who has this problem and ran some scripts to study the data. Mostly, I'm concerned with thekeys
leveldb index, because when retrieving a message by a certain key, we get the wrong key.For all records on the log, I printed out:
If you look closely, there seems to be an off-by-two gap. Notice these in bold (and that seqs 1381000 and 1381001 are missing):
Also, if I simply load some index files, the metadata says (notice the
Entries
):The text was updated successfully, but these errors were encountered: