Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken seq.index, keys index, and duplicate records on log #291

Closed
staltz opened this issue Dec 13, 2021 · 65 comments
Closed

Broken seq.index, keys index, and duplicate records on log #291

staltz opened this issue Dec 13, 2021 · 65 comments

Comments

@staltz
Copy link
Member

staltz commented Dec 13, 2021

This issue has the same symptoms as ssbc/jitdb#161, and seems to be happening only on iOS, I haven't gotten bug reports for Android yet.

I got the db2 folder from Dave who has this problem and ran some scripts to study the data. Mostly, I'm concerned with the keys leveldb index, because when retrieving a message by a certain key, we get the wrong key.

For all records on the log, I printed out:

seq | offset | key in the keys leveldb index | key in the msg, if different
0 | 0 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256
1 | 526 | %LqzwwHg3Lt+uU4S3DXdE8jkBfvma/1Yin1mrDzPF4V8=.sha256
2 | 995 | %V3/jZdpj1CsyVI3QH/rajkDahH1R30u1vf05OUA8h8U=.sha256
3 | 1464 | %Ctc7zcaT0N4Myvm3/+VEPItu1/1x0SJHysrCoJQzmUA=.sha256
4 | 2297 | %GNNx+fj9DmdMlazwqh439K+j0G1ovAGr4stelJj4EkM=.sha256
5 | 2766 | %PjwtwmxcVOyAZ399Z1Qy4og+z+sl8/P+Lfq3D/zV77w=.sha256
6 | 3250 | %p3aECWInrsJZdSO7Sp7CQR3a29MpIuVrAxwXG2tSTl0=.sha256
7 | 3736 | %m/OBdukgm7bDwQaT1P2aOHFXoLo5yzW7bfqy7JfrZeE=.sha256
...
1380993 | 992019409 | %bXb5tFQpLYK0kFOiOqhZRoJZoFsTBRLCWbCpIWNfUvY=.sha256
1380994 | 992019902 | %Fe9zF4gDp0a0EgvCSsaWmTgg1eHMKh5mKeDBnFr57LQ=.sha256
1380995 | 992020386 | %gp8jQXiLBKrUMuFWFjizZDUAD6FdSC694JbvPk8og2I=.sha256
1380996 | 992020870 | %7O1RVgAC1DGTieWj3dPY6ECyweQNV3XGsP0d3fvO7sM=.sha256
1380997 | 992021367 | %fsuepooJoYmmY5wrz8tsiUI6LKbAODDI39Fdcp7feGc=.sha256
1380998 | 992021851 | %RI616dra+5k4fCAL7OO6HcppXvejgh99xMc45p1qZEQ=.sha256
1380999 | 992022335 | %Cau8bVqsaWC98bwwYV7/dn4FNJ8KjFEj2PFI7D8kVHA=.sha256
1381002 | 992023787 | %XFqGE3AfQT7tsoB8KnzTAas1xGgnncj0cxe7L/jhy1c=.sha256 | %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256
1381003 | 992024723 | %cBQmvYU179KsLKdoiemhkzaCq7GxW09yHtd2ClE68Vs=.sha256 | %CuDpduTU/tUcUDET3nWzZF1kjPJlN0qb+Ir9U9UqmX0=.sha256
1381004 | 992025198 | %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256 | %NieKr/HRCGOvmIV9MLVXnIu3rtX4scD/yf9eRt3kxyQ=.sha256
1381005 | 992025940 | %CuDpduTU/tUcUDET3nWzZF1kjPJlN0qb+Ir9U9UqmX0=.sha256 | %M90y6E8OCiG/qY/Ns+FFRk7Gy8qfTkTGJFdtrMCrmXc=.sha256
1381006 | 992026945 | %NieKr/HRCGOvmIV9MLVXnIu3rtX4scD/yf9eRt3kxyQ=.sha256 | %p9GdZRKBYEV1LMj3DknZe1Pwsj+yzfDDyXMwHBLmIBE=.sha256
1381007 | 992027429 | %M90y6E8OCiG/qY/Ns+FFRk7Gy8qfTkTGJFdtrMCrmXc=.sha256 | %EcegYJ7ukIvt+6KMmhr3XGEhPIhrV+5ZyEhLPxJEsg4=.sha256
1381008 | 992027922 | %p9GdZRKBYEV1LMj3DknZe1Pwsj+yzfDDyXMwHBLmIBE=.sha256 | %UtpB04gXRGFiC7tim0FSGCnggAuCy3wDgQ1CM+Pq4N8=.sha256
...
1384614 | 994561122 | %M0d0QBvh5wd+4ogtz4cN2Whgi2mxc1pdMb+ybjxnRJM=.sha256 | %LxvYK+TO6ThepFgMMa01J4niIEnSIw7vIjfNsX8uBys=.sha256
1384615 | 994561591 | %tSWBFyjCin8d9FviNrRiOAUJGPWABcLbgvheUnoqhvk=.sha256 | %NEF5yav9PYStfL5YJ0WS5vyyGGmJF5GixH8AiFR6CX8=.sha256
1384616 | 994562060 | %LxvYK+TO6ThepFgMMa01J4niIEnSIw7vIjfNsX8uBys=.sha256 | %yoHZBhIi9yMQQurH+Zfl96Gd2x6TodEJrkO6Yn4QV74=.sha256
1384617 | 994562544 | %NEF5yav9PYStfL5YJ0WS5vyyGGmJF5GixH8AiFR6CX8=.sha256 | %N2MK2p64cY6y3fRMWLzCgtPvRXklwRhjK1PqrAcyyy4=.sha256
1384618 | 0 | %yoHZBhIi9yMQQurH+Zfl96Gd2x6TodEJrkO6Yn4QV74=.sha256 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256
1384619 | 0 | %N2MK2p64cY6y3fRMWLzCgtPvRXklwRhjK1PqrAcyyy4=.sha256 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256

If you look closely, there seems to be an off-by-two gap. Notice these in bold (and that seqs 1381000 and 1381001 are missing):

seq offset actualKey expectedKey
1380998 992021851 %RI61
1380999 992022335 %Cau8
1381002 992023787 %XFqG %uBNh
1381003 992024723 %cBQm %CuDp
1381004 992025198 %uBNh %NieK
1381005 992025940 %CuDp %M90y

Also, if I simply load some index files, the metadata says (notice the Entries):

Loaded seq.index with:
  Entries: 1384618
  Offset: 994562544

Loaded keys leveldb with:
  Version: 1
  Entries: 1384620
  Offset: 994562544
@arj03
Copy link
Member

arj03 commented Dec 13, 2021

If one deletes the level indexes and rebuild, they also end up at 1384618 entries. And there seems to be corruption in the jitdb indexes as well. I'll do some more testing.

@arj03
Copy link
Member

arj03 commented Dec 13, 2021

This got me thinking. What version of node is the ios running? I'm wondering if it might have something to do with the M1 chip. It seems like node 16 was the first official release to support the Apple silicon.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Node.js Mobile (on both Android and iOS) is 12.19.0. iPhones don't have (I think) M1 chips, it's only on macBooks, iMac, and iPad Pro.

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

Ah right, it is A15 in the latest iphone.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Since you rebuilt the indexes, could you tell me what do you see for the record at seq 1381000 and 1381001 and 1381002? I'd like to know what those 3 rows should have been. And I'm curious what message has the key %XFqG...

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Idea: even if we don't discover the culprit, we can still build a system to discover inconsistencies and automatically heal them. Like, if the number of entries in leveldb doesn't match the core indexes in jitdb, then rebuild parts of the leveldb index. (Of course, not fixing the actual culprit will then make it harder for us to debug the culprit, and this issue could linger on forever). But just an idea.

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

Those 3 are: %XFqGE3AfQT7tsoB8KnzTAas1xGgnncj0cxe7L/jhy1c=.sha256, %cBQmvYU179KsLKdoiemhkzaCq7GxW09yHtd2ClE68Vs=.sha256, %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256.

One thing I noticed is that if I get the messages for my profile I get only 1 message using the corrupted indexes. While clearing the whole thing I get > 8000. This is where I discovered the multiple indexes are corrupt. One thing I find strange is that we now have crc checks, meaning that this should not be any corruption in read/write, rather how they are extracted. Unless they were written before we added these cheks, do you remember if we did a reindex after deployed those?

What I also find really strange is that it seems that both level & the jitdb indexes are corrupt.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Oh, which jitdb indexes you detected were corrupted?

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

I tried running a query for my posts messages, I havn't gotten exactly to the bottom of it yet, but at least the author + post and one other index was corrupt.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

One thing I find strange is that we now have crc checks, meaning that this should not be any corruption in read/write, rather how they are extracted. Unless they were written before we added these cheks, do you remember if we did a reindex after deployed those?

Here goes some detective work:

The last jitdb without CRC was [email protected] and [email protected].

In Manyverse, we forced the rebuild of all leveldb indexes (not jitdb indexes) recently:

https://github.com/staltz/manyverse/blob/215b42f4c9931400ccbc3130834f956004f2471d/src/backend/ssb.ts#L35-L38

But we forced the rebuild of all jitdb indexes (with [email protected] and [email protected]) for Android only with this commit because we thought that (see issue ssbc/jitdb#149 (comment)) this issue afflicted only Android with ext4:

staltz/manyverse@82a65c5

Note, CRC didn't work on iOS, so I made this patch to jitdb and finally [email protected] supported CRC on iOS.

And then I updated the "rebuild all jitdb indexes" logic in Manyverse to also include iOS:

staltz/manyverse@7814647

And I also updated [email protected]:

staltz/manyverse@f033099

Both of the commits above were shipped into Manyverse version 0.2105.11, so I believe that iOS users had all of their indexes rebuilt and CRC-checked, but maybe something sneaky happened that I can't see?

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

That is interesting because I checked the seq numbers of my posts messages in dave's log and I can see that they go way back (they start at seq 8017). This means that they should have been reindexed last month when you made that release. I tried dumping the raw author index and can see that it is vastly different from my reindex on even the first seq. And this is a prefix index, so once a seq has been reindexed, it should never change. The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

This means that they should have been reindexed last month when you made that release.

Just a clarification, I mentioned 0.2105.11 earlier, which means iOS should have had a full jitdb reindexing and CRC since May this year. You mentioned "last month", why?

The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

Maybe it's possible that the file issue1328 wasn't written atomically, but that sounds impossible because it's a 0 byte file.

@davegomez
Copy link

For what is worth (it looks it might), I updated my phone to the latest version last month.

iOS automatically copied the data content of the apps during the migration, so I didn’t have to install and sync Manyverse again.

I can confirm I noticed the issue after the migration to the new phone.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Very interesting! Thanks!

Just for more context, though, there is another user who is experiencing the same symptoms, so I think it's a bit more general issue, maybe they didn't change their phone too.

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

This means that they should have been reindexed last month when you made that release.

Just a clarification, I mentioned 0.2105.11 earlier, which means iOS should have had a full jitdb reindexing and CRC since May this year. You mentioned "last month", why?

Oh, I misread the version numbers. I thought 11 meant the month :)

The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

Maybe it's possible that the file issue1328 wasn't written atomically, but that sounds impossible because it's a 0 byte file.

That shouldn't happen and the folder should have been deleted before anyway.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

I want to try later, to rebuild all of these indexes (both jitdb and leveldb) from the log, and then do a diff on all indexes and see which one went wrong first. Because it seems that once one is wrong, it could infect the other indexes with wrong data.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Can it be that the problem is that canDecrypt is not CRC-checked?

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

Sorry for the deleted message, I spoke to early, was running on my own profile and needed to check if it also is a problem with another one. So I tested cblgh and I see the same problem. I get the correct number of results by removing author + type post + canDecrypt index. In other words they are all 3 corrupted.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Can it be that the problem is that canDecrypt is not CRC-checked?

Still seems like we could do this, right? canDecrypt.index and encrypted.index are currently not CRC'd.

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

Sure that would be a good idea.

The corruption is really strange as it seems to happen to more or less everything as far as I can see.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

The corruption is really strange as it seems to happen to more or less everything as far as I can see.

Can you detect whether a corruption is due to "bad data" coming from another corrupted index, or whether all of the corruptions are "original corruptions"? It would be easier for us if we detected that just canDecrypt.index and encrypted.index were corrupted and then all of the other indexes got infected with that.

But it could surely be a general filesystem-caused corruption.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

I want to try later, to rebuild all of these indexes (both jitdb and leveldb) from the log, and then do a diff on all indexes and see which one went wrong first.

Hmm, I can't do this without unless I have Dave's secret keys (in order to create canDecrypt.index). I guess I could ask Dave to run some scripts using his secret key.

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

After further testing I'm not so sure that canDecrypt is actually corrupt. Let me do some more testing on the author + post that is for sure corrupt.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Yeah, it's actually hard to know if canDecrypt is corrupted unless we get Dave to rebuild it on desktop.

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

PS: Dave is gonna run this https://github.com/staltz/ssb-db2-issue-291

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

Oh my. It turns out we might have to be really careful with bipf when upgrading. I was testing with latest bipf + ssb-db2. Pinning db2 to 2.3.3 and bipf to 1.5.1 seems to produce the correct results for author + post. So at least that is good news. Maybe the problem is only level db then. It will be interesting to see you find running the script you linked.

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

Seeing as I'm really slow today 😩 can you tell me when this fix: 557e882 was put into manyverse and the level folders reindexed?

It seems like 2.3.2 and 2.3.3 is not tagged in ssb-db2 git?

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Updated to ssb-db2 2.3.2 in staltz/manyverse@ebe4bf3 and released as Manyverse 0.2110.22

@staltz
Copy link
Member Author

staltz commented Dec 14, 2021

Updated to ssb-db2 2.3.3 in staltz/manyverse@41bb9ea and released as Manyverse 0.2111.11

@arj03
Copy link
Member

arj03 commented Dec 14, 2021

We made so many changes, I almost forgot half of them... ;)

The reason I was so confused is that the author index change for bendy butt bumps the version to 2, but that is only taken into consideration when updating the index ;-) So nothing related to bipf.

Stepping back I don't think this has anything to do with index corruption. It more smells like a bug similar to the level bug. Maybe related to pausing in push-stream, that is just a guess because I noticed there is also 1 message missing from the type index. Sequence 1382999. So the following query will give you the message:

const { seqs } = require('jitdb/operators')
ssb.db.query(
  where(
    and(
      seqs([1382999]),
    ),
  ),
  toCallback((err, msgs) => {
    console.log(msgs[0])
    ssb.close()
  })
)

while this will not, until you rebuild:

const { seqs } = require('jitdb/operators')
ssb.db.query(
  where(
    and(
      seqs([1382999]),
      type('post'),
    ),
  ),
  toCallback((err, msgs) => {
    console.log(msgs[0])
    ssb.close()
  })
)

@arj03
Copy link
Member

arj03 commented Dec 19, 2021

Jacobs keys level has the same kind of bug. It is off by 38 instead of 2.

1021983 | 788241486 | %/DnmQa4xqgIn5qaKgOx8ioUrU8i+ybE+hBFhWCVAuq0=.sha256
1021984 | 788242170 | %5nd7p453VUbLtm/oiWTZC8GPzAKWDNGOxz9e0MWBcF0=.sha256
1021985 | 788242654 | %FWzbolvHxC1NtcYAVzXrVHb+EurmkfX+LR0SRH19oxU=.sha256 | %D7CEf5EgPxX5/sWWgD2JREfpU3zShi85+eFsAhprwkc=.sha256
1021986 | 788243123 | %07shhpv4L6wEWpZxZdm0e5zU2QYo+o76iOgeIelSj9s=.sha256 | %LCC8qCZsZbqjKnQ0t0g+T59l15dD4wzDlMiPq+KBanc=.sha256
1021987 | 788244160 | %xc83mi/928R2MfwNP7DKLB/y5VMb2QZrlaojcrd5jpI=.sha256 | %UtAI9IUZn+MHKAYvdkc+7LU/FRXPJIGlklxkYTUtGo8=.sha256
1021988 | 788245071 | %6mFJx4VxZGswkf89liwNXAS1g0u/7CZ1qKxroremvMg=.sha256 | %j5hkP6mW4uvu0x6ujP1nY4Jdf2nHbyS7wb/KEK9pIJM=.sha256
1021989 | 788245555 | %fwRDVfejokKNk9b+P9mEBEazvIudeLXkyh48LMa1bX0=.sha256 | %+oGNg87aNM5SR404xhKFwRI8R3HcE8VqO1t4vcotc/A=.sha256

from correct:

1022023 | 788267008 | %FWzbolvHxC1NtcYAVzXrVHb+EurmkfX+LR0SRH19oxU=.sha256

these are the 38 missing, when compared to the total

I have also noticed that his author index is correct, but instead seq.index is broken. The following 42 offsets are missing:

788242654
788243123
788244160
788245071
788245555
788246056
788246525
788247018
788247502
788247995
788248717
788249210
788249992
788251454
788252162
788252646
788253121
788253918
788254393
788254868
788255352
788255836
788256320
788256813
788258017
788258497
788258981
788259957
788260906
788261331
788262059
788262543
788263012
788263505
788264349
788264818
788265302
788265795

I have a theory what might be have caused the problem with seq.index. It just doesn't explain the author corruption or the level bugs. What is a bit puzzling about the corrupt author prefix index is that, if you look at the prefixes after it is corrupted, they don't point to any known authors starts.

@staltz
Copy link
Member Author

staltz commented Dec 20, 2021

Studying Jacob's base leveldb, I noticed there are only two wrong KVs:

key=@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519
value={offset: 793494573, sequence: 1810}
key=@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519
value={offset: 793494080, sequence: 191}

Correct ones (they differ only by offset):

key=@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519
value={offset: 788258981, sequence: 1810}
key=@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519
value={offset: 788265302, sequence: 191}

And the META for base is:

Bad base:

  Version: 2
  Entries: 1032716
  Offset: 796004667

Rebuilt base (differs only by entries):

  Version: 2
  Entries: 1032754
  Offset: 796004667

38 is the diff.

@arj03
Copy link
Member

arj03 commented Dec 20, 2021

Hmmmmmmmmmmmm good find. Why does Jacob has the same message multiple times?

Check this (I exposed the getMsgByOffset method):

ssb.db.getMsgByOffset(788258981, (err, msg) => {
  console.log(788258981, msg)
})

ssb.db.getMsgByOffset(793494573, (err, msg) => {
  console.log(7934945731, msg)
})

ssb.db.getMsgByOffset(788265302, (err, msg) => {
  console.log(788265302, msg)
})

ssb.db.getMsgByOffset(793494080, (err, msg) => {
  console.log(793494080, msg)
})

=>

788258981 {
  key: '%LFYUCrJ6S2TKCB7o+2f8++xFRQXQYpZg++jpDXsQnFk=.sha256',
  value: {
    previous: '%wRGPk9ftm1BzYMCPbbf5jIDsZVTdyXsX06lZjIlueE8=.sha256',
    sequence: 1810,
    author: '@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519',
    timestamp: 1638438970621,
    hash: 'sha256',
    content: {
      type: 'post',
      root: '%juRWvyC7fH9YwdCdI0YXhjGHad+XXsZqdThCyFBmutM=.sha256',
      fork: '%3rVlNUCs8OD6ceHGrRlcUz1A0BNvVQFHtFylGi2WLnA=.sha256',
      branch: '%lM7Yum/PzzVmTBvaCQBZDYkgwe+gkmW/IQu4/+yhoB4=.sha256',
      reply: [Object],
      channel: null,
      recps: null,
      text: "Do you know about the snail electric fence method? I've seen it done with just some copper wire and a 9 V battery. Haven't tried it myself though.\n",
      mentions: []
    },
    signature: 'cJKmGtkeSuJf4wg0AnuHs85iD9GAvVuiic494LDqZ5QVs9v+wGtoSqA89XOK2CEwE5hMzF80+nF+aOq3JMxMBw==.sig.ed25519'
  },
  timestamp: 1638625571799
}
7934945731 {
  key: '%LFYUCrJ6S2TKCB7o+2f8++xFRQXQYpZg++jpDXsQnFk=.sha256',
  value: {
    previous: '%wRGPk9ftm1BzYMCPbbf5jIDsZVTdyXsX06lZjIlueE8=.sha256',
    sequence: 1810,
    author: '@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519',
    timestamp: 1638438970621,
    hash: 'sha256',
    content: {
      type: 'post',
      root: '%juRWvyC7fH9YwdCdI0YXhjGHad+XXsZqdThCyFBmutM=.sha256',
      fork: '%3rVlNUCs8OD6ceHGrRlcUz1A0BNvVQFHtFylGi2WLnA=.sha256',
      branch: '%lM7Yum/PzzVmTBvaCQBZDYkgwe+gkmW/IQu4/+yhoB4=.sha256',
      reply: [Object],
      channel: null,
      recps: null,
      text: "Do you know about the snail electric fence method? I've seen it done with just some copper wire and a 9 V battery. Haven't tried it myself though.\n",
      mentions: []
    },
    signature: 'cJKmGtkeSuJf4wg0AnuHs85iD9GAvVuiic494LDqZ5QVs9v+wGtoSqA89XOK2CEwE5hMzF80+nF+aOq3JMxMBw==.sig.ed25519'
  },
  timestamp: 1638625912978
}
793494080 {
  key: '%bKTVmbsjc9YqpZQrP+V94+KyGMQ0n3jSUrOZY1ADlD8=.sha256',
  value: {
    previous: '%cquXN1dXcOkKIocYhFHjo0XnDI0EZf3e5R+wyHZHxow=.sha256',
    sequence: 191,
    author: '@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519',
    timestamp: 1636321596142,
    hash: 'sha256',
    content: { type: 'vote', channel: null, vote: [Object] },
    signature: 'E9Hvxu41Gf+TvwjLMTPFPLWYS+9OJgxfaJYAZWxu/Obp1pQ91p9e8VthA279+2LaSP3BENnU++xHV3YSLjvwDQ==.sig.ed25519'
  },
  timestamp: 1638625912974
}
788265302 {
  key: '%bKTVmbsjc9YqpZQrP+V94+KyGMQ0n3jSUrOZY1ADlD8=.sha256',
  value: {
    previous: '%cquXN1dXcOkKIocYhFHjo0XnDI0EZf3e5R+wyHZHxow=.sha256',
    sequence: 191,
    author: '@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519',
    timestamp: 1636321596142,
    hash: 'sha256',
    content: { type: 'vote', channel: null, vote: [Object] },
    signature: 'E9Hvxu41Gf+TvwjLMTPFPLWYS+9OJgxfaJYAZWxu/Obp1pQ91p9e8VthA279+2LaSP3BENnU++xHV3YSLjvwDQ==.sig.ed25519'
  },
  timestamp: 1638625571896
}

@staltz
Copy link
Member Author

staltz commented Dec 20, 2021

Wait, this is literally just on the log, no indexes involved?

@arj03
Copy link
Member

arj03 commented Dec 20, 2021

Exactly

@arj03
Copy link
Member

arj03 commented Dec 20, 2021

I wonder what came first, the corrupt indexes or double messages 🤔

@staltz
Copy link
Member Author

staltz commented Dec 20, 2021

Well, one thing we know for sure, that seq.index has a saving problem. We could push that fix and reset all indexes, and then wait for more months to pass by and see what happens.

@arj03
Copy link
Member

arj03 commented Dec 20, 2021

Agree that it would be good with that fix. I'm just still trying to understand what is going on here.

I found some duplicate messages in Daves database as well (key, first offset, duplicate offset).

duplicate key %z/MDV4mT/XaiIibeP0C5vf+0tKLAcFxtMBDHLf5eu+E=.sha256 180348963 180749729
duplicate key %+hW4qOX4xuIO8QE9RTUAaHqL0bD28r4GyF/kYJLBz38=.sha256 180295907 180753120
duplicate key %/gzGlFUQBE3C6Kd1QfQSzZxG78IL4DmJssBIIdQCnlI=.sha256 180328614 180769814
duplicate key %NWG837v3J2qhjGmK9yEZvH7gZ34fzlEKfKjeuWXmRG8=.sha256 180371842 181260814
duplicate key %3Z/wvMZQlLWR+pSMn9yERhxwNU6jVrDQGqbvZV+sEZM=.sha256 180435314 245829508
duplicate key %rRUeWOTw1GQBHPJCwAsYSnuFCjRaLh/vafGZJnoA56o=.sha256 180472215 276944048

I checked all of Jacobs and there are only the 2 you already found.

@staltz staltz changed the title Corrupted indexes on iOS Broken seq.index and keys index and duplicate records on log Dec 21, 2021
@staltz staltz changed the title Broken seq.index and keys index and duplicate records on log Broken seq.index, keys index, and duplicate records on log Dec 21, 2021
@arj03
Copy link
Member

arj03 commented Dec 21, 2021

First of all I have been really trying to break AAOL by threating it as a black block and throwing all kinds of random input at it. I havn't seen it break yet. So my hunch is that it must be somewhere else.

Good news is I think I have found something. Look at this line and imagine we have added a new plugin. In this case we will reindex, but most plugins will just skip. BUT they will update their processedOffset thus lovering it compared to processedSeq. If you then kill the app while some of them have updated, but NOT keys or base then there will be a mismatch!

@arj03
Copy link
Member

arj03 commented Dec 21, 2021

The million $ question is, why was that line not put inside the tuborg?

I mean it is such a lovely lady

image

@staltz
Copy link
Member Author

staltz commented Dec 21, 2021

Look at this line and imagine we have added a new plugin. In this case we will reindex, but most plugins will just skip. BUT they will update their processedOffset thus lovering it compared to processedSeq. If you then kill the app while some of them have updated, but NOT keys or base then there will be a mismatch!

🙊 🙈

The million $ question is, why was that line not put inside the tuborg?

I have a strong feeling/memory that we nitpicked that line of code many times and there was SOME reason why it was like that. Let me see if I can find anything from git history.

@staltz
Copy link
Member Author

staltz commented Dec 21, 2021

@arj03
Copy link
Member

arj03 commented Dec 21, 2021

Hmm. It still seems like a bug to me. I'll try running the tests with this changed.

@staltz
Copy link
Member Author

staltz commented Dec 21, 2021

Yeah, feel free to tuborgify this.

(Side note: when all this is fixed I'd like to update to the latest ssb-db2 instead of doing more backports. I hope I get to the bottom of the memleak quickly)

@arj03
Copy link
Member

arj03 commented Dec 21, 2021

Great! To sum up. There seems to be a myriad of bugs making debugging especially hard. But I'm glad that it seems that these bugs so far have been in jitdb and db2.

  • off-by-x level
  • Jacob seq.index corruption
  • duplicate messages? I'm not sure about that one, but both of the bugs above is probably not super good to begin with
  • corruption in author (Daves log). I'm still really puzzled by that one, but I'm pretty confidant in the fixes for the first 2.

@staltz
Copy link
Member Author

staltz commented Dec 22, 2021

@arj03 Any thoughts what we should do about the broken log.bipf (with duplicate records)? I imagine if it remains broken it could also infect the indexes. Perhaps scan the log and delete duplicate records?

@Powersource
Copy link
Collaborator

Powersource commented Dec 22, 2021 via email

@staltz
Copy link
Member Author

staltz commented Dec 22, 2021

Speaking about it, I just noticed on my phone's Manyverse that it also has this bug. Yay what a great christmas gift.

@arj03
Copy link
Member

arj03 commented Dec 22, 2021

@arj03 Any thoughts what we should do about the broken log.bipf (with duplicate records)? I imagine if it remains broken it could also infect the indexes. Perhaps scan the log and delete duplicate records?

For things like base & keys they should be fine with the extra messages. JitDB like posts can be a bit more problematic as you'll have too many records. You could delete them as you say, that should work. Just note that its the first time you'll be running with deleted records so there is always a risk there.

@staltz
Copy link
Member Author

staltz commented Dec 23, 2021

I'll close this issue later, when I discover the best way of deleting those duplicate records from the log (I am trying to avoid a O(n^2) algorithm, or a memory-hungry one).

@staltz
Copy link
Member Author

staltz commented Dec 23, 2021

Here's the script I've been using the remove duplicates:

  const B_KEY = Buffer.from('key');
  const existing = new Set<string>();
  const deletables = new Set<number>();

  // Find deletables
  await new Promise<void>((resolve) => {
    log.stream({gt: -1}).pipe({
      paused: false,
      write: function (record: {value: Buffer | null; offset: number}) {
        const buffer = record.value;
        if (!buffer) return;
        const pKey = BIPF.seekKey(buffer, 0, B_KEY) as number;
        const shortKey = BIPF.decode(buffer, pKey).slice(0, 15) as string;
        if (existing.has(shortKey)) {
          deletables.add(record.offset);
        } else {
          existing.add(shortKey);
        }
      },
      end: () => {
        resolve();
      },
    });
  });

  const del = util.promisify(log.del);
  for (const offset of deletables) {
    await del(offset);
  }

  existing.clear();
  deletables.clear();

The shortKey idea is a prefix and it means that if you have 1 million records, then the existing Set will take ~16MB of RAM, which is pretty fine.

It works, and it's not too slow. It takes about ~5s to scan the log on desktop (maybe it'll take about 20s on mobile), and Jacob's log.bipf had 2 duplicates, Dave's log.bipf had 6. I tested a couple of times and it seems solid.

There was one crash I found and I'll submit a PR to ssb-db2, a simple thing.

@arj03
Copy link
Member

arj03 commented Dec 23, 2021

Nice speed!

const shortKey = BIPF.decode(buffer, pKey).slice(0, 15) as string;

You should be able to do slice(1, 15) as the first char is always the same.

@staltz
Copy link
Member Author

staltz commented Dec 23, 2021

Yes, slice(1, 16), to have 16 characters.

@staltz
Copy link
Member Author

staltz commented Dec 23, 2021

Now we can close :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants