Broken seq.index, keys index, and duplicate records on log #291

staltz · 2021-12-13T15:52:25Z

This issue has the same symptoms as ssbc/jitdb#161, and seems to be happening only on iOS, I haven't gotten bug reports for Android yet.

I got the db2 folder from Dave who has this problem and ran some scripts to study the data. Mostly, I'm concerned with the keys leveldb index, because when retrieving a message by a certain key, we get the wrong key.

For all records on the log, I printed out:

seq | offset | key in the keys leveldb index | key in the msg, if different
0 | 0 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256
1 | 526 | %LqzwwHg3Lt+uU4S3DXdE8jkBfvma/1Yin1mrDzPF4V8=.sha256
2 | 995 | %V3/jZdpj1CsyVI3QH/rajkDahH1R30u1vf05OUA8h8U=.sha256
3 | 1464 | %Ctc7zcaT0N4Myvm3/+VEPItu1/1x0SJHysrCoJQzmUA=.sha256
4 | 2297 | %GNNx+fj9DmdMlazwqh439K+j0G1ovAGr4stelJj4EkM=.sha256
5 | 2766 | %PjwtwmxcVOyAZ399Z1Qy4og+z+sl8/P+Lfq3D/zV77w=.sha256
6 | 3250 | %p3aECWInrsJZdSO7Sp7CQR3a29MpIuVrAxwXG2tSTl0=.sha256
7 | 3736 | %m/OBdukgm7bDwQaT1P2aOHFXoLo5yzW7bfqy7JfrZeE=.sha256
...
1380993 | 992019409 | %bXb5tFQpLYK0kFOiOqhZRoJZoFsTBRLCWbCpIWNfUvY=.sha256
1380994 | 992019902 | %Fe9zF4gDp0a0EgvCSsaWmTgg1eHMKh5mKeDBnFr57LQ=.sha256
1380995 | 992020386 | %gp8jQXiLBKrUMuFWFjizZDUAD6FdSC694JbvPk8og2I=.sha256
1380996 | 992020870 | %7O1RVgAC1DGTieWj3dPY6ECyweQNV3XGsP0d3fvO7sM=.sha256
1380997 | 992021367 | %fsuepooJoYmmY5wrz8tsiUI6LKbAODDI39Fdcp7feGc=.sha256
1380998 | 992021851 | %RI616dra+5k4fCAL7OO6HcppXvejgh99xMc45p1qZEQ=.sha256
1380999 | 992022335 | %Cau8bVqsaWC98bwwYV7/dn4FNJ8KjFEj2PFI7D8kVHA=.sha256
1381002 | 992023787 | %XFqGE3AfQT7tsoB8KnzTAas1xGgnncj0cxe7L/jhy1c=.sha256 | %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256
1381003 | 992024723 | %cBQmvYU179KsLKdoiemhkzaCq7GxW09yHtd2ClE68Vs=.sha256 | %CuDpduTU/tUcUDET3nWzZF1kjPJlN0qb+Ir9U9UqmX0=.sha256
1381004 | 992025198 | %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256 | %NieKr/HRCGOvmIV9MLVXnIu3rtX4scD/yf9eRt3kxyQ=.sha256
1381005 | 992025940 | %CuDpduTU/tUcUDET3nWzZF1kjPJlN0qb+Ir9U9UqmX0=.sha256 | %M90y6E8OCiG/qY/Ns+FFRk7Gy8qfTkTGJFdtrMCrmXc=.sha256
1381006 | 992026945 | %NieKr/HRCGOvmIV9MLVXnIu3rtX4scD/yf9eRt3kxyQ=.sha256 | %p9GdZRKBYEV1LMj3DknZe1Pwsj+yzfDDyXMwHBLmIBE=.sha256
1381007 | 992027429 | %M90y6E8OCiG/qY/Ns+FFRk7Gy8qfTkTGJFdtrMCrmXc=.sha256 | %EcegYJ7ukIvt+6KMmhr3XGEhPIhrV+5ZyEhLPxJEsg4=.sha256
1381008 | 992027922 | %p9GdZRKBYEV1LMj3DknZe1Pwsj+yzfDDyXMwHBLmIBE=.sha256 | %UtpB04gXRGFiC7tim0FSGCnggAuCy3wDgQ1CM+Pq4N8=.sha256
...
1384614 | 994561122 | %M0d0QBvh5wd+4ogtz4cN2Whgi2mxc1pdMb+ybjxnRJM=.sha256 | %LxvYK+TO6ThepFgMMa01J4niIEnSIw7vIjfNsX8uBys=.sha256
1384615 | 994561591 | %tSWBFyjCin8d9FviNrRiOAUJGPWABcLbgvheUnoqhvk=.sha256 | %NEF5yav9PYStfL5YJ0WS5vyyGGmJF5GixH8AiFR6CX8=.sha256
1384616 | 994562060 | %LxvYK+TO6ThepFgMMa01J4niIEnSIw7vIjfNsX8uBys=.sha256 | %yoHZBhIi9yMQQurH+Zfl96Gd2x6TodEJrkO6Yn4QV74=.sha256
1384617 | 994562544 | %NEF5yav9PYStfL5YJ0WS5vyyGGmJF5GixH8AiFR6CX8=.sha256 | %N2MK2p64cY6y3fRMWLzCgtPvRXklwRhjK1PqrAcyyy4=.sha256
1384618 | 0 | %yoHZBhIi9yMQQurH+Zfl96Gd2x6TodEJrkO6Yn4QV74=.sha256 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256
1384619 | 0 | %N2MK2p64cY6y3fRMWLzCgtPvRXklwRhjK1PqrAcyyy4=.sha256 | %cwv/eckJoOfVqGx7bABbQduZBFrHZly0PfeE7nrwotc=.sha256

If you look closely, there seems to be an off-by-two gap. Notice these in bold (and that seqs 1381000 and 1381001 are missing):

seq	offset	actualKey	expectedKey
1380998	992021851	%RI61
1380999	992022335	%Cau8
1381002	992023787	%XFqG	%uBNh
1381003	992024723	%cBQm	%CuDp
1381004	992025198	%uBNh	%NieK
1381005	992025940	%CuDp	%M90y

Also, if I simply load some index files, the metadata says (notice the Entries):

Loaded seq.index with:
  Entries: 1384618
  Offset: 994562544

Loaded keys leveldb with:
  Version: 1
  Entries: 1384620
  Offset: 994562544

The text was updated successfully, but these errors were encountered:

arj03 · 2021-12-13T20:48:02Z

If one deletes the level indexes and rebuild, they also end up at 1384618 entries. And there seems to be corruption in the jitdb indexes as well. I'll do some more testing.

arj03 · 2021-12-13T22:48:45Z

This got me thinking. What version of node is the ios running? I'm wondering if it might have something to do with the M1 chip. It seems like node 16 was the first official release to support the Apple silicon.

staltz · 2021-12-14T08:44:56Z

Node.js Mobile (on both Android and iOS) is 12.19.0. iPhones don't have (I think) M1 chips, it's only on macBooks, iMac, and iPad Pro.

arj03 · 2021-12-14T08:58:27Z

Ah right, it is A15 in the latest iphone.

staltz · 2021-12-14T09:31:17Z

Since you rebuilt the indexes, could you tell me what do you see for the record at seq 1381000 and 1381001 and 1381002? I'd like to know what those 3 rows should have been. And I'm curious what message has the key %XFqG...

staltz · 2021-12-14T09:33:26Z

Idea: even if we don't discover the culprit, we can still build a system to discover inconsistencies and automatically heal them. Like, if the number of entries in leveldb doesn't match the core indexes in jitdb, then rebuild parts of the leveldb index. (Of course, not fixing the actual culprit will then make it harder for us to debug the culprit, and this issue could linger on forever). But just an idea.

arj03 · 2021-12-14T09:41:17Z

Those 3 are: %XFqGE3AfQT7tsoB8KnzTAas1xGgnncj0cxe7L/jhy1c=.sha256, %cBQmvYU179KsLKdoiemhkzaCq7GxW09yHtd2ClE68Vs=.sha256, %uBNhUX420oJmq0O/2b/YxmokP+jQSAYptyrIGnd3QDw=.sha256.

One thing I noticed is that if I get the messages for my profile I get only 1 message using the corrupted indexes. While clearing the whole thing I get > 8000. This is where I discovered the multiple indexes are corrupt. One thing I find strange is that we now have crc checks, meaning that this should not be any corruption in read/write, rather how they are extracted. Unless they were written before we added these cheks, do you remember if we did a reindex after deployed those?

What I also find really strange is that it seems that both level & the jitdb indexes are corrupt.

staltz · 2021-12-14T09:43:24Z

Oh, which jitdb indexes you detected were corrupted?

arj03 · 2021-12-14T09:46:52Z

I tried running a query for my posts messages, I havn't gotten exactly to the bottom of it yet, but at least the author + post and one other index was corrupt.

staltz · 2021-12-14T10:14:54Z

One thing I find strange is that we now have crc checks, meaning that this should not be any corruption in read/write, rather how they are extracted. Unless they were written before we added these cheks, do you remember if we did a reindex after deployed those?

Here goes some detective work:

The last jitdb without CRC was [email protected] and [email protected].

In Manyverse, we forced the rebuild of all leveldb indexes (not jitdb indexes) recently:

https://github.com/staltz/manyverse/blob/215b42f4c9931400ccbc3130834f956004f2471d/src/backend/ssb.ts#L35-L38

But we forced the rebuild of all jitdb indexes (with [email protected] and [email protected]) for Android only with this commit because we thought that (see issue ssbc/jitdb#149 (comment)) this issue afflicted only Android with ext4:

staltz/manyverse@82a65c5

Note, CRC didn't work on iOS, so I made this patch to jitdb and finally [email protected] supported CRC on iOS.

And then I updated the "rebuild all jitdb indexes" logic in Manyverse to also include iOS:

staltz/manyverse@7814647

And I also updated [email protected]:

staltz/manyverse@f033099

Both of the commits above were shipped into Manyverse version 0.2105.11, so I believe that iOS users had all of their indexes rebuilt and CRC-checked, but maybe something sneaky happened that I can't see?

arj03 · 2021-12-14T10:28:34Z

That is interesting because I checked the seq numbers of my posts messages in dave's log and I can see that they go way back (they start at seq 8017). This means that they should have been reindexed last month when you made that release. I tried dumping the raw author index and can see that it is vastly different from my reindex on even the first seq. And this is a prefix index, so once a seq has been reindexed, it should never change. The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

staltz · 2021-12-14T11:29:01Z

This means that they should have been reindexed last month when you made that release.

Just a clarification, I mentioned 0.2105.11 earlier, which means iOS should have had a full jitdb reindexing and CRC since May this year. You mentioned "last month", why?

The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

Maybe it's possible that the file issue1328 wasn't written atomically, but that sounds impossible because it's a 0 byte file.

davegomez · 2021-12-14T11:42:09Z

For what is worth (it looks it might), I updated my phone to the latest version last month.

iOS automatically copied the data content of the apps during the migration, so I didn’t have to install and sync Manyverse again.

I can confirm I noticed the issue after the migration to the new phone.

staltz · 2021-12-14T11:44:24Z

Very interesting! Thanks!

Just for more context, though, there is another user who is experiencing the same symptoms, so I think it's a bit more general issue, maybe they didn't change their phone too.

arj03 · 2021-12-14T11:51:01Z

This means that they should have been reindexed last month when you made that release.

Just a clarification, I mentioned 0.2105.11 earlier, which means iOS should have had a full jitdb reindexing and CRC since May this year. You mentioned "last month", why?

Oh, I misread the version numbers. I thought 11 meant the month :)

The file ISSUE_1328 must be correctly written otherwise that whole file would fail... Hmm

Maybe it's possible that the file issue1328 wasn't written atomically, but that sounds impossible because it's a 0 byte file.

That shouldn't happen and the folder should have been deleted before anyway.

staltz · 2021-12-14T11:54:22Z

I want to try later, to rebuild all of these indexes (both jitdb and leveldb) from the log, and then do a diff on all indexes and see which one went wrong first. Because it seems that once one is wrong, it could infect the other indexes with wrong data.

staltz · 2021-12-14T13:02:22Z

Can it be that the problem is that canDecrypt is not CRC-checked?

arj03 · 2021-12-14T13:05:52Z

Sorry for the deleted message, I spoke to early, was running on my own profile and needed to check if it also is a problem with another one. So I tested cblgh and I see the same problem. I get the correct number of results by removing author + type post + canDecrypt index. In other words they are all 3 corrupted.

staltz · 2021-12-14T13:11:54Z

Can it be that the problem is that canDecrypt is not CRC-checked?

Still seems like we could do this, right? canDecrypt.index and encrypted.index are currently not CRC'd.

arj03 · 2021-12-14T13:15:14Z

Sure that would be a good idea.

The corruption is really strange as it seems to happen to more or less everything as far as I can see.

staltz · 2021-12-14T13:20:47Z

The corruption is really strange as it seems to happen to more or less everything as far as I can see.

Can you detect whether a corruption is due to "bad data" coming from another corrupted index, or whether all of the corruptions are "original corruptions"? It would be easier for us if we detected that just canDecrypt.index and encrypted.index were corrupted and then all of the other indexes got infected with that.

But it could surely be a general filesystem-caused corruption.

staltz · 2021-12-14T13:25:45Z

I want to try later, to rebuild all of these indexes (both jitdb and leveldb) from the log, and then do a diff on all indexes and see which one went wrong first.

Hmm, I can't do this without unless I have Dave's secret keys (in order to create canDecrypt.index). I guess I could ask Dave to run some scripts using his secret key.

arj03 · 2021-12-14T13:28:51Z

After further testing I'm not so sure that canDecrypt is actually corrupt. Let me do some more testing on the author + post that is for sure corrupt.

staltz · 2021-12-14T13:37:54Z

Yeah, it's actually hard to know if canDecrypt is corrupted unless we get Dave to rebuild it on desktop.

staltz · 2021-12-14T13:57:11Z

PS: Dave is gonna run this https://github.com/staltz/ssb-db2-issue-291

arj03 · 2021-12-14T14:28:40Z

Oh my. It turns out we might have to be really careful with bipf when upgrading. I was testing with latest bipf + ssb-db2. Pinning db2 to 2.3.3 and bipf to 1.5.1 seems to produce the correct results for author + post. So at least that is good news. Maybe the problem is only level db then. It will be interesting to see you find running the script you linked.

arj03 · 2021-12-14T15:23:55Z

Seeing as I'm really slow today 😩 can you tell me when this fix: 557e882 was put into manyverse and the level folders reindexed?

It seems like 2.3.2 and 2.3.3 is not tagged in ssb-db2 git?

staltz · 2021-12-14T15:30:33Z

Updated to ssb-db2 2.3.2 in staltz/manyverse@ebe4bf3 and released as Manyverse 0.2110.22

staltz · 2021-12-14T15:33:41Z

Updated to ssb-db2 2.3.3 in staltz/manyverse@41bb9ea and released as Manyverse 0.2111.11

arj03 · 2021-12-14T21:49:09Z

We made so many changes, I almost forgot half of them... ;)

The reason I was so confused is that the author index change for bendy butt bumps the version to 2, but that is only taken into consideration when updating the index ;-) So nothing related to bipf.

Stepping back I don't think this has anything to do with index corruption. It more smells like a bug similar to the level bug. Maybe related to pausing in push-stream, that is just a guess because I noticed there is also 1 message missing from the type index. Sequence 1382999. So the following query will give you the message:

const { seqs } = require('jitdb/operators')
ssb.db.query(
  where(
    and(
      seqs([1382999]),
    ),
  ),
  toCallback((err, msgs) => {
    console.log(msgs[0])
    ssb.close()
  })
)

while this will not, until you rebuild:

const { seqs } = require('jitdb/operators')
ssb.db.query(
  where(
    and(
      seqs([1382999]),
      type('post'),
    ),
  ),
  toCallback((err, msgs) => {
    console.log(msgs[0])
    ssb.close()
  })
)

arj03 · 2021-12-19T20:26:07Z

Jacobs keys level has the same kind of bug. It is off by 38 instead of 2.

1021983 | 788241486 | %/DnmQa4xqgIn5qaKgOx8ioUrU8i+ybE+hBFhWCVAuq0=.sha256
1021984 | 788242170 | %5nd7p453VUbLtm/oiWTZC8GPzAKWDNGOxz9e0MWBcF0=.sha256
1021985 | 788242654 | %FWzbolvHxC1NtcYAVzXrVHb+EurmkfX+LR0SRH19oxU=.sha256 | %D7CEf5EgPxX5/sWWgD2JREfpU3zShi85+eFsAhprwkc=.sha256
1021986 | 788243123 | %07shhpv4L6wEWpZxZdm0e5zU2QYo+o76iOgeIelSj9s=.sha256 | %LCC8qCZsZbqjKnQ0t0g+T59l15dD4wzDlMiPq+KBanc=.sha256
1021987 | 788244160 | %xc83mi/928R2MfwNP7DKLB/y5VMb2QZrlaojcrd5jpI=.sha256 | %UtAI9IUZn+MHKAYvdkc+7LU/FRXPJIGlklxkYTUtGo8=.sha256
1021988 | 788245071 | %6mFJx4VxZGswkf89liwNXAS1g0u/7CZ1qKxroremvMg=.sha256 | %j5hkP6mW4uvu0x6ujP1nY4Jdf2nHbyS7wb/KEK9pIJM=.sha256
1021989 | 788245555 | %fwRDVfejokKNk9b+P9mEBEazvIudeLXkyh48LMa1bX0=.sha256 | %+oGNg87aNM5SR404xhKFwRI8R3HcE8VqO1t4vcotc/A=.sha256

from correct:

1022023 | 788267008 | %FWzbolvHxC1NtcYAVzXrVHb+EurmkfX+LR0SRH19oxU=.sha256

these are the 38 missing, when compared to the total

I have also noticed that his author index is correct, but instead seq.index is broken. The following 42 offsets are missing:

I have a theory what might be have caused the problem with seq.index. It just doesn't explain the author corruption or the level bugs. What is a bit puzzling about the corrupt author prefix index is that, if you look at the prefixes after it is corrupted, they don't point to any known authors starts.

staltz · 2021-12-20T11:48:22Z

Studying Jacob's base leveldb, I noticed there are only two wrong KVs:

key=@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519
value={offset: 793494573, sequence: 1810}

key=@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519
value={offset: 793494080, sequence: 191}

Correct ones (they differ only by offset):

key=@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519
value={offset: 788258981, sequence: 1810}

key=@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519
value={offset: 788265302, sequence: 191}

And the META for base is:

Bad base:

  Version: 2
  Entries: 1032716
  Offset: 796004667

Rebuilt base (differs only by entries):

  Version: 2
  Entries: 1032754
  Offset: 796004667

38 is the diff.

arj03 · 2021-12-20T15:52:58Z

Hmmmmmmmmmmmm good find. Why does Jacob has the same message multiple times?

Check this (I exposed the getMsgByOffset method):

ssb.db.getMsgByOffset(788258981, (err, msg) => {
  console.log(788258981, msg)
})

ssb.db.getMsgByOffset(793494573, (err, msg) => {
  console.log(7934945731, msg)
})

ssb.db.getMsgByOffset(788265302, (err, msg) => {
  console.log(788265302, msg)
})

ssb.db.getMsgByOffset(793494080, (err, msg) => {
  console.log(793494080, msg)
})

=>

788258981 {
  key: '%LFYUCrJ6S2TKCB7o+2f8++xFRQXQYpZg++jpDXsQnFk=.sha256',
  value: {
    previous: '%wRGPk9ftm1BzYMCPbbf5jIDsZVTdyXsX06lZjIlueE8=.sha256',
    sequence: 1810,
    author: '@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519',
    timestamp: 1638438970621,
    hash: 'sha256',
    content: {
      type: 'post',
      root: '%juRWvyC7fH9YwdCdI0YXhjGHad+XXsZqdThCyFBmutM=.sha256',
      fork: '%3rVlNUCs8OD6ceHGrRlcUz1A0BNvVQFHtFylGi2WLnA=.sha256',
      branch: '%lM7Yum/PzzVmTBvaCQBZDYkgwe+gkmW/IQu4/+yhoB4=.sha256',
      reply: [Object],
      channel: null,
      recps: null,
      text: "Do you know about the snail electric fence method? I've seen it done with just some copper wire and a 9 V battery. Haven't tried it myself though.\n",
      mentions: []
    },
    signature: 'cJKmGtkeSuJf4wg0AnuHs85iD9GAvVuiic494LDqZ5QVs9v+wGtoSqA89XOK2CEwE5hMzF80+nF+aOq3JMxMBw==.sig.ed25519'
  },
  timestamp: 1638625571799
}
7934945731 {
  key: '%LFYUCrJ6S2TKCB7o+2f8++xFRQXQYpZg++jpDXsQnFk=.sha256',
  value: {
    previous: '%wRGPk9ftm1BzYMCPbbf5jIDsZVTdyXsX06lZjIlueE8=.sha256',
    sequence: 1810,
    author: '@22YV3HX/BGClXH49ZebkcepqRpj7a9f5s0ddwhDSfxk=.ed25519',
    timestamp: 1638438970621,
    hash: 'sha256',
    content: {
      type: 'post',
      root: '%juRWvyC7fH9YwdCdI0YXhjGHad+XXsZqdThCyFBmutM=.sha256',
      fork: '%3rVlNUCs8OD6ceHGrRlcUz1A0BNvVQFHtFylGi2WLnA=.sha256',
      branch: '%lM7Yum/PzzVmTBvaCQBZDYkgwe+gkmW/IQu4/+yhoB4=.sha256',
      reply: [Object],
      channel: null,
      recps: null,
      text: "Do you know about the snail electric fence method? I've seen it done with just some copper wire and a 9 V battery. Haven't tried it myself though.\n",
      mentions: []
    },
    signature: 'cJKmGtkeSuJf4wg0AnuHs85iD9GAvVuiic494LDqZ5QVs9v+wGtoSqA89XOK2CEwE5hMzF80+nF+aOq3JMxMBw==.sig.ed25519'
  },
  timestamp: 1638625912978
}
793494080 {
  key: '%bKTVmbsjc9YqpZQrP+V94+KyGMQ0n3jSUrOZY1ADlD8=.sha256',
  value: {
    previous: '%cquXN1dXcOkKIocYhFHjo0XnDI0EZf3e5R+wyHZHxow=.sha256',
    sequence: 191,
    author: '@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519',
    timestamp: 1636321596142,
    hash: 'sha256',
    content: { type: 'vote', channel: null, vote: [Object] },
    signature: 'E9Hvxu41Gf+TvwjLMTPFPLWYS+9OJgxfaJYAZWxu/Obp1pQ91p9e8VthA279+2LaSP3BENnU++xHV3YSLjvwDQ==.sig.ed25519'
  },
  timestamp: 1638625912974
}
788265302 {
  key: '%bKTVmbsjc9YqpZQrP+V94+KyGMQ0n3jSUrOZY1ADlD8=.sha256',
  value: {
    previous: '%cquXN1dXcOkKIocYhFHjo0XnDI0EZf3e5R+wyHZHxow=.sha256',
    sequence: 191,
    author: '@V+BlA4XU3ZmjKLHlyG22tfmt+d67x5vGzgnbN1EzGjw=.ed25519',
    timestamp: 1636321596142,
    hash: 'sha256',
    content: { type: 'vote', channel: null, vote: [Object] },
    signature: 'E9Hvxu41Gf+TvwjLMTPFPLWYS+9OJgxfaJYAZWxu/Obp1pQ91p9e8VthA279+2LaSP3BENnU++xHV3YSLjvwDQ==.sig.ed25519'
  },
  timestamp: 1638625571896
}

staltz · 2021-12-20T16:03:32Z

Wait, this is literally just on the log, no indexes involved?

arj03 · 2021-12-20T16:05:25Z

Exactly

arj03 · 2021-12-20T16:12:28Z

I wonder what came first, the corrupt indexes or double messages 🤔

staltz · 2021-12-20T17:05:09Z

Well, one thing we know for sure, that seq.index has a saving problem. We could push that fix and reset all indexes, and then wait for more months to pass by and see what happens.

arj03 · 2021-12-20T20:07:22Z

Agree that it would be good with that fix. I'm just still trying to understand what is going on here.

I found some duplicate messages in Daves database as well (key, first offset, duplicate offset).

duplicate key %z/MDV4mT/XaiIibeP0C5vf+0tKLAcFxtMBDHLf5eu+E=.sha256 180348963 180749729
duplicate key %+hW4qOX4xuIO8QE9RTUAaHqL0bD28r4GyF/kYJLBz38=.sha256 180295907 180753120
duplicate key %/gzGlFUQBE3C6Kd1QfQSzZxG78IL4DmJssBIIdQCnlI=.sha256 180328614 180769814
duplicate key %NWG837v3J2qhjGmK9yEZvH7gZ34fzlEKfKjeuWXmRG8=.sha256 180371842 181260814
duplicate key %3Z/wvMZQlLWR+pSMn9yERhxwNU6jVrDQGqbvZV+sEZM=.sha256 180435314 245829508
duplicate key %rRUeWOTw1GQBHPJCwAsYSnuFCjRaLh/vafGZJnoA56o=.sha256 180472215 276944048

I checked all of Jacobs and there are only the 2 you already found.

arj03 · 2021-12-21T13:20:03Z

First of all I have been really trying to break AAOL by threating it as a black block and throwing all kinds of random input at it. I havn't seen it break yet. So my hunch is that it must be somewhere else.

Good news is I think I have found something. Look at this line and imagine we have added a new plugin. In this case we will reindex, but most plugins will just skip. BUT they will update their processedOffset thus lovering it compared to processedSeq. If you then kill the app while some of them have updated, but NOT keys or base then there will be a mismatch!

arj03 · 2021-12-21T13:21:38Z

The million $ question is, why was that line not put inside the tuborg?

I mean it is such a lovely lady

staltz · 2021-12-21T13:24:20Z

Look at this line and imagine we have added a new plugin. In this case we will reindex, but most plugins will just skip. BUT they will update their processedOffset thus lovering it compared to processedSeq. If you then kill the app while some of them have updated, but NOT keys or base then there will be a mismatch!

🙊 🙈

The million $ question is, why was that line not put inside the tuborg?

I have a strong feeling/memory that we nitpicked that line of code many times and there was SOME reason why it was like that. Let me see if I can find anything from git history.

staltz · 2021-12-21T13:26:42Z

201fc75#diff-c5028153d5ff1418b59a22205015c03072dc91893d0ee73a1a7e813159703033R60

arj03 · 2021-12-21T13:30:02Z

Hmm. It still seems like a bug to me. I'll try running the tests with this changed.

staltz · 2021-12-21T13:31:45Z

Yeah, feel free to tuborgify this.

(Side note: when all this is fixed I'd like to update to the latest ssb-db2 instead of doing more backports. I hope I get to the bottom of the memleak quickly)

arj03 · 2021-12-21T13:38:50Z

Great! To sum up. There seems to be a myriad of bugs making debugging especially hard. But I'm glad that it seems that these bugs so far have been in jitdb and db2.

off-by-x level
Jacob seq.index corruption
duplicate messages? I'm not sure about that one, but both of the bugs above is probably not super good to begin with
corruption in author (Daves log). I'm still really puzzled by that one, but I'm pretty confidant in the fixes for the first 2.

staltz · 2021-12-22T10:05:47Z

@arj03 Any thoughts what we should do about the broken log.bipf (with duplicate records)? I imagine if it remains broken it could also infect the indexes. Perhaps scan the log and delete duplicate records?

Powersource · 2021-12-22T10:16:09Z

Fyi i deleted my folder and key. But there might be other people with the same issue Jacob

…

On Wed, Dec 22, 2021, 11:05 André Staltz ***@***.***> wrote: @arj03 <https://github.com/arj03> Any thoughts what we should do about the broken log.bipf (with duplicate records)? I imagine if it remains broken it could also infect the indexes. Perhaps scan the log and delete duplicate records? — Reply to this email directly, view it on GitHub <#291 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAK4M5ZPG7NUD4HBYMRC2LLUSGPINANCNFSM5J6PTPEA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

staltz · 2021-12-22T10:17:51Z

Speaking about it, I just noticed on my phone's Manyverse that it also has this bug. Yay what a great christmas gift.

arj03 · 2021-12-22T14:11:20Z

@arj03 Any thoughts what we should do about the broken log.bipf (with duplicate records)? I imagine if it remains broken it could also infect the indexes. Perhaps scan the log and delete duplicate records?

For things like base & keys they should be fine with the extra messages. JitDB like posts can be a bit more problematic as you'll have too many records. You could delete them as you say, that should work. Just note that its the first time you'll be running with deleted records so there is always a risk there.

staltz · 2021-12-23T11:48:51Z

I'll close this issue later, when I discover the best way of deleting those duplicate records from the log (I am trying to avoid a O(n^2) algorithm, or a memory-hungry one).

staltz · 2021-12-23T13:31:08Z

Here's the script I've been using the remove duplicates:

  const B_KEY = Buffer.from('key');
  const existing = new Set<string>();
  const deletables = new Set<number>();

  // Find deletables
  await new Promise<void>((resolve) => {
    log.stream({gt: -1}).pipe({
      paused: false,
      write: function (record: {value: Buffer | null; offset: number}) {
        const buffer = record.value;
        if (!buffer) return;
        const pKey = BIPF.seekKey(buffer, 0, B_KEY) as number;
        const shortKey = BIPF.decode(buffer, pKey).slice(0, 15) as string;
        if (existing.has(shortKey)) {
          deletables.add(record.offset);
        } else {
          existing.add(shortKey);
        }
      },
      end: () => {
        resolve();
      },
    });
  });

  const del = util.promisify(log.del);
  for (const offset of deletables) {
    await del(offset);
  }

  existing.clear();
  deletables.clear();

The shortKey idea is a prefix and it means that if you have 1 million records, then the existing Set will take ~16MB of RAM, which is pretty fine.

It works, and it's not too slow. It takes about ~5s to scan the log on desktop (maybe it'll take about 20s on mobile), and Jacob's log.bipf had 2 duplicates, Dave's log.bipf had 6. I tested a couple of times and it seems solid.

There was one crash I found and I'll submit a PR to ssb-db2, a simple thing.

arj03 · 2021-12-23T13:33:48Z

Nice speed!

const shortKey = BIPF.decode(buffer, pKey).slice(0, 15) as string;

You should be able to do slice(1, 15) as the first char is always the same.

staltz · 2021-12-23T13:35:00Z

Yes, slice(1, 16), to have 16 characters.

staltz · 2021-12-23T14:28:11Z

Now we can close :)

staltz mentioned this issue Dec 20, 2021

Problematic writes ssbc/jitdb#185

Closed

staltz changed the title ~~Corrupted indexes on iOS~~ Broken seq.index and keys index and duplicate records on log Dec 21, 2021

staltz changed the title ~~Broken seq.index and keys index and duplicate records on log~~ Broken seq.index, keys index, and duplicate records on log Dec 21, 2021

arj03 mentioned this issue Dec 21, 2021

Make sure processedOffset is only updated together with processedSeq #295

Merged

staltz closed this as completed Dec 23, 2021

staltz mentioned this issue Jan 17, 2022

Database streaming glitches (+43 skipping) #302

Closed

Broken seq.index, keys index, and duplicate records on log #291

Broken seq.index, keys index, and duplicate records on log #291

Comments

staltz commented Dec 13, 2021 • edited Loading

arj03 commented Dec 13, 2021

arj03 commented Dec 13, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021 • edited Loading

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021

davegomez commented Dec 14, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021 • edited Loading

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

arj03 commented Dec 14, 2021

staltz commented Dec 14, 2021

staltz commented Dec 14, 2021

arj03 commented Dec 14, 2021

arj03 commented Dec 19, 2021

staltz commented Dec 20, 2021 • edited Loading

arj03 commented Dec 20, 2021

staltz commented Dec 20, 2021

arj03 commented Dec 20, 2021

arj03 commented Dec 20, 2021

staltz commented Dec 20, 2021

arj03 commented Dec 20, 2021 • edited Loading

arj03 commented Dec 21, 2021 • edited Loading

arj03 commented Dec 21, 2021 • edited Loading

staltz commented Dec 21, 2021

🙊 🙈

staltz commented Dec 21, 2021

arj03 commented Dec 21, 2021

staltz commented Dec 21, 2021 • edited Loading

arj03 commented Dec 21, 2021

staltz commented Dec 22, 2021

Powersource commented Dec 22, 2021 via email

staltz commented Dec 22, 2021

arj03 commented Dec 22, 2021

staltz commented Dec 23, 2021

staltz commented Dec 23, 2021 • edited Loading

arj03 commented Dec 23, 2021

staltz commented Dec 23, 2021

staltz commented Dec 23, 2021

staltz commented Dec 13, 2021 •

edited

Loading

staltz commented Dec 14, 2021 •

edited

Loading

staltz commented Dec 14, 2021 •

edited

Loading

staltz commented Dec 20, 2021 •

edited

Loading

arj03 commented Dec 20, 2021 •

edited

Loading

arj03 commented Dec 21, 2021 •

edited

Loading

arj03 commented Dec 21, 2021 •

edited

Loading

staltz commented Dec 21, 2021 •

edited

Loading

staltz commented Dec 23, 2021 •

edited

Loading