Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

Closed
blckbx opened this issue Oct 1, 2023 · 7 comments · Fixed by #8611
Labels
bug Unintended code behaviour database Related to the database/storage of LND P1 MUST be fixed or reviewed postgres
Milestone

Comments

@blckbx
Copy link

blckbx commented Oct 1, 2023

Background

Been observing an issue with PostgreSQL DB since version 0.17 regarding serialization of reads/writes (probably because LND now exposes SQL errors to logs?).
Before 0.17 I used to compact on-the-fly while running lnd in background. With 0.17 I ran across this serialization error when compacting (vaccuum full). Today this error also led to a force close of a channel (no compaction happened at this time):

2023-10-01 13:21:03.430 [ERR] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): failing link: unable to accept revocation: ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) with error: invalid revocation

followed by multiple

2023-10-01 13:21:38.721 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3)
2023-10-01 13:21:39.055 [INF] PEER: Peer(020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217): unable to read message from peer: read next header: EOF
2023-10-01 13:21:39.055 [WRN] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): error when syncing channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error
2023-10-01 13:21:39.055 [INF] PEER: Peer(020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217): disconnecting 020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217@23.88.103.101:40009, reason: read handler closed
2023-10-01 13:21:39.055 [ERR] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): failing link: unable to synchronize channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error with error: unable to resume channel, recovery required
2023-10-01 13:21:39.055 [INF] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): exited
2023-10-01 13:21:39.155 [INF] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): stopping

resulting in

2023-10-01 13:44:00.128 [DBG] NTFN: Found spend of outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3: spend_tx=ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891, block_height=810154
2023-10-01 13:44:00.385 [DBG] NTFN: Found spend of outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3: spend_tx=ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891, block_height=810154
2023-10-01 13:44:00.385 [DBG] NTFN: Spend request outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3 spent at tip=810154
2023-10-01 13:44:10.989 [INF] NTFN: Dispatching confirmed spend notification for outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3 at current height=810154: ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891[0] spending 61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3 at height=810154
2023-10-01 13:44:10.989 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=chainTrigger from state=StateDefault
2023-10-01 13:44:10.990 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): new block (height=810154) examining active HTLC's
2023-10-01 13:44:10.990 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:10.991 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): no actions for chain trigger, terminating
2023-10-01 13:44:10.993 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): terminating at state=StateDefault
2023-10-01 13:44:11.257 [INF] CNCT: Unilateral close of ChannelPoint(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3) detected
2023-10-01 13:44:11.258 [INF] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): remote party has closed channel out on-chain
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateDefault
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): new block (height=810154) examining active HTLC's
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:22.428 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateContractClosed
2023-10-01 13:44:22.463 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:22.621 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): inserting 1 contract resolvers
2023-10-01 13:44:22.847 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting to resolve *contractcourt.commitSweepResolver
2023-10-01 13:44:22.847 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): contract *contractcourt.commitSweepResolver not yet resolved
2023-10-01 13:44:22.978 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateWaitingFullResolution
2023-10-01 13:44:22.978 [INF] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): still awaiting contract resolution
2023-10-01 13:44:22.997 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): still have unresolved contract: *contractcourt.commitSweepResolver
2023-10-01 13:44:22.997 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): terminating at state=StateWaitingFullResolution
2023-10-01 13:44:23.136 [DBG] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): commit conf_height=810154, unlock_height=810154
2023-10-01 13:44:23.136 [DBG] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): isDelayedOutput=false, isLocalCommitTx=false
2023-10-01 13:44:23.136 [INF] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): Sweeping with witness type: CommitmentNoDelayTweakless
2023-10-01 13:44:23.136 [INF] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): sweeping commit output

Your environment

  • version of lnd: 0.17-rc6
  • which operating system (uname -a on *Nix): Linux 5.15.0-84-generic 93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023 x86_64
  • version of btcd, bitcoind, or other backend: bitcoind v25
  • any other relevant environment details:
    • Postgresql 14
    • peer in question is running CLN v23.08.1

lnd.conf:

db.postgres.timeout=0
db.postgres.maxconnections=50

Steps to reproduce

Reproduction method unknown.

Expected behaviour

No SQL error.

Actual behaviour

Serialization error in database.

Logs:
FC Log: fc.log
SQLSTATE 40001 while compacting: compacting.log

tl;dr:
I think this is the equivalent of #7869 for postgres. I'll try to set db.postgres.timeout to 10m. Although not sure why lnd is not retrying transactions as specified in #7960 (running latest 0.17-rc6).

@blckbx blckbx added bug Unintended code behaviour needs triage labels Oct 1, 2023
@guggero
Copy link
Collaborator

guggero commented Oct 2, 2023

Since #7927 we do re-try on certain SQL failures. Perhaps we just need to catch SQLSTATE 40001 as a re-tryable error.
Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first?
Things should also get better with #7992, which hopefully will make it in for v0.17.1-beta.

@blckbx
Copy link
Author

blckbx commented Oct 2, 2023

Since #7927 we do re-try on certain SQL failures. Perhaps we just need to catch SQLSTATE 40001 as a re-tryable error.

Reading https://github.com/lightningnetwork/lnd/blob/master/sqldb/sqlerrors.go#L70-L74 with pgerr.SerializationFailure referencing https://github.com/jackc/pgerrcode/blob/master/errcode.go#L218 - which actually is SQLSTATE 40001 - I assumed this already is the case.

Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first?

Okay, I'll try 5 minutes and see how it goes. Usually compaction takes less than that.

Things should also get better with #7992, which hopefully will make it in for v0.17.1-beta.

Yeah, can't wait to try out #7992! Although a huge key-value table still is a restriction regarding (b)locking techniques, so also looking forward to 0.18 and first SQL schemes.

@Roasbeef
Copy link
Member

Roasbeef commented Oct 2, 2023

So #7992 also fixes issues with the retry logic: before certain errors weren't properly wrapped, so they weren't detected as serialization errors.

Interesting that you're running into it as is though, since we have an in process mutex that should limit to just a single writer.

EDIT: ah reading it again, I see you're running a background vacuum, that could trigger the retry logic there.

@blckbx
Copy link
Author

blckbx commented Oct 2, 2023

Okay, this explains why vacuuming the database with the same database user as lnd is set up with throws the error. I can reproduce the error by background vaccuuming regardless of timeout setting.

Then I'm not sure what caused the serialization error that led to the force close. Could it be heavy usage due to parallelization of rebalancing and forwarding?

I'll keep an eye on it.

@saubyk saubyk added P1 MUST be fixed or reviewed postgres database Related to the database/storage of LND and removed needs triage labels Oct 3, 2023
@blckbx
Copy link
Author

blckbx commented Oct 3, 2023

Today I compacted the database manually with user "postgres" instead. This correctly pauses all db actions for lnd and resumes them after compaction has finished. No errors in lnd's log. Still not sure what caused the force close now but I guess a concurrent database access with the same lnd user is the closest I could think of.

@blckbx
Copy link
Author

blckbx commented Oct 8, 2023

Tried another compaction run which caused LND to shutdown again.
With config:

db.postgres.timeout=2m

Compacting duration:

200.74 sec

Compaction Starting Time: 2023-10-08 16:41:50
First Error Sighting: 2023-10-08 16:42:10

2023-10-08 16:42:10.893 [CRT] CHDB: Caught unhandled error: ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001)
2023-10-08 16:42:10.894 [INF] CHDB: Sending request for shutdown
2023-10-08 16:42:10.894 [INF] LTND: Received shutdown request.
2023-10-08 16:42:10.894 [INF] LTND: Shutting down...

@saubyk saubyk added this to the v0.18.0 milestone Nov 21, 2023
@saubyk saubyk added this to lnd v0.18 Nov 21, 2023
@saubyk
Copy link
Collaborator

saubyk commented Nov 21, 2023

This issue will be addressed with #7992

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour database Related to the database/storage of LND P1 MUST be fixed or reviewed postgres
Projects
None yet
4 participants