[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

blckbx · 2023-10-01T13:27:58Z

Background

Been observing an issue with PostgreSQL DB since version 0.17 regarding serialization of reads/writes (probably because LND now exposes SQL errors to logs?).
Before 0.17 I used to compact on-the-fly while running lnd in background. With 0.17 I ran across this serialization error when compacting (vaccuum full). Today this error also led to a force close of a channel (no compaction happened at this time):

2023-10-01 13:21:03.430 [ERR] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): failing link: unable to accept revocation: ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) with error: invalid revocation

followed by multiple

2023-10-01 13:21:38.721 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3)
2023-10-01 13:21:39.055 [INF] PEER: Peer(020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217): unable to read message from peer: read next header: EOF
2023-10-01 13:21:39.055 [WRN] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): error when syncing channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error
2023-10-01 13:21:39.055 [INF] PEER: Peer(020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217): disconnecting 020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217@23.88.103.101:40009, reason: read handler closed
2023-10-01 13:21:39.055 [ERR] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): failing link: unable to synchronize channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error with error: unable to resume channel, recovery required
2023-10-01 13:21:39.055 [INF] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): exited
2023-10-01 13:21:39.155 [INF] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): stopping

resulting in

2023-10-01 13:44:00.128 [DBG] NTFN: Found spend of outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3: spend_tx=ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891, block_height=810154
2023-10-01 13:44:00.385 [DBG] NTFN: Found spend of outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3: spend_tx=ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891, block_height=810154
2023-10-01 13:44:00.385 [DBG] NTFN: Spend request outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3 spent at tip=810154
2023-10-01 13:44:10.989 [INF] NTFN: Dispatching confirmed spend notification for outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3 at current height=810154: ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891[0] spending 61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3 at height=810154
2023-10-01 13:44:10.989 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=chainTrigger from state=StateDefault
2023-10-01 13:44:10.990 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): new block (height=810154) examining active HTLC's
2023-10-01 13:44:10.990 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:10.991 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): no actions for chain trigger, terminating
2023-10-01 13:44:10.993 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): terminating at state=StateDefault
2023-10-01 13:44:11.257 [INF] CNCT: Unilateral close of ChannelPoint(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3) detected
2023-10-01 13:44:11.258 [INF] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): remote party has closed channel out on-chain
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateDefault
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): new block (height=810154) examining active HTLC's
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:22.428 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateContractClosed
2023-10-01 13:44:22.463 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:22.621 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): inserting 1 contract resolvers
2023-10-01 13:44:22.847 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting to resolve *contractcourt.commitSweepResolver
2023-10-01 13:44:22.847 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): contract *contractcourt.commitSweepResolver not yet resolved
2023-10-01 13:44:22.978 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateWaitingFullResolution
2023-10-01 13:44:22.978 [INF] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): still awaiting contract resolution
2023-10-01 13:44:22.997 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): still have unresolved contract: *contractcourt.commitSweepResolver
2023-10-01 13:44:22.997 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): terminating at state=StateWaitingFullResolution
2023-10-01 13:44:23.136 [DBG] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): commit conf_height=810154, unlock_height=810154
2023-10-01 13:44:23.136 [DBG] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): isDelayedOutput=false, isLocalCommitTx=false
2023-10-01 13:44:23.136 [INF] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): Sweeping with witness type: CommitmentNoDelayTweakless
2023-10-01 13:44:23.136 [INF] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): sweeping commit output

Your environment

version of lnd: 0.17-rc6
which operating system (uname -a on *Nix): Linux 5.15.0-84-generic 93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023 x86_64
version of btcd, bitcoind, or other backend: bitcoind v25
any other relevant environment details:
- Postgresql 14
- peer in question is running CLN v23.08.1

lnd.conf:

db.postgres.timeout=0
db.postgres.maxconnections=50

Steps to reproduce

Reproduction method unknown.

Expected behaviour

No SQL error.

Actual behaviour

Serialization error in database.

Logs:
FC Log: fc.log
SQLSTATE 40001 while compacting: compacting.log

tl;dr:
I think this is the equivalent of #7869 for postgres. I'll try to set db.postgres.timeout to 10m. Although not sure why lnd is not retrying transactions as specified in #7960 (running latest 0.17-rc6).

The text was updated successfully, but these errors were encountered:

guggero · 2023-10-02T07:52:47Z

Since #7927 we do re-try on certain SQL failures. Perhaps we just need to catch SQLSTATE 40001 as a re-tryable error.
Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first?
Things should also get better with #7992, which hopefully will make it in for v0.17.1-beta.

blckbx · 2023-10-02T08:12:38Z

Since #7927 we do re-try on certain SQL failures. Perhaps we just need to catch SQLSTATE 40001 as a re-tryable error.

Reading https://github.com/lightningnetwork/lnd/blob/master/sqldb/sqlerrors.go#L70-L74 with pgerr.SerializationFailure referencing https://github.com/jackc/pgerrcode/blob/master/errcode.go#L218 - which actually is SQLSTATE 40001 - I assumed this already is the case.

Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first?

Okay, I'll try 5 minutes and see how it goes. Usually compaction takes less than that.

Things should also get better with #7992, which hopefully will make it in for v0.17.1-beta.

Yeah, can't wait to try out #7992! Although a huge key-value table still is a restriction regarding (b)locking techniques, so also looking forward to 0.18 and first SQL schemes.

Roasbeef · 2023-10-02T19:11:15Z

So #7992 also fixes issues with the retry logic: before certain errors weren't properly wrapped, so they weren't detected as serialization errors.

Interesting that you're running into it as is though, since we have an in process mutex that should limit to just a single writer.

EDIT: ah reading it again, I see you're running a background vacuum, that could trigger the retry logic there.

blckbx · 2023-10-02T19:44:43Z

Okay, this explains why vacuuming the database with the same database user as lnd is set up with throws the error. I can reproduce the error by background vaccuuming regardless of timeout setting.

Then I'm not sure what caused the serialization error that led to the force close. Could it be heavy usage due to parallelization of rebalancing and forwarding?

I'll keep an eye on it.

blckbx · 2023-10-03T13:13:54Z

Today I compacted the database manually with user "postgres" instead. This correctly pauses all db actions for lnd and resumes them after compaction has finished. No errors in lnd's log. Still not sure what caused the force close now but I guess a concurrent database access with the same lnd user is the closest I could think of.

blckbx · 2023-10-08T15:09:40Z

Tried another compaction run which caused LND to shutdown again.
With config:

db.postgres.timeout=2m

Compacting duration:

200.74 sec

Compaction Starting Time: 2023-10-08 16:41:50
First Error Sighting: 2023-10-08 16:42:10

2023-10-08 16:42:10.893 [CRT] CHDB: Caught unhandled error: ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001)
2023-10-08 16:42:10.894 [INF] CHDB: Sending request for shutdown
2023-10-08 16:42:10.894 [INF] LTND: Received shutdown request.
2023-10-08 16:42:10.894 [INF] LTND: Shutting down...

saubyk · 2023-11-21T17:52:15Z

This issue will be addressed with #7992

blckbx added bug Unintended code behaviour needs triage labels Oct 1, 2023

saubyk added P1 MUST be fixed or reviewed postgres database Related to the database/storage of LND and removed needs triage labels Oct 3, 2023

saubyk added this to the v0.18.0 milestone Nov 21, 2023

saubyk added this to lnd v0.18 Nov 21, 2023

saubyk moved this to 📋 Backlog in lnd v0.18 Nov 21, 2023

saubyk linked a pull request Feb 6, 2024 that will close this issue

kvdb/postgres: remove global application level lock #7992

Closed

saubyk removed this from lnd v0.18 Feb 6, 2024

guggero mentioned this issue Mar 7, 2024

kvdb/postgres: remove global application level lock #8529

Closed

saubyk linked a pull request Apr 2, 2024 that will close this issue

sqldb+kvdb: unify SQL error mapping and transaction retry for between the kvdb and sqldb packages #8611

Merged

coderabbitai bot mentioned this issue Apr 9, 2024

sqldb+kvdb: unify SQL error mapping and transaction retry for between the kvdb and sqldb packages #8611

Merged

Roasbeef closed this as completed in #8611 Apr 11, 2024

IanPasteur mentioned this issue Feb 21, 2025

[bug]: LND crashed on postgres "out of shared memory" error #9538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

blckbx commented Oct 1, 2023 •

edited

Loading

guggero commented Oct 2, 2023

blckbx commented Oct 2, 2023 •

edited

Loading

Roasbeef commented Oct 2, 2023 •

edited

Loading

blckbx commented Oct 2, 2023

blckbx commented Oct 3, 2023

blckbx commented Oct 8, 2023

saubyk commented Nov 21, 2023

[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

Comments

blckbx commented Oct 1, 2023 • edited Loading

Background

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

guggero commented Oct 2, 2023

blckbx commented Oct 2, 2023 • edited Loading

Roasbeef commented Oct 2, 2023 • edited Loading

blckbx commented Oct 2, 2023

blckbx commented Oct 3, 2023

blckbx commented Oct 8, 2023

saubyk commented Nov 21, 2023

blckbx commented Oct 1, 2023 •

edited

Loading

blckbx commented Oct 2, 2023 •

edited

Loading

Roasbeef commented Oct 2, 2023 •

edited

Loading