Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: check leader status with our health checker to correctly shut down LND if network partitions #8938

Merged

Conversation

bhandras
Copy link
Collaborator

@bhandras bhandras commented Jul 25, 2024

Change Description

LND currently holds the leader lease until shutdown, at which point it will resign. In some scenarios, it may be desirable for LND to relinquish leadership and shut down if it becomes partitioned from the etcd cluster. This PR aims to implement this behavior by adding a leader status check to the existing health checks, which will verify the leader status every minute. To prevent hanging due to network issues, we also introduce reasonable timeouts for etcd calls. This allows for a clean shutdown upon a request from the health check module.

Fixes: #8913

Steps to Test

make itest backend=bitcoind dbbackend=etcd icase=leader_health_check


This change is Reviewable

Copy link
Contributor

coderabbitai bot commented Jul 25, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@bhandras bhandras self-assigned this Jul 25, 2024
@bhandras bhandras added database Related to the database/storage of LND etcd labels Jul 25, 2024
@bhandras bhandras force-pushed the etcd-leader-election-fixups branch 2 times, most recently from 1e767fd to 5ebb557 Compare July 26, 2024 15:05
@bhandras bhandras marked this pull request as ready for review July 26, 2024 15:26
@@ -207,6 +207,12 @@ replace google.golang.org/protobuf => github.com/lightninglabs/protobuf-go-hex-d
// Temporary replace until the next version of sqldb is tagged.
replace github.com/lightningnetwork/lnd/sqldb => ./sqldb

// Temporary replace until the next version of healthcheck is tagged.
replace github.com/lightningnetwork/lnd/healthcheck => ./healthcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker commit to make sure we remove the pins.

@Filiprogrammer
Copy link
Contributor

Tested this on a 3 node regtest cluster:

  • lndetcd1 (Leader)
  • lndetcd2 (Waiting)
  • lndetcd3 (Waiting)

lnd.conf:

[Application Options]
listen=0.0.0.0:9735
alias=lndetcd

[Bitcoin]
bitcoin.regtest=true
bitcoin.node=bitcoind

[Bitcoind]
bitcoind.rpcuser=${BITCOIND_RPCUSER}
bitcoind.rpcpass=${BITCOIND_RPCPASS}
bitcoind.rpchost=${IP_OF_BITCOIND}:8332
bitcoind.zmqpubrawtx=tcp://${IP_OF_BITCOIND}:29001
bitcoind.zmqpubrawblock=tcp://${IP_OF_BITCOIND}:29002
bitcoind.estimatemode=ECONOMICAL

[tor]
tor.active=true
tor.v3=true

[db]
db.backend=etcd

[etcd]
db.etcd.host=127.0.0.1:2379
db.etcd.disabletls=1

[cluster]
cluster.enable-leader-election=1
cluster.leader-elector=etcd
cluster.etcd-election-prefix=cluster-leader
cluster.id=${HOSTNAME}

Logs:

14:43:44: Disconnected lndetcd1 from the network.

14:44:38 lndetcd2: [INF] LTND: Elected as leader (lndetcd2)

14:44:46 lndetcd1: [CRT] SRVR: Health check: leader status failed after 1 calls
14:44:46 lndetcd1: [INF] SRVR: Sending request for shutdown
14:44:46 lndetcd1: [INF] LTND: Received shutdown request.
14:44:46 lndetcd1: [INF] LTND: Shutting down...
14:44:46 lndetcd1: [INF] LTND: Systemd was notified about stopping
14:44:46 lndetcd1: [INF] LTND: Gracefully shutting down.
14:44:46 lndetcd1: [INF] NANN: Channel Status Manager shutting down...
14:44:48 lndetcd1: {"level":"warn","ts":"2024-07-29T14:44:48.658Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008fe540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:44:55 lndetcd1: {"level":"warn","ts":"2024-07-29T14:44:55.658Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:02 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:02.659Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008fe540/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:09 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:09.660Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:11.655Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008fe540/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:45:11 lndetcd1: [ERR] NANN: Unable to load active channels: no active channels exist
14:45:11 lndetcd1: [INF] HSWC: HTLC Switch shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=7
14:45:11 lndetcd1: [INF] HSWC: Onion processor shutting down...
14:45:11 lndetcd1: [INF] HSWC: Decaying hash log received shutdown request
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=12
14:45:11 lndetcd1: [INF] INVC: InvoiceRegistry shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=11
14:45:11 lndetcd1: [INF] CRTR: Channel Router shutting down...
14:45:11 lndetcd1: [INF] CNCT: ChainArbitrator shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=9
14:45:11 lndetcd1: [INF] FNDG: Funding manager shutting down...
14:45:11 lndetcd1: [INF] BRAR: Breach arbiter shutting down...
14:45:11 lndetcd1: [INF] UTXN: UTXO nursery shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=6
14:45:11 lndetcd1: [INF] DISC: Authenticated gossiper shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=10
14:45:11 lndetcd1: [INF] SWPR: Sweeper shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=5
14:45:11 lndetcd1: [INF] SWPR: TxPublisher stopping...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=4
14:45:11 lndetcd1: [INF] CHNF: ChannelNotifier shutting down...
14:45:11 lndetcd1: [INF] PRNF: PeerNotifier shutting down...
14:45:11 lndetcd1: [INF] HSWC: HtlcNotifier shutting down...
14:45:11 lndetcd1: [INF] CHBU: chanbackup.SubSwapper shutting down...
14:45:11 lndetcd1: [INF] NTFN: bitcoind notifier shutting down...
14:45:11 lndetcd1: [INF] NTFN: Stopping mempool notifier
14:45:11 lndetcd1: [ERR] NTFN: dead epoch stream in BestBlockTracker
14:45:11 lndetcd1: [INF] CHFT: ChannelEventStore shutting down...
14:45:11 lndetcd1: [ERR] HSWC: InterceptableSwitch stopped: block epoch stream stopped
14:45:23 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:23.661Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:23 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:23.661Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:30 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:30.686Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":3,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:37 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:37.687Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:41 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:41.661Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:45:41 lndetcd1: [INF] WTCL: (anchor) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
14:45:41 lndetcd1: [INF] WTCL: (taproot) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
14:45:44 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:44.690Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":4,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:51 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:51.691Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:58 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:58.689Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":5,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:05 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:05.690Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:11.663Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = Unknown desc = context deadline exceeded"}
14:46:12 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:12.690Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":6,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:19 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:19.692Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:26 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:26.692Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":7,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:33 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:33.693Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:40 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:40.694Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":8,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:41 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:41.673Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:46:41 lndetcd1: [INF] WTCL: (taproot) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
14:46:54 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:54.696Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:54 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:54.696Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":9,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:01 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:01.725Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:08 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:08.726Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":10,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:11.674Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:47:22 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:22.727Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:22 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:22.728Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":11,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:29 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:29.753Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":12,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:36 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:36.754Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:41 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:41.679Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:47:43 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:43.754Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":13,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:50 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:50.756Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:57 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:57.757Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":14,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:48:04 lndetcd1: {"level":"warn","ts":"2024-07-29T14:48:04.757Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:48:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:48:11.679Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:48:11 lndetcd1: [INF] HLCK: Health monitor shutting down...
14:48:11 lndetcd1: [INF] RPCS: Stopping RPC Server
14:48:11 lndetcd1: [INF] RPCS: Stopping VersionRPC Sub-RPC Server
14:48:11 lndetcd1: [INF] RPCS: Stopping RouterRPC Sub-RPC Server
14:48:11 lndetcd1: [INF] RPCS: Stopping WatchtowerClientRPC Sub-RPC Server
14:48:11 lndetcd1: [INF] TORC: Stopping tor controller
14:48:11 lndetcd1: [INF] LTND: Attempting to resign from leader role (lndetcd1)
14:48:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:48:11.758Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":15,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:48:14 lndetcd1: [INF] LTND: Shutdown complete

As can be seen from the logs I provided, between 14:44:38 and 14:44:46, both lndetcd1 and lndetcd2 were active at the same time.

Also, lndetcd1 took more than 3 minutes to shut down while trying to interact with etcd. (14:44:46 - 14:48:14)

@bhandras bhandras force-pushed the etcd-leader-election-fixups branch from 5ebb557 to b98958a Compare July 31, 2024 14:41
@bhandras
Copy link
Collaborator Author

Tested this on a 3 node regtest cluster:

  • lndetcd1 (Leader)
  • lndetcd2 (Waiting)
  • lndetcd3 (Waiting)

lnd.conf:

[Application Options]
listen=0.0.0.0:9735
alias=lndetcd

[Bitcoin]
bitcoin.regtest=true
bitcoin.node=bitcoind

[Bitcoind]
bitcoind.rpcuser=${BITCOIND_RPCUSER}
bitcoind.rpcpass=${BITCOIND_RPCPASS}
bitcoind.rpchost=${IP_OF_BITCOIND}:8332
bitcoind.zmqpubrawtx=tcp://${IP_OF_BITCOIND}:29001
bitcoind.zmqpubrawblock=tcp://${IP_OF_BITCOIND}:29002
bitcoind.estimatemode=ECONOMICAL

[tor]
tor.active=true
tor.v3=true

[db]
db.backend=etcd

[etcd]
db.etcd.host=127.0.0.1:2379
db.etcd.disabletls=1

[cluster]
cluster.enable-leader-election=1
cluster.leader-elector=etcd
cluster.etcd-election-prefix=cluster-leader
cluster.id=${HOSTNAME}

Logs:

14:43:44: Disconnected lndetcd1 from the network.

14:44:38 lndetcd2: [INF] LTND: Elected as leader (lndetcd2)

14:44:46 lndetcd1: [CRT] SRVR: Health check: leader status failed after 1 calls
14:44:46 lndetcd1: [INF] SRVR: Sending request for shutdown
14:44:46 lndetcd1: [INF] LTND: Received shutdown request.
14:44:46 lndetcd1: [INF] LTND: Shutting down...
14:44:46 lndetcd1: [INF] LTND: Systemd was notified about stopping
14:44:46 lndetcd1: [INF] LTND: Gracefully shutting down.
14:44:46 lndetcd1: [INF] NANN: Channel Status Manager shutting down...
14:44:48 lndetcd1: {"level":"warn","ts":"2024-07-29T14:44:48.658Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008fe540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:44:55 lndetcd1: {"level":"warn","ts":"2024-07-29T14:44:55.658Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:02 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:02.659Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008fe540/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:09 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:09.660Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:11.655Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008fe540/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:45:11 lndetcd1: [ERR] NANN: Unable to load active channels: no active channels exist
14:45:11 lndetcd1: [INF] HSWC: HTLC Switch shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=7
14:45:11 lndetcd1: [INF] HSWC: Onion processor shutting down...
14:45:11 lndetcd1: [INF] HSWC: Decaying hash log received shutdown request
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=12
14:45:11 lndetcd1: [INF] INVC: InvoiceRegistry shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=11
14:45:11 lndetcd1: [INF] CRTR: Channel Router shutting down...
14:45:11 lndetcd1: [INF] CNCT: ChainArbitrator shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=9
14:45:11 lndetcd1: [INF] FNDG: Funding manager shutting down...
14:45:11 lndetcd1: [INF] BRAR: Breach arbiter shutting down...
14:45:11 lndetcd1: [INF] UTXN: UTXO nursery shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=6
14:45:11 lndetcd1: [INF] DISC: Authenticated gossiper shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=10
14:45:11 lndetcd1: [INF] SWPR: Sweeper shutting down...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=5
14:45:11 lndetcd1: [INF] SWPR: TxPublisher stopping...
14:45:11 lndetcd1: [INF] NTFN: Cancelling epoch notification, epoch_id=4
14:45:11 lndetcd1: [INF] CHNF: ChannelNotifier shutting down...
14:45:11 lndetcd1: [INF] PRNF: PeerNotifier shutting down...
14:45:11 lndetcd1: [INF] HSWC: HtlcNotifier shutting down...
14:45:11 lndetcd1: [INF] CHBU: chanbackup.SubSwapper shutting down...
14:45:11 lndetcd1: [INF] NTFN: bitcoind notifier shutting down...
14:45:11 lndetcd1: [INF] NTFN: Stopping mempool notifier
14:45:11 lndetcd1: [ERR] NTFN: dead epoch stream in BestBlockTracker
14:45:11 lndetcd1: [INF] CHFT: ChannelEventStore shutting down...
14:45:11 lndetcd1: [ERR] HSWC: InterceptableSwitch stopped: block epoch stream stopped
14:45:23 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:23.661Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:23 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:23.661Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:30 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:30.686Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":3,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:37 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:37.687Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:41 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:41.661Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:45:41 lndetcd1: [INF] WTCL: (anchor) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
14:45:41 lndetcd1: [INF] WTCL: (taproot) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
14:45:44 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:44.690Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":4,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:51 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:51.691Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:45:58 lndetcd1: {"level":"warn","ts":"2024-07-29T14:45:58.689Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":5,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:05 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:05.690Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:11.663Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = Unknown desc = context deadline exceeded"}
14:46:12 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:12.690Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":6,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:19 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:19.692Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:26 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:26.692Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":7,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:33 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:33.693Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:40 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:40.694Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":8,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:41 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:41.673Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:46:41 lndetcd1: [INF] WTCL: (taproot) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
14:46:54 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:54.696Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:46:54 lndetcd1: {"level":"warn","ts":"2024-07-29T14:46:54.696Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":9,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:01 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:01.725Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:08 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:08.726Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":10,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:11.674Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:47:22 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:22.727Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:22 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:22.728Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":11,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:29 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:29.753Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":12,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:36 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:36.754Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:41 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:41.679Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:47:43 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:43.754Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":13,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:50 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:50.756Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:47:57 lndetcd1: {"level":"warn","ts":"2024-07-29T14:47:57.757Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":14,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:48:04 lndetcd1: {"level":"warn","ts":"2024-07-29T14:48:04.757Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:48:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:48:11.679Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0008ff6c0/127.0.0.1:2379","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
14:48:11 lndetcd1: [INF] HLCK: Health monitor shutting down...
14:48:11 lndetcd1: [INF] RPCS: Stopping RPC Server
14:48:11 lndetcd1: [INF] RPCS: Stopping VersionRPC Sub-RPC Server
14:48:11 lndetcd1: [INF] RPCS: Stopping RouterRPC Sub-RPC Server
14:48:11 lndetcd1: [INF] RPCS: Stopping WatchtowerClientRPC Sub-RPC Server
14:48:11 lndetcd1: [INF] TORC: Stopping tor controller
14:48:11 lndetcd1: [INF] LTND: Attempting to resign from leader role (lndetcd1)
14:48:11 lndetcd1: {"level":"warn","ts":"2024-07-29T14:48:11.758Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d36c0/127.0.0.1:2379","attempt":15,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
14:48:14 lndetcd1: [INF] LTND: Shutdown complete

As can be seen from the logs I provided, between 14:44:38 and 14:44:46, both lndetcd1 and lndetcd2 were active at the same time.

Also, lndetcd1 took more than 3 minutes to shut down while trying to interact with etcd. (14:44:46 - 14:48:14)

You can try setting --cluster.leader-session-ttl=30s for example to make sure that it times out (and fails over) faster. Note that going super low with the TTL is not recommended.

@bhandras bhandras requested a review from Roasbeef July 31, 2024 14:45
@Filiprogrammer
Copy link
Contributor

You can try setting --cluster.leader-session-ttl=30s for example to make sure that it times out (and fails over) faster. Note that going super low with the TTL is not recommended.

This did not solve the problem of the two nodes occasionally running simultaneously for a few seconds.

But setting healthcheck.leader.interval=30s instead, while leaving the default cluster.leader-session-ttl=60, fixed this.
With these settings, the time for the disconnected leader to initiate a shutdown is between 10-30 seconds, and the time for another node to take over is between 40-60 seconds after the previous leader was disconnected. So healthcheck.leader.interval should be at least 20 seconds lower than cluster.leader-session-ttl to prevent the two from overlapping. 20 seconds seems to be the time interval at which the etcd lease is kept alive.

Therefore, it might be a good idea to reduce the default value of healthcheck.leader.interval to 30 seconds, or instead increase cluster.leader-session-ttl to 90 seconds.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @bhandras)

@Roasbeef Roasbeef added this to the v0.18.3 milestone Aug 1, 2024
@Roasbeef Roasbeef requested review from a team, Crypt-iQ and morehouse and removed request for a team and morehouse August 1, 2024 00:38
Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides nits, also release notes

return false, err
}

return string(resp.Kvs[0].Value) == e.id, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be length checked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it also be done for Leader() above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to Leader() as well.

I don't think it is needed for either case as we can assume that there is a leader session so the key exists in the DB, but just in case it's good to have this extra check to avoid crashing in case of some unwanted failure.

go func() {
defer p.wg.Done()
// Ignore the copy error due to the connection being closed.
_, _ = io.Copy(targetConn, conn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clever, didn't know you could do this. could be useful for simulating other things like network failure w/o calling the disconnect func

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, it could be useful for other tests too in the future.

sample-lnd.conf Outdated
; check.
; healthcheck.leader.attempts=1

; The amount of time we should backoff between failed attempts of leader checks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@bhandras bhandras force-pushed the etcd-leader-election-fixups branch from b98958a to 182fad6 Compare August 1, 2024 15:15
@bhandras
Copy link
Collaborator Author

bhandras commented Aug 1, 2024

You can try setting --cluster.leader-session-ttl=30s for example to make sure that it times out (and fails over) faster. Note that going super low with the TTL is not recommended.

This did not solve the problem of the two nodes occasionally running simultaneously for a few seconds.

But setting healthcheck.leader.interval=30s instead, while leaving the default cluster.leader-session-ttl=60, fixed this. With these settings, the time for the disconnected leader to initiate a shutdown is between 10-30 seconds, and the time for another node to take over is between 40-60 seconds after the previous leader was disconnected. So healthcheck.leader.interval should be at least 20 seconds lower than cluster.leader-session-ttl to prevent the two from overlapping. 20 seconds seems to be the time interval at which the etcd lease is kept alive.

Therefore, it might be a good idea to reduce the default value of healthcheck.leader.interval to 30 seconds, or instead increase cluster.leader-session-ttl to 90 seconds.

Increased the TTL to 90 seconds as our healthchecks are every minute at a minimum currently.

@bhandras bhandras force-pushed the etcd-leader-election-fixups branch 2 times, most recently from 36f9c18 to 4a579b8 Compare August 1, 2024 15:39
Previously our RPC calls to etcd would hang even in the case of properly
set dial timeouts and even if there was a network partition. To ensure
liveness we need to make sure that calls fail correctly in case of
system failure. To fix this we add a default timeout of 30 seconds to
each etcd RPC call.
This is to ensure that the added functionality works correctly and
should be removed once these changes are merged and the packages are
tagged.
This commit extends our healtcheck with an optional leader check. This
is to ensure that given network partition or other cluster wide failure
we act as soon as possible to avoid a split-brain situation where a new
leader is elected but we still hold onto our etcd client.
@bhandras bhandras force-pushed the etcd-leader-election-fixups branch from 4a579b8 to 037161e Compare August 1, 2024 17:04
@Roasbeef Roasbeef merged commit 6e9eb1d into lightningnetwork:master Aug 1, 2024
28 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to the database/storage of LND etcd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: lnd not resigning from leader role when disconnected from cluster
4 participants