-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: check leader status with our health checker to correctly shut down LND if network partitions #8938
multi: check leader status with our health checker to correctly shut down LND if network partitions #8938
Conversation
Important Review skippedAuto reviews are limited to specific labels. Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
1e767fd
to
5ebb557
Compare
@@ -207,6 +207,12 @@ replace google.golang.org/protobuf => github.com/lightninglabs/protobuf-go-hex-d | |||
// Temporary replace until the next version of sqldb is tagged. | |||
replace github.com/lightningnetwork/lnd/sqldb => ./sqldb | |||
|
|||
// Temporary replace until the next version of healthcheck is tagged. | |||
replace github.com/lightningnetwork/lnd/healthcheck => ./healthcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marker commit to make sure we remove the pins.
Tested this on a 3 node regtest cluster:
lnd.conf: [Application Options]
listen=0.0.0.0:9735
alias=lndetcd
[Bitcoin]
bitcoin.regtest=true
bitcoin.node=bitcoind
[Bitcoind]
bitcoind.rpcuser=${BITCOIND_RPCUSER}
bitcoind.rpcpass=${BITCOIND_RPCPASS}
bitcoind.rpchost=${IP_OF_BITCOIND}:8332
bitcoind.zmqpubrawtx=tcp://${IP_OF_BITCOIND}:29001
bitcoind.zmqpubrawblock=tcp://${IP_OF_BITCOIND}:29002
bitcoind.estimatemode=ECONOMICAL
[tor]
tor.active=true
tor.v3=true
[db]
db.backend=etcd
[etcd]
db.etcd.host=127.0.0.1:2379
db.etcd.disabletls=1
[cluster]
cluster.enable-leader-election=1
cluster.leader-elector=etcd
cluster.etcd-election-prefix=cluster-leader
cluster.id=${HOSTNAME} Logs:
As can be seen from the logs I provided, between 14:44:38 and 14:44:46, both lndetcd1 and lndetcd2 were active at the same time. Also, lndetcd1 took more than 3 minutes to shut down while trying to interact with etcd. (14:44:46 - 14:48:14) |
5ebb557
to
b98958a
Compare
You can try setting |
This did not solve the problem of the two nodes occasionally running simultaneously for a few seconds. But setting Therefore, it might be a good idea to reduce the default value of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @bhandras)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides nits, also release notes
return false, err | ||
} | ||
|
||
return string(resp.Kvs[0].Value) == e.id, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be length checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it also be done for Leader()
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to Leader()
as well.
I don't think it is needed for either case as we can assume that there is a leader session so the key exists in the DB, but just in case it's good to have this extra check to avoid crashing in case of some unwanted failure.
go func() { | ||
defer p.wg.Done() | ||
// Ignore the copy error due to the connection being closed. | ||
_, _ = io.Copy(targetConn, conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very clever, didn't know you could do this. could be useful for simulating other things like network failure w/o calling the disconnect func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, it could be useful for other tests too in the future.
sample-lnd.conf
Outdated
; check. | ||
; healthcheck.leader.attempts=1 | ||
|
||
; The amount of time we should backoff between failed attempts of leader checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
b98958a
to
182fad6
Compare
Increased the TTL to 90 seconds as our healthchecks are every minute at a minimum currently. |
36f9c18
to
4a579b8
Compare
Previously our RPC calls to etcd would hang even in the case of properly set dial timeouts and even if there was a network partition. To ensure liveness we need to make sure that calls fail correctly in case of system failure. To fix this we add a default timeout of 30 seconds to each etcd RPC call.
This is to ensure that the added functionality works correctly and should be removed once these changes are merged and the packages are tagged.
This commit extends our healtcheck with an optional leader check. This is to ensure that given network partition or other cluster wide failure we act as soon as possible to avoid a split-brain situation where a new leader is elected but we still hold onto our etcd client.
4a579b8
to
037161e
Compare
Change Description
LND currently holds the leader lease until shutdown, at which point it will resign. In some scenarios, it may be desirable for LND to relinquish leadership and shut down if it becomes partitioned from the etcd cluster. This PR aims to implement this behavior by adding a leader status check to the existing health checks, which will verify the leader status every minute. To prevent hanging due to network issues, we also introduce reasonable timeouts for etcd calls. This allows for a clean shutdown upon a request from the health check module.
Fixes: #8913
Steps to Test
make itest backend=bitcoind dbbackend=etcd icase=leader_health_check
This change is