-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing: cancelable payment loop #8734
routing: cancelable payment loop #8734
Conversation
48cfd5b
to
1d6e008
Compare
dcdd913
to
02072fa
Compare
02072fa
to
e2cfd30
Compare
e2cfd30
to
4fa701d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Only one blocking comment around removing the context.Context
nil check.
The other comment is just a suggestion
0bfb3db
to
2ab2bd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
2ab2bd9
to
4f18995
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have timeout already? Why do we need this? I think once the payment is sent, there's no easy way to cancel it, maybe related #5461
That's also my understanding, we can't cancel the payment once the htlc is sent. Here, additionally to the existent timeout check that interrupts the payment loop on expiry, we pass in a context (from the rpc server client stream) that is checked for cancellation and interrupts the payment loop if the user cancelled the stream context(Ctrl+C'd the sendpayment from cli). This can only happen if a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code wise this looks good 🔥! Wanted to note that this code may break usage patterns, if users expect to be able to dispatch payments in the background with lncli sendpayment
or somehow if SendPaymentV2
is not waited on for responses. This would not be possible anymore with the current setting as it stops the payment loop in the next occasion if the context is cancelled. If the cmd is interrupted, this will lead to cancellation of the payment loop, as expected. Should we retain backward compatibility by adding a bool
cancellable
to the payment request (we could then decide whether we'd like to forward the context or to initialize a new background one)?
routing/payment_lifecycle.go
Outdated
@@ -319,27 +322,42 @@ lifecycle: | |||
} | |||
|
|||
// checkTimeout checks whether the payment has reached its timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update the docs too
routing/payment_lifecycle.go
Outdated
select { | ||
case <-p.timeoutChan: | ||
log.Warnf("payment attempt not completed before timeout") | ||
err := failPayment(channeldb.FailureReasonTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can instead do return failPayment(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
routing/payment_lifecycle.go
Outdated
case <-p.router.quit: | ||
return fmt.Errorf("check payment timeout got: %w", | ||
ErrRouterShuttingDown) | ||
|
||
// Fall through if we haven't hit our time limit. | ||
// Fall through if we haven't hit our time limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think gofmt
failed here
routing/payment_lifecycle.go
Outdated
} | ||
|
||
case <-ctx.Done(): | ||
log.Warnf("payment attempt context canceled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need cap the log and add a p.identifier
.
routing/payment_lifecycle_test.go
Outdated
@@ -179,6 +180,34 @@ func sendPaymentAndAssertFailed(t *testing.T, | |||
} | |||
} | |||
|
|||
// sendPaymentAndAssertFailed calls `resumePayment` and asserts that an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// sendPaymentAndAssertFailed calls `resumePayment` and asserts that an error | |
// sendPaymentAndAssertContextCancelled calls `resumePayment` and asserts that an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've called it sendPaymentAndAssertError
.
routing/payment_lifecycle_test.go
Outdated
@@ -179,6 +180,34 @@ func sendPaymentAndAssertFailed(t *testing.T, | |||
} | |||
} | |||
|
|||
// sendPaymentAndAssertFailed calls `resumePayment` and asserts that an error | |||
// is returned. | |||
func sendPaymentAndAssertContextCancelled(t *testing.T, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it's the same as sendPaymentAndAssertFailed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've consolidated the two.
routing/payment_lifecycle.go
Outdated
return err | ||
} | ||
|
||
return ctx.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we cannot return an error here, as it'd cause the payment loop to exit. For the timeout case, we never exit the loop. Consider this case, say a payment is made of two HTLC attempts,
HTLC1 ---> hop1 ---> remote
HTLC2 ---> hop2 ---> remote
Then remote settles the invoice, but hop2 is offline, we'd end up in this case,
HTLC1 ---> hop1 ---> remote ---> settled
HTLC2 ---> hop2 ---> remote ---> pending
After 60s, we'd time out this payment, but we won't quit the loop as we still think the payment is inflight,
lnd/channeldb/payment_status.go
Line 148 in 9d358bc
// | true | true | false | true | StatusInFlight | |
Then hop2 comes online and settles it, we'd consider the payment successful.
lnd/channeldb/payment_status.go
Line 156 in 9d358bc
// | false | true | false | true | StatusSucceeded | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, once the HTLC is sent, there's no easy way to cancel it unless you get a clear response from your peer saying it's failed/settled. If we exit here because the context is canceled, the payment would still have the status StatusInFlight
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation. I see why we can't interrupt the entire payment loop but instead behave exactly how we do it the timeout case.
4f18995
to
215b4c9
Compare
215b4c9
to
9a3fa63
Compare
Thanks to the reviewers so far. I think I addressed all your concerns, namely:
I tested different scenarios locally and didn't encounter an error, additionally to the added test. Looking forward to another round of feedback. |
lnrpc/routerrpc/router_server.go
Outdated
|
||
// Cancel the timeout context. If the context already timed out or if | ||
// there was no timeout provided, this will be a no-op. | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't cancel the payment directly because trackPayment
is blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is to avoid a linter issue that would arise if I wrote above ctx, _ = context.WithDeadline(ctx, timeout)
. And if I assign the cancel I have to call it. The right place seems to be after trackPayment
because then the context can be canceled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative could be to place it after sendPayment
in SendPaymentAsync
, but not sure that's a good pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also silence the linter here with a comment as to why we don't need the context in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another opinion here would be appreciated (if the context cancelation may be done within the SendPaymentAsync
goroutine as a defer statement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call cancel here I think it will immediately invoke <-ctx.Done
in checkContext
since it's async sending. I think we can instead decide the right context to use here (stream.Context
or background ctx), pass it to SendPaymentAsync -> sendPayment
, and inside sendPayment
we create the timeout ctx, and call the defer cancel()
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout thing may also be moved to the payment session, but that's a bigger refactor so future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call cancel here I think it will immediately invoke
<-ctx.Done
incheckContext
since it's async sending.
Before cancel()
is called there is the s.trackPayment
call which is blocking. So I think cancel()
is only called once the track payment call ends which implies that also the async payment loop ended. I think the cancel would then be correct here. We could add clarifying docs here.
Could also create the timeoutCtx in sendPayment
. instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before cancel() is called there is the s.trackPayment call which is blocking.
Gotya - yea to me it feels "safer" to do this in sendPayment
, so any future changes in s.trackPayment
won't affect the behavior of sending payments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will change this.
routing/payment_lifecycle.go
Outdated
// checkContext checks whether the payment context has been canceled. | ||
// Cancellation occurs manually or if the context times out. | ||
func (p *paymentLifecycle) checkContext(ctx context.Context) error { | ||
failPayment := func(reason channeldb.FailureReason) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be inlined again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failPayment
is still present :)
83eb67a
to
09592bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK LGTM ⚡. Perhaps the error could be changed to a timeout error and there's an open question concerning the cancel
call.
routing/payment_lifecycle.go
Outdated
// checkContext checks whether the payment context has been canceled. | ||
// Cancellation occurs manually or if the context times out. | ||
func (p *paymentLifecycle) checkContext(ctx context.Context) error { | ||
failPayment := func(reason channeldb.FailureReason) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failPayment
is still present :)
lnrpc/routerrpc/router_server.go
Outdated
|
||
// Cancel the timeout context. If the context already timed out or if | ||
// there was no timeout provided, this will be a no-op. | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another opinion here would be appreciated (if the context cancelation may be done within the SendPaymentAsync
goroutine as a defer statement).
8acd646
to
6bcac70
Compare
lnrpc/routerrpc/router_server.go
Outdated
|
||
// Cancel the timeout context. If the context already timed out or if | ||
// there was no timeout provided, this will be a no-op. | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we call cancel here I think it will immediately invoke <-ctx.Done
in checkContext
since it's async sending. I think we can instead decide the right context to use here (stream.Context
or background ctx), pass it to SendPaymentAsync -> sendPayment
, and inside sendPayment
we create the timeout ctx, and call the defer cancel()
there.
lnrpc/routerrpc/router_server.go
Outdated
|
||
// Cancel the timeout context. If the context already timed out or if | ||
// there was no timeout provided, this will be a no-op. | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout thing may also be moved to the payment session, but that's a bigger refactor so future PRs.
err := p.router.cfg.Control.FailPayment( | ||
p.identifier, channeldb.FailureReasonTimeout, | ||
) | ||
reason := channeldb.FailureReasonTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
routing/payment_lifecycle.go
Outdated
if err != nil { | ||
return fmt.Errorf("FailPayment got %w", err) | ||
} | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't think we need to return here
@@ -318,23 +312,36 @@ lifecycle: | |||
return [32]byte{}, nil, *failure | |||
} | |||
|
|||
// checkTimeout checks whether the payment has reached its timeout. | |||
func (p *paymentLifecycle) checkTimeout() error { | |||
// checkContext checks whether the payment context has been canceled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or timed out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second sentence clarifies that.
6bcac70
to
e63375a
Compare
13a9343
to
21b64f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, and needs a rebase!
@@ -19,14 +19,23 @@ | |||
|
|||
# Bug Fixes | |||
|
|||
* [SendPaymentV2]() now cancels the background payment loop if the user cancels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing the PR link
"timeout, id=%s", p.identifier.String()) | ||
} else { | ||
log.Warnf("Payment attempt context canceled, id=%s", | ||
p.identifier.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a new failure FailureReasonCanceled
to record the failure reason explicitly - think it will save us dome debugging timeout in the future. Also prefer doing it in a commit, or better a new PR since this PR is already very complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I addressed this in this follow-up PR: #8836. There are unit and itests failing but I think they are unrelated to this PR.
21b64f5
to
1e6e323
Compare
Just merged the PR that fixed the itest, could you do a rebase and let it run again, thanks! |
1e6e323
to
da0b724
Compare
There still seems to be an unrelated error left: https://github.com/lightningnetwork/lnd/actions/runs/9501738226/job/26188011967?pr=8734#step:6:468 EDIT: Seems to be a flake which is not appearing in the latest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that test is a flake, and thanks for the PR! LGTM☄️
da0b724
to
4568dfc
Compare
In this commit we set up the payment loop context according to user-provided parameters. The `cancelable` parameter indicates whether the user is able to interrupt the payment loop by cancelling the server stream context. We'll additionally wrap the context in a deadline if the user provided a payment timeout. We remove the timeout channel of the payment_lifecycle.go and in favor of the deadline context.
Fixes #8534.
While interrupting
lncli estimateroutefee
ends the client program, the route server is left behind with an abandoned payment loop.To allow users to cancel payment requests properly, the payment loop in
resumePayment
has to be abandoned as well as soon as the user request it.lnd/routing/payment_lifecycle.go
Line 170 in f523f52
This PR makes the
SendPaymentV2
streaming context accessible from the payment loop and allows it to safely end if the context was cancelled.