-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agreement: handle pseudonode enqueueing failures #2741
agreement: handle pseudonode enqueueing failures #2741
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2741 +/- ##
==========================================
+ Coverage 47.43% 47.47% +0.04%
==========================================
Files 369 369
Lines 59494 59532 +38
==========================================
+ Hits 28221 28265 +44
- Misses 27984 27986 +2
+ Partials 3289 3281 -8
Continue to review full report at Codecov.
|
@@ -140,16 +140,18 @@ func (avv *AsyncVoteVerifier) verifyVote(verctx context.Context, l LedgerReader, | |||
// if we're done while waiting for room in the requests channel, don't queue the request | |||
req := asyncVerifyVoteRequest{ctx: verctx, l: l, uv: &uv, index: index, message: message, out: out} | |||
avv.wg.Add(1) | |||
if avv.backlogExecPool.EnqueueBacklog(avv.ctx, avv.executeVoteVerification, req, avv.execpoolOut) != nil { | |||
if err := avv.backlogExecPool.EnqueueBacklog(avv.ctx, avv.executeVoteVerification, req, avv.execpoolOut); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, inside of EnqueueBacklog it is also doing a select on case <-avv.ctx.Done()
(passed in as enqueueCtx, alongside a select for backlog.ctx.Done()) which will cause EnqueueBacklog to return ctx.Err(). If you have two pending selects like that on the stack, is it deterministic which one will fire first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is this comment warning "instead, enqueue so the worker will set the error value and return the cancelled vote properly." However if the waiting inside EnqueueBacklog() is interrupted by one of the Done()s it's waiting on, it will never reach the worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we have two pending selects isn't an issue. The one in verifyVote
would be evaluated when we get to that point, and if it's not canceled yet, it would go to the default
statement, executing EnqueueBacklog
, where it would also block on the same channel.
As for the comment, I think it's a bug - I think that the intent was to have a fallthrough
here. But I would leave that to a separate PR, as it's not really related to this change. ( i.e. in out case, we're passing a TODO context, which would never expire and therefore this statement would never be executed )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are other callers (unauthenticatedBundle.verify and cryptoVerifier.voteFillWorker) passing a valid context into verifyVote — just not pseudonode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm aware of that - that's why I did not attempted to change this code ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Ideally we should only see these new log messages and errors around shutdown time or if something is broken.
yes, that's what I'm hoping for. |
Summary
This PR addresses two separate issues in the
pseudonodeNode
implementation:pseudonodeVotesTask.execute
andpseudonodeProposalsTask.execute
do not handle vote verification enqueueing failures ( to the execution pool ). This could lead the the pseudonode processing go-routine being stuck.pseudonodeVotesTask.execute
could block forever in case the output channel is not being read from.Trigger
This issue was detected by the telemetry:
Test Plan
Few unit tests added. More to come.