Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agreement: handle pseudonode enqueueing failures #2741

Merged
merged 14 commits into from
Nov 17, 2021

Conversation

tsachiherman
Copy link
Contributor

@tsachiherman tsachiherman commented Aug 13, 2021

Summary

This PR addresses two separate issues in the pseudonodeNode implementation:

  1. pseudonodeVotesTask.execute and pseudonodeProposalsTask.execute do not handle vote verification enqueueing failures ( to the execution pool ). This could lead the the pseudonode processing go-routine being stuck.
  2. The pseudonodeVotesTask.execute could block forever in case the output channel is not being read from.

Trigger

This issue was detected by the telemetry:

 ({num-occurances}): {telemetry error string}
 (67): pseudonode.MakeVotes call failed(attest) pseudonode input channel is full

Test Plan

Few unit tests added. More to come.

@codecov-commenter
Copy link

codecov-commenter commented Aug 13, 2021

Codecov Report

Merging #2741 (e81e4ad) into master (eea0a75) will increase coverage by 0.04%.
The diff coverage is 63.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2741      +/-   ##
==========================================
+ Coverage   47.43%   47.47%   +0.04%     
==========================================
  Files         369      369              
  Lines       59494    59532      +38     
==========================================
+ Hits        28221    28265      +44     
- Misses      27984    27986       +2     
+ Partials     3289     3281       -8     
Impacted Files Coverage Δ
agreement/pseudonode.go 71.80% <61.81%> (-0.35%) ⬇️
agreement/cryptoVerifier.go 75.17% <66.66%> (-0.56%) ⬇️
agreement/asyncVoteVerifier.go 89.83% <75.00%> (+8.01%) ⬆️
ledger/blockqueue.go 81.03% <0.00%> (-2.88%) ⬇️
network/requestTracker.go 71.12% <0.00%> (-0.44%) ⬇️
network/wsNetwork.go 63.04% <0.00%> (-0.30%) ⬇️
network/wsPeer.go 68.33% <0.00%> (+0.27%) ⬆️
ledger/acctupdates.go 65.96% <0.00%> (+0.95%) ⬆️
catchup/peerSelector.go 100.00% <0.00%> (+1.04%) ⬆️
catchup/service.go 70.57% <0.00%> (+1.24%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eea0a75...e81e4ad. Read the comment docs.

@tsachiherman tsachiherman marked this pull request as ready for review August 16, 2021 15:01
@tsachiherman tsachiherman requested review from cce and a user August 16, 2021 15:02
@tsachiherman tsachiherman mentioned this pull request Sep 2, 2021
@cce cce requested a review from zeldovich November 15, 2021 20:46
@@ -140,16 +140,18 @@ func (avv *AsyncVoteVerifier) verifyVote(verctx context.Context, l LedgerReader,
// if we're done while waiting for room in the requests channel, don't queue the request
req := asyncVerifyVoteRequest{ctx: verctx, l: l, uv: &uv, index: index, message: message, out: out}
avv.wg.Add(1)
if avv.backlogExecPool.EnqueueBacklog(avv.ctx, avv.executeVoteVerification, req, avv.execpoolOut) != nil {
if err := avv.backlogExecPool.EnqueueBacklog(avv.ctx, avv.executeVoteVerification, req, avv.execpoolOut); err != nil {
Copy link
Contributor

@cce cce Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, inside of EnqueueBacklog it is also doing a select on case <-avv.ctx.Done() (passed in as enqueueCtx, alongside a select for backlog.ctx.Done()) which will cause EnqueueBacklog to return ctx.Err(). If you have two pending selects like that on the stack, is it deterministic which one will fire first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this comment warning "instead, enqueue so the worker will set the error value and return the cancelled vote properly." However if the waiting inside EnqueueBacklog() is interrupted by one of the Done()s it's waiting on, it will never reach the worker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we have two pending selects isn't an issue. The one in verifyVote would be evaluated when we get to that point, and if it's not canceled yet, it would go to the default statement, executing EnqueueBacklog, where it would also block on the same channel.

As for the comment, I think it's a bug - I think that the intent was to have a fallthrough here. But I would leave that to a separate PR, as it's not really related to this change. ( i.e. in out case, we're passing a TODO context, which would never expire and therefore this statement would never be executed )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are other callers (unauthenticatedBundle.verify and cryptoVerifier.voteFillWorker) passing a valid context into verifyVote — just not pseudonode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm aware of that - that's why I did not attempted to change this code ;-)

@tsachiherman tsachiherman requested a review from cce November 17, 2021 16:59
Copy link
Contributor

@cce cce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Ideally we should only see these new log messages and errors around shutdown time or if something is broken.

@tsachiherman
Copy link
Contributor Author

LGTM. Ideally we should only see these new log messages and errors around shutdown time or if something is broken.

yes, that's what I'm hoping for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants