Skip to content
This repository was archived by the owner on Dec 4, 2024. It is now read-only.

Exponential round timeouts cause very long restart times #261

Closed
mrwillis opened this issue Nov 30, 2021 · 0 comments · Fixed by #263
Closed

Exponential round timeouts cause very long restart times #261

mrwillis opened this issue Nov 30, 2021 · 0 comments · Fixed by #263
Assignees

Comments

@mrwillis
Copy link
Contributor

mrwillis commented Nov 30, 2021

Exponential round timeouts cause very long restart times

Description

Related: #245.

We made some faulty nodes to test the resiliency of the nodes and noticed that the time it takes for the nodes to get back to consensus is very long after 2/3 of the nodes are not byzantine anymore.

It's very possible this was the cause of us thinking our testnet was stuck (#232, #248) but in reality perhaps there was a network partition / other bug in the node discovery causing it, then this making restarts take a very long time.

Inspired by

image

Their implementation here: getamis/go-ethereum#99

Code here which is a WIP adapted for the SDK:

sx-network#2

The main issue here is that if a cluster stops for an hour, say (due to connectivity, etc), it might take 3+ hours for the nodes to recover even if they are all still connected, giving the illusion of a stuck chain. This is not what a chain operator might expect.

The only way to "fix" it, is to get all the nodes to restart, effectively resetting their round to 1.

Admittedly, this is technically "working as intended" in the code but not something you would expect.

Your environment

  • OS and version Ubuntu 20
  • version of the Polygon SDK 7f2e61d
  • branch that causes this issue develop with adapted modes from above

Steps to reproduce

  1. Create a 7 node cluster. Let it produce some blocks.
  2. Make 3 of the nodes byzantine (don't gossip blocks, gossip wrong messages etc) which will push it below the 2/3 threshold.
  3. The cluster should stop producing blocks.
  4. Wait 10 minutes or so.
  5. Replace the byzantine node with a standard node
  6. Observe it takes 30m+ to get all nodes to reach the same round and produce blocks again.
  • Where the issue is, if you know
  • Which commands triggered the issue, if any

Expected behaviour

It should not take so long to recover and start producing blocks again.

Actual behaviour

It takes a very long time for the nodes to reach the same round and produce blocks.

Logs

2021-11-30T17:37:30.154-0500 [DEBUG] polygon.consensus.ibft: state change: new=CommitState
2021-11-30T17:37:30.154-0500 [INFO]  polygon.blockchain: write block: num=2680 parent=0xa145ad73ef2d8c399ce14712c068af2beba574288f0a991494ac6e9bcd718536
2021-11-30T18:13:28.291-0500 [INFO]  polygon.blockchain: write block: num=2681 parent=0x6a5c7127f8619b017887050065a198f0573eefdb713ffae70bd936558b8415ed
2021-11-30T18:13:38.377-0500 [INFO]  polygon.blockchain: write block: num=2682 parent=0x45d5213d7dcb10c7b62d69fbb4ea0a5b51cc0609da7e2332d5cb6a2c863a046a

Proposed solution

We have some ideas here

#245

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants