Exponential round timeouts cause very long restart times #261

mrwillis · 2021-11-30T23:36:58Z

Exponential round timeouts cause very long restart times

Description

Related: #245.

We made some faulty nodes to test the resiliency of the nodes and noticed that the time it takes for the nodes to get back to consensus is very long after 2/3 of the nodes are not byzantine anymore.

It's very possible this was the cause of us thinking our testnet was stuck (#232, #248) but in reality perhaps there was a network partition / other bug in the node discovery causing it, then this making restarts take a very long time.

Inspired by

Their implementation here: getamis/go-ethereum#99

Code here which is a WIP adapted for the SDK:

sx-network#2

The main issue here is that if a cluster stops for an hour, say (due to connectivity, etc), it might take 3+ hours for the nodes to recover even if they are all still connected, giving the illusion of a stuck chain. This is not what a chain operator might expect.

The only way to "fix" it, is to get all the nodes to restart, effectively resetting their round to 1.

Admittedly, this is technically "working as intended" in the code but not something you would expect.

Your environment

OS and version Ubuntu 20
version of the Polygon SDK 7f2e61d
branch that causes this issue develop with adapted modes from above

Steps to reproduce

Create a 7 node cluster. Let it produce some blocks.
Make 3 of the nodes byzantine (don't gossip blocks, gossip wrong messages etc) which will push it below the 2/3 threshold.
The cluster should stop producing blocks.
Wait 10 minutes or so.
Replace the byzantine node with a standard node
Observe it takes 30m+ to get all nodes to reach the same round and produce blocks again.

Where the issue is, if you know
Which commands triggered the issue, if any

Expected behaviour

It should not take so long to recover and start producing blocks again.

Actual behaviour

It takes a very long time for the nodes to reach the same round and produce blocks.

Logs

2021-11-30T17:37:30.154-0500 [DEBUG] polygon.consensus.ibft: state change: new=CommitState
2021-11-30T17:37:30.154-0500 [INFO]  polygon.blockchain: write block: num=2680 parent=0xa145ad73ef2d8c399ce14712c068af2beba574288f0a991494ac6e9bcd718536
2021-11-30T18:13:28.291-0500 [INFO]  polygon.blockchain: write block: num=2681 parent=0x6a5c7127f8619b017887050065a198f0573eefdb713ffae70bd936558b8415ed
2021-11-30T18:13:38.377-0500 [INFO]  polygon.blockchain: write block: num=2682 parent=0x45d5213d7dcb10c7b62d69fbb4ea0a5b51cc0609da7e2332d5cb6a2c863a046a

Proposed solution

We have some ideas here

#245

The text was updated successfully, but these errors were encountered:

This was referenced Dec 1, 2021

Exponential round timeouts cause very long delays in recovering lost consesus #263

Merged

Consensus failure cycling AcceptState -> ValidateState -> RoundChangeState -> AcceptState #248

Closed

brkomir self-assigned this Dec 1, 2021

zivkovicmilos linked a pull request Dec 1, 2021 that will close this issue

Exponential round timeouts cause very long delays in recovering lost consesus #263

Merged

8 tasks

brkomir closed this as completed in #263 Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential round timeouts cause very long restart times #261

Exponential round timeouts cause very long restart times #261

mrwillis commented Nov 30, 2021 •

edited

Loading

Exponential round timeouts cause very long restart times #261

Exponential round timeouts cause very long restart times #261

Comments

mrwillis commented Nov 30, 2021 • edited Loading