You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 4, 2024. It is now read-only.
We made some faulty nodes to test the resiliency of the nodes and noticed that the time it takes for the nodes to get back to consensus is very long after 2/3 of the nodes are not byzantine anymore.
It's very possible this was the cause of us thinking our testnet was stuck (#232, #248) but in reality perhaps there was a network partition / other bug in the node discovery causing it, then this making restarts take a very long time.
The main issue here is that if a cluster stops for an hour, say (due to connectivity, etc), it might take 3+ hours for the nodes to recover even if they are all still connected, giving the illusion of a stuck chain. This is not what a chain operator might expect.
The only way to "fix" it, is to get all the nodes to restart, effectively resetting their round to 1.
Admittedly, this is technically "working as intended" in the code but not something you would expect.
Exponential round timeouts cause very long restart times
Description
Related: #245.
We made some faulty nodes to test the resiliency of the nodes and noticed that the time it takes for the nodes to get back to consensus is very long after 2/3 of the nodes are not byzantine anymore.
It's very possible this was the cause of us thinking our testnet was stuck (#232, #248) but in reality perhaps there was a network partition / other bug in the node discovery causing it, then this making restarts take a very long time.
Inspired by
Their implementation here: getamis/go-ethereum#99
Code here which is a WIP adapted for the SDK:
sx-network#2
The main issue here is that if a cluster stops for an hour, say (due to connectivity, etc), it might take 3+ hours for the nodes to recover even if they are all still connected, giving the illusion of a stuck chain. This is not what a chain operator might expect.
The only way to "fix" it, is to get all the nodes to restart, effectively resetting their round to 1.
Admittedly, this is technically "working as intended" in the code but not something you would expect.
Your environment
develop
with adapted modes from aboveSteps to reproduce
Expected behaviour
It should not take so long to recover and start producing blocks again.
Actual behaviour
It takes a very long time for the nodes to reach the same round and produce blocks.
Logs
Proposed solution
We have some ideas here
#245
The text was updated successfully, but these errors were encountered: