You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 4, 2024. It is now read-only.
Discussion: Questions on exponentially-increasing IBFT randomTimeout()
Description
While investigating a chain halt that occurred on our testnet earlier this week, restarting all validators fixed our chain as more messages would be sent in a shorter timespan, allowing nodes to transition out of each round faster (since the required messageQueue length thresholds would be met sooner). This then brought up some internal discussion which led to some questions we had about randomTimeout():
Could introducing a new flag to cap the timeout to a specified maximum value alleviate the problem? At least if the maximum randomTimeout() was 10 minutes for example, a chain would be able to recover from a halt in the span of hours instead of what could be days (or more). Shouldn't the goal be to recover the chain as quickly as possible if a chain halt were to occur?
Left this open as a discussion - thanks in advance and as always we appreciate the hard work.
Your environment
OS and version Ubuntu 20
version of the Polygon SDK
branch that causes this issue develop
The text was updated successfully, but these errors were encountered:
Discussion: Questions on exponentially-increasing IBFT randomTimeout()
Description
While investigating a chain halt that occurred on our testnet earlier this week, restarting all validators fixed our chain as more messages would be sent in a shorter timespan, allowing nodes to transition out of each round faster (since the required messageQueue length thresholds would be met sooner). This then brought up some internal discussion which led to some questions we had about
randomTimeout()
:Why use exponentially-increasing timeouts in the first place? We realize the original geth fork you've based the SDK on makes use of this calculate as well but we couldn't find out any indication why they didn't use a fixed or linearly-increasing timeout instead (see https://github.com/getamis/go-ethereum/blob/c7547381b2ea8999e423970d619835c662176790/consensus/istanbul/core/core.go#L316-L329). Was this to prevent multiple nodes from changing state at the same time?
Could introducing a new flag to cap the timeout to a specified maximum value alleviate the problem? At least if the maximum randomTimeout() was 10 minutes for example, a chain would be able to recover from a halt in the span of hours instead of what could be days (or more). Shouldn't the goal be to recover the chain as quickly as possible if a chain halt were to occur?
Left this open as a discussion - thanks in advance and as always we appreciate the hard work.
Your environment
Ubuntu 20
develop
The text was updated successfully, but these errors were encountered: