Only update max size of undo_db and fork_db after successfully pushed a block #1333

abitmore · 2018-09-17T16:10:25Z

Bug Description
There is a corner case that would cause chain reorganization to fail.

update_global_dynamic_data() is called in the middle of _apply_block():

bitshares-core/libraries/chain/db_block.cpp

Line 575 in 0508438

update_global_dynamic_data( next_block, missed );

Max size of both fork db and undo db is updated inside update_global_dynamic_data():

bitshares-core/libraries/chain/db_update.cpp

Lines 80 to 81 in 0508438

    
           _undo_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 ); 
        
           _fork_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 );

If the max size is changed to smaller or kept unchanged, the original smallest reversible block will be removed from undo db and fork db. If an exception is thrown after update_global_dynamic_data() in _apply_block(), changes done in the object database will be reverted, but the removed block won't be put into fork db nor undo db again. In this case, the chain would be unable to reorganize to another fork if the removed block is not in the fork.

By the way, there is another scenario described in steemit/steem#2911.

Expected Behavior
Always able to reorganize the chain from LIB.

Impacts
Describe which portion(s) of BitShares Core may be impacted by this bug. Please tick at least one box.

CORE TEAM TASK LIST

The text was updated successfully, but these errors were encountered:

abitmore · 2018-09-18T02:02:37Z

Related: steemit/steem#2911 (different scenario).

abitmore · 2018-09-19T16:16:05Z

Steem PR: steemit/steem#2914 (also contains other changes, may or may not apply to BitShares).

pmconrad · 2018-09-21T08:35:31Z

Moving the checkpointing code to push_block means checkpoints are ignored during most of replay. I don't think we should do this.

abitmore · 2018-09-21T16:14:12Z

Just FYI my patch for Steem didn't move the checkpointing code, instead, I added one in push_block. It brought better performance when syncing. https://github.com/abitmore/steem/commits/20180917-fix1

jmjatlanta · 2018-10-19T15:06:16Z

Please take a look over these ramblings and see if I am on the right track:

The problem surfaces when decreasing the max size of _undo_db while there are (at least) 2 forks.
Decreasing the size of _undo_db is done when the LIB moves forward (among other times, but this was a scenario that my mind comprehends).
A poorly timed error while pushing a block can cause a chain reorganization, but a previous resize of _undo_db means that a block (or blocks) near LIB may no longer exist within _undo_db.

So, a test of this scenario would require
a) 2 forks
b) a movement forward of LIB to reduce undo_db and cause loss of data
c) an attempt to push a bad block on 1 of the forks to cause a reorganization of the chain

Am I close?

pmconrad · 2018-10-19T16:10:33Z

The critical part is that

LIB is on a block common to both forks
a block X arrives on fork A that moves LIB forward to a block that is not common to both forks
X fails to apply after update_global_dynamic_data() -> LIB has moved already.
A block Y arrives on fork B. B cannot be applied because the node can't roll back to the point before the fork. So the node is stuck.

I think this is close to impossible in practice. Fork B can only become valid if you replace 1/3 of the witnesses on that fork only, or when 1/3 of the witnesses are double-signing.

jmjatlanta · 2018-10-19T18:19:49Z

Thank you for clearing that up. I was thinking about subsequent blocks coming in, but was getting lost because how can 1 fork have a LIB and another no? That would mean that the LIB is reversible, which should be impossible.

Now I am straight. A block comes in that gets LIB to be on 1 fork, we somewhat liked it, we adjusted the size of undo_db, and then had a problem with that same block. Now we don't have what we need to switch forks.

I will attempt to make a test for that scenario. I'll put a lower priority on this, due to the low chance of this occurring.

jmjatlanta · 2018-10-23T20:08:20Z

The calculation for LIB comes after the calculation for the _undo_db and _fork_db size. I believe that means that if the current block moves the LIB forward, it will not be until the next block that the blocks are prematurely removed from _undo_db.

So my latest hypothesis is that this problem can not be caused by a block moving the LIB forward. If that hypothesis is correct, I ask the following 2 questions:

What other processes can cause the _undo_db to shrink, other than movement of LIB?
Could shrinking the size of _undo_db on the next block instead of the current block cause a problem?

I believe the answer to question 2 is no, as it has been working that way.

pmconrad · 2018-10-24T14:50:55Z

The actual shrinking (i. e. deletion of entries) in undo_db happens when a new undo_session is created. After applying a block, this will typically happen when pending transactions are replayed.
fork_db OTOH shrinks immediately during set_max_size.
Hm.

Suppose we're at block 10 and dgp.lib is 3.
fork_db must contain all forks starting with block number 4 with the longest fork having length 7 (blocks 4...10).
undo_db must contain sessions of blocks 10, 9, 8, 7, 6, 5, 4 plus one for pending_transactions, plus one temporary (see _push_transaction), i. e. at least 9 sessions.
Suppose both have max_size = 8. Note that undo_db.start_undo_session leaves max_size sessions on the stack before creating another one.

We receive block 11.

This happens:

_pending_tx_session is rolled back (via detail::without_pending_transactions). 7 undo sessions remain.
Block 11 is pushed into fork_db. fork_db deletes all blocks before (head - max_size), i. e. (11-7) = 4.
A new undo_session is created (number 8).
Transactions are applied (this could make undo_db.max_size increase due to proposals being executed, but let's ignore this for now).
update_global_dynamic_data sets head_block_number to 11, then max_size to (11-3+1) = 9. Note that LIB is still at 3.
** fork_db deletes all blocks before (11-9) = 2. Nothing happens, these have already been deleted.
** undo_db only updates its max_size.
update_last_irreversible_block updates LIB. Worst case is that LIB stays at 3.
a) Suppose the block is applied successfully.
** the block's undo_session is committed, which simply leaves it lying on the stack.
** pending_transactions are then pushed, which creates two more undo_sessions (9 and 10). Again, when 10 is created, 9 are left on the stack, so this is not a problem.
b) Suppose an error happens and the block is un-applied.
** the block's undo_session is rolled back, leaving 7 on the stack.
** the block is removed from fork_db. max_size remains at 9 for both.
** pending_transactions are then pushed, which creates two more undo_sessions (8 and 9).

This looks safe, unless I've missed something.

When blocks are popped, they stay in fork_db, while their corresponding undo_sessions are rolled back and removed from the undo_stack.

Note that popping blocks will also undo updates to dgp.last_irreversible_block, so from the POV of the chain LIB will be moving backwards. However the actual last_irreversible_block stays where it is, because it is irreversible by definition.

We only pop blocks when switching to a longer fork, so after popping some and then applying some more dgp.last_irreversible_block either is smaller than or equal to the actual LIB (which will result in max_size being even greater than before), or it is greater, which means that the actual LIB has advanced as well, and everything should be in order.

I can see one edge case though:

When blocks are re-applied while switching to a fork, the calculated max_size can be smaller than the difference between the original head and the actual LIB. In that case, fork_db will drop too many blocks, because fork_db uses the head of the new longest chain as its internal _head.

E. g. if in the above example we receive block 11' instead of 11, then roll back 10, 9, 8, and 7, then apply 7', we might have dgp.lib = 2 which results in max_size = (7-2+1) = 6. Then fork_db will drop all blocks before (11-6)=5, i. e. block 4 will (wrongly) be dropped from fork_db.

Thoughts?

abitmore · 2018-10-25T08:39:34Z

@pmconrad:

update_last_irreversible_block updates LIB. Worst case is that LIB stays at 3.

A worse case is that LIB advances E.G. to 6, then 3,4,5 will be dropped too early.

When blocks are re-applied while switching to a fork, the calculated max_size can be smaller than the difference between the original head and the actual LIB. In that case, fork_db will drop too many blocks, because fork_db uses the head of the new longest chain as its internal _head.
E. g. if in the above example we receive block 11' instead of 11, then roll back 10, 9, 8, and 7, then apply 7', we might have dgp.lib = 2 which results in max_size = (7-2+1) = 6. Then fork_db will drop all blocks before (11-6)=5, i. e. block 4 will (wrongly) be dropped from fork_db.

This is the scenario described in steemit/steem#2911

pmconrad · 2018-10-25T12:28:40Z

A worse case is that LIB advances E.G. to 6, then 3,4,5 will be dropped too early.

If LIB advances to 6 then 3,4,5 can be dropped.

The scenario when switching forks can be fixed easily as in steemit/steem@653f0b5 .

abitmore · 2018-10-25T23:14:19Z

If LIB advances to 6 then 3,4,5 can be dropped.

... only when "the block is applied successfully".

The scenario when switching forks can be fixed easily as in steemit/steem@653f0b5 .

This is needed as well: steemit/steem@c54fd43#diff-778ae4a84a14457ed12b22337048d2bfR3259

pmconrad · 2018-10-26T12:26:20Z

... only when "the block is applied successfully".

Oh, right.

This is needed as well: steemit/steem@c54fd43#diff-778ae4a84a14457ed12b22337048d2bfR3259

Something like this, yes. Their codebase looks quite different at that point.

ryanRfox · 2019-02-12T12:40:31Z

@jmjatlanta is investigating a test case to identify this bug.

jmjatlanta · 2019-02-15T14:31:56Z

Branch jmj_1333 presents a test case to attempt to cause the issue, but with no success yet. There are currently gaps in my knowledge in (at least) 3 areas. The notes below probably will only make sense to me, but I do not want to forget them:

LIB calculation happens near the end of the process, and resizing of undo_db and fork_db near the beginning. Therefore, LIB should be "set in stone" before anything bad happens. When the dbs are resized, in my mind, they should be accurate.
LIB states a block number. Nothing in the LIB calculation mentions the actual block. So a rogue witness can sign 2 blocks on 2 forks at the same level. LIB would point to the level, not the block. I am not going to overthink this point, as the chances of this happening are certainly low. But it keeps appearing, as I have yet to figure out how to make the witness ordering different between forks. I am imagining witnesses being swapped out on 1 fork, so not a rogue witness, but a disagreement of who the witnesses are.
My test assumes (rogue) witnesses signing on both chains. This is impractical, and is probably why the test works when it shouldn't. I need to carefully work through the examples given, and try to separate the scenarios into unique tests. In short, a very clear understanding of 1 scenario is required to build the test. Other scenarios interfere with the cognitive limits of my ability to understand.

pmconrad · 2019-02-15T14:50:47Z

Rogue witnesses are a separate problem, you can ignore them here.

LIB states a block number is fine, because by definition there cannot be competing LIBs. (With rogue witnesses, two competing forks each with their own LIB can emerge. This is a different problem, see BFT.)

See my comment #1333 (comment) above, and @abitmore 's remark that "Worst case is LIB advances to 6, then the block fails and is rolled back". Then the next block arrives and both fork_db and undo_db are pruned too far.

nathanielhourt · 2019-03-22T23:25:52Z

Hey all, trying to wrap my head around this issue too... Setting aside for a moment discussion of the max undo size, a scenario which jumps the LIB forward the maximum number should trigger the bug, right? So does this toy scenario trigger it?:

7 witnesses: ABCDEFGH
LIB at 5/7

Block #	Signing Witness	LIB #	Sigs Since LIB
1	A	0	A
2	B	0	AB
3	C	0	ABC
4	D	0	ABCD
5	E	1	BCDE
6	F	2	CDEF
7	G	3	DEFG
8	H	4	EFGH
9	G	4	EFGH
10	F	4	EFGH
11	E	4	EFGH
12	A	8	AEFG

... except block 12 has a bad transaction, so it fails midway through and gets rewound, and then B shows up with a different block 5 and production continues from that chain... should that trigger the bug?

nathanielhourt · 2019-03-22T23:41:02Z

I suppose in my scenario above, H could sign 12, which would leave LIB at 4, and then A signs 13, which would cause LIB to jump to 9:

Block #	Witness	LIB	Sigs Since LIB
12	H	4	EFGH
13	A	9	AEFH

So that would be the maximum jump size of 5 blocks. Then 13 fails and rewinds, B continues from 5...

pmconrad · 2019-03-23T08:17:04Z

Yes, I think that's accurate. Except that A-H are 8 witnesses not 7 :-).

abitmore · 2019-03-23T08:21:24Z

This issues is about defensive approaches when something wrong HAS happened. To be clear, anything which can cause an uncaught exception after update_global_dynamic_data() IS A BUG and should be fixed asap, because witnesses will be unable to generate a new block thus will lead to a chain halt.

If the exception IS caused by a new transaction in the new block, an approach is to try to produce an empty block again (Steem had this mechanism but BitShares has not). In this case, if the chain continues growing with an empty block in the front, it's not the scenario described in this issue. Ideally we should have a mechanism to detect the troublesome transaction and fix the issue related, but IMHO it's out of this issue's scope.
If the exception is NOT caused by a new transaction in the new block, the chain would halt with current code. Currently we have no mechanism to try to automatically produce a block based on an earlier block (which is after LIB of course). Note: we can probably add this mechanism later.
- a typical scenario is: witnesses are running different versions of witness_node which are slightly incompatible (which the devs didn't know), E.G. when a new consensus-changing release is out;
- another typical scenario is: the chain has halted, some witnesses are trying to revive the chain manually, starting a fork from LIB (because when shutting down witness_node it will pop out all blocks after LIB). If the exception is caused by a transaction after LIB, the attempt (to revive) should success, ideally the witnesses on the old fork will switch to the new fork automatically when the new fork is longer, however, the switching may probably fail due to the potential bug described in this issue.

@jmjatlanta to write a test case for this issue, we need to do something abnormal in the test case, specifically, need to trigger an exception after update_global_dynamic_data().

abitmore · 2019-03-23T09:24:36Z

As @jmjatlanta noticed, actually we're resizing undo_db and fork_db based on last block's LIB. That said, resize is deferred to the next block. Thus there would be no issue if the new size is big enough, as long as the new size is based on state of last block. ~~We probably only need to change the +1s in these 2 lines~~

bitshares-core/libraries/chain/db_update.cpp

Lines 80 to 81 in 5bad558

    
           _undo_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 ); 
        
           _fork_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 );

~~to +2, or~~ no need to change at all.

abitmore · 2019-03-23T10:29:53Z

Not a bug. We're safe.

abitmore · 2019-03-23T14:21:43Z

We still have issues here similar to steemit/steem#2911. This would be easier to reproduce. Perhaps better create a new issue for it?

nathanielhourt · 2019-03-23T15:06:37Z

Yes, I think that's accurate. Except that A-H are 8 witnesses not 7 :-).

Wow I'm an idot 🤦‍♂️ haha

abitmore · 2019-03-25T17:07:40Z

Closing this since @jmjatlanta has created #1679 for following Steem fork_db issue (steemit/steem#2911).

abitmore added security bug 3d Bug Classification indicating the existing implementation does not match the intention of the design 6 Security Impact flag identifying system/user security labels Sep 17, 2018

abitmore added this to the Future Feature Release milestone Sep 17, 2018

abitmore mentioned this issue Sep 21, 2018

Cleanup Irreversible and Checkpoint steemit/steem#2914

Merged

jmjatlanta self-assigned this Oct 12, 2018

ryanRfox added the 4a Low Priority Priority indicating minimal impact to system/user -OR- an inexpensive workaround exists label Feb 12, 2019

abitmore added the won't fix label Mar 23, 2019

abitmore removed this from the Future Feature Release milestone Mar 23, 2019

abitmore closed this as completed Mar 23, 2019

abitmore added not a bug 2a Discussion Needed Prompt for team to discuss at next stand up. and removed not a bug won't fix labels Mar 23, 2019

abitmore reopened this Mar 23, 2019

abitmore added this to the Future Feature Release milestone Mar 23, 2019

jmjatlanta mentioned this issue Mar 25, 2019

fork_db.set_max_size called too early, may make switching forks impossible #1679

Open

17 tasks

abitmore closed this as completed Mar 25, 2019

abitmore removed this from the Future Feature Release milestone Mar 25, 2019

abitmore added won't fix not a bug and removed 2a Discussion Needed Prompt for team to discuss at next stand up. labels Mar 25, 2019

abitmore mentioned this issue May 8, 2020

Add tests about chain reorganization #2171

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only update max size of undo_db and fork_db after successfully pushed a block #1333

Only update max size of undo_db and fork_db after successfully pushed a block #1333

abitmore commented Sep 17, 2018 •

edited

Loading

abitmore commented Sep 18, 2018

abitmore commented Sep 19, 2018

pmconrad commented Sep 21, 2018

abitmore commented Sep 21, 2018

jmjatlanta commented Oct 19, 2018

pmconrad commented Oct 19, 2018 •

edited

Loading

jmjatlanta commented Oct 19, 2018 •

edited

Loading

jmjatlanta commented Oct 23, 2018 •

edited

Loading

pmconrad commented Oct 24, 2018

abitmore commented Oct 25, 2018

pmconrad commented Oct 25, 2018

abitmore commented Oct 25, 2018 •

edited

Loading

pmconrad commented Oct 26, 2018

ryanRfox commented Feb 12, 2019

jmjatlanta commented Feb 15, 2019

pmconrad commented Feb 15, 2019

nathanielhourt commented Mar 22, 2019

nathanielhourt commented Mar 22, 2019

pmconrad commented Mar 23, 2019

abitmore commented Mar 23, 2019

abitmore commented Mar 23, 2019 •

edited

Loading

abitmore commented Mar 23, 2019

abitmore commented Mar 23, 2019 •

edited

Loading

nathanielhourt commented Mar 23, 2019

abitmore commented Mar 25, 2019

Only update max size of undo_db and fork_db after successfully pushed a block #1333

Only update max size of undo_db and fork_db after successfully pushed a block #1333

Comments

abitmore commented Sep 17, 2018 • edited Loading

CORE TEAM TASK LIST

abitmore commented Sep 18, 2018

abitmore commented Sep 19, 2018

pmconrad commented Sep 21, 2018

abitmore commented Sep 21, 2018

jmjatlanta commented Oct 19, 2018

pmconrad commented Oct 19, 2018 • edited Loading

jmjatlanta commented Oct 19, 2018 • edited Loading

jmjatlanta commented Oct 23, 2018 • edited Loading

pmconrad commented Oct 24, 2018

abitmore commented Oct 25, 2018

pmconrad commented Oct 25, 2018

abitmore commented Oct 25, 2018 • edited Loading

pmconrad commented Oct 26, 2018

ryanRfox commented Feb 12, 2019

jmjatlanta commented Feb 15, 2019

pmconrad commented Feb 15, 2019

nathanielhourt commented Mar 22, 2019

nathanielhourt commented Mar 22, 2019

pmconrad commented Mar 23, 2019

abitmore commented Mar 23, 2019

abitmore commented Mar 23, 2019 • edited Loading

abitmore commented Mar 23, 2019

abitmore commented Mar 23, 2019 • edited Loading

nathanielhourt commented Mar 23, 2019

abitmore commented Mar 25, 2019

abitmore commented Sep 17, 2018 •

edited

Loading

pmconrad commented Oct 19, 2018 •

edited

Loading

jmjatlanta commented Oct 19, 2018 •

edited

Loading

jmjatlanta commented Oct 23, 2018 •

edited

Loading

abitmore commented Oct 25, 2018 •

edited

Loading

abitmore commented Mar 23, 2019 •

edited

Loading

abitmore commented Mar 23, 2019 •

edited

Loading