Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only update max size of undo_db and fork_db after successfully pushed a block #1333

Closed
17 tasks
abitmore opened this issue Sep 17, 2018 · 25 comments
Closed
17 tasks

Comments

@abitmore
Copy link
Member

abitmore commented Sep 17, 2018

Bug Description
There is a corner case that would cause chain reorganization to fail.

update_global_dynamic_data() is called in the middle of _apply_block():

update_global_dynamic_data( next_block, missed );

Max size of both fork db and undo db is updated inside update_global_dynamic_data():

_undo_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 );
_fork_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 );

If the max size is changed to smaller or kept unchanged, the original smallest reversible block will be removed from undo db and fork db. If an exception is thrown after update_global_dynamic_data() in _apply_block(), changes done in the object database will be reverted, but the removed block won't be put into fork db nor undo db again. In this case, the chain would be unable to reorganize to another fork if the removed block is not in the fork.

By the way, there is another scenario described in steemit/steem#2911.

Expected Behavior
Always able to reorganize the chain from LIB.

Impacts
Describe which portion(s) of BitShares Core may be impacted by this bug. Please tick at least one box.

  • API (the application programming interface)
  • Build (the build process or something prior to compiled code)
  • CLI (the command line wallet)
  • Deployment (the deployment process after building such as Docker, Travis, etc.)
  • DEX (the Decentralized EXchange, market engine, etc.)
  • P2P (the peer-to-peer network for transaction/block propagation)
  • Performance (system or user efficiency, etc.)
  • Protocol (the blockchain logic, consensus, validation, etc.)
  • Security (the security of system or user data, etc.)
  • UX (the User Experience)
  • Other (please add below)

CORE TEAM TASK LIST

  • Evaluate / Prioritize Bug Report
  • Refine User Stories / Requirements
  • Define Test Cases
  • Design / Develop Solution
  • Perform QA/Testing
  • Update Documentation
@abitmore abitmore added security bug 3d Bug Classification indicating the existing implementation does not match the intention of the design 6 Security Impact flag identifying system/user security labels Sep 17, 2018
@abitmore abitmore added this to the Future Feature Release milestone Sep 17, 2018
@abitmore
Copy link
Member Author

Related: steemit/steem#2911 (different scenario).

@abitmore
Copy link
Member Author

Steem PR: steemit/steem#2914 (also contains other changes, may or may not apply to BitShares).

@pmconrad
Copy link
Contributor

Moving the checkpointing code to push_block means checkpoints are ignored during most of replay. I don't think we should do this.

@abitmore
Copy link
Member Author

Just FYI my patch for Steem didn't move the checkpointing code, instead, I added one in push_block. It brought better performance when syncing. https://github.com/abitmore/steem/commits/20180917-fix1

@jmjatlanta
Copy link
Contributor

Please take a look over these ramblings and see if I am on the right track:

  1. The problem surfaces when decreasing the max size of _undo_db while there are (at least) 2 forks.
  2. Decreasing the size of _undo_db is done when the LIB moves forward (among other times, but this was a scenario that my mind comprehends).
  3. A poorly timed error while pushing a block can cause a chain reorganization, but a previous resize of _undo_db means that a block (or blocks) near LIB may no longer exist within _undo_db.

So, a test of this scenario would require
a) 2 forks
b) a movement forward of LIB to reduce undo_db and cause loss of data
c) an attempt to push a bad block on 1 of the forks to cause a reorganization of the chain

Am I close?

@pmconrad
Copy link
Contributor

pmconrad commented Oct 19, 2018

The critical part is that

  • LIB is on a block common to both forks
  • a block X arrives on fork A that moves LIB forward to a block that is not common to both forks
  • X fails to apply after update_global_dynamic_data() -> LIB has moved already.
  • A block Y arrives on fork B. B cannot be applied because the node can't roll back to the point before the fork. So the node is stuck.

I think this is close to impossible in practice. Fork B can only become valid if you replace 1/3 of the witnesses on that fork only, or when 1/3 of the witnesses are double-signing.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Oct 19, 2018

Thank you for clearing that up. I was thinking about subsequent blocks coming in, but was getting lost because how can 1 fork have a LIB and another no? That would mean that the LIB is reversible, which should be impossible.

Now I am straight. A block comes in that gets LIB to be on 1 fork, we somewhat liked it, we adjusted the size of undo_db, and then had a problem with that same block. Now we don't have what we need to switch forks.

I will attempt to make a test for that scenario. I'll put a lower priority on this, due to the low chance of this occurring.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Oct 23, 2018

The calculation for LIB comes after the calculation for the _undo_db and _fork_db size. I believe that means that if the current block moves the LIB forward, it will not be until the next block that the blocks are prematurely removed from _undo_db.

So my latest hypothesis is that this problem can not be caused by a block moving the LIB forward. If that hypothesis is correct, I ask the following 2 questions:

  1. What other processes can cause the _undo_db to shrink, other than movement of LIB?
  2. Could shrinking the size of _undo_db on the next block instead of the current block cause a problem?

I believe the answer to question 2 is no, as it has been working that way.

@pmconrad
Copy link
Contributor

The actual shrinking (i. e. deletion of entries) in undo_db happens when a new undo_session is created. After applying a block, this will typically happen when pending transactions are replayed.
fork_db OTOH shrinks immediately during set_max_size.
Hm.


Suppose we're at block 10 and dgp.lib is 3.
fork_db must contain all forks starting with block number 4 with the longest fork having length 7 (blocks 4...10).
undo_db must contain sessions of blocks 10, 9, 8, 7, 6, 5, 4 plus one for pending_transactions, plus one temporary (see _push_transaction), i. e. at least 9 sessions.
Suppose both have max_size = 8. Note that undo_db.start_undo_session leaves max_size sessions on the stack before creating another one.

We receive block 11.

This happens:

  • _pending_tx_session is rolled back (via detail::without_pending_transactions). 7 undo sessions remain.
  • Block 11 is pushed into fork_db. fork_db deletes all blocks before (head - max_size), i. e. (11-7) = 4.
  • A new undo_session is created (number 8).
  • Transactions are applied (this could make undo_db.max_size increase due to proposals being executed, but let's ignore this for now).
  • update_global_dynamic_data sets head_block_number to 11, then max_size to (11-3+1) = 9. Note that LIB is still at 3.
    ** fork_db deletes all blocks before (11-9) = 2. Nothing happens, these have already been deleted.
    ** undo_db only updates its max_size.
  • update_last_irreversible_block updates LIB. Worst case is that LIB stays at 3.
  • a) Suppose the block is applied successfully.
    ** the block's undo_session is committed, which simply leaves it lying on the stack.
    ** pending_transactions are then pushed, which creates two more undo_sessions (9 and 10). Again, when 10 is created, 9 are left on the stack, so this is not a problem.
  • b) Suppose an error happens and the block is un-applied.
    ** the block's undo_session is rolled back, leaving 7 on the stack.
    ** the block is removed from fork_db. max_size remains at 9 for both.
    ** pending_transactions are then pushed, which creates two more undo_sessions (8 and 9).

This looks safe, unless I've missed something.


When blocks are popped, they stay in fork_db, while their corresponding undo_sessions are rolled back and removed from the undo_stack.

Note that popping blocks will also undo updates to dgp.last_irreversible_block, so from the POV of the chain LIB will be moving backwards. However the actual last_irreversible_block stays where it is, because it is irreversible by definition.

We only pop blocks when switching to a longer fork, so after popping some and then applying some more dgp.last_irreversible_block either is smaller than or equal to the actual LIB (which will result in max_size being even greater than before), or it is greater, which means that the actual LIB has advanced as well, and everything should be in order.

I can see one edge case though:

  • When blocks are re-applied while switching to a fork, the calculated max_size can be smaller than the difference between the original head and the actual LIB. In that case, fork_db will drop too many blocks, because fork_db uses the head of the new longest chain as its internal _head.

E. g. if in the above example we receive block 11' instead of 11, then roll back 10, 9, 8, and 7, then apply 7', we might have dgp.lib = 2 which results in max_size = (7-2+1) = 6. Then fork_db will drop all blocks before (11-6)=5, i. e. block 4 will (wrongly) be dropped from fork_db.

Thoughts?

@abitmore
Copy link
Member Author

@pmconrad:

update_last_irreversible_block updates LIB. Worst case is that LIB stays at 3.

A worse case is that LIB advances E.G. to 6, then 3,4,5 will be dropped too early.

When blocks are re-applied while switching to a fork, the calculated max_size can be smaller than the difference between the original head and the actual LIB. In that case, fork_db will drop too many blocks, because fork_db uses the head of the new longest chain as its internal _head.
E. g. if in the above example we receive block 11' instead of 11, then roll back 10, 9, 8, and 7, then apply 7', we might have dgp.lib = 2 which results in max_size = (7-2+1) = 6. Then fork_db will drop all blocks before (11-6)=5, i. e. block 4 will (wrongly) be dropped from fork_db.

This is the scenario described in steemit/steem#2911

@pmconrad
Copy link
Contributor

A worse case is that LIB advances E.G. to 6, then 3,4,5 will be dropped too early.

If LIB advances to 6 then 3,4,5 can be dropped.

The scenario when switching forks can be fixed easily as in steemit/steem@653f0b5 .

@abitmore
Copy link
Member Author

abitmore commented Oct 25, 2018

If LIB advances to 6 then 3,4,5 can be dropped.

... only when "the block is applied successfully".

The scenario when switching forks can be fixed easily as in steemit/steem@653f0b5 .

This is needed as well: steemit/steem@c54fd43#diff-778ae4a84a14457ed12b22337048d2bfR3259

@pmconrad
Copy link
Contributor

... only when "the block is applied successfully".

Oh, right.

This is needed as well: steemit/steem@c54fd43#diff-778ae4a84a14457ed12b22337048d2bfR3259

Something like this, yes. Their codebase looks quite different at that point.

@ryanRfox ryanRfox added the 4a Low Priority Priority indicating minimal impact to system/user -OR- an inexpensive workaround exists label Feb 12, 2019
@ryanRfox
Copy link
Contributor

@jmjatlanta is investigating a test case to identify this bug.

@jmjatlanta
Copy link
Contributor

Branch jmj_1333 presents a test case to attempt to cause the issue, but with no success yet. There are currently gaps in my knowledge in (at least) 3 areas. The notes below probably will only make sense to me, but I do not want to forget them:

  1. LIB calculation happens near the end of the process, and resizing of undo_db and fork_db near the beginning. Therefore, LIB should be "set in stone" before anything bad happens. When the dbs are resized, in my mind, they should be accurate.

  2. LIB states a block number. Nothing in the LIB calculation mentions the actual block. So a rogue witness can sign 2 blocks on 2 forks at the same level. LIB would point to the level, not the block. I am not going to overthink this point, as the chances of this happening are certainly low. But it keeps appearing, as I have yet to figure out how to make the witness ordering different between forks. I am imagining witnesses being swapped out on 1 fork, so not a rogue witness, but a disagreement of who the witnesses are.

  3. My test assumes (rogue) witnesses signing on both chains. This is impractical, and is probably why the test works when it shouldn't. I need to carefully work through the examples given, and try to separate the scenarios into unique tests. In short, a very clear understanding of 1 scenario is required to build the test. Other scenarios interfere with the cognitive limits of my ability to understand.

@pmconrad
Copy link
Contributor

Rogue witnesses are a separate problem, you can ignore them here.

LIB states a block number is fine, because by definition there cannot be competing LIBs. (With rogue witnesses, two competing forks each with their own LIB can emerge. This is a different problem, see BFT.)

See my comment #1333 (comment) above, and @abitmore 's remark that "Worst case is LIB advances to 6, then the block fails and is rolled back". Then the next block arrives and both fork_db and undo_db are pruned too far.

@nathanielhourt
Copy link
Contributor

Hey all, trying to wrap my head around this issue too... Setting aside for a moment discussion of the max undo size, a scenario which jumps the LIB forward the maximum number should trigger the bug, right? So does this toy scenario trigger it?:

7 witnesses: ABCDEFGH
LIB at 5/7

Block # Signing Witness LIB # Sigs Since LIB
1 A 0 A
2 B 0 AB
3 C 0 ABC
4 D 0 ABCD
5 E 1 BCDE
6 F 2 CDEF
7 G 3 DEFG
8 H 4 EFGH
9 G 4 EFGH
10 F 4 EFGH
11 E 4 EFGH
12 A 8 AEFG

... except block 12 has a bad transaction, so it fails midway through and gets rewound, and then B shows up with a different block 5 and production continues from that chain... should that trigger the bug?

@nathanielhourt
Copy link
Contributor

I suppose in my scenario above, H could sign 12, which would leave LIB at 4, and then A signs 13, which would cause LIB to jump to 9:

Block # Witness LIB Sigs Since LIB
12 H 4 EFGH
13 A 9 AEFH

So that would be the maximum jump size of 5 blocks. Then 13 fails and rewinds, B continues from 5...

@pmconrad
Copy link
Contributor

Yes, I think that's accurate. Except that A-H are 8 witnesses not 7 :-).

@abitmore
Copy link
Member Author

This issues is about defensive approaches when something wrong HAS happened. To be clear, anything which can cause an uncaught exception after update_global_dynamic_data() IS A BUG and should be fixed asap, because witnesses will be unable to generate a new block thus will lead to a chain halt.

  • If the exception IS caused by a new transaction in the new block, an approach is to try to produce an empty block again (Steem had this mechanism but BitShares has not). In this case, if the chain continues growing with an empty block in the front, it's not the scenario described in this issue. Ideally we should have a mechanism to detect the troublesome transaction and fix the issue related, but IMHO it's out of this issue's scope.

  • If the exception is NOT caused by a new transaction in the new block, the chain would halt with current code. Currently we have no mechanism to try to automatically produce a block based on an earlier block (which is after LIB of course). Note: we can probably add this mechanism later.

    • a typical scenario is: witnesses are running different versions of witness_node which are slightly incompatible (which the devs didn't know), E.G. when a new consensus-changing release is out;
    • another typical scenario is: the chain has halted, some witnesses are trying to revive the chain manually, starting a fork from LIB (because when shutting down witness_node it will pop out all blocks after LIB). If the exception is caused by a transaction after LIB, the attempt (to revive) should success, ideally the witnesses on the old fork will switch to the new fork automatically when the new fork is longer, however, the switching may probably fail due to the potential bug described in this issue.

@jmjatlanta to write a test case for this issue, we need to do something abnormal in the test case, specifically, need to trigger an exception after update_global_dynamic_data().

@abitmore
Copy link
Member Author

abitmore commented Mar 23, 2019

As @jmjatlanta noticed, actually we're resizing undo_db and fork_db based on last block's LIB. That said, resize is deferred to the next block. Thus there would be no issue if the new size is big enough, as long as the new size is based on state of last block. We probably only need to change the +1s in these 2 lines

_undo_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 );
_fork_db.set_max_size( _dgp.head_block_number - _dgp.last_irreversible_block_num + 1 );

to +2, or no need to change at all.

@abitmore abitmore removed 3d Bug Classification indicating the existing implementation does not match the intention of the design 4a Low Priority Priority indicating minimal impact to system/user -OR- an inexpensive workaround exists 6 Security Impact flag identifying system/user security bug security labels Mar 23, 2019
@abitmore abitmore removed this from the Future Feature Release milestone Mar 23, 2019
@abitmore
Copy link
Member Author

Not a bug. We're safe.

@abitmore abitmore added not a bug 2a Discussion Needed Prompt for team to discuss at next stand up. and removed not a bug won't fix labels Mar 23, 2019
@abitmore
Copy link
Member Author

abitmore commented Mar 23, 2019

We still have issues here similar to steemit/steem#2911. This would be easier to reproduce. Perhaps better create a new issue for it?

@abitmore abitmore reopened this Mar 23, 2019
@abitmore abitmore added this to the Future Feature Release milestone Mar 23, 2019
@nathanielhourt
Copy link
Contributor

Yes, I think that's accurate. Except that A-H are 8 witnesses not 7 :-).

Wow I'm an idot 🤦‍♂️ haha

@abitmore
Copy link
Member Author

Closing this since @jmjatlanta has created #1679 for following Steem fork_db issue (steemit/steem#2911).

@abitmore abitmore removed this from the Future Feature Release milestone Mar 25, 2019
@abitmore abitmore added won't fix not a bug and removed 2a Discussion Needed Prompt for team to discuss at next stand up. labels Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants