Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix state merkle truncate #15800

Merged
merged 1 commit into from
Jan 23, 2025
Merged

Conversation

msmouse
Copy link
Contributor

@msmouse msmouse commented Jan 23, 2025

Description

in case of a partial commit on a unsharded state db, two things were wrong:

  1. when we see a partial version, we need to look at the previous version before jumping the the epoch boundary (we forgot to subtract 1 from the version)
  2. the ledger meta db were not passed in as intended (instead, we passed the state merkle meta db)

incomplete / faulty fix previously #15089

How Has This Been Tested?

The test_db_restart smoke test does catch this sometimes but it's not easy to trigger. I bumped the number of restarts on that test.

Key Areas to Review

Type of Change

  • Bug fix

Which Components or Systems Does This Change Impact?

  • Validator Node

in case of a partial commit on a unsharded state db, two things were
wrong:
1. when we see a partial version, we need to look at the previous version before jumping the the epoch boundary (we forgot to substract 1 from the version)
2. the ledger meta db were not passed in as intended (instead, we passed
   the state merkle meta db)

The test_db_restart smoke test does catch this sometimes but it's not
easy to trigger. I bumpped the number of restarts on that test.
Copy link

trunk-io bot commented Jan 23, 2025

⏱️ 3h 1m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
execution-performance / single-node-performance 1h 36m 🟩🟩🟩🟩
execution-performance / test-target-determinator 21m 🟩🟩🟩🟩
test-target-determinator 15m 🟩🟩🟩
check-dynamic-deps 14m 🟩🟩🟩🟩🟩
rust-cargo-deny 7m 🟩🟩🟩🟩
rust-doc-tests 6m 🟩
rust-doc-tests 5m 🟩
rust-doc-tests 5m 🟩
fetch-last-released-docker-image-tag 5m 🟩🟩🟩
semgrep/ci 2m 🟩🟩🟩🟩🟩
general-lints 2m 🟩🟩🟩🟩
Backport PR 1m 🟥🟩
file_change_determinator 51s 🟩🟩🟩🟩
file_change_determinator 35s 🟩🟩🟩
permission-check 11s 🟩🟩🟩🟩🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / single-node-performance 25m 16m +60%

settingsfeedbackdocs ⋅ learn more about trunk.io

@msmouse msmouse requested a review from areshand January 23, 2025 20:00
@msmouse msmouse added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jan 23, 2025

This comment has been minimized.

This comment has been minimized.

@msmouse msmouse enabled auto-merge (squash) January 23, 2025 22:02

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@msmouse msmouse merged commit f3d4dba into main Jan 23, 2025
135 of 138 checks passed
@msmouse msmouse deleted the 0123-alden-fix-state-merkle-truncate branch January 23, 2025 22:38
github-actions bot pushed a commit that referenced this pull request Jan 24, 2025
in case of a partial commit on a unsharded state db, two things were
wrong:
1. when we see a partial version, we need to look at the previous version before jumping the the epoch boundary (we forgot to substract 1 from the version)
2. the ledger meta db were not passed in as intended (instead, we passed
   the state merkle meta db)

The test_db_restart smoke test does catch this sometimes but it's not
easy to trigger. I bumpped the number of restarts on that test.

(cherry picked from commit f3d4dba)
github-actions bot pushed a commit that referenced this pull request Jan 24, 2025
in case of a partial commit on a unsharded state db, two things were
wrong:
1. when we see a partial version, we need to look at the previous version before jumping the the epoch boundary (we forgot to substract 1 from the version)
2. the ledger meta db were not passed in as intended (instead, we passed
   the state merkle meta db)

The test_db_restart smoke test does catch this sometimes but it's not
easy to trigger. I bumpped the number of restarts on that test.

(cherry picked from commit f3d4dba)
Copy link
Contributor

💔 Some backports could not be created

Status Branch Result
aptos-release-v1.25 An unhandled error occurred. Please see the logs for details
aptos-release-v1.26

Manual backport

To create the backport manually run:

backport --pr 15800

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

github-actions bot pushed a commit that referenced this pull request Jan 24, 2025
in case of a partial commit on a unsharded state db, two things were
wrong:
1. when we see a partial version, we need to look at the previous version before jumping the the epoch boundary (we forgot to substract 1 from the version)
2. the ledger meta db were not passed in as intended (instead, we passed
   the state merkle meta db)

The test_db_restart smoke test does catch this sometimes but it's not
easy to trigger. I bumpped the number of restarts on that test.

(cherry picked from commit f3d4dba)
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
aptos-release-v1.25
aptos-release-v1.26

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on d15fc969c89551a1461d931d327b8d4dbfb2f814 ==> 8d4ea519b35004f9d73391ad7177ad71c2b5d815

Compatibility test results for d15fc969c89551a1461d931d327b8d4dbfb2f814 ==> 8d4ea519b35004f9d73391ad7177ad71c2b5d815 (PR)
1. Check liveness of validators at old version: d15fc969c89551a1461d931d327b8d4dbfb2f814
compatibility::simple-validator-upgrade::liveness-check : committed: 13056.92 txn/s, latency: 2505.90 ms, (p50: 2600 ms, p70: 2700, p90: 2900 ms, p99: 3100 ms), latency samples: 427600
2. Upgrading first Validator to new version: 8d4ea519b35004f9d73391ad7177ad71c2b5d815
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 4353.75 txn/s, latency: 7130.60 ms, (p50: 7900 ms, p70: 8500, p90: 8800 ms, p99: 8900 ms), latency samples: 92600
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 4323.14 txn/s, latency: 7836.95 ms, (p50: 8800 ms, p70: 8900, p90: 9200 ms, p99: 9400 ms), latency samples: 151720
3. Upgrading rest of first batch to new version: 8d4ea519b35004f9d73391ad7177ad71c2b5d815
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 4425.82 txn/s, latency: 7036.18 ms, (p50: 7800 ms, p70: 8400, p90: 8700 ms, p99: 8900 ms), latency samples: 92960
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 4431.31 txn/s, latency: 7655.09 ms, (p50: 8600 ms, p70: 8700, p90: 9000 ms, p99: 9200 ms), latency samples: 156120
4. upgrading second batch to new version: 8d4ea519b35004f9d73391ad7177ad71c2b5d815
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 8121.96 txn/s, latency: 3751.40 ms, (p50: 4400 ms, p70: 4600, p90: 4800 ms, p99: 5000 ms), latency samples: 148500
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 7733.13 txn/s, latency: 4358.32 ms, (p50: 4700 ms, p70: 4800, p90: 5000 ms, p99: 5100 ms), latency samples: 259120
5. check swarm health
Compatibility test for d15fc969c89551a1461d931d327b8d4dbfb2f814 ==> 8d4ea519b35004f9d73391ad7177ad71c2b5d815 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 8d4ea519b35004f9d73391ad7177ad71c2b5d815

two traffics test: inner traffic : committed: 14723.68 txn/s, latency: 2690.70 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5598340
two traffics test : committed: 99.98 txn/s, latency: 1395.24 ms, (p50: 1400 ms, p70: 1400, p90: 1500 ms, p99: 1700 ms), latency samples: 1680
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.450, avg: 1.417", "ConsensusProposalToOrdered: max: 0.294, avg: 0.291", "ConsensusOrderedToCommit: max: 0.423, avg: 0.404", "ConsensusProposalToCommit: max: 0.715, avg: 0.696"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.74s no progress at version 35505 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.65s no progress at version 2007429 (avg 0.58s) [limit 16].
Test Ok

msmouse added a commit that referenced this pull request Jan 24, 2025
in case of a partial commit on a unsharded state db, two things were
wrong:
1. when we see a partial version, we need to look at the previous version before jumping the the epoch boundary (we forgot to substract 1 from the version)
2. the ledger meta db were not passed in as intended (instead, we passed
   the state merkle meta db)

The test_db_restart smoke test does catch this sometimes but it's not
easy to trigger. I bumpped the number of restarts on that test.

(cherry picked from commit f3d4dba)

Co-authored-by: Alden Hu <[email protected]>
perryjrandall pushed a commit that referenced this pull request Jan 29, 2025
in case of a partial commit on a unsharded state db, two things were
wrong:
1. when we see a partial version, we need to look at the previous version before jumping the the epoch boundary (we forgot to substract 1 from the version)
2. the ledger meta db were not passed in as intended (instead, we passed
   the state merkle meta db)

The test_db_restart smoke test does catch this sometimes but it's not
easy to trigger. I bumpped the number of restarts on that test.

(cherry picked from commit f3d4dba)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR v1.25 v1.26
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants