Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Streaming Service] Add support for dynamic prefetching #11951

Merged
merged 4 commits into from
Feb 21, 2024

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Feb 8, 2024

Note: > 80% of this PR is just unit test changes/additions.

Description

This PR adds dynamic prefetching support to the data streaming service. Instead of having a static (config-defined) maximum number of in-flight requests, we now dynamically increase/decrease the maximum based on successful RPC responses and timeouts, i.e., for each successful response we increase the value (e.g., +1), for each timeout we decrease the value (e.g., -3), and when we see a timeout, freeze the increases for some time (e.g., 30 secs), to regain stability.

The feature is config gated, and the experiments show significant improvements for mainnet PFNs running in GCP, e.g., a 40% improvement in fast syncing time when combined with latency aware peer dialing (already landed).

The PR offers several commits:

  1. Add dynamic prefetching support to the data streaming service.
  2. (Tests) Add simple unit tests for dynamic prefetching.
  3. (Tests) Update the existing integration tests and add several new tests for dynamic prefetching.
  4. Add simple back pressure to the state snapshot receiver when fast syncing (the increased aggression now actually allows us saturate storage).

Test Plan

New and existing tests.

Copy link

trunk-io bot commented Feb 8, 2024

⏱️ 12h 38m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-unit-coverage 4h 9m 🟩
rust-smoke-coverage 3h 15m 🟩
rust-unit-tests 1h 28m 🟩🟩🟩
windows-build 1h 11m 🟩🟩🟩🟩
forge-e2e-test / forge 43m 🟩🟩🟩
rust-images / rust-all 39m 🟩🟩🟩
rust-lints 22m 🟩🟩🟩
run-tests-main-branch 19m 🟥🟥🟥
check 12m 🟩🟩🟩
general-lints 8m 🟩🟩🟩
check-dynamic-deps 8m 🟩🟩🟩🟩
semgrep/ci 2m 🟩🟩🟩🟩
file_change_determinator 34s 🟩🟩🟩
file_change_determinator 33s 🟩🟩🟩
file_change_determinator 32s 🟩🟩🟩
permission-check 14s 🟩🟩🟩
permission-check 13s 🟩🟩🟩
upload-to-codecov 13s 🟩
permission-check 12s 🟩🟩🟩🟩
permission-check 11s 🟩🟩🟩🟩
permission-check 8s 🟩🟩🟩
determine-docker-build-metadata 7s 🟩🟩🟩

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
run-tests-main-branch 6m 4m +46%

settingsfeedbackdocs ⋅ learn more about trunk.io

@JoshLind JoshLind added the CICD:run-forge-e2e-perf Run the e2e perf forge only label Feb 8, 2024

This comment has been minimized.

This comment has been minimized.

Copy link

codecov bot commented Feb 8, 2024

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (6371800) 69.8% compared to head (1df7efc) 69.8%.
Report is 5 commits behind head on main.

❗ Current head 1df7efc differs from pull request most recent head 1e8650e. Consider uploading reports for the commit 1e8650e to get more accurate results

Files Patch % Lines
...ate-sync/data-streaming-service/src/data_stream.rs 87.2% 7 Missing ⚠️
.../data-streaming-service/src/dynamic_prefetching.rs 99.8% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main   #11951    +/-   ##
========================================
  Coverage    69.8%    69.8%            
========================================
  Files        2199     2200     +1     
  Lines      416774   417399   +625     
========================================
+ Hits       291116   291742   +626     
+ Misses     125658   125657     -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JoshLind JoshLind force-pushed the dyn_prefetch_aggr_3 branch from 1df7efc to 8088b11 Compare February 8, 2024 13:36

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@JoshLind JoshLind marked this pull request as ready for review February 8, 2024 16:11
@JoshLind JoshLind requested a review from gregnazario as a code owner February 8, 2024 16:11
Copy link
Contributor

@bchocho bchocho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JoshLind JoshLind enabled auto-merge (rebase) February 21, 2024 15:26

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 1e8650e43f7104cfd4392f5fcc65333ca080cf9d

two traffics test: inner traffic : committed: 7266 txn/s, latency: 5246 ms, (p50: 4800 ms, p90: 6400 ms, p99: 12900 ms), latency samples: 3146260
two traffics test : committed: 100 txn/s, latency: 2200 ms, (p50: 2100 ms, p90: 2400 ms, p99: 5500 ms), latency samples: 1840
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.253, avg: 0.202", "QsPosToProposal: max: 0.182, avg: 0.167", "ConsensusProposalToOrdered: max: 0.605, avg: 0.561", "ConsensusOrderedToCommit: max: 0.491, avg: 0.466", "ConsensusProposalToCommit: max: 1.054, avg: 1.027"]
Max round gap was 1 [limit 4] at version 19027. Max no progress secs was 7.255185 [limit 15] at version 1302782.
Test Ok

Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.9.5 ==> 1e8650e43f7104cfd4392f5fcc65333ca080cf9d

Compatibility test results for aptos-node-v1.9.5 ==> 1e8650e43f7104cfd4392f5fcc65333ca080cf9d (PR)
1. Check liveness of validators at old version: aptos-node-v1.9.5
compatibility::simple-validator-upgrade::liveness-check : committed: 5476 txn/s, latency: 5944 ms, (p50: 5100 ms, p90: 8100 ms, p99: 23900 ms), latency samples: 191660
2. Upgrading first Validator to new version: 1e8650e43f7104cfd4392f5fcc65333ca080cf9d
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 638 txn/s, submitted: 668 txn/s, expired: 30 txn/s, latency: 37243 ms, (p50: 43200 ms, p90: 57500 ms, p99: 59000 ms), latency samples: 54243
3. Upgrading rest of first batch to new version: 1e8650e43f7104cfd4392f5fcc65333ca080cf9d
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 250 txn/s, submitted: 479 txn/s, expired: 228 txn/s, latency: 38391 ms, (p50: 42200 ms, p90: 50200 ms, p99: 68100 ms), latency samples: 22829
4. upgrading second batch to new version: 1e8650e43f7104cfd4392f5fcc65333ca080cf9d
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3039 txn/s, latency: 10022 ms, (p50: 9600 ms, p90: 16300 ms, p99: 17100 ms), latency samples: 133720
5. check swarm health
Compatibility test for aptos-node-v1.9.5 ==> 1e8650e43f7104cfd4392f5fcc65333ca080cf9d passed
Test Ok

@JoshLind JoshLind merged commit 8d24ec7 into main Feb 21, 2024
79 of 80 checks passed
@JoshLind JoshLind deleted the dyn_prefetch_aggr_3 branch February 21, 2024 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-forge-e2e-perf Run the e2e perf forge only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants