Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod schedule change #15658

Merged
merged 2 commits into from
Jan 10, 2025
Merged

Pod schedule change #15658

merged 2 commits into from
Jan 10, 2025

Conversation

areshand
Copy link
Contributor

@areshand areshand commented Dec 28, 2024

Description

only schedule pod when the PVC is bound if the pod is not the first pod when initializing the PVC.

Previously, this is not an issue. But recently the first pod of a PVC could be scheduled after the following ones causing disk zone misalignment. will follow up separately on why the pod scheduling is delayed on k8s.

How Has This Been Tested?

local run replay-verify on k8s cluster

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Dec 28, 2024

⏱️ 22m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-move-tests 13m 🟩
rust-cargo-deny 3m 🟩
check-dynamic-deps 2m 🟩🟩🟩
general-lints 1m 🟩🟩
semgrep/ci 59s 🟩🟩🟩
rust-move-tests 50s
file_change_determinator 38s 🟩🟩🟩
rust-move-tests 14s
permission-check 7s 🟩🟩🟩
permission-check 7s 🟩🟩🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

Comment on lines +433 to 435
pvc_bound_status[i % len(self.pvcs)] or i < len(self.pvcs)
): # we only create a new pod to intialize the pvc before the PVC is bound
if (
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition for initializing PVCs needs to be more precise. The current check i < len(self.pvcs) allows any pod with an index less than the PVC count to proceed, but this could lead to multiple pods initializing the same PVC. Instead, use pvc_bound_status[i % len(self.pvcs)] or (i < len(self.pvcs) and i == i % len(self.pvcs)) to ensure only the first pod for each PVC can initialize it, while subsequent pods must wait for the PVC to be bound.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@areshand
Copy link
Contributor Author

@grao1991 please stamp this?

@areshand areshand enabled auto-merge (rebase) January 10, 2025 18:51
@areshand areshand disabled auto-merge January 10, 2025 18:51

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 6593fb81261f25490ffddc2252a861c994234c2a ==> 7e6d3ed7082e229354fcf37b2fe8583d4a4855a5

Compatibility test results for 6593fb81261f25490ffddc2252a861c994234c2a ==> 7e6d3ed7082e229354fcf37b2fe8583d4a4855a5 (PR)
1. Check liveness of validators at old version: 6593fb81261f25490ffddc2252a861c994234c2a
compatibility::simple-validator-upgrade::liveness-check : committed: 16759.12 txn/s, latency: 2031.59 ms, (p50: 1800 ms, p70: 2000, p90: 2700 ms, p99: 5700 ms), latency samples: 568920
2. Upgrading first Validator to new version: 7e6d3ed7082e229354fcf37b2fe8583d4a4855a5
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 6367.81 txn/s, latency: 4769.98 ms, (p50: 5500 ms, p70: 5900, p90: 6000 ms, p99: 6200 ms), latency samples: 118880
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6024.37 txn/s, latency: 5523.52 ms, (p50: 6000 ms, p70: 6100, p90: 6200 ms, p99: 6300 ms), latency samples: 213860
3. Upgrading rest of first batch to new version: 7e6d3ed7082e229354fcf37b2fe8583d4a4855a5
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7777.75 txn/s, latency: 3821.58 ms, (p50: 4400 ms, p70: 4600, p90: 4900 ms, p99: 5100 ms), latency samples: 142800
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7947.94 txn/s, latency: 4225.32 ms, (p50: 4600 ms, p70: 4700, p90: 4900 ms, p99: 5000 ms), latency samples: 267580
4. upgrading second batch to new version: 7e6d3ed7082e229354fcf37b2fe8583d4a4855a5
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 12444.68 txn/s, latency: 2380.12 ms, (p50: 2700 ms, p70: 2800, p90: 3000 ms, p99: 3000 ms), latency samples: 214460
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 12233.11 txn/s, latency: 2681.71 ms, (p50: 2800 ms, p70: 2900, p90: 3100 ms, p99: 3200 ms), latency samples: 395160
5. check swarm health
Compatibility test for 6593fb81261f25490ffddc2252a861c994234c2a ==> 7e6d3ed7082e229354fcf37b2fe8583d4a4855a5 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 7e6d3ed7082e229354fcf37b2fe8583d4a4855a5

two traffics test: inner traffic : committed: 14663.66 txn/s, latency: 2711.04 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 4200 ms), latency samples: 5575540
two traffics test : committed: 99.96 txn/s, latency: 1541.88 ms, (p50: 1300 ms, p70: 1400, p90: 2300 ms, p99: 4200 ms), latency samples: 1760
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.504, avg: 1.351", "ConsensusProposalToOrdered: max: 0.294, avg: 0.290", "ConsensusOrderedToCommit: max: 0.312, avg: 0.303", "ConsensusProposalToCommit: max: 0.605, avg: 0.593"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.53s no progress at version 1195 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.53s no progress at version 2022528 (avg 0.52s) [limit 16].
Test Ok

@areshand areshand merged commit 78a80aa into main Jan 10, 2025
49 checks passed
@areshand areshand deleted the same_zone branch January 10, 2025 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants