[Bug] Broker would lost bookie rack information in pulsar new version #23282
Labels
release/blocker
Indicate the PR or issue that should block the release until it gets resolved
type/bug
The PR fixed a bug or issue reported a bug
Search before asking
Read release policy
Version
pulsar-3.0.6
Minimal reproduce step
What did you expect to see?
..
What did you see instead?
After upgrade to pulsar-3.0.6,observe that when bookie restart, some bookie's rack information become /defaultRegion/defaultRack,which is not correct.
After diving into code and error log, this issue is probably due to this pr, #22846. This pr made BookieRackAffinityMapping#watchAvailableBookies become async. However, I think this operation can not be async.
Let's see what happen when bookieClient construct in pulsar. we can see the code in https://github.com/apache/bookkeeper/blob/1f1df813b9b4efd410925caadfa45cfb17b811ba/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeper.java#L409-L548
pulsar/pulsar-broker-common/src/main/java/org/apache/pulsar/bookie/rackawareness/BookieRackAffinityMapping.java
Lines 114 to 170 in fc0e4e3
When we receive notification for bookie creation in metadataStore, it would go into this code block, execute first listener, and then second listener.
pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java
Lines 221 to 233 in a8ae3e4
When we execute second listener to do placementPolicy.onClusterChanged(), it would finally go into here, execute resolver.resolve(names). This resolver's implementation is BookieRackAffinityMapping#resolve. https://github.com/apache/bookkeeper/blob/1f1df813b9b4efd410925caadfa45cfb17b811ba/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/TopologyAwareEnsemblePlacementPolicy.java#L554-L585
Therefore, we can see that the second listener actually depend on the first listener. They must be executed in a sync way.
But now we change to async way. So when a bookie restart, broker would permanently lost the rack information of this bookie, causing serious problem.
We add a log in BookieRackAffinityMapping#updateRacksWithHost, and confirm that the problem occur once the async code is executed later.
Anything else?
pulsar-2.9 do not have this issue.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: