-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] topic policy deadlock block metadata thread. #23786
[fix][broker] topic policy deadlock block metadata thread. #23786
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short, I don't think this change fixes the deadlock.
Deadlock analysis
In thread A (broker-client-shared-internal-executor-5-1
):
return topics.computeIfAbsent(topicName.toString(), (tpName) ->
loadOrCreatePersistentTopic(tpName, createIfMissing, properties, topicPolicies));
It already holds the lock of SystemTopicBasedTopicPoliciesService#policyCacheInitMap
(via policyCacheInitMap.compute
method) and tries to acquire the lock of BrokerService#topics
.
Thread B (pulsar-io-3-6
) is blocked at the same line, while it already holds the lock of BrokerService#topics
and it goes into the loadOrCreatePersistentTopic
method. However, it eventually calls SystemTopicBasedTopicPoliciesService.getTopicPoliciesAsync
and tries to acquire the lock of SystemTopicBasedTopicPoliciesService#policyCacheInitMap
.
Thread C (metadata-store-10-1
) is just blocked at acquiring the lock of policyCacheInitMap
like what thread B is blocked. However, the root cause is the deadlock between thread A and B.
Why I think the changes do not work
You've made the following changes:
- Move the
policyCacheInitMap.compute
call fromthenAccept
tothenComposeAsync
. - Don't call
whenComplete
in thethenComposeAsync
method.
The 1st change just moves the thread C from the metadata store thread pool to the CompletableFuture#ASYNC_POOL
thread pool. It doesn't affect the deadlock between thread A and B. The 2nd change actually does not make a difference because in the existing code, it's already called in the thread pool of CompletableFuture#ASYNC_POOL
via CompletableFuture.runAsync
.
The public CompletableFuture<ManagedLedgerFactory> getManagedLedgerFactoryForTopic(TopicName topicName) {
// NOTE: it calls `getTopicPoliciesBypassSystemTopic` and tries to acquire the lock of
//. `SystemTopicBasedTopicPoliciesService#policyCacheInitMap`
return getManagedLedgerConfig(topicName) called by protected CompletableFuture<Map<String, String>> fetchTopicPropertiesAsync(TopicName topicName) {
if (!topicName.isPartitioned()) {
return getManagedLedgerFactoryForTopic(topicName).thenCompose( called by private void checkOwnershipAndCreatePersistentTopic(/* ... */) {
TopicName topicName = TopicName.get(topic);
pulsar.getNamespaceService().isServiceUnitActiveAsync(topicName)
.thenAccept(isActive -> {
if (isActive) {
CompletableFuture<Map<String, String>> propertiesFuture;
if (properties == null) {
//Read properties from storage when loading topic.
propertiesFuture = fetchTopicPropertiesAsync(topicName); // HERE called by protected CompletableFuture<Optional<Topic>> loadOrCreatePersistentTopic(final String topic,
boolean createIfMissing, Map<String, String> properties, @Nullable TopicPolicies topicPolicies) {
/* ... */
checkTopicNsOwnership(topic)
.thenRun(() -> {
final Semaphore topicLoadSemaphore = topicLoadRequestSemaphore.get();
if (topicLoadSemaphore.tryAcquire()) {
checkOwnershipAndCreatePersistentTopic(topic, createIfMissing, topicFuture, // HERE
properties, topicPolicies); Here I think the correct fix is to move |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #23786 +/- ##
============================================
+ Coverage 73.57% 74.16% +0.59%
+ Complexity 32624 31764 -860
============================================
Files 1877 1853 -24
Lines 139502 143369 +3867
Branches 15299 16277 +978
============================================
+ Hits 102638 106331 +3693
+ Misses 28908 28669 -239
- Partials 7956 8369 +413
Flags with carried forward coverage won't be shown. Click here to find out more.
|
...ker/src/main/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deadlock is caused by locks from two concurrent hash maps (policyCacheInitMap
and BrokerService#topics
), so fixing one (policyCacheInitMap
) of them in this PR is okay for me.
@mattisonchao Please create a separate PR for cherry-picking to branch-3.3 since there are major merge conflicts. The merge conflicts seem to be mainly due to lack of #23319 changes in branch-3.3 . |
(cherry picked from commit 86f8a84)
Motivation
Modifications
Verifying this change
Documentation
doc-not-needed