Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][ml] Fix deadlock in PendingReadsManager #23958

Merged
merged 2 commits into from
Feb 11, 2025

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Feb 11, 2025

Fixes #23952

Motivation

A deadlock could occur in PendingReadsManager after #23901 changes. This deadlock was captured in a test case, but based on code analysis this problem applies to production code execution too.

Found one Java-level deadlock:
=============================
"main":
  waiting to lock monitor 0x00007fd520fe6f10 (object 0x000010003425f350, a org.apache.pulsar.broker.service.persistent.PersistentSubscription),
  which is held by "PulsarTestContext-executor-OrderedExecutor-0-0"

"PulsarTestContext-executor-OrderedExecutor-0-0":
  waiting to lock monitor 0x00007fd4f400dd00 (object 0x000010003425fd70, a org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer),
  which is held by "broker-topic-workers-OrderedExecutor-0-0"

"broker-topic-workers-OrderedExecutor-0-0":
  waiting to lock monitor 0x00007fd7a406f4a0 (object 0x000010003427f678, a org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager$PendingRead),
  which is held by "PulsarTestContext-executor-OrderedExecutor-0-0"

Java stack information for the threads listed above:
===================================================
"main":
	at org.apache.pulsar.broker.service.persistent.PersistentSubscription.close(PersistentSubscription.java)
	- waiting to lock <0x000010003425f350> (a org.apache.pulsar.broker.service.persistent.PersistentSubscription)
	at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$close$56(PersistentTopic.java:1697)
	...
"PulsarTestContext-executor-OrderedExecutor-0-0":
	at org.apache.pulsar.broker.service.AbstractDispatcherSingleActiveConsumer.disconnectActiveConsumers(AbstractDispatcherSingleActiveConsumer.java)
	- waiting to lock <0x000010003425fd70> (a org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer)
	at org.apache.pulsar.broker.service.persistent.PersistentSubscription.resetCursor(PersistentSubscription.java:856)
	- locked <0x000010003425f350> (a org.apache.pulsar.broker.service.persistent.PersistentSubscription)
	at org.apache.pulsar.broker.service.persistent.PersistentSubscription$6.findEntryComplete(PersistentSubscription.java:824)
	at org.apache.pulsar.broker.service.persistent.PersistentMessageFinder.findEntryComplete(PersistentMessageFinder.java:162)
	at org.apache.bookkeeper.mledger.impl.OpFindNewest.readEntryComplete(OpFindNewest.java:133)
	at org.apache.bookkeeper.mledger.impl.cache.RangeEntryCacheImpl$1.readEntriesComplete(RangeEntryCacheImpl.java:241)
	at org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager$PendingRead.readEntriesComplete(PendingReadsManager.java:253)
	- locked <0x000010003427f678> (a org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager$PendingRead)
	at org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager$PendingRead.lambda$attach$0(PendingReadsManager.java:232)
	at org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager$PendingRead$$Lambda/0x00007fd54cb0fc60.run(Unknown Source)
	at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
	at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.runWith([email protected]/Thread.java:1596)
	at java.lang.Thread.run([email protected]/Thread.java:1583)
"broker-topic-workers-OrderedExecutor-0-0":
	at org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager$PendingRead.addListener(PendingReadsManager.java)
	- waiting to lock <0x000010003427f678> (a org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager$PendingRead)
	at org.apache.bookkeeper.mledger.impl.cache.PendingReadsManager.readEntries(PendingReadsManager.java:430)
	at org.apache.bookkeeper.mledger.impl.cache.RangeEntryCacheImpl.doAsyncReadEntriesByPosition(RangeEntryCacheImpl.java:416)
	at org.apache.bookkeeper.mledger.impl.cache.RangeEntryCacheImpl.asyncReadEntriesByPosition(RangeEntryCacheImpl.java:303)
	at org.apache.bookkeeper.mledger.impl.cache.RangeEntryCacheImpl.asyncReadEntry0(RangeEntryCacheImpl.java:282)
	at org.apache.bookkeeper.mledger.impl.cache.RangeEntryCacheImpl.asyncReadEntry(RangeEntryCacheImpl.java:264)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncReadEntry(ManagedLedgerImpl.java:2180)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.internalReadFromLedger(ManagedLedgerImpl.java:2150)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncReadEntries(ManagedLedgerImpl.java:1906)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.asyncReadEntriesWithSkip(ManagedCursorImpl.java:871)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.asyncReadEntriesWithSkipOrWait(ManagedCursorImpl.java:1024)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.asyncReadEntriesOrWait(ManagedCursorImpl.java:997)
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.readMoreEntries(PersistentDispatcherSingleActiveConsumer.java:387)
	- locked <0x000010003425fd70> (a org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer)
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer.lambda$dispatchEntriesToConsumer$2(PersistentDispatcherSingleActiveConsumer.java:242)
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer$$Lambda/0x00007fd54cb4bc30.run(Unknown Source)
	at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
	at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:113)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.runWith([email protected]/Thread.java:1596)
	at java.lang.Thread.run([email protected]/Thread.java:1583)

Found 1 deadlock.

Modifications

  • Run callbacks in PendingRead without a synchronization lock on the PendingRead instance.
  • Modify the code to make a copy of the listeners/callbacks so that synchronization isn't needed.
  • Refactor PendingRead and replace single "boolean completed" with a state field.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 81.81818% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.24%. Comparing base (bbc6224) to head (26b6028).
Report is 895 commits behind head on master.

Files with missing lines Patch % Lines
...keeper/mledger/impl/cache/PendingReadsManager.java 81.81% 3 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23958      +/-   ##
============================================
+ Coverage     73.57%   74.24%   +0.67%     
+ Complexity    32624    31894     -730     
============================================
  Files          1877     1853      -24     
  Lines        139502   143821    +4319     
  Branches      15299    16339    +1040     
============================================
+ Hits         102638   106782    +4144     
+ Misses        28908    28650     -258     
- Partials       7956     8389     +433     
Flag Coverage Δ
inttests 26.86% <77.27%> (+2.27%) ⬆️
systests 23.26% <77.27%> (-1.06%) ⬇️
unittests 73.76% <81.81%> (+0.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...keeper/mledger/impl/cache/PendingReadsManager.java 86.41% <81.81%> (-0.25%) ⬇️

... and 1043 files with indirect coverage changes

@lhotari lhotari merged commit 367faef into apache:master Feb 11, 2025
52 checks passed
lhotari added a commit that referenced this pull request Feb 11, 2025
lhotari added a commit that referenced this pull request Feb 11, 2025
lhotari added a commit that referenced this pull request Feb 11, 2025
hanmz pushed a commit to hanmz/pulsar that referenced this pull request Feb 12, 2025
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 19, 2025
(cherry picked from commit 367faef)
(cherry picked from commit f2aa71b)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 20, 2025
(cherry picked from commit 367faef)
(cherry picked from commit 1fd5903)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 24, 2025
(cherry picked from commit 367faef)
(cherry picked from commit f2aa71b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants