Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408

toaler · 2016-03-02T22:47:00Z

Similar to the issue described awhile ago here[1], I'm seeing the same problem where the recencyQueue is 223M entries deep. My understanding is the recencyQueue is drained on eviction, expiry, write or every 64 reads from the cache. The specific cache is mainly read dominant. The cache is setup as follows.

private static final Cache<Class<? extends Foo>, Optional<Bar>> propertyCache = CacheBuilder
        .newBuilder().maximumSize(50).build();

The extract from MAT shows the following. 223M entries strongly referenced by the LocalCache$Segement[4].

I haven't been able to reproduce the problem locally and this is happening very infrequent in production. Any ideas on how/why this is happening?

propertyCache com.google.common.cache.LocalCache$LocalManualCache 5356412192 16
localCache com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80
map com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80
map com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80

        recencyQueue  java.util.concurrent.ConcurrentLinkedQueue size = 223183166 5356396032 24

The text was updated successfully, but these errors were encountered:

toaler · 2016-03-03T00:16:19Z

Forgot the reference link,

[1] - https://groups.google.com/forum/#!msg/guava-discuss/LWutCZo8eH0/pBgXKa6293wJ

ben-manes · 2016-03-03T06:20:09Z

In synthetic tests this is easy to reproduce, but was viewed as not viable in realistic usages to handle beyond a safety threshold (per CLHM in your reference). The long-term was to switch to a ring buffer, but that kept being put off as not seen as overly critical. Without a real-world error, it was seen as a GC + perf optimization which is less important given Google's impressive infrastructure (e.g. less application cache centric by having fast data stores).

The best solution is #2063. You can of course switch to Caffeine if you're on Java 8, where I tried to take care of all of the improvements that never made it out of Guava's backlog. But I suspect that the problem is some runaway threads doing very little else but causing churn on the cache. You might want to see why this propertyCache is so dominating that the segment queues can't be drained fast enough. That might be partially mitigated by increasing the concurrencyLevel (defaulted to 4) to allow more segments to be drained in parallel. I wouldn't expect an explicit cleanUp() thread to help, but its worth a shot. For both Guava and CLHM we only saw this in synthetic stress tests and this is the first real-world error, making me suspicious that you have a bug that's the real culprit here.

toaler · 2016-03-08T16:36:39Z

Thanks Ben,

I'm struggling at reproducing the issue. What is the access pattern that
causes the recancy queue to grow unbounded?

Brian

On Wednesday, March 2, 2016, Ben Manes [email protected] wrote:

In synthetic tests this is easy to reproduce, but was viewed as not viable
in realistic usages to handle beyond a safety threshold (per CLHM in your
reference). The long-term was to switch to a ring buffer
#2063 (comment), but
that kept being put off as not seen as overly critical. Without a
real-world error, it was seen as a GC + perf optimization which is less
important given Google's impressive infrastructure (e.g. less application
cache centric by having fast data stores).

The best solution is #2063 #2063.
You can of course switch to Caffeine
https://github.com/ben-manes/caffeine if you're on Java 8, where I
tried to take care of all of the improvements that never made it out of
Guava's backlog. But I suspect that the problem is some runaway threads
doing very little else but causing churn on the cache. You might want to
see why this propertyCache is so dominating that the segment queues can't
be drained fast enough. That might be partially mitigated by increasing the
concurrencyLevel (defaulted to 4) to allow more segments to be drained in
parallel. I wouldn't expect an explicit cleanUp() thread to help, but its
worth a shot. For both Guava and CLHM w e only saw this is synthetic stress
tests and this is the first real-world error, making me suspicious that you
have a bug that's the real culprit here.

—
Reply to this email directly or view it on GitHub
#2408 (comment).

ben-manes · 2016-03-08T19:43:01Z

The access rate has to exceed the drain (cleanUp) rate. If you have more threads than CPUs doing nothing but reading from the cache without any pause time then queue will grow excessively large. See this gist which is an adaption of the stress tests I use. You'll see that Guava quickly fails with the error java.lang.OutOfMemoryError: GC overhead limit exceeded.

In Guava each segment is drained independently and the work is delegated to one of the calling threads. In that test it tries to spread the access across distribution. In reality some entries in a cache are hotter than others (Zipf distribution) so the per-segment model was pragmatic but not optimal. Still, the assumption that the application is doing other work than thrashing the cache holds true in practice, such that this was only seen in synthetic tests. The use of a ring buffer to lose events corrects this problem and is justified as most of the events are to the same hot entries which will be kept by the cache. If the access distribution was uniform then a perfect LRU wouldn't have a better hit rate anyways.

Guava's cache performs poorly under high loaded when tracking recency. The details are in google/guava#2408. The short version is that each access is tracked and aggregated. If there's constant read access by more threads than there are cores (as happens for the AvroSerde in a Presto worker), the whole thing gets backed up and ultimately leads to a lengthy GC pause and query timeouts. Expiring based on age instead of recency avoids the issue.

cpovirk · 2019-07-24T23:45:55Z

At this point, we recommend Caffeine, and we're unlikely to make even bugfixes to cache unless there's a common issue :(

wagnermarkd mentioned this issue Mar 30, 2017

Expire schemas from Avro cache based on age instead of recency. prestodb/presto-hive-apache#28

Closed

kokosing mentioned this issue Jul 20, 2018

Switch AVRO InstanceCache to use Caffeine cache prestodb/presto-hive-apache#32

Open

cpovirk closed this as completed Jul 24, 2019

cpovirk added package=cache status=will-not-fix labels Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408

Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408

toaler commented Mar 2, 2016

toaler commented Mar 3, 2016

ben-manes commented Mar 3, 2016

toaler commented Mar 8, 2016

ben-manes commented Mar 8, 2016

cpovirk commented Jul 24, 2019

Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408

Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408

Comments

toaler commented Mar 2, 2016

toaler commented Mar 3, 2016

ben-manes commented Mar 3, 2016

toaler commented Mar 8, 2016

ben-manes commented Mar 8, 2016

cpovirk commented Jul 24, 2019