Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408

Closed
toaler opened this issue Mar 2, 2016 · 5 comments
Closed

Comments

@toaler
Copy link

toaler commented Mar 2, 2016

Similar to the issue described awhile ago here[1], I'm seeing the same problem where the recencyQueue is 223M entries deep. My understanding is the recencyQueue is drained on eviction, expiry, write or every 64 reads from the cache. The specific cache is mainly read dominant. The cache is setup as follows.

private static final Cache<Class<? extends Foo>, Optional<Bar>> propertyCache = CacheBuilder
        .newBuilder().maximumSize(50).build();   

The extract from MAT shows the following. 223M entries strongly referenced by the LocalCache$Segement[4].

I haven't been able to reproduce the problem locally and this is happening very infrequent in production. Any ideas on how/why this is happening?

propertyCache com.google.common.cache.LocalCache$LocalManualCache 5356412192 16
localCache com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80
map com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80
map com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80

        recencyQueue  java.util.concurrent.ConcurrentLinkedQueue size = 223183166 5356396032 24
@toaler
Copy link
Author

toaler commented Mar 3, 2016

@ben-manes
Copy link
Contributor

In synthetic tests this is easy to reproduce, but was viewed as not viable in realistic usages to handle beyond a safety threshold (per CLHM in your reference). The long-term was to switch to a ring buffer, but that kept being put off as not seen as overly critical. Without a real-world error, it was seen as a GC + perf optimization which is less important given Google's impressive infrastructure (e.g. less application cache centric by having fast data stores).

The best solution is #2063. You can of course switch to Caffeine if you're on Java 8, where I tried to take care of all of the improvements that never made it out of Guava's backlog. But I suspect that the problem is some runaway threads doing very little else but causing churn on the cache. You might want to see why this propertyCache is so dominating that the segment queues can't be drained fast enough. That might be partially mitigated by increasing the concurrencyLevel (defaulted to 4) to allow more segments to be drained in parallel. I wouldn't expect an explicit cleanUp() thread to help, but its worth a shot. For both Guava and CLHM we only saw this in synthetic stress tests and this is the first real-world error, making me suspicious that you have a bug that's the real culprit here.

@toaler
Copy link
Author

toaler commented Mar 8, 2016

Thanks Ben,

I'm struggling at reproducing the issue. What is the access pattern that
causes the recancy queue to grow unbounded?

  • Brian

On Wednesday, March 2, 2016, Ben Manes [email protected] wrote:

In synthetic tests this is easy to reproduce, but was viewed as not viable
in realistic usages to handle beyond a safety threshold (per CLHM in your
reference). The long-term was to switch to a ring buffer
#2063 (comment), but
that kept being put off as not seen as overly critical. Without a
real-world error, it was seen as a GC + perf optimization which is less
important given Google's impressive infrastructure (e.g. less application
cache centric by having fast data stores).

The best solution is #2063 #2063.
You can of course switch to Caffeine
https://github.com/ben-manes/caffeine if you're on Java 8, where I
tried to take care of all of the improvements that never made it out of
Guava's backlog. But I suspect that the problem is some runaway threads
doing very little else but causing churn on the cache. You might want to
see why this propertyCache is so dominating that the segment queues can't
be drained fast enough. That might be partially mitigated by increasing the
concurrencyLevel (defaulted to 4) to allow more segments to be drained in
parallel. I wouldn't expect an explicit cleanUp() thread to help, but its
worth a shot. For both Guava and CLHM w e only saw this is synthetic stress
tests and this is the first real-world error, making me suspicious that you
have a bug that's the real culprit here.


Reply to this email directly or view it on GitHub
#2408 (comment).

@ben-manes
Copy link
Contributor

The access rate has to exceed the drain (cleanUp) rate. If you have more threads than CPUs doing nothing but reading from the cache without any pause time then queue will grow excessively large. See this gist which is an adaption of the stress tests I use. You'll see that Guava quickly fails with the error java.lang.OutOfMemoryError: GC overhead limit exceeded.

In Guava each segment is drained independently and the work is delegated to one of the calling threads. In that test it tries to spread the access across distribution. In reality some entries in a cache are hotter than others (Zipf distribution) so the per-segment model was pragmatic but not optimal. Still, the assumption that the application is doing other work than thrashing the cache holds true in practice, such that this was only seen in synthetic tests. The use of a ring buffer to lose events corrects this problem and is justified as most of the events are to the same hot entries which will be kept by the cache. If the access distribution was uniform then a perfect LRU wouldn't have a better hit rate anyways.

electrum pushed a commit to prestodb/presto-hive-apache that referenced this issue May 12, 2017
Guava's cache performs poorly under high loaded when tracking recency. The
details are in google/guava#2408. The short version is that each access is
tracked and aggregated. If there's constant read access by more threads
than there are cores (as happens for the AvroSerde in a Presto worker),
the whole thing gets backed up and ultimately leads to a lengthy GC pause
and query timeouts.

Expiring based on age instead of recency avoids the issue.
@cpovirk
Copy link
Member

cpovirk commented Jul 24, 2019

At this point, we recommend Caffeine, and we're unlikely to make even bugfixes to cache unless there's a common issue :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants