-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guava LocalCache recencyQueue is 223M entires dominating 5.3GB of heap #2408
Comments
Forgot the reference link, [1] - https://groups.google.com/forum/#!msg/guava-discuss/LWutCZo8eH0/pBgXKa6293wJ |
In synthetic tests this is easy to reproduce, but was viewed as not viable in realistic usages to handle beyond a safety threshold (per CLHM in your reference). The long-term was to switch to a ring buffer, but that kept being put off as not seen as overly critical. Without a real-world error, it was seen as a GC + perf optimization which is less important given Google's impressive infrastructure (e.g. less application cache centric by having fast data stores). The best solution is #2063. You can of course switch to Caffeine if you're on Java 8, where I tried to take care of all of the improvements that never made it out of Guava's backlog. But I suspect that the problem is some runaway threads doing very little else but causing churn on the cache. You might want to see why this |
Thanks Ben, I'm struggling at reproducing the issue. What is the access pattern that
On Wednesday, March 2, 2016, Ben Manes [email protected] wrote:
|
The access rate has to exceed the drain (cleanUp) rate. If you have more threads than CPUs doing nothing but reading from the cache without any pause time then queue will grow excessively large. See this gist which is an adaption of the stress tests I use. You'll see that Guava quickly fails with the error In Guava each segment is drained independently and the work is delegated to one of the calling threads. In that test it tries to spread the access across distribution. In reality some entries in a cache are hotter than others (Zipf distribution) so the per-segment model was pragmatic but not optimal. Still, the assumption that the application is doing other work than thrashing the cache holds true in practice, such that this was only seen in synthetic tests. The use of a ring buffer to lose events corrects this problem and is justified as most of the events are to the same hot entries which will be kept by the cache. If the access distribution was uniform then a perfect LRU wouldn't have a better hit rate anyways. |
Guava's cache performs poorly under high loaded when tracking recency. The details are in google/guava#2408. The short version is that each access is tracked and aggregated. If there's constant read access by more threads than there are cores (as happens for the AvroSerde in a Presto worker), the whole thing gets backed up and ultimately leads to a lengthy GC pause and query timeouts. Expiring based on age instead of recency avoids the issue.
At this point, we recommend Caffeine, and we're unlikely to make even bugfixes to cache unless there's a common issue :( |
Similar to the issue described awhile ago here[1], I'm seeing the same problem where the recencyQueue is 223M entries deep. My understanding is the recencyQueue is drained on eviction, expiry, write or every 64 reads from the cache. The specific cache is mainly read dominant. The cache is setup as follows.
The extract from MAT shows the following. 223M entries strongly referenced by the LocalCache$Segement[4].
I haven't been able to reproduce the problem locally and this is happening very infrequent in production. Any ideas on how/why this is happening?
propertyCache com.google.common.cache.LocalCache$LocalManualCache 5356412192 16
localCache com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80
map com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80
map com.google.common.cache.LocalCache 5356412176 128
segments com.google.common.cache.LocalCache$Segment[4] 5356412048 32
[1] com.google.common.cache.LocalCache$Segment 5356400000 80
The text was updated successfully, but these errors were encountered: