-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid a possible max call stack exceeded error #2929
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a review to unblock running the Jenkins CI tests.
(Do I need to "approve" it for tests to run? Testing this. :)
@Marsup Thanks! I'll take a look on Monday. Of course a repro and test case would be great, but I can understand if that's not easy/possible here. |
I've tried to isolate it, but couldn't. The bug usually happens on our side after several days of async iteration, even if there was a reproducible scenario, I'm uncertain if it's achievable in a reasonable amount of time. |
26b07e6
to
33dcf9e
Compare
❕ Build Aborted
Expand to view the summary
Build stats
Steps errors
Expand to view the steps failures
|
a questionably useful reproFWIW, I am able to reproduce a similar /* eslint-disable */
const apm = require('./').start({
captureExceptions: false,
logUncaughtExceptions: true,
transactionMaxSpans: 20000
})
apm.startTransaction('myTrans')
const depth = 10000
let i
for (i = 0; i < depth; i++) {
apm.startSpan(`s${i}`)
}
for (i = 0; i < depth; i++) {
apm.currentSpan.end()
}
apm.endTransaction() Running that with node v14.19.3 yields:
My line number in timers.js differs because I've made tweaks. I have some
Aside: one can play with a smaller v8/node stack size to trigger this more easily:
What could be the cause?Reading the code, I believe this can only be one of:
The former would be a bug. It should be impossible because you cannot re-parent. The The latter seems possible. My reproduction above shows a manual (and synchronous, because I was being lazy) case where a Span parent-child tree that is deeper than the v8 max stack size allows can trigger this.
Yes, that seems plausible. Is your code explicitly creating spans with the APM Agent's API (i.e. Possible workaroundI stopped trying to find a repro for the "very deep span tree" because I think there is a simpler workaround for the RangeError: diff --git a/lib/instrumentation/timer.js b/lib/instrumentation/timer.js
index 8e134d6f..be722935 100644
--- a/lib/instrumentation/timer.js
+++ b/lib/instrumentation/timer.js
@@ -8,17 +8,29 @@
var microtime = require('relative-microtime')
+/**
+ * Return `time`, if given and valid, or the current time (calculated using
+ * the given `timer`).
+ *
+ * @param {Timer} timer
+ * @param {Number} [time] - An optional float number of milliseconds since
+ * the epoch (i.e. the same as provided by `Date.now()`).
+ * @returns {Number} A time in *microseconds* since the Epoch.
+ */
function maybeTime (timer, time) {
- if (timer._parent) return maybeTime(timer._parent, time)
- return time >= 0 ? time * 1000 : timer._timer()
+ if (time >= 0) {
+ return time * 1000
+ } else {
+ return timer._timer()
+ }
}
module.exports = class Timer {
// `startTime`: millisecond float
- constructor (timer, startTime) {
- this._parent = timer
- this._timer = timer ? timer._timer : microtime()
- this.start = maybeTime(this, startTime) // microsecond integer
+ constructor (parentTimer, startTime) {
+ this._parent = parentTimer
+ this._timer = parentTimer ? parentTimer._timer : microtime()
+ this.start = maybeTime(this, startTime) // microsecond integer (note: might not be an integer)
this.endTimestamp = null
this.duration = null // millisecond float
this.selfTime = null // millisecond float When a That this code re-uses the I'll get a PR going and testing with this. If it passes tests, then I think we can merge it. ConclusionsI still dislike that we don't fully understand what is happening here. Is there really an accidental internal tree of |
…in maybeTime(...) This avoids a possible 'RangeError: Maximum call stack size exceeded' as described in #2929. The case the user hit is in `maybeTime(...)` used in Span timer handling by dropping the recursion. A `Timer` instance's `_timer` is always the same object as its parent's `_timer` if it has a parent, so there is no need to walk the chain. Refs: #2929
Our scenario is a very basic scheduler that could be summed up like this: async function () {
while (true) {
let transaction;
try {
transaction = apm.startTransaction('name);
await doStuff(); // That call will create some spans along the way
if (transaction) transaction.end('success');
} catch (e) {
if (transaction) transaction.end('error');
}
await setTimeout(nextTurn);
}
} It's launched in the context of the setup of a hapi plugin if that changes anything. Your solution seems clean indeed, and I agree with your conclusion that it's frustrating not to find the origin. I can't think of a case where we would be accidentally creating deeply nested trees voluntarily though, which is why we thought of #2611, our tasks are performed without any recursivity, just standard loops, p-map, and occasional async iterators. |
Thank you very much for this! This helped me get a repro that might match your case. possibly matching repro/* eslint-disable */
// repro-2929.js
const apm = require('./').start({
captureExceptions: false,
logUncaughtExceptions: true
})
const { setTimeout } = require('timers/promises')
// Do some stuff. Create some spans.
async function doStuff() {
// await Promise.resolve()
const s1 = apm.startSpan('s1')
await setTimeout(1)
const s2 = apm.startSpan('s2')
await setTimeout(1)
if (s2) s2.end()
if (s1) s1.end()
}
async function setupBackgroundTask () {
// await Promise.resolve()
let n = 0
const nextTurn = 1000
while (true) {
let transaction;
try {
transaction = apm.startTransaction(`doStuff-${++n}`);
await doStuff(); // That call will create some spans along the way
if (transaction) transaction.end('success');
} catch (e) {
if (transaction) transaction.end('error');
}
await setTimeout(nextTurn);
}
}
async function main() {
setupBackgroundTask()
console.log('current run context after setupBackgroundTask: %s', apm._instrumentation._runCtxMgr)
// Do some periodic thing that creates some spans. Normally these spans
// would not be created because there is no current transaction. However,
// because of https://github.com/elastic/apm-agent-nodejs/issues/2611
// we have a current transaction from `setupBackgroundTask()` (even if ended).
let i = 0
setInterval(async function () {
i++
await doStuff()
// Periodically take a look to see if there is a current "RunContext" with
// a deep span stack.
if (i === 100) {
console.log('current run context after 100 iterations: %s', apm._instrumentation._runCtxMgr)
}
if (i % 100 === 0) {
console.log('current span stack depth:', apm._instrumentation._runCtxMgr.active()._spans.length)
}
}, 5)
}
main() An example run of that:
I built upon your scheduler (naming it This line: console.log('current run context after setupBackgroundTask: %s', apm._instrumentation._runCtxMgr) prints an internal object that summarizes the "current run context" which holds the current transaction and stack of current spans. We can see in the example run that -- due to #2611 -- after calling Any async task that is created at this point in the code -- for example the Once it gets deep enough we crash on the Experiments for you@Marsup If you are able, there are a couple experiments you could do on your code that might help show if this is indeed the type of thing you are hitting.
Fixing thisFixing that Some thoughts on possible fixes/workarounds: First, I think the best thing would be to add the ".withTransaction()" and Second, we could consider emitting a warning for the user if it looks like we Note that this isn't perfect. We could still be leaking Spans via the Third, we could/should have a troubleshooting doc landing anchor for this |
@Marsup One thing I forgot to mention. After doing these experiments (if you are able), ultimately I think you'll want to add the following workaround in your |
I indeed do notice a transaction after that call, it's marked as
It's kind of hard, the leak is happening inside a hapi plugin registration, so the next call with a span creation is hard to predict. I've put your setInterval function right after that leak and the span stack depth is not growing at all, it's a constant 0. |
I'll get the maybeTime fix/workaround (#2939) in soon. I've opened #2983 to consider adding |
…in maybeTime(...) (#2939) This avoids a possible 'RangeError: Maximum call stack size exceeded' as described in #2929. The case the user hit is in `maybeTime(...)` used in Span timer handling by dropping the recursion. A `Timer` instance's `_timer` is always(*) the same object as its parent's `_timer` if it has a parent, so there is no need to walk the chain. (*): Assuming the internal `timer` option to Span/Transaction creation isn't manually used. This change adds a deprecation warning if that is the case. Refs: #2929
[email protected] is release with the |
Thanks a lot! Do we need to keep this one open then? |
I would like some issue to track adding one or more of the warnings and troubleshooting docs mentioned in "Fixing this" above. I'll open a separate issue for that now. ...time passes... I've opened #2995 for that. We can close this one now. Thank very much @Marsup. |
…in maybeTime(...) (elastic#2939) This avoids a possible 'RangeError: Maximum call stack size exceeded' as described in elastic#2929. The case the user hit is in `maybeTime(...)` used in Span timer handling by dropping the recursion. A `Timer` instance's `_timer` is always(*) the same object as its parent's `_timer` if it has a parent, so there is no need to walk the chain. (*): Assuming the internal `timer` option to Span/Transaction creation isn't manually used. This change adds a deprecation warning if that is the case. Refs: elastic#2929
Sometimes, our production servers hit the following error:
We don't yet have an explanation, but we strongly suspect that #2611 is involved because it mostly happens on long-running async/await chains.
Until it's fixed, I think this function can be rewritten iteratively instead of recursively. Also, it doesn't appear to need to look up for parents if
time
is provided, as it's going to be constant all the way.Do you feel this change needs tests?
Checklist