Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid a possible max call stack exceeded error #2929

Closed
wants to merge 1 commit into from

Conversation

Marsup
Copy link
Contributor

@Marsup Marsup commented Sep 16, 2022

Sometimes, our production servers hit the following error:

RangeError: Maximum call stack size exceeded
    at maybeTime (/home/runner/application/node_modules/.pnpm/[email protected]/node_modules/elastic-apm-node/lib/instrumentation/timer.js:11:20)
    at maybeTime (/home/runner/application/node_modules/.pnpm/[email protected]/node_modules/elastic-apm-node/lib/instrumentation/timer.js:12:29)
    at maybeTime (/home/runner/application/node_modules/.pnpm/[email protected]/node_modules/elastic-apm-node/lib/instrumentation/timer.js:12:29)
    at maybeTime (/home/runner/application/node_modules/.pnpm/[email protected]/node_modules/elastic-apm-node/lib/instrumentation/timer.js:12:29)
    at maybeTime (/home/runner/application/node_modules/.pnpm/[email protected]/node_modules/elastic-apm-node/lib/instrumentation/timer.js:12:29)
...

We don't yet have an explanation, but we strongly suspect that #2611 is involved because it mostly happens on long-running async/await chains.

Until it's fixed, I think this function can be rewritten iteratively instead of recursively. Also, it doesn't appear to need to look up for parents if time is provided, as it's going to be constant all the way.

Do you feel this change needs tests?

Checklist

  • Implement code
  • Add tests
  • Update TypeScript typings
  • Update documentation
  • Add CHANGELOG.asciidoc entry
  • Commit message follows commit guidelines

@github-actions github-actions bot added agent-nodejs Make available for APM Agents project planning. community triage labels Sep 16, 2022
@trentm trentm self-requested a review September 16, 2022 17:27
Copy link
Member

@trentm trentm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a review to unblock running the Jenkins CI tests.
(Do I need to "approve" it for tests to run? Testing this. :)

@trentm
Copy link
Member

trentm commented Sep 16, 2022

@Marsup Thanks! I'll take a look on Monday.

Of course a repro and test case would be great, but I can understand if that's not easy/possible here.

@Marsup
Copy link
Contributor Author

Marsup commented Sep 16, 2022

I've tried to isolate it, but couldn't. The bug usually happens on our side after several days of async iteration, even if there was a reproducible scenario, I'm uncertain if it's achievable in a reasonable amount of time.

@apmmachine
Copy link
Contributor

❕ Build Aborted

The PR is not allowed to run in the CI yet

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2022-09-20T09:31:43.249+0000

  • Duration: 4 min 15 sec

Steps errors 2

Expand to view the steps failures

Load a resource file from a library
  • Took 0 min 0 sec . View more details here
  • Description: approval-list/elastic/apm-agent-nodejs.yml
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: githubApiCall: The REST API call https://api.github.com/orgs/elastic/members/Marsup return the message : java.lang.Exception: httpRequest: Failure connecting to the service https://api.github.com/orgs/elastic/members/Marsup : httpRequest: Failure connecting to the service https://api.github.com/orgs/elastic/members/Marsup : Code: 404Error: {"message":"User does not exist or is not a member of the organization","documentation_url":"https://docs.github.com/rest/reference/orgs#check-organization-membership-for-a-user"}

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run module tests for <modules> : Run TAV tests for one or more modules, where <modules> can be either a comma separated list of modules (e.g. memcached,redis) or the string literal ALL to test all modules

  • run benchmark tests : Run the benchmark test only.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@trentm
Copy link
Member

trentm commented Sep 20, 2022

a questionably useful repro

FWIW, I am able to reproduce a similar RangeError: Maximum call stack size exceeded with this script:

/* eslint-disable */
const apm = require('./').start({
  captureExceptions: false,
  logUncaughtExceptions: true,
  transactionMaxSpans: 20000
})

apm.startTransaction('myTrans')

const depth = 10000
let i
for (i = 0; i < depth; i++) {
  apm.startSpan(`s${i}`)
}
for (i = 0; i < depth; i++) {
  apm.currentSpan.end()
}

apm.endTransaction()

Running that with node v14.19.3 yields:

...
/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26
    return maybeTime(timer._parent, time, depth)
           ^

RangeError: Maximum call stack size exceeded
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:26:12)

My line number in timers.js differs because I've made tweaks. I have some console.log(...)s that seem to make a difference on whether a RangeError is actually thrown, so I don't claim to have a handle on the v8 inner workings here.

  1. I don't think this is yet a useful test case.
  2. This isn't necessarily the case you are getting.

Aside: one can play with a smaller v8/node stack size to trigger this more easily:

$ node --v8-options
...
  --stack-size (default size of stack region v8 is allowed to use (in kBytes))
        type: int  default: 984
...

What could be the cause?

Reading the code, I believe this can only be one of:

  1. a circular timer._parent loop
  2. a timer._parent chain that actually is huge (exceeding max stack depth)

The former would be a bug. It should be impossible because you cannot re-parent. The _parent is determined at Span creation time; there is no way to go back and reset that Timer._parent to make a loop.

The latter seems possible. My reproduction above shows a manual (and synchronous, because I was being lazy) case where a Span parent-child tree that is deeper than the v8 max stack size allows can trigger this.

we strongly suspect that #2611 is involved because it mostly happens on long-running async/await chains

Yes, that seems plausible. Is your code explicitly creating spans with the APM Agent's API (i.e. apm.startSpan(...))? Or is this happening just with automatic instrumentation?

Possible workaround

I stopped trying to find a repro for the "very deep span tree" because I think there is a simpler workaround for the RangeError:

diff --git a/lib/instrumentation/timer.js b/lib/instrumentation/timer.js
index 8e134d6f..be722935 100644
--- a/lib/instrumentation/timer.js
+++ b/lib/instrumentation/timer.js
@@ -8,17 +8,29 @@

 var microtime = require('relative-microtime')

+/**
+ * Return `time`, if given and valid, or the current time (calculated using
+ * the given `timer`).
+ *
+ * @param {Timer} timer
+ * @param {Number} [time] - An optional float number of milliseconds since
+ *    the epoch (i.e. the same as provided by `Date.now()`).
+ * @returns {Number} A time in *microseconds* since the Epoch.
+ */
 function maybeTime (timer, time) {
-  if (timer._parent) return maybeTime(timer._parent, time)
-  return time >= 0 ? time * 1000 : timer._timer()
+  if (time >= 0) {
+    return time * 1000
+  } else {
+    return timer._timer()
+  }
 }

 module.exports = class Timer {
   // `startTime`: millisecond float
-  constructor (timer, startTime) {
-    this._parent = timer
-    this._timer = timer ? timer._timer : microtime()
-    this.start = maybeTime(this, startTime) // microsecond integer
+  constructor (parentTimer, startTime) {
+    this._parent = parentTimer
+    this._timer = parentTimer ? parentTimer._timer : microtime()
+    this.start = maybeTime(this, startTime) // microsecond integer (note: might not be an integer)
     this.endTimestamp = null
     this.duration = null // millisecond float
     this.selfTime = null // millisecond float

When a Timer instance is created with a _parent, then it always re-uses the parent's _timer. So, unless I'm really missing something, there is no need to walk the timer._parent chain to the root when that root Timer instance has the same _timer (a microtime) as the leaf Timer -- they are the same object. So we can just use it directly.

That this code re-uses the _timer attribute name for objects of different types (a Timer in some cases, a microtime in others) made this code confusing to read.

I'll get a PR going and testing with this. If it passes tests, then I think we can merge it.

Conclusions

I still dislike that we don't fully understand what is happening here. Is there really an accidental internal tree of Spans in your program that is so deep that this recursive maybeTime blows the stack?

trentm added a commit that referenced this pull request Sep 20, 2022
…in maybeTime(...)

This avoids a possible 'RangeError: Maximum call stack size exceeded' as
described in #2929.

The case the user hit is in `maybeTime(...)` used in Span timer handling
by dropping the recursion.  A `Timer` instance's `_timer` is always the
same object as its parent's `_timer` if it has a parent, so there is no
need to walk the chain.

Refs: #2929
@Marsup
Copy link
Contributor Author

Marsup commented Sep 21, 2022

Our scenario is a very basic scheduler that could be summed up like this:

async function () {
  while (true) {
    let transaction;
    try {
      transaction = apm.startTransaction('name);
      await doStuff(); // That call will create some spans along the way
      if (transaction) transaction.end('success');
    } catch (e) {
      if (transaction) transaction.end('error');
    }
    await setTimeout(nextTurn);
  }
}

It's launched in the context of the setup of a hapi plugin if that changes anything.

Your solution seems clean indeed, and I agree with your conclusion that it's frustrating not to find the origin.

I can't think of a case where we would be accidentally creating deeply nested trees voluntarily though, which is why we thought of #2611, our tasks are performed without any recursivity, just standard loops, p-map, and occasional async iterators.

@trentm
Copy link
Member

trentm commented Sep 22, 2022

Our scenario is a very basic scheduler that could be summed up like this:

Thank you very much for this! This helped me get a repro that might match your case.

possibly matching repro

/* eslint-disable */
// repro-2929.js
const apm = require('./').start({
  captureExceptions: false,
  logUncaughtExceptions: true
})

const { setTimeout } = require('timers/promises')

// Do some stuff. Create some spans.
async function doStuff() {
  // await Promise.resolve()
  const s1 = apm.startSpan('s1')
  await setTimeout(1)

  const s2 = apm.startSpan('s2')
  await setTimeout(1)
  if (s2) s2.end()

  if (s1) s1.end()
}

async function setupBackgroundTask () {
  // await Promise.resolve()
  let n = 0
  const nextTurn = 1000
  while (true) {
    let transaction;
    try {
      transaction = apm.startTransaction(`doStuff-${++n}`);
      await doStuff(); // That call will create some spans along the way
      if (transaction) transaction.end('success');
    } catch (e) {
      if (transaction) transaction.end('error');
    }
    await setTimeout(nextTurn);
  }
}

async function main() {
  setupBackgroundTask()
  console.log('current run context after setupBackgroundTask: %s', apm._instrumentation._runCtxMgr)

  // Do some periodic thing that creates some spans. Normally these spans
  // would not be created because there is no current transaction. However,
  // because of https://github.com/elastic/apm-agent-nodejs/issues/2611
  // we have a current transaction from `setupBackgroundTask()` (even if ended).
  let i = 0
  setInterval(async function () {
    i++
    await doStuff()

    // Periodically take a look to see if there is a current "RunContext" with
    // a deep span stack.
    if (i === 100) {
      console.log('current run context after 100 iterations: %s', apm._instrumentation._runCtxMgr)
    }
    if (i % 100 === 0) {
      console.log('current span stack depth:', apm._instrumentation._runCtxMgr.active()._spans.length)
    }
  }, 5)
}

main()

An example run of that:

% node --stack-size=80 repro-2929.js
current run context after setupBackgroundTask: AsyncLocalStorageRunContextManager( RunContext<Transaction(fc86e1, 'doStuff-1'), [Span(c19e38, 's1')]> )
current run context after 100 iterations: AsyncLocalStorageRunContextManager( RunContext<Transaction(fc86e1, 'doStuff-1', ended), [Span(c19e38, 's1', ended),Span(9530bf, 's1', ended),Span(fde8ad, 's1', ended),Span(dcf6c0, 's1', ended),Span(b7ac41, 's1', ended),Span(8930cd, 's1', ended),Span(d38e2d, 's1', ended),Span(e2028c, 's1', ended),Span(670ea6, 's1', ended),Span(893717, 's1', ended),Span(b739de, 's1', ended),Span(02576b, 's1', ended),Span(7af829, 's1', ended),Span(5e3a43, 's1', ended),Span(34556b, 's1', ended),Span(738e2a, 's1', ended),Span(85e60f, 's1', ended),Span(1124b1, 's1', ended),Span(963c82, 's1', ended),Span(9d38d3, 's1', ended),Span(3351ec, 's1', ended),Span(43e67b, 's1', ended),Span(1e43da, 's1', ended),Span(8362e9, 's1', ended),Span(c2b13c, 's1', ended),Span(3d0c96, 's1', ended),Span(dd235f, 's1', ended),Span(77ad1e, 's1', ended),Span(40b032, 's1', ended),Span(35ddc8, 's1', ended),Span(d27f33, 's1', ended),Span(fcb267, 's1', ended),Span(4728fd, 's1', ended),Span(19781b, 's1', ended),Span(7dbfd5, 's1', ended),Span(5f6143, 's1', ended),Span(051151, 's1', ended),Span(35ef36, 's1', ended),Span(88a32b, 's1', ended),Span(bbd87e, 's1', ended),Span(851842, 's1', ended),Span(6cbedd, 's1', ended),Span(89b207, 's1', ended),Span(817cb0, 's1', ended),Span(bb9194, 's1', ended),Span(c35ba9, 's1', ended),Span(2d55e1, 's1', ended),Span(7dc085, 's1', ended),Span(210864, 's1', ended),Span(697bc5, 's1', ended),Span(d087bf, 's1', ended),Span(e31933, 's1', ended),Span(866df5, 's1', ended),Span(04f3f0, 's1', ended),Span(735171, 's1', ended),Span(768f65, 's1', ended),Span(4a603a, 's1', ended),Span(fd68b6, 's1', ended),Span(a4a8a3, 's1', ended),Span(3e0339, 's1', ended),Span(2494c1, 's1', ended),Span(8c5066, 's1', ended),Span(30589c, 's1', ended),Span(3a119d, 's1', ended),Span(6083b5, 's1', ended),Span(76bcbb, 's1', ended),Span(605644, 's1', ended),Span(986d4e, 's1', ended),Span(f13b8d, 's1', ended),Span(11dea1, 's1', ended),Span(06ba43, 's1', ended),Span(56ea28, 's1', ended),Span(ae4870, 's1', ended),Span(00da63, 's1', ended),Span(30b0a8, 's1', ended),Span(2b69c1, 's1', ended),Span(7d439e, 's1', ended),Span(7789f6, 's1', ended),Span(cfee59, 's1', ended),Span(669e41, 's1', ended),Span(da5383, 's1', ended),Span(c716f6, 's1', ended),Span(6f645d, 's1', ended),Span(59e02a, 's1', ended),Span(36c3b8, 's1', ended),Span(a2b152, 's1', ended),Span(782383, 's1', ended),Span(140467, 's1', ended),Span(d08ea5, 's1', ended),Span(ff6988, 's1', ended),Span(95af59, 's1', ended),Span(fa7608, 's1', ended),Span(bfc12c, 's1', ended),Span(875af1, 's1', ended),Span(c05010, 's1', ended),Span(f503cc, 's1', ended),Span(bcc09c, 's1', ended),Span(467b65, 's1', ended),Span(bf9aa0, 's1', ended),Span(b1992a, 's1', ended),Span(d66816, 's1', ended)]> )
current span stack depth: 101
current span stack depth: 201
current span stack depth: 301
current span stack depth: 401
current span stack depth: 501
current span stack depth: 601
current span stack depth: 701
current span stack depth: 801
current span stack depth: 901
current span stack depth: 1001
current span stack depth: 1101
current span stack depth: 1201
current span stack depth: 1301
/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:28
function maybeTime (timer, time) {
                   ^

RangeError: Maximum call stack size exceeded
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:28:20)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:29:29)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:29:29)
...
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:29:29)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:29:29)
    at maybeTime (/Users/trentm/el/apm-agent-nodejs4/lib/instrumentation/timer.js:29:29)

I built upon your scheduler (naming it function setupBackgroundTask). I set it up to have an incrementing transaction name doStuff-${n} -- normally a bad idea because high cardinality, but it helps us see the problematic behaviour here.

This line:

  console.log('current run context after setupBackgroundTask: %s', apm._instrumentation._runCtxMgr)

prints an internal object that summarizes the "current run context" which holds the current transaction and stack of current spans. We can see in the example run that -- due to #2611 -- after calling setupBackgroundTask we unintentionally have a current transaction (Transaction(fc86e1, 'doStuff-1')).

Any async task that is created at this point in the code -- for example the setInterval -- will inherit that current transaction. If that async task creates spans, then those will get added to this transaction. The await doStuff() inside the setInterval again suffers from #2611 and results in repeatedly appending a new Span('s1') to the current run context. I added some console.logs to periodically show how deep that stack is getting.

Once it gets deep enough we crash on the maybeTime recursion issue.

Experiments for you

@Marsup If you are able, there are a couple experiments you could do on your code that might help show if this is indeed the type of thing you are hitting.

  1. Put a line like console.log('current run context after setupBackgroundTask: %s', apm._instrumentation._runCtxMgr); just after the equivalent of your setupBackgroundTask();. Does that show a current transaction "leaking" out from the scheduler function?
  2. If so, can you identify one or more async tasks created synchronously after that scheduler function -- the equivalent of the setInterval in my repro -- that might be creating spans without their own transaction? You might most easily find them by adding console.log('current span stack depth:', apm._instrumentation._runCtxMgr.active()._spans.length); after the first await someFunction() in any of those async tasks that are inheriting the leaked transaction.

Fixing this

Fixing that maybeTime recursion is good. However, it doesn't solve your issue (assuming this repro matches your issue). This case is leaking Spans -- which could be a long term memory leak issue -- and a deep current span stack leads to poor CPU performance in the agent (inside its RunContext.js).

Some thoughts on possible fixes/workarounds:

First, I think the best thing would be to add the ".withTransaction()" and
".withSpan()" APIs to our agent API as suggested in
#2611

Second, we could consider emitting a warning for the user if it looks like we
are hitting a pathological situation like this. E.g. in RunContext.enterSpan()
there is a huge perf cost if frequently entering/leaving a very deep stack
(copying the _spans array). A heuristic for this warning could be (a) _spans
is deep and (b) the transaction ended long ago.

Note that this isn't perfect. We could still be leaking Spans via the
span._timer._parent linked list, even if we don't have a single deep
RunContext stack. An alternate heuristic warning could be on
Transaction.createSpan: if _buildSpans is large and transaction ended long ago,
then likely in this case.

Third, we could/should have a troubleshooting doc landing anchor for this
situation. It discuss the issue somewhat and perhaps defers to a longer
discussion on an issue (perhaps on #2611?).

@trentm
Copy link
Member

trentm commented Sep 23, 2022

Experiments for you

...

@Marsup One thing I forgot to mention. After doing these experiments (if you are able), ultimately I think you'll want to add the following workaround in your async function setupBackgroundTask() equivalent: add a await Promise.resolve(); at the start of the function, or at least before any apm.startTransaction(). This will ensure that the async function yields an async context that does not include the APM transaction. No leaked transaction, no accidental accumulation of spans.

@trentm trentm removed the triage label Sep 23, 2022
@trentm trentm self-assigned this Sep 23, 2022
@Marsup
Copy link
Contributor Author

Marsup commented Sep 27, 2022

I indeed do notice a transaction after that call, it's marked as ended, there are also spans sometimes (all ended as well), probably depending on timing. I have added the await Promise.resolve() both at the start of the function and before any call to startTransaction, I can't say I'm seeing a difference in the logs.

If so, can you identify one or more async tasks created synchronously after that scheduler function

It's kind of hard, the leak is happening inside a hapi plugin registration, so the next call with a span creation is hard to predict.

I've put your setInterval function right after that leak and the span stack depth is not growing at all, it's a constant 0.

@trentm
Copy link
Member

trentm commented Oct 13, 2022

I'll get the maybeTime fix/workaround (#2939) in soon. I've opened #2983 to consider adding withSpan/withTransaction() APIs soonish. I'll keep this PR open for now (or I can add a separate issue if anyone prefers) to track adding one or more of the warnings and troubleshooting docs mentioned in "Fixing this" above.

trentm added a commit that referenced this pull request Oct 17, 2022
…in maybeTime(...) (#2939)

This avoids a possible 'RangeError: Maximum call stack size exceeded' as
described in #2929.

The case the user hit is in `maybeTime(...)` used in Span timer handling
by dropping the recursion.  A `Timer` instance's `_timer` is always(*) the
same object as its parent's `_timer` if it has a parent, so there is no
need to walk the chain.

(*): Assuming the internal `timer` option to Span/Transaction creation
  isn't manually used. This change adds a deprecation warning if that
  is the case.

Refs: #2929
@trentm
Copy link
Member

trentm commented Oct 19, 2022

[email protected] is release with the maybeTime recursion fix now.

@Marsup
Copy link
Contributor Author

Marsup commented Oct 19, 2022

Thanks a lot! Do we need to keep this one open then?

@trentm
Copy link
Member

trentm commented Oct 19, 2022

Do we need to keep this one open then?

I would like some issue to track adding one or more of the warnings and troubleshooting docs mentioned in "Fixing this" above. I'll open a separate issue for that now.

...time passes...

I've opened #2995 for that. We can close this one now. Thank very much @Marsup.

@trentm trentm closed this Oct 19, 2022
@Marsup Marsup deleted the fix/max-call-stack branch October 19, 2022 16:04
PeterEinberger pushed a commit to fpm-git/apm-agent-nodejs that referenced this pull request Aug 20, 2024
…in maybeTime(...) (elastic#2939)

This avoids a possible 'RangeError: Maximum call stack size exceeded' as
described in elastic#2929.

The case the user hit is in `maybeTime(...)` used in Span timer handling
by dropping the recursion.  A `Timer` instance's `_timer` is always(*) the
same object as its parent's `_timer` if it has a parent, so there is no
need to walk the chain.

(*): Assuming the internal `timer` option to Span/Transaction creation
  isn't manually used. This change adds a deprecation warning if that
  is the case.

Refs: elastic#2929
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-nodejs Make available for APM Agents project planning. community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants