Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear wnb after close to prevent memory leak to Gateways. #5212

Closed
wants to merge 1 commit into from

Conversation

davidzhao
Copy link

We are seeing a memory leak whenever the gateway connections fail (and reconnects).

The sequence of events goes like this:

  1. local NATS cluster having a lot of subscriptions (>1MM)
  2. that cluster tries to send topic subscriptions to a remote region with limited bandwidth
  3. the write times out and gateway reconnects
  4. memory grows by a few hundred MBs each iteration
  5. NATS OOMs

The problem is even after reconnection to gateways, the previous flush attempt buffers failed messages, including those that are too large and are timing out. When bandwidth limitation persists, it will lead to each reconnection attempt increasing memory usage by hundreds of MBs.

Signed-off-by: [email protected]

@davidzhao davidzhao requested a review from a team as a code owner March 13, 2024 06:53
@derekcollison derekcollison requested a review from kozlovic March 13, 2024 10:46
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Note, I would not consider this a memory leak, but if it is, then we should get to the bottom of it as to what is holding a reference to this connection? I know that we keep a rotation of closed connections, but that should not have a part into keeping a reference to this buffer. Also, this is not specific to gateways, just that in your setup this is affecting gateways because it goes over a poor link, but that would also apply to any other type of connections.
Lastly, are you using GOMEMLIMIT? If not, maybe that is why your server OOM'ed (again, assuming this is not a memory leak but just delayed GC).

@davidzhao
Copy link
Author

we should get to the bottom of it as to what is holding a reference to this connection? I know that we keep a rotation of closed connections, but that should not have a part into keeping a reference to this buffer.

It looks to me that the connection goes into reconnect when timeout errors are encountered. Is it supposed to close it entirely for a new one to open up? (when the destination is a gateway)

@neilalexander
Copy link
Member

Need to be especially careful with wnb as it is used outside of the lock during writev calls, so if there is a writev in progress, another goroutine setting wnb = nil will create a data race. Are you sure the flusher is not running at this point?

@kozlovic
Copy link
Member

@davidzhao The reconnect() call will attempt to re-create a connection, depending on the type of connection we are dealing with. But at the end, it is a brand new object, so the current connection *client should no longer be referenced.

@neilalexander As to the use of wnb without lock, you are right. However, flushAndClose() is only invoked from the writeLoop when it is detected that the connection is "closed" (marked as such) - unless the writeLoop was never started. So I think it should be safe.

@kozlovic
Copy link
Member

@derekcollison Let me experiment a bit more before we merge. I want to make sure that @neilalexander concerns are addressed.

@kozlovic
Copy link
Member

@davidzhao We should backtrack a bit. I had assumed that you found from profiling that the memory used was because of wnb, but is that the case? Since you claim that there are over 1M subscriptions, I wonder if the memory increase you see at each reconnect is not due to the fact that the server saves "closed connections" in a ring buffer, and that includes the list of subscriptions, and may be the reason for the memory usage.
I am going to check the impact of setting c.out.wnb to nil, etc.. to make sure that there is no race as @neilalexander fears, but the point is that making those changes would be pointless if this was not the reason for the memory growth. So if you could go in more details on your findings that led you to submit this PR, I would appreciate it.

@kozlovic
Copy link
Member

@davidzhao Hmm... never mind, after checking it looks like we save closed connections only for client and leaf connections, not other types. Digging more...

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I have verified that it is safe to access c.out.wnb from here because once a writeLoop is started, server code should always enqueue protocols for the writeLoop to send. The writeLoop is the one invoking flushOutbound and only if it has detected that the connection is closed will it call flushAndClose().
What I am not sure about is why not returning these buffers to the pool causes the memory growth. I mean, should not the GC reclaim those? I am not contesting that they should be returned to the pool, this is the right thing to do, but why is it not reclaimed if the connection is no longer referenced?
@neilalexander Is there anything particular about nbPoolGet that explains this?
@derekcollison I am approving this PR but would rather wait for @neilalexander to comment/approve.

@derekcollison
Copy link
Member

GC will not reclaim if it does not think it needs to..

So in container based environments its critical to set GOMEMLIMIT to ~75% of the memory that is provisioned in the environment. If the memory provisioned is large, say 32GB, I suggest 16-20GB for GOMEMLIMIT for the server.

@davidzhao
Copy link
Author

@kozlovic Here's the profile from our production environment. This instance had high memory use and was close to getting killed.

sfo-nats3-heap

But at the end, it is a brand new object, so the current connection *client should no longer be referenced.

I'm not seeing this path.. tracing the code reconnect would re-use the same object when connecting to remote gateways.

What I am not sure about is why not returning these buffers to the pool causes the memory growth.

I could be wrong, but to me c.out.wnb is still holding onto these buffers. If the transmission fails and the client obj is reused, it'll keep appending to this slice. Clearing it out during close seems to be the sensible thing to do.

@neilalexander
Copy link
Member

@kozlovic Yes, pools are emptied on a GC run, but if the GC doesn't think it needs to run yet, it's possible for them to build up in size.

The problem may not be the pooled buffers though, it may be the length of the c.out.wnb slice itself. In which case wiping it on shutdown is fine, but we may also want to be checking the cap(c.out.wnb) more regularly (perhaps in flushOutbound) to make sure that one spike doesn't mean we hold onto a large underlying array potentially forever.

@davidzhao Can you please supply the full memory profile from /allocs?debug=0 as the SVG attached looks to be empty?

@kozlovic
Copy link
Member

kozlovic commented Mar 14, 2024

I'm not seeing this path.. tracing the code reconnect would re-use the same object when connecting to remote gateways.

@davidzhao, no, look at the reconnect code. In the case of an outbound GW connection, we end-up invoking this:

srv.startGoRoutine(func() { srv.reconnectGateway(gwCfg) })
, which invokes this:
func (s *Server) reconnectGateway(cfg *gatewayCfg) {
, then this:
func (s *Server) solicitGateway(cfg *gatewayCfg, firstConnect bool) {
, finally this:
func (s *Server) createGateway(cfg *gatewayCfg, url *url.URL, conn net.Conn) {
where a new connection (*client) is created here:
c := &client{srv: s, nc: conn, start: now, last: now, kind: GATEWAY}

Starting at the first link I posted, there is no longer any reference to the original connection *client.

@davidzhao
Copy link
Author

davidzhao commented Mar 14, 2024

@davidzhao, no, look at the reconnect code. In the case of an outbound GW connection, we end-up invoking this:

Thank you for keeping me honest. that's the thread that I missed.

So the observed "leak" might just be that GC isn't reclaiming the memory. And I think the PR will still help to reduce allocations by returning these buffers back to the pool

@davidzhao
Copy link
Author

Update: we've set GOMEMLIMIT and observed that GC is working much harder. However, it's still not cleaning up after the memory held in wnb. Does this mean that something is holding on to the old client obj? Updated profiles attached

updated-profile.zip

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, @neilalexander is right that this is "racy". I have seen a test fail with a data race when an internalSendLoop for system account is calling flushClients()->flushOutbound() which races with accessing it from here.

However, I have spent a lot of time to figure out what is holding a reference to the connection and I may have find it but need more time to investigate and see if that would prevent the memory usage without doing the cleaning here.

kozlovic added a commit that referenced this pull request Mar 25, 2024
The main issue for router/gateway connections were that they would
be registered with the global account but never removed from there,
which would cause the connection to be retained. The memory leak
was apparent in one user setup with large amount of subscriptions
that were transfered to a slow gateway, causing accumulation of
partial buffers before the connection was dropped.

For gateway connections, we also needed to clean the outbound's
map since it can contain sublist referencing its own connection.
Same for subscriptions in c.subs.

Another found cause for retained connection is with leafnode where
we are using a subscription referencing the leafnode connection
that was globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil`
when the connection is closed. Without that, it seems that the connection
would not be released (at least based on experimentation). We now
make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing
itself in some way, prevents the GC from releasing the object, even
if the "top level" object is no longer reachable. For instance,
suppose `obj.someField = obj` seem to prevent the GC from releasing
`obj`, even if `obj` itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb,
but the way it was done was racy since c.out.wnb is used without
lock in flushOutbound. Once the retain/release issue is fixed,
cleaning this buffer is not really required (but good practice
especially if not running with GOMEMLIMIT), so we take care
of cleaning this up, but under the protection of the flushOutbound
flag. If set, flushAndClose() will not do the cleaning, flushOutbound
will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic <[email protected]>
@kozlovic
Copy link
Member

@davidzhao I have submitted PR #5244 to address the issue in a more systemic way. You should be referenced in the release notes should my PR be accepted.

@davidzhao
Copy link
Author

@davidzhao I have submitted PR #5244 to address the issue in a more systemic way. You should be referenced in the release notes should my PR be accepted.

That's fantastic news! I'm looking forward to trying it out once it makes it into a release. In the mean time, we have been running on a patched build with clearing wnb and it has improved the memory footprint in our case. But you are right that it's addressing the symptom and not the root cause.

derekcollison added a commit that referenced this pull request Mar 26, 2024
…5244)

The main issue for router/gateway connections were that they would be
registered with the global account but never removed from there, which
would cause the connection to be retained. The memory leak was apparent
in one user setup with large amount of subscriptions that were
transfered to a slow gateway, causing accumulation of partial buffers
before the connection was dropped.

For gateway connections, we also needed to clean the outbound's map
since it can contain sublist referencing its own connection. Same for
subscriptions in c.subs.

Another found cause for retained connection is with leafnode where we
are using a subscription referencing the leafnode connection that was
globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil`
when the connection is closed. Without that, it seems that the
connection would not be released (at least based on experimentation). We
now make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing
itself in some way, prevents the GC from releasing the object, even if
the "top level" object is no longer reachable. For instance, suppose
`obj.someField = obj` seem to prevent the GC from releasing `obj`, even
if `obj` itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb, but the
way it was done was racy since c.out.wnb is used without lock in
flushOutbound. Once the retain/release issue is fixed, cleaning this
buffer is not really required (but good practice especially if not
running with GOMEMLIMIT), so we take care of cleaning this up, but under
the protection of the flushOutbound flag. If set, flushAndClose() will
not do the cleaning, flushOutbound will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic <[email protected]>
@kozlovic
Copy link
Member

@davidzhao Let's close this PR and the other has already been merged to main and should be available in nightly builds. If you have a chance to test it out, that would be great.

@kozlovic kozlovic closed this Mar 26, 2024
derekcollison pushed a commit that referenced this pull request Mar 29, 2024
The main issue for router/gateway connections were that they would
be registered with the global account but never removed from there,
which would cause the connection to be retained. The memory leak
was apparent in one user setup with large amount of subscriptions
that were transfered to a slow gateway, causing accumulation of
partial buffers before the connection was dropped.

For gateway connections, we also needed to clean the outbound's
map since it can contain sublist referencing its own connection.
Same for subscriptions in c.subs.

Another found cause for retained connection is with leafnode where
we are using a subscription referencing the leafnode connection
that was globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil`
when the connection is closed. Without that, it seems that the connection
would not be released (at least based on experimentation). We now
make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing
itself in some way, prevents the GC from releasing the object, even
if the "top level" object is no longer reachable. For instance,
suppose `obj.someField = obj` seem to prevent the GC from releasing
`obj`, even if `obj` itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb,
but the way it was done was racy since c.out.wnb is used without
lock in flushOutbound. Once the retain/release issue is fixed,
cleaning this buffer is not really required (but good practice
especially if not running with GOMEMLIMIT), so we take care
of cleaning this up, but under the protection of the flushOutbound
flag. If set, flushAndClose() will not do the cleaning, flushOutbound
will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic <[email protected]>
neilalexander pushed a commit that referenced this pull request Apr 4, 2024
The main issue for router/gateway connections were that they would
be registered with the global account but never removed from there,
which would cause the connection to be retained. The memory leak
was apparent in one user setup with large amount of subscriptions
that were transfered to a slow gateway, causing accumulation of
partial buffers before the connection was dropped.

For gateway connections, we also needed to clean the outbound's
map since it can contain sublist referencing its own connection.
Same for subscriptions in c.subs.

Another found cause for retained connection is with leafnode where
we are using a subscription referencing the leafnode connection
that was globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil`
when the connection is closed. Without that, it seems that the connection
would not be released (at least based on experimentation). We now
make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing
itself in some way, prevents the GC from releasing the object, even
if the "top level" object is no longer reachable. For instance,
suppose `obj.someField = obj` seem to prevent the GC from releasing
`obj`, even if `obj` itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb,
but the way it was done was racy since c.out.wnb is used without
lock in flushOutbound. Once the retain/release issue is fixed,
cleaning this buffer is not really required (but good practice
especially if not running with GOMEMLIMIT), so we take care
of cleaning this up, but under the protection of the flushOutbound
flag. If set, flushAndClose() will not do the cleaning, flushOutbound
will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic <[email protected]>
neilalexander pushed a commit that referenced this pull request Apr 4, 2024
The main issue for router/gateway connections were that they would
be registered with the global account but never removed from there,
which would cause the connection to be retained. The memory leak
was apparent in one user setup with large amount of subscriptions
that were transfered to a slow gateway, causing accumulation of
partial buffers before the connection was dropped.

For gateway connections, we also needed to clean the outbound's
map since it can contain sublist referencing its own connection.
Same for subscriptions in c.subs.

Another found cause for retained connection is with leafnode where
we are using a subscription referencing the leafnode connection
that was globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil`
when the connection is closed. Without that, it seems that the connection
would not be released (at least based on experimentation). We now
make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing
itself in some way, prevents the GC from releasing the object, even
if the "top level" object is no longer reachable. For instance,
suppose `obj.someField = obj` seem to prevent the GC from releasing
`obj`, even if `obj` itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb,
but the way it was done was racy since c.out.wnb is used without
lock in flushOutbound. Once the retain/release issue is fixed,
cleaning this buffer is not really required (but good practice
especially if not running with GOMEMLIMIT), so we take care
of cleaning this up, but under the protection of the flushOutbound
flag. If set, flushAndClose() will not do the cleaning, flushOutbound
will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic <[email protected]>
neilalexander pushed a commit that referenced this pull request Apr 4, 2024
The main issue for router/gateway connections were that they would
be registered with the global account but never removed from there,
which would cause the connection to be retained. The memory leak
was apparent in one user setup with large amount of subscriptions
that were transfered to a slow gateway, causing accumulation of
partial buffers before the connection was dropped.

For gateway connections, we also needed to clean the outbound's
map since it can contain sublist referencing its own connection.
Same for subscriptions in c.subs.

Another found cause for retained connection is with leafnode where
we are using a subscription referencing the leafnode connection
that was globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil`
when the connection is closed. Without that, it seems that the connection
would not be released (at least based on experimentation). We now
make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing
itself in some way, prevents the GC from releasing the object, even
if the "top level" object is no longer reachable. For instance,
suppose `obj.someField = obj` seem to prevent the GC from releasing
`obj`, even if `obj` itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb,
but the way it was done was racy since c.out.wnb is used without
lock in flushOutbound. Once the retain/release issue is fixed,
cleaning this buffer is not really required (but good practice
especially if not running with GOMEMLIMIT), so we take care
of cleaning this up, but under the protection of the flushOutbound
flag. If set, flushAndClose() will not do the cleaning, flushOutbound
will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants