-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIXED] Possible memory leak due to connections not being released #5244
Conversation
The main issue for router/gateway connections were that they would be registered with the global account but never removed from there, which would cause the connection to be retained. The memory leak was apparent in one user setup with large amount of subscriptions that were transfered to a slow gateway, causing accumulation of partial buffers before the connection was dropped. For gateway connections, we also needed to clean the outbound's map since it can contain sublist referencing its own connection. Same for subscriptions in c.subs. Another found cause for retained connection is with leafnode where we are using a subscription referencing the leafnode connection that was globally registered in a global sublist but never removed. Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil` when the connection is closed. Without that, it seems that the connection would not be released (at least based on experimentation). We now make sure that c.out.sg is not nil before sending a signal. The bottom line is that it looks like having an object referencing itself in some way, prevents the GC from releasing the object, even if the "top level" object is no longer reachable. For instance, suppose `obj.someField = obj` seem to prevent the GC from releasing `obj`, even if `obj` itself is no longer referenced. The original issue/PR (#5212) was simply cleaning the c.out.wnb, but the way it was done was racy since c.out.wnb is used without lock in flushOutbound. Once the retain/release issue is fixed, cleaning this buffer is not really required (but good practice especially if not running with GOMEMLIMIT), so we take care of cleaning this up, but under the protection of the flushOutbound flag. If set, flushAndClose() will not do the cleaning, flushOutbound will do it. Relates to #5212 Signed-off-by: Ivan Kozlovic <[email protected]>
} else if c.kind == LEAF { | ||
a.nleafs++ | ||
} | ||
// If we did not add it, we are done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes in accounts.go are simply exiting early if the client was not added/removed, instead of checking for the boolean at every following "if" statements.
@@ -1590,6 +1599,8 @@ func (c *client) flushOutbound() bool { | |||
} | |||
if err != nil { | |||
c.Errorf("Error compressing data: %v", err) | |||
// We need to grab the lock now before marking as closed and exiting | |||
c.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was missing since we are in a location where the lock was released but we are failing and returning before acquiring the lock after the TCP write. The caller is expected that the function returns with lock held.
@@ -2099,7 +2113,8 @@ func (c *client) processConnect(arg []byte) error { | |||
} | |||
|
|||
// If no account designation. | |||
if c.acc == nil { | |||
// Do this only for CLIENT and LEAF connections. | |||
if c.acc == nil && (c.kind == CLIENT || c.kind == LEAF) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the main issue with the Gateway reported memory leak. The connection was assigned to the global account's client list, but never removed from there. So I believe that we should do it only for those type of connections.
@@ -1162,7 +1162,7 @@ func (c *client) removeRemoteSubs() { | |||
c.mu.Lock() | |||
srv := c.srv | |||
subs := c.subs | |||
c.subs = make(map[string]*subscription) | |||
c.subs = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere in this PR where I set c.subs = nil
, I could, like we were doing here, create an empty map. But I believe that we check for connection not being closed when accessing (at least setting a value to) c.subs.
@derekcollison @neilalexander Here is a simple example that shows why we need to do some of the cleaning I do in this PR for the connection to be released. Without some of the fields set to nil, the connection is seemingly not released (but I can't know for certain since SetFinalizer() says that there is no guarantee that the finalizer will be invoked.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The main issue for router/gateway connections were that they would be registered with the global account but never removed from there, which would cause the connection to be retained. The memory leak was apparent in one user setup with large amount of subscriptions that were transfered to a slow gateway, causing accumulation of partial buffers before the connection was dropped.
For gateway connections, we also needed to clean the outbound's map since it can contain sublist referencing its own connection. Same for subscriptions in c.subs.
Another found cause for retained connection is with leafnode where we are using a subscription referencing the leafnode connection that was globally registered in a global sublist but never removed.
Also the connection's c.out.sg (a
*Cond
) needed to be set tonil
when the connection is closed. Without that, it seems that the connection would not be released (at least based on experimentation). We now make sure that c.out.sg is not nil before sending a signal.The bottom line is that it looks like having an object referencing itself in some way, prevents the GC from releasing the object, even if the "top level" object is no longer reachable. For instance, suppose
obj.someField = obj
seem to prevent the GC from releasingobj
, even ifobj
itself is no longer referenced.The original issue/PR (#5212) was simply cleaning the c.out.wnb, but the way it was done was racy since c.out.wnb is used without lock in flushOutbound. Once the retain/release issue is fixed, cleaning this buffer is not really required (but good practice especially if not running with GOMEMLIMIT), so we take care of cleaning this up, but under the protection of the flushOutbound flag. If set, flushAndClose() will not do the cleaning, flushOutbound will do it.
Relates to #5212
Signed-off-by: Ivan Kozlovic [email protected]