Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Possible memory leak due to connections not being released #5244

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

kozlovic
Copy link
Member

The main issue for router/gateway connections were that they would be registered with the global account but never removed from there, which would cause the connection to be retained. The memory leak was apparent in one user setup with large amount of subscriptions that were transfered to a slow gateway, causing accumulation of partial buffers before the connection was dropped.

For gateway connections, we also needed to clean the outbound's map since it can contain sublist referencing its own connection. Same for subscriptions in c.subs.

Another found cause for retained connection is with leafnode where we are using a subscription referencing the leafnode connection that was globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a *Cond) needed to be set to nil when the connection is closed. Without that, it seems that the connection would not be released (at least based on experimentation). We now make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing itself in some way, prevents the GC from releasing the object, even if the "top level" object is no longer reachable. For instance, suppose obj.someField = obj seem to prevent the GC from releasing obj, even if obj itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb, but the way it was done was racy since c.out.wnb is used without lock in flushOutbound. Once the retain/release issue is fixed, cleaning this buffer is not really required (but good practice especially if not running with GOMEMLIMIT), so we take care of cleaning this up, but under the protection of the flushOutbound flag. If set, flushAndClose() will not do the cleaning, flushOutbound will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic [email protected]

The main issue for router/gateway connections were that they would
be registered with the global account but never removed from there,
which would cause the connection to be retained. The memory leak
was apparent in one user setup with large amount of subscriptions
that were transfered to a slow gateway, causing accumulation of
partial buffers before the connection was dropped.

For gateway connections, we also needed to clean the outbound's
map since it can contain sublist referencing its own connection.
Same for subscriptions in c.subs.

Another found cause for retained connection is with leafnode where
we are using a subscription referencing the leafnode connection
that was globally registered in a global sublist but never removed.

Also the connection's c.out.sg (a `*Cond`) needed to be set to `nil`
when the connection is closed. Without that, it seems that the connection
would not be released (at least based on experimentation). We now
make sure that c.out.sg is not nil before sending a signal.

The bottom line is that it looks like having an object referencing
itself in some way, prevents the GC from releasing the object, even
if the "top level" object is no longer reachable. For instance,
suppose `obj.someField = obj` seem to prevent the GC from releasing
`obj`, even if `obj` itself is no longer referenced.

The original issue/PR (#5212) was simply cleaning the c.out.wnb,
but the way it was done was racy since c.out.wnb is used without
lock in flushOutbound. Once the retain/release issue is fixed,
cleaning this buffer is not really required (but good practice
especially if not running with GOMEMLIMIT), so we take care
of cleaning this up, but under the protection of the flushOutbound
flag. If set, flushAndClose() will not do the cleaning, flushOutbound
will do it.

Relates to #5212

Signed-off-by: Ivan Kozlovic <[email protected]>
@kozlovic kozlovic requested a review from a team as a code owner March 26, 2024 00:22
} else if c.kind == LEAF {
a.nleafs++
}
// If we did not add it, we are done
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in accounts.go are simply exiting early if the client was not added/removed, instead of checking for the boolean at every following "if" statements.

@@ -1590,6 +1599,8 @@ func (c *client) flushOutbound() bool {
}
if err != nil {
c.Errorf("Error compressing data: %v", err)
// We need to grab the lock now before marking as closed and exiting
c.mu.Lock()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was missing since we are in a location where the lock was released but we are failing and returning before acquiring the lock after the TCP write. The caller is expected that the function returns with lock held.

@@ -2099,7 +2113,8 @@ func (c *client) processConnect(arg []byte) error {
}

// If no account designation.
if c.acc == nil {
// Do this only for CLIENT and LEAF connections.
if c.acc == nil && (c.kind == CLIENT || c.kind == LEAF) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the main issue with the Gateway reported memory leak. The connection was assigned to the global account's client list, but never removed from there. So I believe that we should do it only for those type of connections.

@@ -1162,7 +1162,7 @@ func (c *client) removeRemoteSubs() {
c.mu.Lock()
srv := c.srv
subs := c.subs
c.subs = make(map[string]*subscription)
c.subs = nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere in this PR where I set c.subs = nil, I could, like we were doing here, create an empty map. But I believe that we check for connection not being closed when accessing (at least setting a value to) c.subs.

@kozlovic
Copy link
Member Author

@derekcollison @neilalexander Here is a simple example that shows why we need to do some of the cleaning I do in this PR for the connection to be released. Without some of the fields set to nil, the connection is seemingly not released (but I can't know for certain since SetFinalizer() says that there is no guarantee that the finalizer will be invoked.

func TestConnectionGC(t *testing.T) {
	c := &client{cid: 1, subs: make(map[string]*subscription)}
	c.out.sg = sync.NewCond(&c.mu)
	t.Logf("Client %p created", c)
	c.subs["foo"] = &subscription{subject: []byte("foo"), client: c}

	ch := make(chan struct{}, 1)
	runtime.SetFinalizer(c, func(c *client) {
		t.Logf("Client %p GC'ed", c)
		ch <- struct{}{}
	})

	// With the following commented out, the client would not be GC'ed.

	// The c.out.sg (*Cond) needs to be set to nil for GC to work.
	// c.out.sg = nil

	// The subscription needs to be set to nil
	// c.subs = nil
	//
	// Or at least the reference to itself removed, either by emptying
	// the map:
	// for key := range c.subs {
	// 	delete(c.subs, key)
	// }
	//
	// Or by setting the client to `nil`, but that can't be done if there
	// is a risk that the subscription is being used somewhere else.
	// for _, sub := range c.subs {
	// 	sub.client = nil
	// }
	//

        // Set the client to nil, it won't be enough if the above is not done.
	c = nil
	tk := time.NewTicker(50 * time.Millisecond)
	tm := time.NewTimer(5 * time.Second)
	for {
		select {
		case <-ch:
			return
		case <-tk.C:
			runtime.GC()
		case <-tm.C:
			t.Fatalf("Did not get GC'ed")
		}
	}
}

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 9c686cf into main Mar 26, 2024
4 checks passed
@derekcollison derekcollison deleted the connections_leak branch March 26, 2024 00:52
derekcollison added a commit that referenced this pull request Apr 4, 2024
Cherry-pick the following PRs into the v2.10.14 release branch:

* #5237
* #5238
* #5242
* #5208
* #5244
* #5248
* #5246
* #5112
* #5247
* #5256
* #5258
* #5264
* #5266
* #5267
* #5265
* #5271
* #5270
* #5272
* #5276
* #5274
* #5275
* #5277
* #5279

Signed-off-by: Neil Twigg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants