Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop reconnect_on_network_failure rescue/retry loop on session#close #149

Conversation

andreaseger
Copy link
Contributor

I noticed that sometimes our processes get stuck on shutdown after we call close on the MarchHare::Session

While debugging this I noticed that this happens during a network outage because in this case march_hare is "stuck" in a never stopping rescue/retry loop in reconnecting_on_network_failures.

To solve this I now track if close was called and if so exit the automatic_recovery.

Honestly not entirely sure how to automatically test this. I'm able to manually reproduce this issue and that this fixes it but not sure I can make a spec out of this because it somewhat involves killing rabbitmq.

@andreaseger andreaseger force-pushed the end_network_recovery_on_session_close branch from 925db94 to 13b98e9 Compare March 6, 2020 12:23
Copy link
Contributor Author

@andreaseger andreaseger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another stab on this with the intention to make the logic within the code easier to understand. Also made the behavior change more explicit via a raised exception

@@ -318,7 +322,8 @@ def disable_automatic_recovery
# Begins automatic connection recovery (typically only used internally
# to recover from network failures)
def automatically_recover
@logger.debug("session: begin automatic connection recovery")
raise ConnectionClosedException if @was_explicitly_closed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make this new behaviors more obvious by raising a exception if automatically_recover is called after an explicit close.

@@ -583,6 +588,7 @@ def converting_rjc_exceptions_to_ruby(&block)
def reconnecting_on_network_failures(interval_in_ms, &fn)
@logger.debug("session: reconnecting_on_network_failures")
begin
return if @was_explicitly_closed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow seems better to have the loop exit in there instead of in the rescue, but it's basically the same

@@ -10,7 +10,7 @@

def close_all_connections!
# wait for stats to refresh, make sure to run bin/ci/before_build.sh as well!
sleep 0.7
sleep 1.7
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had huge issues in getting 2 specs here to work (l 318 & 333) increasing this made them stable for me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script updates management plugin settings so that changes are observable over HTTP API earlier. Was it executed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run the script. Although I had to modify it to work with the rabbitmq I have running inside a podman/docker container. I can revert and see if it works on travis, would be fine for me. I also didn't try lots of values here just added 1 extra seconds and then the specs worked stable locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine. Let's keep the value that works for you.

@@ -61,6 +71,7 @@
end
c = MarchHare.connect(executor_factory: factory, network_recovery_interval: 0)
c.close
c.instance_variable_set(:@was_explicitly_closed, false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make these tests work I now override the internal instance var. Not ideal for sure. I also don't quite understand the intention behind the close + automatically_recover in these tests. Maybe you can shed some light on why

- calling `automatically_recover` after explicit `close` will fail now
- introduce `reopen` method to enable reconnect after explicit close
- make explicit close handling more obvious on recovery
@andreaseger andreaseger force-pushed the end_network_recovery_on_session_close branch from 13b98e9 to 08574c0 Compare March 6, 2020 21:39
@michaelklishin michaelklishin merged commit b6eb022 into ruby-amqp:master Mar 7, 2020
@michaelklishin
Copy link
Member

Thank you!

@andreaseger andreaseger deleted the end_network_recovery_on_session_close branch March 7, 2020 20:27
@andreaseger
Copy link
Contributor Author

any plans on creating a release with this?

@michaelklishin
Copy link
Member

4.2.0 is out

@andreaseger
Copy link
Contributor Author

Awesome thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants