-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No error if RabbitMQ goes down #76
Comments
@ebuildy that is because we now use the RabbitMQ library's built in auto-reconnect facility which is more robust. We'd be glad to accept a patch that hooks into it and logs these events however! |
You are talking about this: https://www.rabbitmq.com/api-guide.html#recovery ? Thanks you, I mush finished some PR on other logstash plugins and RabbitMQ Java client (rabbitmq/rabbitmq-java-client#138) before, then I will give an eye to it. |
@andrewvc beside there are no logs if the rabbitmq server closes the channel, a reconnect isnt happening, why i assume logstash with rabbitmq input is not usable atm, especially in a dockerized environment... any solutions here? we to use a haproxy in between rabbit and logstash |
@digitalkaoz why is the rabbitmq server closing the channel? AFAIK if you do that the meaning is you want the client to stop and not restart. Please correct me if I'm wrong here. |
@digitalkaoz that's a different situation, may I add, from one where the server has gone down as a result of an error. |
@ebuildy @digitalkaoz , if you all can share a reproducible test case I'd be glad to try it and find a fix. Just list the steps to reproduce, preferably just using logstash and a local rabbitmq. |
@andrewvc the rabbit and haproxy server are restarted during deployments (they are just docker containers) here the setup: rabbitmq -> 5672(tcp) -> haproxy -> 5672(tcp) -> logstash
input {
rabbitmq {
connection_timeout => 1000
heartbeat => 30
host => "haproxy"
subscription_retry_interval_seconds => 30
queue => "logs"
user => "guest"
password => "guest"
} if we only restart rabbitmq it seems fine when we gracefully shutdown the server: $ docker exec rabbitmq rabbitmqctl close_connection {$CON_ID} "rabbit shutdown"
$ docker exec rabbitmq rabbitmqctl stop_app we explicitly close the connection to force the but if we restart haproxy the tcp gets cut off ungracefully and the logstash plugin wont notice the channel is closed (and doenst go into restart mode) because it received an error frame. the java recovery stuff doesnt seem to work for that. hope thats enough information |
i think thats a quite common scenario in a containerized world |
when i restart logstash i get the following error:
|
@digitalkaoz thanks for the very detailed description here! @michaelklishin should the java client's auto connection recovery just handle this? |
It takes time to detect peer unavailability. I don't see how this is a connection recovery issue: recovery cannot start if a client connection is still considered to be alive. |
Also: do not set heartbeats to a value lower than 5 seconds. That is recipe for false positives. |
We waited alot but logstash wont recover the Connection. I would rather get and neither this plugin, nor march_here nor the the java-amqp lib seems to notice that |
@michaelklishin thanks for popping in! He's setting the heartbeat to 30 seconds. It's confusing because the connection timeout is in millis, but the heartbeat is in seconds. I think this is correct yes? Looking at the source of the java client it looks like timeout is set in millis and heartbeats in seconds. Is that all correct? |
@andrewvc @michaelklishin i created a tiny docker-compose setup to reproduce this issue: https://github.com/digitalkaoz/rabbit-haproxy-logstash-fail simply follow the |
Heartbeat timeout is in seconds per protocol spec. Also see this thread. |
So you have HAproxy in between. HAproxy has its own timeouts (which heartbeats in theory make irrelevant but still) and what RabbitMQ sees is a TCP connection from HAproxy, not the client. I can do a March Hare release with a preview version of the 3.6.6 Java client, however. |
Thanks @digitalkaoz , I'll try and repro today. |
Also thanks for the background @michaelklishin . I'm wondering if what's happening is that HAProxy is switching the backend without letting logstash know, and without disconnecting from logstash. Will report back. |
it still fails when i wait lets say 5 minutes... @michaelklishin as i understand the relevant fix is in https://github.com/rabbitmq/rabbitmq-java-client/releases/tag/v4.0.0.M2 ? |
@andrewvc typically proxies propagate upstream connection loss. At least that's what HAproxy does by default. In fact, if it didn't then the two logical ends of the TCP connection would be out of sync. I doubt it's a Java client problem here but I've updated March Hare to the latest 3.6.x version from source. |
@digitalkaoz I'm sorry but there is no evidence that I have seen that this is a client library issue. A good thing to try would be to eliminate HAproxy from the picture entirely and also verify the effective heartbeat timeout value (management UI displays it on the connection page). |
@michaelklishin sure if i remove haproxy from the setup the reconnect works. but thats not a typical setup (or at least at scale) with haproxy in the middle, reconnect fails (even with a heartbeat of |
It should. We don't know why that doesn't happen: that's why I recommended checking the actual effective value. |
after restarting rabbit there is no connection alive on the rabbit-node because nobody initiated a new connection. neither haproxy, nor logstash. bc. their connection is still there but defacto useless? |
@michaelklishin, is there a best practice here for accomplishing what @digitalkaoz wants to accomplish? I think having a layer of indirection via a load balancer makes sense. This is something that's come up multiple times, I'm wondering if the RabbitMQ project has an official stance here. |
There's no stance as HAproxy is supposed to be transparent to the client. @digitalkaoz HAproxy is not supposed to initiate any connections upstream unless a new client connects to it. Can we please stop guessing and move on to forming hypotheses and verifying them? The first one I have is: the heartbeat value is actually not set to what the reporter thinks it is used. I've mentioned how this can be verified above. Without this we can pour hours and hours of time and get nothing out of it. My team already spends hours every day on various support questions and we would appreciate if the reporters cooperated a bit to save us all some time. I also pushed a Java client update to March Hare, even though I doubt it will change anything. A Wireshark/tcpdump capture will provide a lot of information to look at. |
A Wireshark capture should make it easy to see heartbeat frames on the wire: not only their actual presence but also timestamps (they are sent at roughly 1/2 of the effective timeout value). |
@michaelklishin the heartbeat is set to |
Bumping versions fixed this issue to me @michaelklishin . Thanks for pushing the new version. Opened this PR to bump it logstash-plugins/logstash-mixin-rabbitmq_connection#32 We need a separate PR to do a better job logging bad connections. |
Please respond to this ticket if you're still experiencing this bug @digitalkaoz and we can reopen it . You should be able to upgrade by running |
@andrewvc im not sure how it fixes it for you?
same thing, if i kill and start rabbit logstash still doesnt recover the connection, last message from logstash was now (thanks to your newest commit ;) )
the important thing is to not restart logstash! |
@digitalkaoz yes, I followed the following your procedure (using your excellent docker-compose):
That all works for me. Not for you? BTW, I'd recommend upgrading to logstash-input-rabbitmq 5.2.0 |
still no luck:
what am i doing wrong? my steps:
still logstash wont catch up :/ |
@digitalkaoz I'm sorry to be so blunt but if you want to get to the bottom of it you need to put in some debugging effort of your own. We have provided several hypotheses, investigated at least two, explained what various log entries mean, what can be altered to identify what application may be introducing the behaviour observed, analyzed at least one Wireshark dump, and based on all the findings produced a new March Hare version. That's enough information to understand how to debug a small distributed system like this one. Asking "what am I doing wrong" is unlikely to be productive. Trying things at random is unlikely to be efficient. |
@digitalkaoz can you update the docker-compose stuff with the latest? One thing that was different in my repro is that I ran everything except logstash in docker. That shouldn't make a difference, but maybe it does. |
Hey @digitalkaoz I'm still glad to help you get to the bottom of this. @michaelklishin thanks so much for your help, I really appreciate it! I'm glad to personally put in some more hours here myself. What would help me move this forward @digitalkaoz is if you could update your docker config to the latest versions so we can be reproing the same things. |
sorry @andrewvc i had an busy weekend. i updated everything to the latests version, but still no luck. the repo is up to date now.
|
In the snippet above both RabbitMQ and publisher are shut down and started at the same time. Note that this is not the same steps that at least some in this thread performed and creates a natural race condition between the two starting. If the producer starts before RabbitMQ then connection recovery WILL NOT BE ENGAGED since there never was a successful connection to recover: initial TCP connection will fail. |
@michaelklishin the producer waits 10s before doing anything, so rabbit has time to start |
I've reopened this issue now that I've reconfirmed it @digitalkaoz , but as I've traced the issue to what I believe to definitely be a MarchHare bug we should move the discussion to ruby-amqp/march_hare#107 Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window. If it takes any longer it is perma-wedged. |
BTW, thanks for the excellent docker compose @digitalkaoz |
@andrewvc and I worked on this for a while today. Packet capture in wireshark showed indeed that the rabbitmq client from Logstash was waiting 5 seconds (as expected) when rabbitmq was stopped before reconnecting. However, reconnecting was only attempted once:
The problem we are looking at now is to try and understand why the rabbitmq client is not reconnecting after the first reconnection attempt had timed out. |
Based on the behavior, I was able to reproduce this without docker and without haproxy
|
@jordansissel this is an edge case in the protocol. Before connection negotiation happens we cannot set up a heartbeat timer but without it it takes a long time to detect unresponsive peers (whether there is a TCP connection issue or just no response as in the It therefore requires an intermediary or a bogus server that doesn't respond to a protocol header. |
@michaelklishin confirmed this with @jordansissel . The fix is in this PR: ruby-amqp/march_hare#108 Thanks for pointing us to the right place! |
@andrewvc @jordansissel thank you for finding a way to reproduce with netcat. If there is a way for you to verify end-to-end with March Hare master, I'd be happy to do a March Hare release tomorrow morning (European time). |
March Hare 2.20.0 is up. |
@michaelklishin thanks ill test it now |
@michaelklishin @andrewvc @jordansissel i can confirm 2.20.0 of march_hare fixed the reconnection issues. thanks for all your efforts! |
@digitalkaoz fantastic! |
Should these changes be ported over to logstash-output-rabbitmq as well? |
I think it is, because the mixin (used by Input and output) loads Jared Kauppila [email protected] schrieb am Do., 3. Nov. 2016,
|
Before version 3.3.0, when Rabbit server went down, logstash throws:
With version 4.0.1, nothing happen (as you can on docker-compose logs, here https://gist.github.com/ebuildy/d405dab7dbaca2e3a03dc2a7fa6f5b29) if queue server is down, is it normal?
Thanks you,
The text was updated successfully, but these errors were encountered: