No error if RabbitMQ goes down #76

ebuildy · 2016-03-29T17:39:12Z

Before version 3.3.0, when Rabbit server went down, logstash throws:

Queue subscription ended! Will retry in 5s
RabbitMQ connection down, will wait to retry subscription

With version 4.0.1, nothing happen (as you can on docker-compose logs, here https://gist.github.com/ebuildy/d405dab7dbaca2e3a03dc2a7fa6f5b29) if queue server is down, is it normal?

Thanks you,

The text was updated successfully, but these errors were encountered:

andrewvc · 2016-03-30T14:10:47Z

@ebuildy that is because we now use the RabbitMQ library's built in auto-reconnect facility which is more robust. We'd be glad to accept a patch that hooks into it and logs these events however!

ebuildy · 2016-03-30T14:19:37Z

You are talking about this: https://www.rabbitmq.com/api-guide.html#recovery ?

Thanks you, I mush finished some PR on other logstash plugins and RabbitMQ Java client (rabbitmq/rabbitmq-java-client#138) before, then I will give an eye to it.

digitalkaoz · 2016-10-18T15:32:00Z

@andrewvc beside there are no logs if the rabbitmq server closes the channel, a reconnect isnt happening, why i assume logstash with rabbitmq input is not usable atm, especially in a dockerized environment... any solutions here? we to use a haproxy in between rabbit and logstash

andrewvc · 2016-10-18T16:48:46Z

@digitalkaoz why is the rabbitmq server closing the channel?

AFAIK if you do that the meaning is you want the client to stop and not restart. Please correct me if I'm wrong here.

andrewvc · 2016-10-18T16:49:36Z

@digitalkaoz that's a different situation, may I add, from one where the server has gone down as a result of an error.

andrewvc · 2016-10-18T16:51:57Z

@ebuildy @digitalkaoz , if you all can share a reproducible test case I'd be glad to try it and find a fix. Just list the steps to reproduce, preferably just using logstash and a local rabbitmq.

digitalkaoz · 2016-10-18T18:05:06Z

@andrewvc the rabbit and haproxy server are restarted during deployments (they are just docker containers)

here the setup:

rabbitmq -> 5672(tcp) -> haproxy -> 5672(tcp) -> logstash

rabbitmq.conf is default

haproxy.conf

listen amqp
        mode tcp
        option tcplog
        bind *:5672
        log global
        balance         roundrobin
        timeout connect 2s
        timeout client  2m
        timeout server  2m
        option          clitcpka
        option tcp-check
        server rabbitmq:5672 check port 5672

logstash.conf

input {
    rabbitmq {
        connection_timeout => 1000
        heartbeat => 30
        host => "haproxy"
        subscription_retry_interval_seconds => 30
        queue => "logs"
        user => "guest"
        password => "guest"
    }

if we only restart rabbitmq it seems fine when we gracefully shutdown the server:

$ docker exec rabbitmq rabbitmqctl close_connection {$CON_ID} "rabbit shutdown"
$ docker exec rabbitmq rabbitmqctl stop_app

we explicitly close the connection to force the logstash-rabbitmq-input-plugin to force a restart

but if we restart haproxy the tcp gets cut off ungracefully and the logstash plugin wont notice the channel is closed (and doenst go into restart mode) because it received an error frame. the java recovery stuff doesnt seem to work for that.

hope thats enough information

digitalkaoz · 2016-10-18T18:06:30Z

i think thats a quite common scenario in a containerized world

digitalkaoz · 2016-10-18T20:38:56Z

when i restart logstash i get the following error:

An unexpected error occurred!", :error=>com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; cause: java.io.EOFException, :class=>"Java::ComRabbitmqClient::AlreadyClosedException", :backtrace=>["com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(com/rabbitmq/client/impl/AMQChannel.java:195)", "com.rabbitmq.client.impl.AMQChannel.rpc(com/rabbitmq/client/impl/AMQChannel.java:241)", "com.rabbitmq.client.impl.ChannelN.basicCancel(com/rabbitmq/client/impl/ChannelN.java:1127)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)", "RUBY.method_missing(/opt/logstash/vendor/bundle/jruby/1.9/gems/march_hare-2.15.0-java/lib/march_hare/channel.rb:874)", "RUBY.shutdown_consumer(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:279)", "RUBY.stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-rabbitmq-4.1.0/lib/logstash/inputs/rabbitmq.rb:273)", "RUBY.do_stop(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/inputs/base.rb:83)", "org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/pipeline.rb:385)", "RUBY.stop_pipeline(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:413)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1342)", "RUBY.shutdown_pipelines(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:406)", "RUBY.shutdown(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:402)", "RUBY.execute(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/agent.rb:233)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:94)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.run(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.4.0-java/lib/logstash/runner.rb:99)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:281)", "RUBY.initialize(/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/task.rb:24)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}

andrewvc · 2016-10-18T23:31:51Z

@digitalkaoz thanks for the very detailed description here!

@michaelklishin should the java client's auto connection recovery just handle this?

michaelklishin · 2016-10-19T04:37:09Z

It takes time to detect peer unavailability. I don't see how this is a connection recovery issue: recovery cannot start if a client connection is still considered to be alive.

michaelklishin · 2016-10-19T04:37:34Z

Also: do not set heartbeats to a value lower than 5 seconds. That is recipe for false positives.

digitalkaoz · 2016-10-19T05:42:18Z

We waited alot but logstash wont recover the Connection. I would rather get
false positives than nothing...any more ideas? The Error Message at the End
says it very clear. The Channel was already closed, bc it received an Error
Frame. The Channel is unusable once received an error_frame...

and neither this plugin, nor march_here nor the the java-amqp lib seems to notice that

andrewvc · 2016-10-19T12:52:34Z

@michaelklishin thanks for popping in!

He's setting the heartbeat to 30 seconds. It's confusing because the connection timeout is in millis, but the heartbeat is in seconds.

I think this is correct yes? Looking at the source of the java client it looks like timeout is set in millis and heartbeats in seconds. Is that all correct?

digitalkaoz · 2016-10-25T10:40:37Z

@andrewvc @michaelklishin i created a tiny docker-compose setup to reproduce this issue:

https://github.com/digitalkaoz/rabbit-haproxy-logstash-fail

simply follow the reproduce steps...

michaelklishin · 2016-10-25T11:04:52Z

Heartbeat timeout is in seconds per protocol spec. Also see this thread.

michaelklishin · 2016-10-25T11:10:23Z

So you have HAproxy in between. HAproxy has its own timeouts (which heartbeats in theory make irrelevant but still) and what RabbitMQ sees is a TCP connection from HAproxy, not the client. I can do a March Hare release with a preview version of the 3.6.6 Java client, however.

andrewvc · 2016-10-25T13:09:17Z

Thanks @digitalkaoz , I'll try and repro today.

andrewvc · 2016-10-25T13:10:04Z

Also thanks for the background @michaelklishin . I'm wondering if what's happening is that HAProxy is switching the backend without letting logstash know, and without disconnecting from logstash. Will report back.

digitalkaoz · 2016-10-25T13:10:25Z

it still fails when i wait lets say 5 minutes... @michaelklishin as i understand the relevant fix is in https://github.com/rabbitmq/rabbitmq-java-client/releases/tag/v4.0.0.M2 ?

michaelklishin · 2016-10-25T13:11:38Z

@andrewvc typically proxies propagate upstream connection loss. At least that's what HAproxy does by default. In fact, if it didn't then the two logical ends of the TCP connection would be out of sync.

I doubt it's a Java client problem here but I've updated March Hare to the latest 3.6.x version from source.

michaelklishin · 2016-10-25T13:13:22Z

@digitalkaoz I'm sorry but there is no evidence that I have seen that this is a client library issue. A good thing to try would be to eliminate HAproxy from the picture entirely and also verify the effective heartbeat timeout value (management UI displays it on the connection page).

digitalkaoz · 2016-10-25T13:20:58Z

@michaelklishin sure if i remove haproxy from the setup the reconnect works. but thats not a typical setup (or at least at scale)

with haproxy in the middle, reconnect fails (even with a heartbeat of 10s), but my question is why? shouldnt the heartbeat tell the plugin that the connection isnt healthy anymore and reinitiate a recovery/reconnect ?

michaelklishin · 2016-10-25T13:27:02Z

It should. We don't know why that doesn't happen: that's why I recommended checking the actual effective value.

digitalkaoz · 2016-10-25T13:33:26Z

after restarting rabbit there is no connection alive on the rabbit-node because nobody initiated a new connection. neither haproxy, nor logstash. bc. their connection is still there but defacto useless?

andrewvc · 2016-10-25T14:02:48Z

@michaelklishin, is there a best practice here for accomplishing what @digitalkaoz wants to accomplish? I think having a layer of indirection via a load balancer makes sense. This is something that's come up multiple times, I'm wondering if the RabbitMQ project has an official stance here.

michaelklishin · 2016-10-25T14:10:59Z

There's no stance as HAproxy is supposed to be transparent to the client.

@digitalkaoz HAproxy is not supposed to initiate any connections upstream unless a new client connects to it.

Can we please stop guessing and move on to forming hypotheses and verifying them? The first one I have is: the heartbeat value is actually not set to what the reporter thinks it is used. I've mentioned how this can be verified above. Without this we can pour hours and hours of time and get nothing out of it. My team already spends hours every day on various support questions and we would appreciate if the reporters cooperated a bit to save us all some time.

I also pushed a Java client update to March Hare, even though I doubt it will change anything.

A Wireshark/tcpdump capture will provide a lot of information to look at.

michaelklishin · 2016-10-25T14:12:02Z

A Wireshark capture should make it easy to see heartbeat frames on the wire: not only their actual presence but also timestamps (they are sent at roughly 1/2 of the effective timeout value).

digitalkaoz · 2016-10-25T14:20:00Z

@michaelklishin the heartbeat is set to 10s as i told you above...but that didnt change anything, or do you mean some other value?

andrewvc · 2016-10-26T01:22:14Z

Bumping versions fixed this issue to me @michaelklishin . Thanks for pushing the new version.

Opened this PR to bump it logstash-plugins/logstash-mixin-rabbitmq_connection#32

We need a separate PR to do a better job logging bad connections.

andrewvc · 2016-10-26T01:49:11Z

Please respond to this ticket if you're still experiencing this bug @digitalkaoz and we can reopen it . You should be able to upgrade by running logstash-plugin --update

digitalkaoz · 2016-10-26T19:21:08Z

@andrewvc im not sure how it fixes it for you?

march_hare-2.19.0
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-4.1.0

same thing, if i kill and start rabbit logstash still doesnt recover the connection, last message from logstash was now (thanks to your newest commit ;) )

{:timestamp=>"2016-10-26T19:12:31.687000+0000", :message=>"RabbitMQ connection was closed!", :url=>"amqp://guest:XXXXXX@proxy:5672/", :automatic_recovery=>true, :cause=>com.rabbitmq.client.ShutdownSignalException: connection error, :level=>:warn}

the important thing is to not restart logstash!
can you tell me your step how it worked on your side?

andrewvc · 2016-10-26T22:32:19Z

@digitalkaoz yes, I followed the following your procedure (using your excellent docker-compose):

Start up the whole stack, wait for LS to come up
docker-compose kill rabbit producer
docker-compose up rabbit producer

That all works for me. Not for you? BTW, I'd recommend upgrading to logstash-input-rabbitmq 5.2.0

digitalkaoz · 2016-10-27T07:25:35Z

still no luck:

logstash:5
logstash-mixin-rabbitmq_connection-4.2.0
logstash-input-rabbitmq-5.2.0
march_hare-2.19.0

what am i doing wrong?

my steps:

$ docker-composer up #all up
$ docker-compose kill rabbit producer #kill rabbit+producer
$ sleep 10
$ docker-compose start rabbit producer #restart rabbit+producer

still logstash wont catch up :/

michaelklishin · 2016-10-27T08:10:10Z

@digitalkaoz I'm sorry to be so blunt but if you want to get to the bottom of it you need to put in some debugging effort of your own. We have provided several hypotheses, investigated at least two, explained what various log entries mean, what can be altered to identify what application may be introducing the behaviour observed, analyzed at least one Wireshark dump, and based on all the findings produced a new March Hare version. That's enough information to understand how to debug a small distributed system like this one.

Asking "what am I doing wrong" is unlikely to be productive. Trying things at random is unlikely to be efficient.

andrewvc · 2016-10-28T13:12:18Z

@digitalkaoz can you update the docker-compose stuff with the latest? One thing that was different in my repro is that I ran everything except logstash in docker. That shouldn't make a difference, but maybe it does.

andrewvc · 2016-10-31T21:56:23Z

Hey @digitalkaoz I'm still glad to help you get to the bottom of this. @michaelklishin thanks so much for your help, I really appreciate it!

I'm glad to personally put in some more hours here myself. What would help me move this forward @digitalkaoz is if you could update your docker config to the latest versions so we can be reproing the same things.

digitalkaoz · 2016-11-01T07:55:51Z

sorry @andrewvc i had an busy weekend.

i updated everything to the latests version, but still no luck. the repo is up to date now.

$ docker-composer up #all up
$ docker-compose kill rabbit producer
$ sleep 10
$ docker-compose start rabbit producer

michaelklishin · 2016-11-01T08:01:17Z

In the snippet above both RabbitMQ and publisher are shut down and started at the same time. Note that this is not the same steps that at least some in this thread performed and creates a natural race condition between the two starting. If the producer starts before RabbitMQ then connection recovery WILL NOT BE ENGAGED since there never was a successful connection to recover: initial TCP connection will fail.

digitalkaoz · 2016-11-01T09:18:10Z

@michaelklishin the producer waits 10s before doing anything, so rabbit has time to start

andrewvc · 2016-11-01T21:53:15Z

I've reopened this issue now that I've reconfirmed it @digitalkaoz , but as I've traced the issue to what I believe to definitely be a MarchHare bug we should move the discussion to ruby-amqp/march_hare#107

Interestingly MarchHare does recover if rabbitmq comes back in under the 5s window. If it takes any longer it is perma-wedged.

andrewvc · 2016-11-01T21:53:35Z

BTW, thanks for the excellent docker compose @digitalkaoz

jordansissel · 2016-11-01T22:47:36Z

@andrewvc and I worked on this for a while today. Packet capture in wireshark showed indeed that the rabbitmq client from Logstash was waiting 5 seconds (as expected) when rabbitmq was stopped before reconnecting. However, reconnecting was only attempted once:

✔️ Client's connection is terminated by HAproxy because RabbitMQ was terminated
✔️ Client initiates new connection to haproxy
✔️ Client sends protocol negotiation: AMQP\0\0\9\1 ("AMQP" + hex 00000901)
✔️ 5 seconds later, Client starts tcp teardown, sending FIN to close the connection. Presumably because the backend (haproxy) did not respond to the client's protocol negotiation
❌ No further connection attempts are made by the client. Expectation is that after 5 seconds, again, the client would attempt to reconnect.

The problem we are looking at now is to try and understand why the rabbitmq client is not reconnecting after the first reconnection attempt had timed out.

jordansissel · 2016-11-01T23:12:14Z

Based on the behavior, I was able to reproduce this without docker and without haproxy

Run rabbitmq + logstash
Let logstash rmq input connect
Stop rabbitmq
run nc -l 5672 to accept the next connection retry from Logstash
wait until 'AMQP' appears on nc (this is the rmq client attempting to reconnect)
After this attempt to reconnect, rmq input never tries again.

michaelklishin · 2016-11-01T23:33:51Z

@jordansissel this is an edge case in the protocol. Before connection negotiation happens we cannot set up a heartbeat timer but without it it takes a long time to detect unresponsive peers (whether there is a TCP connection issue or just no response as in the nc example).

It therefore requires an intermediary or a bogus server that doesn't respond to a protocol header.

andrewvc · 2016-11-01T23:34:56Z

@michaelklishin confirmed this with @jordansissel . The fix is in this PR: ruby-amqp/march_hare#108

Thanks for pointing us to the right place!

michaelklishin · 2016-11-01T23:44:52Z

@andrewvc @jordansissel thank you for finding a way to reproduce with netcat. If there is a way for you to verify end-to-end with March Hare master, I'd be happy to do a March Hare release tomorrow morning (European time).

michaelklishin · 2016-11-02T08:40:10Z

March Hare 2.20.0 is up.

digitalkaoz · 2016-11-02T08:57:09Z

@michaelklishin thanks ill test it now
@andrewvc can you release a 4.2.1 of logstash-mixin-rabbitmq_connection so i get march_hare 2.20.0 in?

digitalkaoz · 2016-11-02T13:27:43Z

@michaelklishin @andrewvc @jordansissel i can confirm 2.20.0 of march_hare fixed the reconnection issues. thanks for all your efforts!

andrewvc · 2016-11-02T13:48:40Z

@digitalkaoz fantastic!

jakauppila · 2016-11-03T19:47:53Z

Should these changes be ported over to logstash-output-rabbitmq as well?

digitalkaoz · 2016-11-03T20:42:06Z

I think it is, because the mixin (used by Input and output) loads
march_hare the output gets this fix to. Or am i wrong?

Jared Kauppila [email protected] schrieb am Do., 3. Nov. 2016,
20:47:

Should these changes be ported over to logstash-output-rabbitmq as well?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#76 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAR61yzJWpvPx2UuNA35Gtgj-x-flg93ks5q6jprgaJpZM4H69a4
.

ebuildy mentioned this issue Mar 29, 2016

No reconnect after node lost in a HA cluster via HAProxy #75

Closed

elasticsearch-bot closed this as completed in logstash-plugins/logstash-mixin-rabbitmq_connection@8d65704 Oct 26, 2016

andrewvc mentioned this issue Oct 26, 2016

Add conn shutdown error log. Improve other error log messages logstash-plugins/logstash-mixin-rabbitmq_connection#33

Closed

andrewvc mentioned this issue Nov 1, 2016

Connection recovery doesn't happen with HAProxy without available backends ruby-amqp/march_hare#107

Closed

andrewvc reopened this Nov 1, 2016

andrewvc mentioned this issue Nov 2, 2016

Bump march hare to 2.20.0 logstash-plugins/logstash-mixin-rabbitmq_connection#34

Closed

andrewvc closed this as completed Nov 2, 2016

No error if RabbitMQ goes down #76

No error if RabbitMQ goes down #76

Comments

ebuildy commented Mar 29, 2016

andrewvc commented Mar 30, 2016

ebuildy commented Mar 30, 2016

digitalkaoz commented Oct 18, 2016

andrewvc commented Oct 18, 2016

andrewvc commented Oct 18, 2016

andrewvc commented Oct 18, 2016

digitalkaoz commented Oct 18, 2016 • edited Loading

digitalkaoz commented Oct 18, 2016

digitalkaoz commented Oct 18, 2016 • edited Loading

andrewvc commented Oct 18, 2016

michaelklishin commented Oct 19, 2016

michaelklishin commented Oct 19, 2016

digitalkaoz commented Oct 19, 2016 • edited Loading

andrewvc commented Oct 19, 2016

digitalkaoz commented Oct 25, 2016

michaelklishin commented Oct 25, 2016

michaelklishin commented Oct 25, 2016

andrewvc commented Oct 25, 2016

andrewvc commented Oct 25, 2016

digitalkaoz commented Oct 25, 2016

michaelklishin commented Oct 25, 2016

michaelklishin commented Oct 25, 2016

digitalkaoz commented Oct 25, 2016 • edited Loading

michaelklishin commented Oct 25, 2016

digitalkaoz commented Oct 25, 2016

andrewvc commented Oct 25, 2016

michaelklishin commented Oct 25, 2016

michaelklishin commented Oct 25, 2016

digitalkaoz commented Oct 25, 2016

andrewvc commented Oct 26, 2016

andrewvc commented Oct 26, 2016

digitalkaoz commented Oct 26, 2016 • edited Loading

andrewvc commented Oct 26, 2016 • edited Loading

digitalkaoz commented Oct 27, 2016 • edited Loading

michaelklishin commented Oct 27, 2016

andrewvc commented Oct 28, 2016

andrewvc commented Oct 31, 2016

digitalkaoz commented Nov 1, 2016

michaelklishin commented Nov 1, 2016

digitalkaoz commented Nov 1, 2016

andrewvc commented Nov 1, 2016

andrewvc commented Nov 1, 2016

jordansissel commented Nov 1, 2016 • edited Loading

jordansissel commented Nov 1, 2016 • edited Loading

michaelklishin commented Nov 1, 2016

andrewvc commented Nov 1, 2016

michaelklishin commented Nov 1, 2016

michaelklishin commented Nov 2, 2016

digitalkaoz commented Nov 2, 2016 • edited Loading

digitalkaoz commented Nov 2, 2016

andrewvc commented Nov 2, 2016

jakauppila commented Nov 3, 2016

digitalkaoz commented Nov 3, 2016

digitalkaoz commented Oct 18, 2016 •

edited

Loading

digitalkaoz commented Oct 18, 2016 •

edited

Loading

digitalkaoz commented Oct 19, 2016 •

edited

Loading

digitalkaoz commented Oct 25, 2016 •

edited

Loading

digitalkaoz commented Oct 26, 2016 •

edited

Loading

andrewvc commented Oct 26, 2016 •

edited

Loading

digitalkaoz commented Oct 27, 2016 •

edited

Loading

jordansissel commented Nov 1, 2016 •

edited

Loading

jordansissel commented Nov 1, 2016 •

edited

Loading

digitalkaoz commented Nov 2, 2016 •

edited

Loading