-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: unable to read message from peer: EOF
, disconnections and routing stops
#8125
Comments
unable to read message from peer: EOF
, disconnections and routing stops
I would recommend to update to v0.17 or (v0.17.1 which should be out soon if you want to wait) as issues around SQLite and p2p code were fixed. Afterwards it will be easier to debug the issue if it still exists. |
0.17 contains a few things that I hear have not been fixed, so I'd rather wait until I see no more hiccups reported by plebs, upgrading could open a completely new can of worms while not assuring the solution of the previous one 😅 |
It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: |
I remember people talking about ungraceful shutdowns in 0.17 and possibly other things in the plebnet groups, but cannot pinpoint anything precise, I was just left with the knowledge of "0.17=unsafe to use yet". Regarding the profiling, I have the port enabled already, but I guess it would be useful only when the issue presents itself again, right? |
This is an overreaction IMO. There is nothing inherently "unsafe" with 0.17 release. |
Okay, just got the issue again. This was after removing the "connecting to many more peers" part, but keeping I have ran the profiling commands you asked and just restarted with |
I am sorry if I caused any hard feelings, that's just my MO with lnd versions and I know now that several plebs are running on it. I will consider 17.1 later though |
where you couldn't convince me to upgrade, a couple of other plebs succeeded, so I'm running with 0.17 since a couple of days. Too early to say it's fixed anyway, in the past I've run even for a week before noticing this issue. |
You don't need to sync gossip from 200 sync peers, it won't help with much, we do fewer peers to reduce the amount of actual gossip bandwidth. It's unclear what your actual issue as, peer may disconnect for various reasons, so that doesn't appear to be abnormal. If you have 200 gossip sync peers, and they're all sending you data, but processing si slow for w/e reason, then this might mean the processing queue fills up, they blocked on writing, then eventually disconnect. Looking at the logs, I don't see anything out of the ordinary, but you may want to move |
the 200 active sync peers or even 47 I've been on for a while, didn't look a determining factor in the issue, because after I went back to the default 3 and stopped increasing number of peers, I encountered the same problem after a couple of days. Since I have a 12-core xeon I think the processing load is not a factor as well, but took the plunge into 0.17 and will be able to report on that now |
Processing load may not be a factor, but I/O may very well be. But can you describe the problem you perceive more precisely? That you have 300 peers, most of which you don't have channels open with, and they disconnect from time to time? That's totally normal: connections aren't persistent unless you have a channel with them. If you weren't able to keep connections to peers you had a channel with, then that would actually be something to look into. Your channel peers are the only peers that really matter. Having more connections to other peers you don't have channels with doesn't necessarily help your node, and doesn't help you route any more effectively. |
the core of the issue is described in first paragraph of OP (I am on 0.17 now and still have to reproduce it, on 0.16.4 it was like clockwork that it would happen after some time). In the meanwhile I have upgraded and been running 0.17 for 2.5days now with no issue, and I just restarted after having increased the number of active sync peers, to test if the new version has the same issue still. |
Hi @KnockOnWoodNode is this still an issue after upgrading to 0.17? |
looks like it is not until now, been running with many active sync peers and no funny business. Still not completely left behind I'd say, but it's looking good |
Thanks for reporting back @KnockOnWoodNode closing this issue now. Please reopen if you observe the issue again and we can investigate. |
Background
It will happen, after N time, that my node starts having channels deactivated to several peers, one after the other, to be reactivated in a few minutes time tops; at the same time, it will become unable to route HTLCs, as I see no more forwards being notified by bos bot, and one of my peers who tracks these things reports a spike in pending outgoing HTLCs from their node to mine, whenever this happens, that will slowly resolve themselves by failing.
Restarting lnd solves the issue, until next time this happens.
I couldn't make solid hypotheses about why this happens, but here's all the details that I can provide so you maybe have some ideas of your own.
I run sqlite backend, and increased timeout to 10m to avoid SQLITE_BUSY errors. I don't remember this error happening before, but I am 90% sure it started after I began connecting to more peers other than my direct channel ones, to get gossip updates faster from the network (this is before I knew about active sync peers and passive sync peers, before I was connecting to many peers which were all passive, I later on caught up and increased my active peers value, but all of this doesn't seem to have had any influence on the issue).
What I seemed to notice, other than seeing this problem arise after I increased the number of peers I connect to, is that the more peers I have, the sooner this happens. Using persistent connections or not doesn't appear to change anything.
I attached a log for one node which I picked among the ones my nodes detected disconnections to this last time. I had increased PEER loglevel to debug, and zgrep'd logs for its pubkey. I have since then restored info loglevel for everything.
I have disabled, for the time being, my script that connects to more peers, to be bale to report what happens in the upcoming days.
rocket.log
Your environment
lnd
0.16.4uname -a
on *Nix) 5.10.0-26-amd64 Fix name typo in README #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linuxbtcd
,bitcoind
, or other backend 24.1Steps to reproduce
Have sqlite backend (no idea if necessary), have an active routing node with 40something channels, connect to many peers (above 300 for faster mishap) with
lncli connect <pubkey>@<address>:<port>
Expected behaviour
lnd continues operating normally, managing forwards like a champ
Actual behaviour
channels are disconnected at random, htlcs are not being processed
The text was updated successfully, but these errors were encountered: