Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: unable to read message from peer: EOF, disconnections and routing stops #8125

Closed
KnockOnWoodNode opened this issue Oct 30, 2023 · 16 comments
Labels
bug Unintended code behaviour needs triage

Comments

@KnockOnWoodNode
Copy link

KnockOnWoodNode commented Oct 30, 2023

Background

It will happen, after N time, that my node starts having channels deactivated to several peers, one after the other, to be reactivated in a few minutes time tops; at the same time, it will become unable to route HTLCs, as I see no more forwards being notified by bos bot, and one of my peers who tracks these things reports a spike in pending outgoing HTLCs from their node to mine, whenever this happens, that will slowly resolve themselves by failing.
Restarting lnd solves the issue, until next time this happens.

I couldn't make solid hypotheses about why this happens, but here's all the details that I can provide so you maybe have some ideas of your own.
I run sqlite backend, and increased timeout to 10m to avoid SQLITE_BUSY errors. I don't remember this error happening before, but I am 90% sure it started after I began connecting to more peers other than my direct channel ones, to get gossip updates faster from the network (this is before I knew about active sync peers and passive sync peers, before I was connecting to many peers which were all passive, I later on caught up and increased my active peers value, but all of this doesn't seem to have had any influence on the issue).
What I seemed to notice, other than seeing this problem arise after I increased the number of peers I connect to, is that the more peers I have, the sooner this happens. Using persistent connections or not doesn't appear to change anything.
I attached a log for one node which I picked among the ones my nodes detected disconnections to this last time. I had increased PEER loglevel to debug, and zgrep'd logs for its pubkey. I have since then restored info loglevel for everything.

I have disabled, for the time being, my script that connects to more peers, to be bale to report what happens in the upcoming days.

rocket.log

Your environment

  • version of lnd 0.16.4
  • which operating system (uname -a on *Nix) 5.10.0-26-amd64 Fix name typo in README #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux
  • version of btcd, bitcoind, or other backend 24.1
  • any other relevant environment details sqlite backend, 12-core Xeon with 64GB of ECC RAM and 6-ssd zpool mirror pool

Steps to reproduce

Have sqlite backend (no idea if necessary), have an active routing node with 40something channels, connect to many peers (above 300 for faster mishap) with lncli connect <pubkey>@<address>:<port>

Expected behaviour

lnd continues operating normally, managing forwards like a champ

Actual behaviour

channels are disconnected at random, htlcs are not being processed

@KnockOnWoodNode KnockOnWoodNode added bug Unintended code behaviour needs triage labels Oct 30, 2023
@KnockOnWoodNode KnockOnWoodNode changed the title [bug]: unable to read message from peer: EOF, disconnections and routing stops [bug]: unable to read message from peer: EOF, disconnections and routing stops Oct 30, 2023
@bitromortac
Copy link
Collaborator

I would recommend to update to v0.17 or (v0.17.1 which should be out soon if you want to wait) as issues around SQLite and p2p code were fixed. Afterwards it will be easier to debug the issue if it still exists.

@KnockOnWoodNode
Copy link
Author

I would recommend to update to v0.17 or (v0.17.1 which should be out soon if you want to wait) as issues around SQLite and p2p code were fixed. Afterwards it will be easier to debug the issue if it still exists.

0.17 contains a few things that I hear have not been fixed, so I'd rather wait until I see no more hiccups reported by plebs, upgrading could open a completely new can of worms while not assuring the solution of the previous one 😅

@bitromortac
Copy link
Collaborator

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful:
curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof
curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof

@KnockOnWoodNode
Copy link
Author

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof

I remember people talking about ungraceful shutdowns in 0.17 and possibly other things in the plebnet groups, but cannot pinpoint anything precise, I was just left with the knowledge of "0.17=unsafe to use yet".

Regarding the profiling, I have the port enabled already, but I guess it would be useful only when the issue presents itself again, right?

@saubyk
Copy link
Collaborator

saubyk commented Oct 31, 2023

"0.17=unsafe to use yet".

This is an overreaction IMO. There is nothing inherently "unsafe" with 0.17 release.

@KnockOnWoodNode
Copy link
Author

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof

Okay, just got the issue again.

This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.

I have ran the profiling commands you asked and just restarted with numgraphsyncpeers=10 to see if that happens again, let me know if you also want profiling reports for that as well.

goroutine_.zip

@KnockOnWoodNode
Copy link
Author

"0.17=unsafe to use yet".

This is an overreaction IMO. There is nothing inherently "unsafe" with 0.17 release.

I am sorry if I caused any hard feelings, that's just my MO with lnd versions and I know now that several plebs are running on it. I will consider 17.1 later though

@bitromortac
Copy link
Collaborator

#8146 is likely unrelated to this issue, but there are some goroutines blocked on WaitForDisconnect, where some fixes have been made with v0.17, see for example #7856. So I'd suggest to wait for running on v0.17.1 and then see if the issue still persists.

@KnockOnWoodNode
Copy link
Author

#8146 is likely unrelated to this issue, but there are some goroutines blocked on WaitForDisconnect, where some fixes have been made with v0.17, see for example #7856. So I'd suggest to wait for running on v0.17.1 and then see if the issue still persists.

where you couldn't convince me to upgrade, a couple of other plebs succeeded, so I'm running with 0.17 since a couple of days.
Nothing weird has happened yet, but I'm back to a normal number of peers and 3 active sync peers, and I'll be increasing those very soon I believe.

Too early to say it's fixed anyway, in the past I've run even for a week before noticing this issue.

@Roasbeef
Copy link
Member

Roasbeef commented Nov 3, 2023

This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.

You don't need to sync gossip from 200 sync peers, it won't help with much, we do fewer peers to reduce the amount of actual gossip bandwidth. It's unclear what your actual issue as, peer may disconnect for various reasons, so that doesn't appear to be abnormal.

If you have 200 gossip sync peers, and they're all sending you data, but processing si slow for w/e reason, then this might mean the processing queue fills up, they blocked on writing, then eventually disconnect.

Looking at the logs, I don't see anything out of the ordinary, but you may want to move PEER and DISC to info so the logs have a bit less spam.

@KnockOnWoodNode
Copy link
Author

KnockOnWoodNode commented Nov 3, 2023

This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.

You don't need to sync gossip from 200 sync peers, it won't help with much, we do fewer peers to reduce the amount of actual gossip bandwidth. It's unclear what your actual issue as, peer may disconnect for various reasons, so that doesn't appear to be abnormal.

If you have 200 gossip sync peers, and they're all sending you data, but processing si slow for w/e reason, then this might mean the processing queue fills up, they blocked on writing, then eventually disconnect.

Looking at the logs, I don't see anything out of the ordinary, but you may want to move PEER and DISC to info so the logs have a bit less spam.

the 200 active sync peers or even 47 I've been on for a while, didn't look a determining factor in the issue, because after I went back to the default 3 and stopped increasing number of peers, I encountered the same problem after a couple of days. Since I have a 12-core xeon I think the processing load is not a factor as well, but took the plunge into 0.17 and will be able to report on that now

@Roasbeef
Copy link
Member

Roasbeef commented Nov 3, 2023

Processing load may not be a factor, but I/O may very well be.

But can you describe the problem you perceive more precisely? That you have 300 peers, most of which you don't have channels open with, and they disconnect from time to time?

That's totally normal: connections aren't persistent unless you have a channel with them. If you weren't able to keep connections to peers you had a channel with, then that would actually be something to look into. Your channel peers are the only peers that really matter.

Having more connections to other peers you don't have channels with doesn't necessarily help your node, and doesn't help you route any more effectively.

@KnockOnWoodNode
Copy link
Author

Processing load may not be a factor, but I/O may very well be.

But can you describe the problem you perceive more precisely? That you have 300 peers, most of which you don't have channels open with, and they disconnect from time to time?

That's totally normal: connections aren't persistent unless you have a channel with them. If you weren't able to keep connections to peers you had a channel with, then that would actually be something to look into. Your channel peers are the only peers that really matter.

Having more connections to other peers you don't have channels with doesn't necessarily help your node, and doesn't help you route any more effectively.

the core of the issue is described in first paragraph of OP (I am on 0.17 now and still have to reproduce it, on 0.16.4 it was like clockwork that it would happen after some time).
Several channel peers (regardless of non-channel peers which is normal they would disconnect after a while) would get randomly disconnected and reconnect after a few minutes with the error in OP title, and this "signaled" the issue had started, as all forwarding activity of my node stopped since the first disconnections happened, and until I restarted lnd.
One of my channel peers who tracks these stats, reported a high sudden spike of pending outgoing htlcs to me, that slowly receded, all the while no htlcs were actually getting through to me. This happened every 2 days in the last week, and at an interval of about 1-2 weeks before then.
During these last days I have also had another event repeat itself twice, which I can't bundle with logs, where I had over 100 pending incoming htlcs (just a handful of outgoing ones) which wouldn't resolve until I restarted lnd.

In the meanwhile I have upgraded and been running 0.17 for 2.5days now with no issue, and I just restarted after having increased the number of active sync peers, to test if the new version has the same issue still.

@saubyk
Copy link
Collaborator

saubyk commented Nov 7, 2023

Hi @KnockOnWoodNode is this still an issue after upgrading to 0.17?

@KnockOnWoodNode
Copy link
Author

Hi @KnockOnWoodNode is this still an issue after upgrading to 0.17?

looks like it is not until now, been running with many active sync peers and no funny business. Still not completely left behind I'd say, but it's looking good

@saubyk
Copy link
Collaborator

saubyk commented Nov 8, 2023

Thanks for reporting back @KnockOnWoodNode closing this issue now. Please reopen if you observe the issue again and we can investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour needs triage
Projects
None yet
Development

No branches or pull requests

4 participants