Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geth not connecting to bootnodes on private test network #1685

Closed
ckeenan opened this issue Aug 19, 2015 · 19 comments
Closed

geth not connecting to bootnodes on private test network #1685

ckeenan opened this issue Aug 19, 2015 · 19 comments

Comments

@ckeenan
Copy link

ckeenan commented Aug 19, 2015

Hi I have noticed that for geth on a private net, setting bootnodes works fine at first but eventually (a few hours later) setting bootnodes will not do anything. Peers added manually can sync up fine though.

I see this for bootnodes running on both AWS and DO instances with same issues? Perhaps its a byproduct of a network that is too small?

@radster360
Copy link

We are running our private net on GCE and I too have similar issues. I am running geth 1.0.1. My private net was running fine and I just checked. My mining node is at different block and my other non-mining node has stopped syncing. It ran for few days without issue.

After killing the nodes and restarting them the block synchronization has restarted and now both nodes are synced up.

@fjl
Copy link
Contributor

fjl commented Aug 19, 2015

@ckeenan Please explain what you mean by "setting bootnodes". It is important to note that geth does not maintain a persistent connection to the bootstrap nodes. They are only used as the first point of contact for the node discovery protocol.

@ckeenan
Copy link
Author

ckeenan commented Aug 19, 2015

@fjl geth --bootnodes="enode://..." .... Discovery indeed goes fine when shutting down & starting that client on occasion (testing out the problem) for the first hour or so with the bootnodes parameter set. Eventually a shutdown and startup will not connect to the bootnode anymore even though the parameter is set. Any ideas?

@obscuren obscuren added the * label Sep 23, 2015
@aakilfernandes
Copy link

Having this same problem until I add the peer via admin.addPeer. Here's what I'm putting into geth

geth --port 30304 --rpc --rpcaddr 127.0.0.1 --rpcport 8101 --rpccorsdomain http://127.0.0.1:8000 --genesis config/development/genesis.json --datadir datadir/development/ --networkid 5473 --bootnodes=“enode://3ede10cd6a5e382db43804f3267c5a6eab6021e245b6eb28d1d9b11720638490e876ce5e61bd76992befa89935c5a34f139df7dda82b00d5ea09da6f96f77839@40.117.36.75:30303" --unlock 42c35b6c220d570bd8898a540406e0b026479f7b --password config/development/password console

I'm able to work around it with

admin.addPeer('enode://3ede10cd6a5e382db43804f3267c5a6eab6021e245b6eb28d1d9b11720638490e876ce5e61bd76992befa89935c5a34f139df7dda82b00d5ea09da6f96f77839@40.117.36.75:30303')

@aakilfernandes
Copy link

Geth
Version: 1.3.3
Protocol Versions: [63 62 61]
Network Id: 1
Go Version: go1.5.3
OS: darwin
GOPATH=
GOROOT=/usr/local/Cellar/go/1.5.3/libexec

@ckeenan
Copy link
Author

ckeenan commented Feb 18, 2016

@aakilfernandes the clock on one of the nodes could be out of sync. Try sudo ntpdate -s time.nist.gov. Or see the Frontier Guide Connecting to the Network: Common Problems With Connectivity

@aakilfernandes
Copy link

@ckeenan that was it. Thanks!

@aakilfernandes
Copy link

Actually @ckeenan, all interested, looks like the problem is back unfortunately. I think the issue is that bootnodes aren't automatically used as peers. when the problem went away, there was a 3rd node in the mix. So I think what's happening is geth searches the bootnodes for peers. If the bootnode doesn't have any peers, it doesn't connect. However geth should use the bootnode itself as a peer if available.

@yxliang01
Copy link

Just a random thought. It does this for avoiding bootnode (potentially to be important public well-known node) being overload?

@immartian
Copy link

I noticed some details from setting verbosity=6, that one bootnodes won't specifically working for a private network, instead, it works for all different network including testnet and main chain. So eventually private nodes won't be able to seek each other via this bootnode.

@karalabe
Copy link
Member

Geth 1.6 has a feature that if the node cannot find any good peers for 30 seconds, it will try to connect to the bootnode itself. This should help short term. Long term we need to enable discovery v5 on the eth protocol for test/private networks, which should solve the issue properly.

@yxliang01
Copy link

@karalabe Well, the problem was that, it just doesn't connect even after 30 seconds.
@immartian Well, are you talking about the official bootnodes? I have this problem with my own nodes as bootnodes.

@galladivya
Copy link

I can't connect to any peer using --bootnodes flag and I also used static-nodes.json file.No use it returns 0 when I used net.peerCount

@pablochacin
Copy link

pablochacin commented Jan 27, 2018

I'm having the same problem. Except that the first peer could actually connect to the boot node (both see each other as peers). I'm testing using docker, but I don't think this would add any additional problem to the mix.

Can someone please confirm this is an issue or a "feature" (that is,a well known issue that for some reason is the expected behavior). Thanks in advance.

@karalabe
Copy link
Member

@pablochacin Please describe your exact setup. We don't see any issues on either mainnet or testnets (which are for all intents and purposes "private" networks).

@pablochacin
Copy link

@karalabe Never mind. I left it running some time (> 30min) and eventually the peers found each other. Now I need to figure out how to be sure this has happened before I test anything (like whisper messaging). Any suggestion here would be appreciated.

As a side note, to my surprise, I found an external node (with public ip) also listed as peer. How could this happen? I'm using network id 15.

172.17.0.2:50294 <- docker container (bootnode)
178.238.233.123:30303 <- external node
172.17.0.4:59714 <- docker container (peer)
172.17.0.5:30303 <- docker container (peer)

@pablochacin
Copy link

pablochacin commented Jan 27, 2018

--nodiscover
Use this to make sure that your node is not discoverable by people who do not manually add you. Otherwise, there is a chance that your node may be inadvertently added to a stranger's node if they have the same genesis file and network id

looks like adding nodes manually is the best way to go, both to ensure they are deterministically added and to prevent the issue of spooky nodes.
.

@zscole
Copy link
Contributor

zscole commented Mar 28, 2018

@pablochacin Unless you add the --nodiscover flag when you launch your geth instance, any other node using the same genesis file and network ID can peer with the nodes on your private testnet. Since 15 isn't a large, random number, I'd say that there are some folks out there that are probably using that network ID and likely have the same settings configured in their custom genesis file.

Edit: Sorry, didn't see your last comment, but yeah. XD

tony-ricciardi pushed a commit to tony-ricciardi/go-ethereum that referenced this issue Jan 20, 2022
* Fix deadlock during StartValidating

StartValidating makes a call to RefreshValPeers while holding
coreMu and RefreshValPeers waits for all validator peers to be deleted
and then reconnects to known validators.

If any of those peers has called IsValidating before RefreshValPeers
tries to delete them, the system gets stuck in a deadlock because
IsValidating also tries to acquire coreMu. The peer will never acquire
coreMu because it is held by StartValidating, and StartValidating will
never return because it is waiting for all peers to disconnect.

This commit makes coreStarted into an atomic variable so that peers can
make threadsafe calls to IsValidating without needing to acquire
coreStarted.

* Fix long wait for nodes to connect

At test startup sometimes nodes were taking in the region of 30s to connect
whilst other times it was happening in μs. The problem was we were
trying to connect all peers to all other peers. That meant that for any
two peers they would both dial each other. Sometimes if this occurred
close enough in time both sides would hang up the connections (I call
this cross dialing). This happens because each side counts their
outgoing connection as connected and then when the incoming connection
arrives they drop it because they see themselves as already connected.
When this happened nodes would retry after some time probably 30s and
then be connected.

The fix was to ensure that for any two nodes only one of them dials the
other.
@hickscorp
Copy link

This could be related to #23210

What I noticed was that specifying either / or / and BootstrapNodes, StaticNodes, TrustedNodes is a potential issue in private settings - yes.
If we specify, say, 3 enode addresses in either / or / all of these parameters, geth will fail to start completely rather than retrying later.
It's really something that should IMO be addressed - because it kind of makes it impossible to have a self-healing network. For example, if any of the specified nodes is down, then none of the other nodes will be able to restart if they also go down. They just crash on startup.
In that other Github issue, I'm suggesting to make this a graceful warning rather than a fatal error... But had zero response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests