Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connections die somtimes (and don't recover) #22

Closed
mixmix opened this issue Mar 18, 2019 · 27 comments
Closed

connections die somtimes (and don't recover) #22

mixmix opened this issue Mar 18, 2019 · 27 comments
Assignees
Labels

Comments

@mixmix
Copy link
Member

mixmix commented Mar 18, 2019

setup : 2 patchbay based peers on the same network. One long standing, one "new" (no log), with connections locked down to be local-only:

var Config = require('ssb-config/inject')
const config = Config('ssb', {
  gossip: {
    local: true,
    friends: false,
    global: false,
    seed: false
  }
})

config.connections.outgoing.net[0].scope = ['device', 'local']
config.connections.incoming.net[0].scope = ['device', 'local']

what we see : beautifully fast replication (say 4+MB/s), and then at some point (say after 100 or 400 MB) something about the connection with the other local peer goes wrong, and the log stops getting data. sbot status polls show that the peer trying to pull fresh data continues to try to reconnect (as the scheduler is doing its job, and this is the only legal connection), but it never connects again. If I close the "new" peer and start it again it picks up where it left off fine, but might get stuck again.

guess : It seems like it's getting itself into some polluted state, but I don't know what the problem is.

meanwhile ssb-replicate just works, but it's dramatically slower (1MB/s) 😢


@mmckegg and @pietgeursen have reported the same things in their recent work.

@mixmix
Copy link
Member Author

mixmix commented Mar 18, 2019

If this bug got resolved, this changes the data grabbing part of initialSync from 10 minutes down to 2.5 minutes!

The indexing is just slow at the moment, but that's probably all going to change with some of that new rust work

@dominictarr
Copy link
Contributor

my hunch is that it's queuing up too much, and blocking the cpu and then timing out?

beautifully fast replication (say 4+MB/s)

are you measuring this at the send side or the receive side?

@christianbundy
Copy link
Contributor

@mixmix

Thanks for the awesome bug report. I got a bit mixed up between the two peers while reading the "what we see" bit, can you verify that this is correct?

  • Alice has no data
  • Bob has some data
  • Alice sees beautifully fast download (say 4+MB/s)
  • at some point (say after 100 or 400 MB) something about the connection with Bob goes wrong
    • Alice stops receiving data
    • sbot status polls show that Alice continues to try to reconnect
      • Alice never connects to Bob again
  • If I close Alice and start it again it picks up where it left off fine, but might get stuck again.

Names [obviously] don't matter, but I want to make sure I understand the setup correctly. Also, are they both configured the same (e.g. does Bob also try to connect to Alice) or do they have designated roles as uploader and downloader?

@christianbundy
Copy link
Contributor

Last note: is there any test code or steps that you could recommend to reproduce? I've only got one machine but I'd really like to get this fixed ASAP.

@arj03
Copy link
Member

arj03 commented Apr 29, 2019

I wonder if #15 could help with this problem.

@christianbundy
Copy link
Contributor

@pietgeursen What do you think?

@pietgeursen
Copy link

pietgeursen commented Apr 29, 2019 via email

@dominictarr
Copy link
Contributor

I would still like to know what was meant by "4 mb" how was it being measured? was it detected at the receive side or the send side?

#22 (comment)

@dominictarr
Copy link
Contributor

hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix

1 similar comment
@dominictarr
Copy link
Contributor

hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix

@mmckegg
Copy link
Contributor

mmckegg commented Jun 9, 2019

In my experience you need two computers AND an internet connection between them (LAN doesn’t seem to trigger it).

Me and @pietgeursen wrote up detailed reproduction steps a few months back, but not sure what the link is. Maybe Piet has it?

Something like:

  • connect to remote peer
  • follow them
  • watch the data flow.... until it stalls!
  • reconnect and the data starts to flow again

We found that this could be reproduced by connecting to any large pub. We were also able to reproduce it with a tunneled connection over the internet between our two sbots.

One way to make this work with only one computer might be to tunnel a port via a remote machine using ssh and then run two sbots on the same machine communicating over this port.

The key takeaway is that the transport seems to be important. On my internet, this happens almost every time I try to do a large sync with a remote peer.

@dominictarr
Copy link
Contributor

@mmckegg thanks! yeah it smells like it's actually related to something closer to networking, maybe it's time for muxrpc to finally get back pressure...

@dominictarr
Copy link
Contributor

debug idea: ssb-replicate works, but is slow, ebt goes fast, but then stalls. what happens if you slow ebt down to ssb-replicate speed?

@mixmix
Copy link
Member Author

mixmix commented Jun 12, 2019

I could reproduce this by trying to do initial sync.
with Patchbay you can easily manage creating new identities, and if there's a peer on the same network it should be hella fast to follow them and get all the stuff. AND IT IS... except it crashes out and doesn't pick it up sometimes.

You might be able to do it with 2 peers on the same machine ... but it's easy if there's a friend in the room to get this happening. It's easy to spot when it happens with the onboarding I've partly built for ssb-ahoy but you can see it with watch ssb-server status too - just watch for the database stopping growing when you know it should go higher.

I saw it stopping at seemingly random points:

  • gets 45MB in
  • gets 201MB in etc

@dominictarr
Copy link
Contributor

update: I tried replicating with my self, two sbots running on one computer. it went really slow, ~200 messages per second. (which I measured by adding a setInterval and console.log into muxrpc/pull-weird) then I tried replicating with my pub, and I saw 2000+ messages per second. I also saw it disconnect and reconnect, but I didn't see it stall yet.

@dominictarr
Copy link
Contributor

I still do not understand how you where measuring "4 mb per second" are you just looking at how fast the log was growing?

@dominictarr
Copy link
Contributor

currently I'm testing this from a script that loads ssb plugins directly... could the frontends be causing this?

@dominictarr
Copy link
Contributor

I am measuring the message receive rate with this command:

 tail -f $TEMP_SSB_DIR/flume/log.offset | pv > /dev/null

pv is "pipe view", a very handy command.

@dominictarr
Copy link
Contributor

I'm only seeing 1-2 mb a second that way, could be my slow computer though.

@dominictarr
Copy link
Contributor

with my script and minimal indexes, I replicated the entire log, 563 mb of log, from following my self with hops:2 in 6 minutes.

@dominictarr
Copy link
Contributor

I noticed my connections where going down and coming back up, eventually realized that my server was crashing because of ssb-viewer (something in the markdown parser)... now that the connection lives longer, maybe it will stall

@dominictarr
Copy link
Contributor

hmm, it still seems to be disconnecting and reconnecting...

@dominictarr
Copy link
Contributor

I set timers.inactivity = -1 and I saw it stall. it was still connected, but the EBT stream ended... trying to figure out why that would happen...

@dominictarr
Copy link
Contributor

my first hunch is it ends after an invalid message, vaguely remember something about that...

@dominictarr
Copy link
Contributor

💥 I figured it out!

there was a mistake here: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/events.js#L448
If you are connected to a peer, and anyone blocks them, you mark them as blocked, which causes the replication stream to end: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/stream.js#L96-L97 but this does not disconnect the stream. If you have config.timers.inactivitiy set to something small, it will disconnect because the stream goes quiet, but the default behaviour https://github.com/ssbc/secret-stack/blob/master/core.js#L93 if you are running ssb-config, it sets config.timers which means inactivity is set to 600e3, ten minutes.

I wasn't seeing at first, because I wasn't using the config module in my script, so inactivitiy was 5 seconds, which I didn't notice.

The fix is simple, the line in events.js should be:

if(state.id == ev.id && state.peers[ev.target]) {

i.e. only end the stream if we block them.

@dominictarr
Copy link
Contributor

this should be fixed in [email protected], please confirm @mixmix @christianbundy @mmckegg

@mixmix
Copy link
Member Author

mixmix commented Jun 14, 2019

this is great news, looking forward to testing this @dominictarr !

There might have been 2 ways I was roughly testing speed in the past. Watching the log size grow by

  • calling sbot.status every 2 seconds from the client
  • running $ watch -d ls -lah ~/.ssb/flume/log.offset (similar, but doesn't bug the sbot)
    The "speed" was something I did in my head looking at how many MB it was jumping (roughly) on each poll. There was a 4x difference between classic replication + ebt.

Notes on those stats:

  • between 2 computers on same wifi
  • both running SSDs

christianbundy added a commit to ssbc/patchwork that referenced this issue Jun 16, 2019
This was previously disabled because of a bug that killed replication,
but once that bug is resolved this should be good to merge.

See: ssbc/ssb-ebt#22
@arj03 arj03 closed this as completed Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants