connections die somtimes (and don't recover) #22

mixmix · 2019-03-18T04:00:05Z

setup : 2 patchbay based peers on the same network. One long standing, one "new" (no log), with connections locked down to be local-only:

var Config = require('ssb-config/inject')
const config = Config('ssb', {
  gossip: {
    local: true,
    friends: false,
    global: false,
    seed: false
  }
})

config.connections.outgoing.net[0].scope = ['device', 'local']
config.connections.incoming.net[0].scope = ['device', 'local']

what we see : beautifully fast replication (say 4+MB/s), and then at some point (say after 100 or 400 MB) something about the connection with the other local peer goes wrong, and the log stops getting data. sbot status polls show that the peer trying to pull fresh data continues to try to reconnect (as the scheduler is doing its job, and this is the only legal connection), but it never connects again. If I close the "new" peer and start it again it picks up where it left off fine, but might get stuck again.

guess : It seems like it's getting itself into some polluted state, but I don't know what the problem is.

meanwhile ssb-replicate just works, but it's dramatically slower (1MB/s) 😢

@mmckegg and @pietgeursen have reported the same things in their recent work.

The text was updated successfully, but these errors were encountered:

mixmix · 2019-03-18T04:02:52Z

If this bug got resolved, this changes the data grabbing part of initialSync from 10 minutes down to 2.5 minutes!

The indexing is just slow at the moment, but that's probably all going to change with some of that new rust work

dominictarr · 2019-03-18T06:43:18Z

my hunch is that it's queuing up too much, and blocking the cpu and then timing out?

beautifully fast replication (say 4+MB/s)

are you measuring this at the send side or the receive side?

christianbundy · 2019-04-01T21:27:19Z

@mixmix

Thanks for the awesome bug report. I got a bit mixed up between the two peers while reading the "what we see" bit, can you verify that this is correct?

Alice has no data
Bob has some data
Alice sees beautifully fast download (say 4+MB/s)
at some point (say after 100 or 400 MB) something about the connection with Bob goes wrong
- Alice stops receiving data
- sbot status polls show that Alice continues to try to reconnect
  - Alice never connects to Bob again
If I close Alice and start it again it picks up where it left off fine, but might get stuck again.

Names [obviously] don't matter, but I want to make sure I understand the setup correctly. Also, are they both configured the same (e.g. does Bob also try to connect to Alice) or do they have designated roles as uploader and downloader?

christianbundy · 2019-04-10T19:08:07Z

Last note: is there any test code or steps that you could recommend to reproduce? I've only got one machine but I'd really like to get this fixed ASAP.

arj03 · 2019-04-29T07:55:25Z

I wonder if #15 could help with this problem.

christianbundy · 2019-04-29T18:59:09Z

@pietgeursen What do you think?

pietgeursen · 2019-04-29T21:24:48Z

Matt and I spent a lot of time debugging this a few months ago in patchwork. We had the code from #15 running and it didn't seem to change things. What we saw: - legacy replication connections never dropped out With ebt: - doing a gossip.reconnect was enough to get it going again - we didn't see the problem when doing an initial sync on a lan. We did see it from a pub. - tested with ssb.mikey.nz and ssb.organicdesign.pub - we did see it with sbot 13 & 14

…

On Tue, Apr 30, 2019 at 6:59 AM Christian Bundy ***@***.***> wrote: @pietgeursen <https://github.com/pietgeursen> What do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACQRI6UZVNCGVVJF4UCI4SLPS5AP5ANCNFSM4G7DUKKQ> .

dominictarr · 2019-05-03T05:05:49Z

I would still like to know what was meant by "4 mb" how was it being measured? was it detected at the receive side or the send side?

#22 (comment)

dominictarr · 2019-06-09T13:15:32Z

hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix

dominictarr · 2019-06-09T13:15:41Z

hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix

mmckegg · 2019-06-09T13:24:32Z

In my experience you need two computers AND an internet connection between them (LAN doesn’t seem to trigger it).

Me and @pietgeursen wrote up detailed reproduction steps a few months back, but not sure what the link is. Maybe Piet has it?

Something like:

connect to remote peer
follow them
watch the data flow.... until it stalls!
reconnect and the data starts to flow again

We found that this could be reproduced by connecting to any large pub. We were also able to reproduce it with a tunneled connection over the internet between our two sbots.

One way to make this work with only one computer might be to tunnel a port via a remote machine using ssh and then run two sbots on the same machine communicating over this port.

The key takeaway is that the transport seems to be important. On my internet, this happens almost every time I try to do a large sync with a remote peer.

dominictarr · 2019-06-12T09:33:48Z

@mmckegg thanks! yeah it smells like it's actually related to something closer to networking, maybe it's time for muxrpc to finally get back pressure...

dominictarr · 2019-06-12T09:35:08Z

debug idea: ssb-replicate works, but is slow, ebt goes fast, but then stalls. what happens if you slow ebt down to ssb-replicate speed?

mixmix · 2019-06-12T09:38:55Z

I could reproduce this by trying to do initial sync.
with Patchbay you can easily manage creating new identities, and if there's a peer on the same network it should be hella fast to follow them and get all the stuff. AND IT IS... except it crashes out and doesn't pick it up sometimes.

You might be able to do it with 2 peers on the same machine ... but it's easy if there's a friend in the room to get this happening. It's easy to spot when it happens with the onboarding I've partly built for ssb-ahoy but you can see it with watch ssb-server status too - just watch for the database stopping growing when you know it should go higher.

I saw it stopping at seemingly random points:

gets 45MB in
gets 201MB in etc

dominictarr · 2019-06-12T11:07:25Z

update: I tried replicating with my self, two sbots running on one computer. it went really slow, ~200 messages per second. (which I measured by adding a setInterval and console.log into muxrpc/pull-weird) then I tried replicating with my pub, and I saw 2000+ messages per second. I also saw it disconnect and reconnect, but I didn't see it stall yet.

dominictarr · 2019-06-12T11:10:00Z

I still do not understand how you where measuring "4 mb per second" are you just looking at how fast the log was growing?

dominictarr · 2019-06-12T11:17:38Z

currently I'm testing this from a script that loads ssb plugins directly... could the frontends be causing this?

dominictarr · 2019-06-12T11:34:43Z

I am measuring the message receive rate with this command:

 tail -f $TEMP_SSB_DIR/flume/log.offset | pv > /dev/null

pv is "pipe view", a very handy command.

dominictarr · 2019-06-12T11:35:50Z

I'm only seeing 1-2 mb a second that way, could be my slow computer though.

dominictarr · 2019-06-12T11:40:52Z

with my script and minimal indexes, I replicated the entire log, 563 mb of log, from following my self with hops:2 in 6 minutes.

dominictarr · 2019-06-12T18:58:08Z

I noticed my connections where going down and coming back up, eventually realized that my server was crashing because of ssb-viewer (something in the markdown parser)... now that the connection lives longer, maybe it will stall

dominictarr · 2019-06-12T19:04:18Z

hmm, it still seems to be disconnecting and reconnecting...

dominictarr · 2019-06-12T19:34:36Z

I set timers.inactivity = -1 and I saw it stall. it was still connected, but the EBT stream ended... trying to figure out why that would happen...

dominictarr · 2019-06-12T19:35:09Z

my first hunch is it ends after an invalid message, vaguely remember something about that...

dominictarr · 2019-06-12T20:17:32Z

💥 I figured it out!

there was a mistake here: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/events.js#L448
If you are connected to a peer, and anyone blocks them, you mark them as blocked, which causes the replication stream to end: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/stream.js#L96-L97 but this does not disconnect the stream. If you have config.timers.inactivitiy set to something small, it will disconnect because the stream goes quiet, but the default behaviour https://github.com/ssbc/secret-stack/blob/master/core.js#L93 if you are running ssb-config, it sets config.timers which means inactivity is set to 600e3, ten minutes.

I wasn't seeing at first, because I wasn't using the config module in my script, so inactivitiy was 5 seconds, which I didn't notice.

The fix is simple, the line in events.js should be:

if(state.id == ev.id && state.peers[ev.target]) {

i.e. only end the stream if we block them.

dominictarr · 2019-06-13T10:22:55Z

this should be fixed in [email protected], please confirm @mixmix @christianbundy @mmckegg

mixmix · 2019-06-14T01:50:14Z

this is great news, looking forward to testing this @dominictarr !

There might have been 2 ways I was roughly testing speed in the past. Watching the log size grow by

calling sbot.status every 2 seconds from the client
running $ watch -d ls -lah ~/.ssb/flume/log.offset (similar, but doesn't bug the sbot)
The "speed" was something I did in my head looking at how many MB it was jumping (roughly) on each poll. There was a 4x difference between classic replication + ebt.

Notes on those stats:

between 2 computers on same wifi
both running SSDs

This was previously disabled because of a bug that killed replication, but once that bug is resolved this should be good to merge. See: ssbc/ssb-ebt#22

mmckegg mentioned this issue Apr 1, 2019

syncing gets stuck and requires restart during onboarding ssbc/patchwork#977

Closed

christianbundy self-assigned this Apr 1, 2019

christianbundy added the bug label Apr 1, 2019

christianbundy mentioned this issue May 4, 2019

Re-enable ssb-ebt ssbc/patchwork#1021

Closed

christianbundy mentioned this issue May 29, 2019

does ssb-ebt disable the legacy replication? #25

Closed

dominictarr mentioned this issue Jun 13, 2019

handle block of a connected peer correctly, fixes ssb-ebt#22 ssbc/epidemic-broadcast-trees#27

Merged

christianbundy added a commit to ssbc/patchwork that referenced this issue Jun 16, 2019

feat: re-enable ssb-ebt

19466e9

This was previously disabled because of a bug that killed replication, but once that bug is resolved this should be good to merge. See: ssbc/ssb-ebt#22

christianbundy mentioned this issue Jun 16, 2019

feat: re-enable ssb-ebt ssbc/patchwork#1106

Merged

arj03 closed this as completed Feb 12, 2020

staltz mentioned this issue Jun 11, 2021

Difficulties replicating with peers that are friends after doing a feed recovery #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connections die somtimes (and don't recover) #22

connections die somtimes (and don't recover) #22

mixmix commented Mar 18, 2019 •

edited

Loading

mixmix commented Mar 18, 2019

dominictarr commented Mar 18, 2019

christianbundy commented Apr 1, 2019

christianbundy commented Apr 10, 2019

arj03 commented Apr 29, 2019

christianbundy commented Apr 29, 2019

pietgeursen commented Apr 29, 2019 via email

dominictarr commented May 3, 2019

dominictarr commented Jun 9, 2019

dominictarr commented Jun 9, 2019

mmckegg commented Jun 9, 2019 •

edited

Loading

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

mixmix commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 13, 2019

mixmix commented Jun 14, 2019

connections die somtimes (and don't recover) #22

connections die somtimes (and don't recover) #22

Comments

mixmix commented Mar 18, 2019 • edited Loading

mixmix commented Mar 18, 2019

dominictarr commented Mar 18, 2019

christianbundy commented Apr 1, 2019

christianbundy commented Apr 10, 2019

arj03 commented Apr 29, 2019

christianbundy commented Apr 29, 2019

pietgeursen commented Apr 29, 2019 via email

dominictarr commented May 3, 2019

dominictarr commented Jun 9, 2019

dominictarr commented Jun 9, 2019

mmckegg commented Jun 9, 2019 • edited Loading

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

mixmix commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 12, 2019

dominictarr commented Jun 13, 2019

mixmix commented Jun 14, 2019

mixmix commented Mar 18, 2019 •

edited

Loading

mmckegg commented Jun 9, 2019 •

edited

Loading