-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connections die somtimes (and don't recover) #22
Comments
If this bug got resolved, this changes the data grabbing part of initialSync from 10 minutes down to 2.5 minutes! The indexing is just slow at the moment, but that's probably all going to change with some of that new rust work |
my hunch is that it's queuing up too much, and blocking the cpu and then timing out?
are you measuring this at the send side or the receive side? |
Thanks for the awesome bug report. I got a bit mixed up between the two peers while reading the "what we see" bit, can you verify that this is correct?
Names [obviously] don't matter, but I want to make sure I understand the setup correctly. Also, are they both configured the same (e.g. does Bob also try to connect to Alice) or do they have designated roles as uploader and downloader? |
Last note: is there any test code or steps that you could recommend to reproduce? I've only got one machine but I'd really like to get this fixed ASAP. |
I wonder if #15 could help with this problem. |
@pietgeursen What do you think? |
Matt and I spent a lot of time debugging this a few months ago in
patchwork. We had the code from #15 running and it didn't seem to change
things.
What we saw:
- legacy replication connections never dropped out
With ebt:
- doing a gossip.reconnect was enough to get it going again
- we didn't see the problem when doing an initial sync on a lan. We did see
it from a pub.
- tested with ssb.mikey.nz and ssb.organicdesign.pub
- we did see it with sbot 13 & 14
…On Tue, Apr 30, 2019 at 6:59 AM Christian Bundy ***@***.***> wrote:
@pietgeursen <https://github.com/pietgeursen> What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACQRI6UZVNCGVVJF4UCI4SLPS5AP5ANCNFSM4G7DUKKQ>
.
|
I would still like to know what was meant by "4 mb" how was it being measured? was it detected at the receive side or the send side? |
hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix |
1 similar comment
hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix |
In my experience you need two computers AND an internet connection between them (LAN doesn’t seem to trigger it). Me and @pietgeursen wrote up detailed reproduction steps a few months back, but not sure what the link is. Maybe Piet has it? Something like:
We found that this could be reproduced by connecting to any large pub. We were also able to reproduce it with a tunneled connection over the internet between our two sbots. One way to make this work with only one computer might be to tunnel a port via a remote machine using ssh and then run two sbots on the same machine communicating over this port. The key takeaway is that the transport seems to be important. On my internet, this happens almost every time I try to do a large sync with a remote peer. |
@mmckegg thanks! yeah it smells like it's actually related to something closer to networking, maybe it's time for muxrpc to finally get back pressure... |
debug idea: ssb-replicate works, but is slow, ebt goes fast, but then stalls. what happens if you slow ebt down to ssb-replicate speed? |
I could reproduce this by trying to do initial sync. You might be able to do it with 2 peers on the same machine ... but it's easy if there's a friend in the room to get this happening. It's easy to spot when it happens with the onboarding I've partly built for ssb-ahoy but you can see it with I saw it stopping at seemingly random points:
|
update: I tried replicating with my self, two sbots running on one computer. it went really slow, ~200 messages per second. (which I measured by adding a setInterval and console.log into muxrpc/pull-weird) then I tried replicating with my pub, and I saw 2000+ messages per second. I also saw it disconnect and reconnect, but I didn't see it stall yet. |
I still do not understand how you where measuring "4 mb per second" are you just looking at how fast the log was growing? |
currently I'm testing this from a script that loads ssb plugins directly... could the frontends be causing this? |
I am measuring the message receive rate with this command:
|
I'm only seeing 1-2 mb a second that way, could be my slow computer though. |
with my script and minimal indexes, I replicated the entire log, 563 mb of log, from following my self with |
I noticed my connections where going down and coming back up, eventually realized that my server was crashing because of ssb-viewer (something in the markdown parser)... now that the connection lives longer, maybe it will stall |
hmm, it still seems to be disconnecting and reconnecting... |
I set |
my first hunch is it ends after an invalid message, vaguely remember something about that... |
💥 I figured it out! there was a mistake here: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/events.js#L448 I wasn't seeing at first, because I wasn't using the config module in my script, so inactivitiy was 5 seconds, which I didn't notice. The fix is simple, the line in events.js should be:
i.e. only end the stream if we block them. |
this should be fixed in [email protected], please confirm @mixmix @christianbundy @mmckegg |
this is great news, looking forward to testing this @dominictarr ! There might have been 2 ways I was roughly testing speed in the past. Watching the log size grow by
Notes on those stats:
|
This was previously disabled because of a bug that killed replication, but once that bug is resolved this should be good to merge. See: ssbc/ssb-ebt#22
setup : 2 patchbay based peers on the same network. One long standing, one "new" (no log), with connections locked down to be local-only:
what we see : beautifully fast replication (say 4+MB/s), and then at some point (say after 100 or 400 MB) something about the connection with the other local peer goes wrong, and the log stops getting data.
sbot status
polls show that the peer trying to pull fresh data continues to try to reconnect (as the scheduler is doing its job, and this is the only legal connection), but it never connects again. If I close the "new" peer and start it again it picks up where it left off fine, but might get stuck again.guess : It seems like it's getting itself into some polluted state, but I don't know what the problem is.
meanwhile
ssb-replicate
just works, but it's dramatically slower (1MB/s) 😢@mmckegg and @pietgeursen have reported the same things in their recent work.
The text was updated successfully, but these errors were encountered: