-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing errors leading to disconnections when WS compression is enabled #1783
Comments
thanks for the report, do you modify the default |
nope, it's using the default value, the messages are varying sizes but not over 2Mb :) |
ok then the default would be using the 1Mb max payload |
Yes, sorry 1Mb, idk why i had the 2Mb in my mind :) |
I haven't dug too deep yet, but it may be possible due to some partial flush? klauspost/compress#993 edit:: hum... probably not, the deflate function uses the same lib and it seems to only do z_sync_flush |
Oh, possibly worth noting, the messages are in protobuf format, hence they are binary. IDK if that could be randomly tripping some control byte sequence somewhere |
UPDATE: It seems that the problem is related to the PING/PONG frames. After doing some gigantic network captures to try to capture some of the events, I noticed that the very last thing that comes in is a PONG, before the client throws the parsing error and disconnects. I confirmed this, by increasing the PingInterval in the clients from the default 2m to 1h, and the issues nearly disappeared (apart from once in a while a random pod having one disconnect after ~1h. |
Thanks @rsafonseca, to clarify is the pong coming from the server or from the client? |
The PONG is coming from the server, the PONG packet itself has the expected 6 byte payload Side note, maybe related to the issue here, i noticed that the PING sent by the client has the compression bit set, and IIUC it shouldn't. If it should, then the problem could be the fact that the PONG sent by the server doesn't |
Another inconsistency i noticed, which is probably irrelevant here, is that the client is masking its writes and the server is not. Looking at the nats-server code, i think only leafnodes are actually doing masking on their WS packets |
The WS Opcode bits for the control messages also seems wrong, both from the server and from the client. ALL packets are using opcode 0x2 (binary message) whereas pings should be using 0x9 and pongs should be using 0xA. There's a bunch of code that relies on these opcodes that's likely not doing anything due to this. In summary the issue is quite possibly that the PONG that is coming from the server has opcode 0x2, and doesn't have the compression bit set, since it's treated like a normal message instead of a control frame |
@wallyqs should we fix the opcodes both on client and server and disable permessage-deflate on the client PING, or just set the permessage-deflate bit on PONGs coming from the server side? |
@wallyqs I've added simple reproduction steps for the bug |
got to the bottom of the actual problem. The statements above still hold true, but they are not the source of the problem. I've opened and linked a PR here with a summarised explanation :) |
Thanks @rsafonseca ! |
Thank you! I'll look at the PR. |
Observed behavior
After enabling the use of websocket connections with compression enabled, we have some client disconnections happening every few minutes. There's millions of messages of varying sizes and payloads passing by without issues, but once a client disconnects it causes a number of issues on our side, so this very problematic.
The issues do not occur when compression is disabled on ws connection
nats: Parse Error [26]: 'Regi'
nats: Parse Error [0]: '"Frame'
nats: Parse Error [0]: ':{"Ena'
Expected behavior
No disconnections due to parsing errors
Server and client version
server 2.10.24
client 1.38.0
Host environment
Running on kubernetes, x86_64 arch nodes, using containerd runtime, using official nats helm chart
Steps to reproduce
For easy 100% reproducibility, I've used the nats-bench example, here's some simple steps to reproduce:
docker run -it --mount type=bind,src=$(pwd)/nats.conf,dst=/etc/nats/nats-server.conf -p 8080:8080 nats -c /etc/nats/nats-server.conf
go build
./nats-bench -s ws://localhost:8080 -n 1000000 -np 2 -ns 2 foo
Disabling compression or increasing the ping interval makes for successful completion of the bench
The text was updated successfully, but these errors were encountered: