Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing errors leading to disconnections when WS compression is enabled #1783

Closed
rsafonseca opened this issue Jan 21, 2025 · 16 comments · Fixed by #1790
Closed

Parsing errors leading to disconnections when WS compression is enabled #1783

rsafonseca opened this issue Jan 21, 2025 · 16 comments · Fixed by #1790
Labels
defect Suspected defect such as a bug or regression

Comments

@rsafonseca
Copy link
Contributor

rsafonseca commented Jan 21, 2025

Observed behavior

After enabling the use of websocket connections with compression enabled, we have some client disconnections happening every few minutes. There's millions of messages of varying sizes and payloads passing by without issues, but once a client disconnects it causes a number of issues on our side, so this very problematic.
The issues do not occur when compression is disabled on ws connection

nats: Parse Error [26]: 'Regi'
nats: Parse Error [0]: '"Frame'
nats: Parse Error [0]: ':{"Ena'

Expected behavior

No disconnections due to parsing errors

Server and client version

server 2.10.24
client 1.38.0

Host environment

Running on kubernetes, x86_64 arch nodes, using containerd runtime, using official nats helm chart

Steps to reproduce

  • Run a nats server with enabled websockets and compression
  • Run a nats.go client which connects to server with websockets, using compression and a high ping rate for easier reproduction

For easy 100% reproducibility, I've used the nats-bench example, here's some simple steps to reproduce:

  • Add the contents of this commit to existing nats repo rsafonseca@4994597
  • cd to examples/nats-bench
  • Run local nats container with the following command to mount the required websocket config:docker run -it --mount type=bind,src=$(pwd)/nats.conf,dst=/etc/nats/nats-server.conf -p 8080:8080 nats -c /etc/nats/nats-server.conf
  • Compile the example nats-bench by running go build
  • Run the following example command: ./nats-bench -s ws://localhost:8080 -n 1000000 -np 2 -ns 2 foo
  • lo and behold, a myriad of failures, parsing errors and disconnects

Disabling compression or increasing the ping interval makes for successful completion of the bench

@rsafonseca rsafonseca added the defect Suspected defect such as a bug or regression label Jan 21, 2025
@wallyqs
Copy link
Member

wallyqs commented Jan 21, 2025

thanks for the report, do you modify the default max_payload setting by any chance?

@rsafonseca
Copy link
Contributor Author

nope, it's using the default value, the messages are varying sizes but not over 2Mb :)

@wallyqs
Copy link
Member

wallyqs commented Jan 21, 2025

ok then the default would be using the 1Mb max payload

@rsafonseca
Copy link
Contributor Author

Yes, sorry 1Mb, idk why i had the 2Mb in my mind :)

@rsafonseca
Copy link
Contributor Author

rsafonseca commented Jan 21, 2025

I haven't dug too deep yet, but it may be possible due to some partial flush? klauspost/compress#993
support for this was added in the patch version after the one nats.go is using

edit:: hum... probably not, the deflate function uses the same lib and it seems to only do z_sync_flush

@rsafonseca
Copy link
Contributor Author

Oh, possibly worth noting, the messages are in protobuf format, hence they are binary. IDK if that could be randomly tripping some control byte sequence somewhere

@rsafonseca
Copy link
Contributor Author

UPDATE: It seems that the problem is related to the PING/PONG frames. After doing some gigantic network captures to try to capture some of the events, I noticed that the very last thing that comes in is a PONG, before the client throws the parsing error and disconnects.

I confirmed this, by increasing the PingInterval in the clients from the default 2m to 1h, and the issues nearly disappeared (apart from once in a while a random pod having one disconnect after ~1h.
I then decreased the PingInterval to 30s, and the amount of parsing errors increased beyond the number of occurrences with the default setting.

@wallyqs
Copy link
Member

wallyqs commented Jan 22, 2025

Thanks @rsafonseca, to clarify is the pong coming from the server or from the client?

@rsafonseca
Copy link
Contributor Author

rsafonseca commented Jan 23, 2025

The PONG is coming from the server, the PONG packet itself has the expected 6 byte payload
50 4f 4e 47 0d 0a PONG..

Side note, maybe related to the issue here, i noticed that the PING sent by the client has the compression bit set, and IIUC it shouldn't. If it should, then the problem could be the fact that the PONG sent by the server doesn't

@rsafonseca
Copy link
Contributor Author

Another inconsistency i noticed, which is probably irrelevant here, is that the client is masking its writes and the server is not. Looking at the nats-server code, i think only leafnodes are actually doing masking on their WS packets

@rsafonseca
Copy link
Contributor Author

rsafonseca commented Jan 23, 2025

The WS Opcode bits for the control messages also seems wrong, both from the server and from the client. ALL packets are using opcode 0x2 (binary message) whereas pings should be using 0x9 and pongs should be using 0xA. There's a bunch of code that relies on these opcodes that's likely not doing anything due to this.

In summary the issue is quite possibly that the PONG that is coming from the server has opcode 0x2, and doesn't have the compression bit set, since it's treated like a normal message instead of a control frame

@rsafonseca
Copy link
Contributor Author

rsafonseca commented Jan 23, 2025

@wallyqs should we fix the opcodes both on client and server and disable permessage-deflate on the client PING, or just set the permessage-deflate bit on PONGs coming from the server side?

@rsafonseca
Copy link
Contributor Author

@wallyqs I've added simple reproduction steps for the bug

@rsafonseca
Copy link
Contributor Author

got to the bottom of the actual problem. The statements above still hold true, but they are not the source of the problem. I've opened and linked a PR here with a summarised explanation :)

@derekcollison
Copy link
Member

Thanks @rsafonseca !

@piotrpio
Copy link
Collaborator

Thank you! I'll look at the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants