Parsing errors leading to disconnections when WS compression is enabled #1783

rsafonseca · 2025-01-21T21:52:40Z

Observed behavior

After enabling the use of websocket connections with compression enabled, we have some client disconnections happening every few minutes. There's millions of messages of varying sizes and payloads passing by without issues, but once a client disconnects it causes a number of issues on our side, so this very problematic.
The issues do not occur when compression is disabled on ws connection

nats: Parse Error [26]: 'Regi'
nats: Parse Error [0]: '"Frame'
nats: Parse Error [0]: ':{"Ena'

Expected behavior

No disconnections due to parsing errors

Server and client version

server 2.10.24
client 1.38.0

Host environment

Running on kubernetes, x86_64 arch nodes, using containerd runtime, using official nats helm chart

Steps to reproduce

Run a nats server with enabled websockets and compression
Run a nats.go client which connects to server with websockets, using compression and a high ping rate for easier reproduction

For easy 100% reproducibility, I've used the nats-bench example, here's some simple steps to reproduce:

Add the contents of this commit to existing nats repo rsafonseca@4994597
cd to examples/nats-bench
Run local nats container with the following command to mount the required websocket config:docker run -it --mount type=bind,src=$(pwd)/nats.conf,dst=/etc/nats/nats-server.conf -p 8080:8080 nats -c /etc/nats/nats-server.conf
Compile the example nats-bench by running go build
Run the following example command: ./nats-bench -s ws://localhost:8080 -n 1000000 -np 2 -ns 2 foo
lo and behold, a myriad of failures, parsing errors and disconnects

Disabling compression or increasing the ping interval makes for successful completion of the bench

The text was updated successfully, but these errors were encountered:

wallyqs · 2025-01-21T22:07:25Z

thanks for the report, do you modify the default max_payload setting by any chance?

rsafonseca · 2025-01-21T22:31:41Z

nope, it's using the default value, the messages are varying sizes but not over 2Mb :)

wallyqs · 2025-01-21T22:33:29Z

ok then the default would be using the 1Mb max payload

rsafonseca · 2025-01-21T22:35:18Z

Yes, sorry 1Mb, idk why i had the 2Mb in my mind :)

rsafonseca · 2025-01-21T22:38:45Z

I haven't dug too deep yet, but it may be possible due to some partial flush? klauspost/compress#993
support for this was added in the patch version after the one nats.go is using

edit:: hum... probably not, the deflate function uses the same lib and it seems to only do z_sync_flush

rsafonseca · 2025-01-21T23:18:15Z

Oh, possibly worth noting, the messages are in protobuf format, hence they are binary. IDK if that could be randomly tripping some control byte sequence somewhere

rsafonseca · 2025-01-22T21:26:32Z

UPDATE: It seems that the problem is related to the PING/PONG frames. After doing some gigantic network captures to try to capture some of the events, I noticed that the very last thing that comes in is a PONG, before the client throws the parsing error and disconnects.

I confirmed this, by increasing the PingInterval in the clients from the default 2m to 1h, and the issues nearly disappeared (apart from once in a while a random pod having one disconnect after ~1h.
I then decreased the PingInterval to 30s, and the amount of parsing errors increased beyond the number of occurrences with the default setting.

wallyqs · 2025-01-22T22:53:36Z

Thanks @rsafonseca, to clarify is the pong coming from the server or from the client?

rsafonseca · 2025-01-23T10:37:18Z

The PONG is coming from the server, the PONG packet itself has the expected 6 byte payload
50 4f 4e 47 0d 0a PONG..

Side note, maybe related to the issue here, i noticed that the PING sent by the client has the compression bit set, and IIUC it shouldn't. If it should, then the problem could be the fact that the PONG sent by the server doesn't

rsafonseca · 2025-01-23T11:08:54Z

Another inconsistency i noticed, which is probably irrelevant here, is that the client is masking its writes and the server is not. Looking at the nats-server code, i think only leafnodes are actually doing masking on their WS packets

rsafonseca · 2025-01-23T11:19:42Z

The WS Opcode bits for the control messages also seems wrong, both from the server and from the client. ALL packets are using opcode 0x2 (binary message) whereas pings should be using 0x9 and pongs should be using 0xA. There's a bunch of code that relies on these opcodes that's likely not doing anything due to this.

In summary the issue is quite possibly that the PONG that is coming from the server has opcode 0x2, and doesn't have the compression bit set, since it's treated like a normal message instead of a control frame

rsafonseca · 2025-01-23T11:49:39Z

@wallyqs should we fix the opcodes both on client and server and disable permessage-deflate on the client PING, or just set the permessage-deflate bit on PONGs coming from the server side?

rsafonseca · 2025-01-28T19:34:10Z

@wallyqs I've added simple reproduction steps for the bug

rsafonseca · 2025-01-29T20:53:35Z

got to the bottom of the actual problem. The statements above still hold true, but they are not the source of the problem. I've opened and linked a PR here with a summarised explanation :)

derekcollison · 2025-01-29T21:09:43Z

Thanks @rsafonseca !

piotrpio · 2025-01-29T21:14:36Z

Thank you! I'll look at the PR.

rsafonseca added the defect Suspected defect such as a bug or regression label Jan 21, 2025

rsafonseca mentioned this issue Jan 29, 2025

fix: protocol parsing errors with ws compression and pongs #1790

Merged

piotrpio closed this as completed in #1790 Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing errors leading to disconnections when WS compression is enabled #1783

Parsing errors leading to disconnections when WS compression is enabled #1783

rsafonseca commented Jan 21, 2025 •

edited

Loading

wallyqs commented Jan 21, 2025

rsafonseca commented Jan 21, 2025

wallyqs commented Jan 21, 2025

rsafonseca commented Jan 21, 2025

rsafonseca commented Jan 21, 2025 •

edited

Loading

rsafonseca commented Jan 21, 2025

rsafonseca commented Jan 22, 2025

wallyqs commented Jan 22, 2025

rsafonseca commented Jan 23, 2025 •

edited

Loading

rsafonseca commented Jan 23, 2025

rsafonseca commented Jan 23, 2025 •

edited

Loading

rsafonseca commented Jan 23, 2025 •

edited

Loading

rsafonseca commented Jan 28, 2025

rsafonseca commented Jan 29, 2025

derekcollison commented Jan 29, 2025

piotrpio commented Jan 29, 2025

Parsing errors leading to disconnections when WS compression is enabled #1783

Parsing errors leading to disconnections when WS compression is enabled #1783

Comments

rsafonseca commented Jan 21, 2025 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

wallyqs commented Jan 21, 2025

rsafonseca commented Jan 21, 2025

wallyqs commented Jan 21, 2025

rsafonseca commented Jan 21, 2025

rsafonseca commented Jan 21, 2025 • edited Loading

rsafonseca commented Jan 21, 2025

rsafonseca commented Jan 22, 2025

wallyqs commented Jan 22, 2025

rsafonseca commented Jan 23, 2025 • edited Loading

rsafonseca commented Jan 23, 2025

rsafonseca commented Jan 23, 2025 • edited Loading

rsafonseca commented Jan 23, 2025 • edited Loading

rsafonseca commented Jan 28, 2025

rsafonseca commented Jan 29, 2025

derekcollison commented Jan 29, 2025

piotrpio commented Jan 29, 2025

rsafonseca commented Jan 21, 2025 •

edited

Loading

rsafonseca commented Jan 21, 2025 •

edited

Loading

rsafonseca commented Jan 23, 2025 •

edited

Loading

rsafonseca commented Jan 23, 2025 •

edited

Loading

rsafonseca commented Jan 23, 2025 •

edited

Loading