Some acknowledged messages not being deleted from stream with sources and WorkQueue retention #5148

zatlodan · 2024-02-29T11:00:07Z

Observed behavior

Stream (STREAM_B_Q) with a single consumer and retention set to WorkQueue reporting non zero message count after all messages are consumed and acknowledged by said consumer.
This stream (STREAM_B_Q) is sourcing from another stream with retention set to Limits (STREAM_A).

This behavior has occurred after a large amount of data was inserted into the source stream (STREAM_A).

Some more details:

The stream configuration was never changed
The stream never had different consumers than the one described below
No related issues found in server logs

STREAM_A

This is the source stream into which the data were published.

Config

Subjects: STREAM_A.>
Replicas: 3
Storage: File
Retention: Limits
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 5m0s
Direct Get: true
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits: Unlimited

State

Messages: 7,003,488
Bytes: 1.7 GiB
FirstSeq: 640,772 @ 2024-02-20T12:21:01 UTC
LastSeq: 7,644,259 @ 2024-02-29T10:12:15 UTC
Active Consumers: 0
Number of Subjects: 1

STREAM_B_Q

This is the WorkQueue with issues.

Config

Subjects: STREAM_B_Q.>
Replicas: 3
Storage: File
Retention: WorkQueue
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 2m0s
Direct Get: true
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits: Unlimited
Sources: STREAM_A

State

Messages: 1,980  <-- This is the issue
Bytes: 636 KiB
FirstSeq: 7,318,290 @ 2024-02-23T11:13:58 UTC
LastSeq: 7,644,857 @ 2024-02-29T10:27:17 UTC
Deleted Messages: 324,588 <-- This is the issue
Active Consumers: 1
Number of Subjects: 1

Consumer

Name: stream-b-testing-consumer
Pull Mode: true
Deliver Policy: All
Ack Policy: Explicit
Ack Wait: 30s
Replay Policy: Instant
Max Ack Pending: 10,000
Max Waiting Pulls: 10
Replicas: 3

Last Delivered Message: Consumer sequence: 7,645,772 Stream sequence: 7,644,857 Last delivery: 9m19s ago
Acknowledgment floor: Consumer sequence: 7,645,772 Stream sequence: 7,644,857 Last Ack: 8m49s ago
Outstanding Acks: 0 out of maximum 10,000
Redelivered Messages: 0
Unprocessed Messages: 0
Waiting Pulls: 1 of maximum 10

View from metrics
This is the jetstream_stream_total_messages metric on the stream STREAM_B_Q in the time the issue has arrisen.
You can see 0 messages in the stream before the published bulk and 1980 after.

Cluster info
4 nodes, all same version and HW specs, same private network.
No leaf nodes connected.

Expected behavior

All messages are removed from the stream after acknowledgement.
Stream reporting 0 total messages.

Server and client version

Server:
Version: 2.10.5
Git Commit: 0883d32
Go Version: go1.21.4

Consuming JS client:
https://www.npmjs.com/package/nats
Version: 2.15.1

CLI used to check
Version: 0.0.35

Host environment

No response

Steps to reproduce

The issue is quite flaky and occurs randomly throughout the month.
But it seems to be triggered by sudden spikes in published data in the source stream.

Setup 3 or 4 node cluster
Create streams with the configuration described in the "observed behavior" section
Publish a large batch of data into the source stream
Wait for consumer to consume messages
Observe issue in the work queue stream

The text was updated successfully, but these errors were encountered:

derekcollison · 2024-02-29T19:26:48Z

Thanks for the report. Best for you to upgrade to latest patch version, 2.10.11. If issue persists let us know.

derekcollison · 2024-02-29T19:27:00Z

Will leave open for now.

zakk616 · 2024-03-19T12:20:18Z

I was facing this with nats-server version 2.9.25 after upgrading to version 2.10.12 issue resolved.

zatlodan · 2024-03-19T12:48:15Z

I was facing this with nats-server version 2.9.25 after upgrading to version 2.10.12 issue resolved.

Thanks for the reply.
We will be updating NATS on our prod environment this week, I will post an update as soon I can.

zatlodan · 2024-04-17T16:12:54Z

We have updated all our NATS server environments to version 2.10.12.

We have cleared the affected streams of any messages and recreated the consumers.

The issue is still there, but different.
A week of monitoring and we have a hanging message in 3 of our 10 streams.
Currently its just a single message.

The difference now is that only one of the instances see the message being stuck.
In some cases its the leader of the stream/consumer and in some cases its not.

vigith · 2024-04-17T16:45:21Z

I think #5270 fixes and is available in 2.10.14

wallyqs · 2024-04-17T16:48:58Z

Thanks for the update @zatlodan, that is a condition that we were able to reproduce and was addressed in the v2.10.14 release from last week.

zatlodan · 2024-04-18T15:27:28Z

Okay, thank you for the response, we will update to 2.10.14 and will let you know.

zatlodan · 2024-05-15T21:13:29Z

Seems that the issue is no longer present after the update to 2.10.14.

Thank you all for help and I will now close this issue 👍

zatlodan added the defect Suspected defect such as a bug or regression label Feb 29, 2024

vigith mentioned this issue Apr 5, 2024

fix: support DiscardNew policy for Jetstream streams numaproj/numaflow#1624

Closed

zatlodan closed this as completed May 15, 2024

suikast42 mentioned this issue Jan 28, 2025

Nats stream messages grows without deleting after upgrade from 2.10.24 to 2.10.25 #6419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some acknowledged messages not being deleted from stream with sources and WorkQueue retention #5148

Some acknowledged messages not being deleted from stream with sources and WorkQueue retention #5148

zatlodan commented Feb 29, 2024 •

edited

Loading

derekcollison commented Feb 29, 2024

derekcollison commented Feb 29, 2024

zakk616 commented Mar 19, 2024

zatlodan commented Mar 19, 2024

zatlodan commented Apr 17, 2024 •

edited

Loading

vigith commented Apr 17, 2024

wallyqs commented Apr 17, 2024

zatlodan commented Apr 18, 2024

zatlodan commented May 15, 2024

Some acknowledged messages not being deleted from stream with sources and WorkQueue retention #5148

Some acknowledged messages not being deleted from stream with sources and WorkQueue retention #5148

Comments

zatlodan commented Feb 29, 2024 • edited Loading

Observed behavior

STREAM_A

STREAM_B_Q

Expected behavior

Server and client version

Host environment

Steps to reproduce

derekcollison commented Feb 29, 2024

derekcollison commented Feb 29, 2024

zakk616 commented Mar 19, 2024

zatlodan commented Mar 19, 2024

zatlodan commented Apr 17, 2024 • edited Loading

vigith commented Apr 17, 2024

wallyqs commented Apr 17, 2024

zatlodan commented Apr 18, 2024

zatlodan commented May 15, 2024

zatlodan commented Feb 29, 2024 •

edited

Loading

zatlodan commented Apr 17, 2024 •

edited

Loading