Pod repeatedly fails due to "nats: consumer not found" error #2444

juliev0 · 2025-03-05T00:19:27Z

Describe the bug

This is after looking through all of the artifacts during this CI failure.

Scenario: new isbsvc test-isbservice-rollout-1 created with new pipeline test-pipeline-rollout-2.

The sink vertex pod named "out" continually restarts. We have the log from the first failure and not the others unfortunately, but presumably they are the same?:

{"level":"info","ts":"2025-03-04T15:35:05.204127832Z","logger":"numaflow.Sink-processor","caller":"commands/processor.go:48","msg":"Starting vertex data processor","version":"Version: v1.4.3-rc3+3b92155, BuildDate: 2025-02-24T00:39:14Z, GitCommit: 3b921554682af1b663f451aefe8ceb106bffebc8, GitTag: , GitTreeState: clean, GoVersion: go1.23.4, Compiler: gc, Platform: linux/amd64"}
Error: failed to get consumer info, nats: consumer not found
Usage:
  numaflow processor [flags]

Flags:
  -h, --help                 help for processor
      --isbsvc-type string   ISB Service type, e.g. jetstream
      --type string          Processor type, 'source', 'sink' or 'udf'

panic: failed to get consumer info, nats: consumer not found

goroutine 1 [running]:
github.com/numaproj/numaflow/cmd/commands.Execute(...)
	/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/root.go:33
main.main()
	/Users/jwang21/workspace/numaproj/numaflow/cmd/main.go:24 +0x3c

Pod restarted 6 times and failed every time.

Prior to this, the Job Pod failed and then succeeded. Unfortunately, we don't have the log from the successful run, only the failed one:

{"level":"info","ts":"2025-03-04T15:34:32.556800234Z","logger":"numaflow.isbsvc-create","caller":"isbsvc/jetstream_service.go:89","msg":"Succeeded to create a side inputs KV","pipeline":"test-pipeline-rollout-2","kvName":"numaplane-system-test-pipeline-rollout-2_SIDE_INPUTS"}
{"level":"info","ts":"2025-03-04T15:34:33.061041705Z","logger":"numaflow.isbsvc-create","caller":"isbsvc/jetstream_service.go:161","msg":"Succeeded to create a stream","pipeline":"test-pipeline-rollout-2","stream":"numaplane-system-test-pipeline-rollout-2-cat-0"}
{"level":"info","ts":"2025-03-04T15:34:33.516849285Z","logger":"numaflow.isbsvc-create","caller":"isbsvc/jetstream_service.go:172","msg":"Succeeded to create a consumer for a stream","pipeline":"test-pipeline-rollout-2","stream":"numaplane-system-test-pipeline-rollout-2-cat-0","consumer":"numaplane-system-test-pipeline-rollout-2-cat-0"}
{"level":"error","ts":"2025-03-04T15:34:38.526356783Z","logger":"numaflow.isbsvc-create","caller":"commands/isbsvc_create.go:93","msg":"Failed to create buffers, buckets and side inputs store.","pipeline":"test-pipeline-rollout-2","error":"failed to create stream \"numaplane-system-test-pipeline-rollout-2-out-0\" and buffers, context deadline exceeded","stacktrace":"github.com/numaproj/numaflow/cmd/commands.NewISBSvcCreateCommand.func1\n\t/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/isbsvc_create.go:93\ngithub.ghproxy.top/spf13/cobra.(*Command).execute\n\t/Users/jwang21/go/pkg/mod/github.com/spf13/[email protected]/command.go:985\ngithub.ghproxy.top/spf13/cobra.(*Command).ExecuteC\n\t/Users/jwang21/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117\ngithub.ghproxy.top/spf13/cobra.(*Command).Execute\n\t/Users/jwang21/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041\ngithub.ghproxy.top/numaproj/numaflow/cmd/commands.Execute\n\t/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/root.go:32\nmain.main\n\t/Users/jwang21/workspace/numaproj/numaflow/cmd/main.go:24\nruntime.main\n\t/usr/local/Cellar/go/1.23.4/libexec/src/runtime/proc.go:272"}
{"level":"error","ts":"2025-03-04T15:34:38.638514817Z","logger":"numaflow.isbsvc-create","caller":"nats/nats_client.go:69","msg":"Nats default: disconnected","pipeline":"test-pipeline-rollout-2","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/Users/jwang21/workspace/numaproj/numaflow/pkg/shared/clients/nats/nats_client.go:69\ngithub.ghproxy.top/nats-io/nats%2ego.(*Conn).close.func1\n\t/Users/jwang21/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5332\ngithub.ghproxy.top/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/Users/jwang21/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:3011"}
{"level":"info","ts":"2025-03-04T15:34:38.656961721Z","logger":"numaflow.isbsvc-create","caller":"nats/nats_client.go:63","msg":"Nats default: connection closed","pipeline":"test-pipeline-rollout-2"}
Error: failed to create stream "numaplane-system-test-pipeline-rollout-2-out-0" and buffers, context deadline exceeded
Usage:
  numaflow isbsvc-create [flags]

Flags:
      --buckets strings                  Buckets to create
      --buffers strings                  Buffers to create
  -h, --help                             help for isbsvc-create
      --isbsvc-type string               ISB Service type, e.g. jetstream
      --serving-source-streams strings   Serving source streams to create
      --side-inputs-store string         Name of the side inputs store

panic: failed to create stream "numaplane-system-test-pipeline-rollout-2-out-0" and buffers, context deadline exceeded

goroutine 1 [running]:
github.com/numaproj/numaflow/cmd/commands.Execute(...)
	/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/root.go:33
main.main()
	/Users/jwang21/workspace/numaproj/numaflow/cmd/main.go:24 +0x3c

I am attaching all artifacts and logs that we have:

pod-logs-progressive-functional (7).zip
resource-changes-progressive-functional (5).zip

The timeline is this:

2025-03-04T15:33:43.781272599Z pipeline created

2025-03-04T15:34:29.771129989Z Create Job starts running

2025-03-04T15:34:49: Create Job Pod restarts after failure and succeeds

2025-03-04T15:35:05.204127832Z test-pipeline-rollout-2 out-0 runs

2025-03-04T15:35:08 test-pipeline-rollout-2 out-0 panics

2025-03-04T15:38:04Z test-pipeline-rollout-2 out-0 has now restarted 5 times

To Reproduce
Steps to reproduce the behavior:

This may not be easily reproducible. This CI test usually passes.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

The text was updated successfully, but these errors were encountered:

juliev0 · 2025-03-05T00:21:36Z

I know this has only happened once and may be hard to reproduce. I'm okay if we don't look into it yet, but I wanted to create a record for it if it happens again.

juliev0 added bug Something isn't working area/controller and removed area/controller labels Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod repeatedly fails due to "nats: consumer not found" error #2444

Pod repeatedly fails due to "nats: consumer not found" error #2444

juliev0 commented Mar 5, 2025 •

edited

Loading

juliev0 commented Mar 5, 2025

Pod repeatedly fails due to "nats: consumer not found" error #2444

Pod repeatedly fails due to "nats: consumer not found" error #2444

Comments

juliev0 commented Mar 5, 2025 • edited Loading

juliev0 commented Mar 5, 2025

juliev0 commented Mar 5, 2025 •

edited

Loading