Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod repeatedly fails due to "nats: consumer not found" error #2444

Open
juliev0 opened this issue Mar 5, 2025 · 1 comment
Open

Pod repeatedly fails due to "nats: consumer not found" error #2444

juliev0 opened this issue Mar 5, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@juliev0
Copy link
Contributor

juliev0 commented Mar 5, 2025

Describe the bug

This is after looking through all of the artifacts during this CI failure.

Scenario: new isbsvc test-isbservice-rollout-1 created with new pipeline test-pipeline-rollout-2.

The sink vertex pod named "out" continually restarts. We have the log from the first failure and not the others unfortunately, but presumably they are the same?:

{"level":"info","ts":"2025-03-04T15:35:05.204127832Z","logger":"numaflow.Sink-processor","caller":"commands/processor.go:48","msg":"Starting vertex data processor","version":"Version: v1.4.3-rc3+3b92155, BuildDate: 2025-02-24T00:39:14Z, GitCommit: 3b921554682af1b663f451aefe8ceb106bffebc8, GitTag: , GitTreeState: clean, GoVersion: go1.23.4, Compiler: gc, Platform: linux/amd64"}
Error: failed to get consumer info, nats: consumer not found
Usage:
  numaflow processor [flags]

Flags:
  -h, --help                 help for processor
      --isbsvc-type string   ISB Service type, e.g. jetstream
      --type string          Processor type, 'source', 'sink' or 'udf'

panic: failed to get consumer info, nats: consumer not found

goroutine 1 [running]:
github.com/numaproj/numaflow/cmd/commands.Execute(...)
	/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/root.go:33
main.main()
	/Users/jwang21/workspace/numaproj/numaflow/cmd/main.go:24 +0x3c

Pod restarted 6 times and failed every time.

Prior to this, the Job Pod failed and then succeeded. Unfortunately, we don't have the log from the successful run, only the failed one:

{"level":"info","ts":"2025-03-04T15:34:32.556800234Z","logger":"numaflow.isbsvc-create","caller":"isbsvc/jetstream_service.go:89","msg":"Succeeded to create a side inputs KV","pipeline":"test-pipeline-rollout-2","kvName":"numaplane-system-test-pipeline-rollout-2_SIDE_INPUTS"}
{"level":"info","ts":"2025-03-04T15:34:33.061041705Z","logger":"numaflow.isbsvc-create","caller":"isbsvc/jetstream_service.go:161","msg":"Succeeded to create a stream","pipeline":"test-pipeline-rollout-2","stream":"numaplane-system-test-pipeline-rollout-2-cat-0"}
{"level":"info","ts":"2025-03-04T15:34:33.516849285Z","logger":"numaflow.isbsvc-create","caller":"isbsvc/jetstream_service.go:172","msg":"Succeeded to create a consumer for a stream","pipeline":"test-pipeline-rollout-2","stream":"numaplane-system-test-pipeline-rollout-2-cat-0","consumer":"numaplane-system-test-pipeline-rollout-2-cat-0"}
{"level":"error","ts":"2025-03-04T15:34:38.526356783Z","logger":"numaflow.isbsvc-create","caller":"commands/isbsvc_create.go:93","msg":"Failed to create buffers, buckets and side inputs store.","pipeline":"test-pipeline-rollout-2","error":"failed to create stream \"numaplane-system-test-pipeline-rollout-2-out-0\" and buffers, context deadline exceeded","stacktrace":"github.com/numaproj/numaflow/cmd/commands.NewISBSvcCreateCommand.func1\n\t/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/isbsvc_create.go:93\ngithub.ghproxy.top/spf13/cobra.(*Command).execute\n\t/Users/jwang21/go/pkg/mod/github.com/spf13/[email protected]/command.go:985\ngithub.ghproxy.top/spf13/cobra.(*Command).ExecuteC\n\t/Users/jwang21/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117\ngithub.ghproxy.top/spf13/cobra.(*Command).Execute\n\t/Users/jwang21/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041\ngithub.ghproxy.top/numaproj/numaflow/cmd/commands.Execute\n\t/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/root.go:32\nmain.main\n\t/Users/jwang21/workspace/numaproj/numaflow/cmd/main.go:24\nruntime.main\n\t/usr/local/Cellar/go/1.23.4/libexec/src/runtime/proc.go:272"}
{"level":"error","ts":"2025-03-04T15:34:38.638514817Z","logger":"numaflow.isbsvc-create","caller":"nats/nats_client.go:69","msg":"Nats default: disconnected","pipeline":"test-pipeline-rollout-2","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/Users/jwang21/workspace/numaproj/numaflow/pkg/shared/clients/nats/nats_client.go:69\ngithub.ghproxy.top/nats-io/nats%2ego.(*Conn).close.func1\n\t/Users/jwang21/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5332\ngithub.ghproxy.top/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/Users/jwang21/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:3011"}
{"level":"info","ts":"2025-03-04T15:34:38.656961721Z","logger":"numaflow.isbsvc-create","caller":"nats/nats_client.go:63","msg":"Nats default: connection closed","pipeline":"test-pipeline-rollout-2"}
Error: failed to create stream "numaplane-system-test-pipeline-rollout-2-out-0" and buffers, context deadline exceeded
Usage:
  numaflow isbsvc-create [flags]

Flags:
      --buckets strings                  Buckets to create
      --buffers strings                  Buffers to create
  -h, --help                             help for isbsvc-create
      --isbsvc-type string               ISB Service type, e.g. jetstream
      --serving-source-streams strings   Serving source streams to create
      --side-inputs-store string         Name of the side inputs store

panic: failed to create stream "numaplane-system-test-pipeline-rollout-2-out-0" and buffers, context deadline exceeded

goroutine 1 [running]:
github.com/numaproj/numaflow/cmd/commands.Execute(...)
	/Users/jwang21/workspace/numaproj/numaflow/cmd/commands/root.go:33
main.main()
	/Users/jwang21/workspace/numaproj/numaflow/cmd/main.go:24 +0x3c

I am attaching all artifacts and logs that we have:

pod-logs-progressive-functional (7).zip
resource-changes-progressive-functional (5).zip

The timeline is this:

2025-03-04T15:33:43.781272599Z pipeline created

2025-03-04T15:34:29.771129989Z Create Job starts running

2025-03-04T15:34:49: Create Job Pod restarts after failure and succeeds

2025-03-04T15:35:05.204127832Z test-pipeline-rollout-2 out-0 runs

2025-03-04T15:35:08 test-pipeline-rollout-2 out-0 panics

2025-03-04T15:38:04Z test-pipeline-rollout-2 out-0 has now restarted 5 times

To Reproduce
Steps to reproduce the behavior:

This may not be easily reproducible. This CI test usually passes.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

@juliev0 juliev0 added bug Something isn't working area/controller and removed area/controller labels Mar 5, 2025
@juliev0
Copy link
Contributor Author

juliev0 commented Mar 5, 2025

I know this has only happened once and may be hard to reproduce. I'm okay if we don't look into it yet, but I wanted to create a record for it if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant