Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application DaemonSet Progressing forever blocks sync waves #7448

Open
3 tasks done
sidewinder12s opened this issue Oct 14, 2021 · 12 comments
Open
3 tasks done

Application DaemonSet Progressing forever blocks sync waves #7448

sidewinder12s opened this issue Oct 14, 2021 · 12 comments
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working sync-waves

Comments

@sidewinder12s
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

In very large clusters where nodes are constantly cycling where an ArgoCD application is managing DaemonSets, many times the health status of the Application will be stuck in progressing for extended periods of time.

This has the effect of blocking Sync Waves from functioning if you put those application in a early wave.

I've also assumed this puts a lot of strain on ArgoCD, especially since I've noticed it keeping pods around in a degraded state in the UI that do not appear to exist any longer in cluster, leading me to believe the cache or some other component might be in a degraded state.

I had wondered if this PR might help clear the situation with a hard refresh: #6463

To Reproduce

We've seen this on a few of our larger batch clusters that scale from 100-500 nodes at a time with ArgoCD Applications in an App of Apps pattern that manage daemonsets (CNI Plugins, Logging Daemonsets, etc)

Expected behavior

I'd expect the Application to eventually finish progressing.

Version

argocd: v2.0.4+0842d44
  BuildDate: 2021-06-23T01:27:53Z
  GitCommit: 0842d448107eb1397b251e63ec4d4bc1b4efdd6e
  GitTreeState: clean
  GoVersion: go1.16
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.0.4+0842d44
  BuildDate: 2021-06-23T01:27:53Z
  GitCommit: 0842d448107eb1397b251e63ec4d4bc1b4efdd6e
  GitTreeState: clean
  GoVersion: go1.16
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v3.9.4 2021-02-09T19:22:10Z
  Helm Version: v3.5.1+g32c2223
  Kubectl Version: v0.20.4
  Jsonnet Version: v0.17.0

We've also scaled up our repo servers and Application Controller replicas to 8 to try and shard the large clusters out more.

Logs

Happy to try and provide any logs if someone can point me to which component they'd like to see. The logs otherwise are pretty noisy.

@sidewinder12s sidewinder12s added the bug Something isn't working label Oct 14, 2021
@jannfis
Copy link
Member

jannfis commented Oct 20, 2021

Is it only the health checks that stays in a progressing state, or is the underlying DaemonSet not reporting readyness either?

@jannfis jannfis added the bug/in-triage This issue needs further triage to be correctly classified label Oct 20, 2021
@sidewinder12s
Copy link
Author

sidewinder12s commented Oct 21, 2021

Is it only the health checks that stays in a progressing state, or is the underlying DaemonSet not reporting readyness either?

@jannfis I believe the daemonset is usually overall reporting progressing (as the target count for the daemonset fluctuates), with the underlying pods being a mixed bag.

Many are heathy, some will report unhealthy as they start or stop, some report unhealthy as they restart as a part of their launch process (the VPC CNI plugin does this for example I think).

I believe I'm also seeing issues where the app controller gets behind the ball and then the cache goes stale/hangs onto pods that no longer exist. With 2.0.x it seemed the only way to really clear these was to restart the app controller but I think I saw some changes in 2.1 that might let a hard refresh invalidate the app cache, I'm upgrading later today to see if this is improved.

Edit:

Taking a look at the AWS VPC CNI Plugin chart for example:

Health details on the daemonset is: Waiting for daemon set "aws-node" rollout to finish: 437 of 465 updated pods are available... As it is scaling up, even though the apps been synced/deployed for months.

The main app health is then simply progressing.

I'm not super familiar with the Lua health scripts to know if it can differentiate between a sync/rollout being successful for a generation and whether pods generally report healthy when showing the apps overall health.

@sidewinder12s
Copy link
Author

I'm still observing issues with Applications with large daemonsets (100+ pods/nodes) retaining stale pods in the daemonset that no longer exist in ArgoCD 2.1.4. These present as progressing, which may actually be what is making this issue seem worse than it is.

Any idea what component is falling behind here/any way to bust that cache? If I describe one of the pods that reports as Progressing, the pod doesn't exist anymore when talking to the Kubernetes API (And the argo UI presents an empty live manifest). The only way I had found to get it to properly update was to restart the application controllers (I am sharded across 8). Most of the rest of our metrics on ArgoCD components don't show any of them particularly overloaded.

I also have timeout.reconciliation set to 2 hours to reduce load on our git/helm repos.

@eelkonio
Copy link

eelkonio commented Nov 3, 2021

I've encountered the same problem. When one pod of a daemonset cannot be scheduled onto the node it belongs to due to limits-reached on the node, the pod will stay in pending/processing status.
When the application is updated to a new version, all resources are updated but the daemonset. The old pods of the daemonset will continue to run until that one pod is scheduled correctly. Only then can the daemonset update all pods again.

What I would expect/wish-to-see is that ArgoCD would update the daemonset and restart all pods accordingly, no matter if pods are in processing/pending state or not.

This has resulted in quite a few applications on multiple clusters not being updated for several updates until we noticed this behaviour. Now we manually delete any daemonsets of "Processing" applications, which triggers the new version to come alive. Cumbersome, but the only way I think. Until a solution is presented in ArgoCD, that is.

Cheers,
Eelko

@duxing
Copy link

duxing commented Oct 20, 2022

running into the same issue and would love to hear from argocd community on what's the proper solution.

context: deployed a daemonset with its official helm chart and one of the values are incorrect, causing pods to stuck in initializing state and the corresponding argocd app show the health status being progressing.

I pushed a change to github with the wrong values corrected but I noticed the app is claimed to be in sync but the daemonset is still stuck in progressing with the old values.

I've attempted force sync, deleting the the daemonset in argocd, and deleting the daemonset via kubectl directly and the daemonset is still defined with the old values.

output from argocd version:

argocd: v2.4.0+91aefab
  BuildDate: 2022-06-10T17:23:37Z
  GitCommit: 91aefabc5b213a258ddcfe04b8e69bb4a2dd2566
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.4.0+91aefab
  BuildDate: 2022-06-10T17:23:37Z
  GitCommit: 91aefabc5b213a258ddcfe04b8e69bb4a2dd2566
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.1+g5cb9af4
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

@davidmontoyago
Copy link

hi everyone, my team is encountering the same issue with Daemonsets and clusters of +150 nodes. the symptoms are the following:

  1. logs for application-controller pods report healthy syncing operations for the Application containing the Daemonsets. log entries are of the sort: Skipping auto-sync: application status is Synced and No status changes. Skipping patch.
  2. Once we open the specific Application in the UI, ArgoCD's API server begins making LOTS of calls to service application.ApplicationService, specifically methods ListResourceActions and ListResourceLinks. As long as there are new pods being added to the Daemonset and the UI is trying to render them, these calls keep flooding the logs, the Application status is reported as Progressing, and slowness in the UI becomes noticeable. These method calls eventually get rate limited by the K8s API server translating in errors in the UI.
  3. We then proceed to click on the Daemonset as to inspect its manifests and status, but it's unable to render any details, just displaying a blank page and the error message Unable to load data: Request has been terminated Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.. In the browser console, we see lots of 400 and 404 errors which are attempts at fetching pods that are no longer present. This continues until the browser becomes really slow. The browser reports up to 500Mb of content downloaded until it freezes. At the same time, we see lots of calls to method CanI which eventually triggers an error on our SSO kicking the user session out.

about our environment:

  • ArgoCD v2.8.4+c279299.
  • our Daemonsets are currently configured with updateStrategy type set to RollingUpdate.

i'm really curious to hear how others have mitigated for this issue. please help!

@jaredhancock31
Copy link

I've also seen this issue. A DaemonSet is fully up and matching the number of nodes that exist in the cluster. Yet, the argo app for it perpetually spins in "progressing" state. Example:

total number of nodes: 100
daemonSet status:

status:
  currentNumberScheduled: 100
  desiredNumberScheduled: 100
  numberAvailable: 100
  numberMisscheduled: 0
  numberReady: 100
  observedGeneration: 2
  updatedNumberScheduled: 100

argoCD app message: Waiting for daemon set "foobar" rollout to finish: 99 of 100 updated pods are available

all pods in the DS are totally healthy. Curious if anyone has found a resolution to this as it leads to a lot of alerts for me.

ArgoCD Version: v2.8.0+804d4b8

@SweetOps
Copy link

SweetOps commented Dec 1, 2023

Hi there, I know that's not a complete fix, however you can set up a custom health check for Daemonset. In my example ds will be marked as Healthy when ~10% of pods are ready.

    resource.customizations: |
      apps/DaemonSet:
        health.lua: |
          hs = {}
          hs.status = "Progressing"
          hs.message = ""
          if obj.status ~= nil then
            if obj.status.numberReady ~= nil and obj.status.desiredNumberScheduled ~= nil and obj.status.numberMisscheduled ~= nil then
              if obj.status.numberMisscheduled / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Degraded"
              elseif obj.status.numberReady / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Healthy"
              else
                hs.status = "Progressing"
              end
              hs.message = obj.status.numberReady .. "/" .. obj.status.desiredNumberScheduled .. " pods are ready"
            end
          end
          return hs

@sidewinder12s
Copy link
Author

Hi there, I know that's not a complete fix, however you can set up a custom health check for Daemonset. In my example ds will be marked as Healthy when ~10% of pods are ready.

    resource.customizations: |
      apps/DaemonSet:
        health.lua: |
          hs = {}
          hs.status = "Progressing"
          hs.message = ""
          if obj.status ~= nil then
            if obj.status.numberReady ~= nil and obj.status.desiredNumberScheduled ~= nil and obj.status.numberMisscheduled ~= nil then
              if obj.status.numberMisscheduled / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Degraded"
              elseif obj.status.numberReady / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Healthy"
              else
                hs.status = "Progressing"
              end
              hs.message = obj.status.numberReady .. "/" .. obj.status.desiredNumberScheduled .. " pods are ready"
            end
          end
          return hs

Huh, I hadn't thought about modifying that. There a reason you didn't go with the inverse and say its healthy as long as 90% of pods are ready/available?

@SweetOps
Copy link

SweetOps commented Dec 1, 2023

Huh, I hadn't thought about modifying that. There a reason you didn't go with the inverse and say its healthy as long as 90% of pods are ready/available?

@sidewinder12s , I believe that waiting for 90% takes a while if your workload is more than 100 nodes. Anyway, it's just an example and you can adapt it to your needs.

@evheniyt
Copy link

Inspired by the customization proposed above, we found another approach:

      resource.customizations: |
        apps/DaemonSet:
          health.lua: |
            hs = {}
            hs.status = "Progressing"
            hs.message = ""
            if obj.status ~= nil then
              if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
                if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
                  hs.status = "Healthy"
                else
                  hs.status = "Progressing"
                end
                hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
              end
            end
            return hs

The idea is instead of using % of ready Pods, is to mark daemonset Progressing if there were real changes in its config (desiredNumberScheduled != updatedNumberScheduled). Because by default daemonsets updates pods 1 by 1 approach with % may not show the Progressing status. My approach allows us to track changes in the daemonset.

@sidewinder12s
Copy link
Author

sidewinder12s commented Jan 28, 2025

Inspired by the customization proposed above, we found another approach:

  resource.customizations: |
    apps/DaemonSet:
      health.lua: |
        hs = {}
        hs.status = "Progressing"
        hs.message = ""
        if obj.status ~= nil then
          if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
            if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
              hs.status = "Healthy"
            else
              hs.status = "Progressing"
            end
            hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
          end
        end
        return hs

The idea is instead of using % of ready Pods, is to mark daemonset Progressing if there were real changes in its config (desiredNumberScheduled != updatedNumberScheduled). Because by default daemonsets updates pods 1 by 1 approach with % may not show the Progressing status. My approach allows us to track changes in the daemonset.

So far so good, though it appears this may not deal with the case where a DS has no scheduled pods or desired scheduled pods.

Ya if you have a DS that has never scheduled a pod, the updatedNumberScheduled status field isn't there.

This appears to deal with that:

    apps/DaemonSet:
      health.lua: |
        hs = {}
        hs.status = "Progressing"
        hs.message = ""
        if obj.status ~= nil then
          if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
            if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
              hs.status = "Healthy"
            else
              hs.status = "Progressing"
            end
            hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
          -- If a daemonset has never scheduled a pod, updatedNumberScheduled is nil
          elseif obj.status.desiredNumberScheduled == 0 then
              hs.status = "Healthy"
          end
        end
        return hs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working sync-waves
Projects
None yet
Development

No branches or pull requests

9 participants