Application DaemonSet Progressing forever blocks sync waves #7448

sidewinder12s · 2021-10-14T23:24:46Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

In very large clusters where nodes are constantly cycling where an ArgoCD application is managing DaemonSets, many times the health status of the Application will be stuck in progressing for extended periods of time.

This has the effect of blocking Sync Waves from functioning if you put those application in a early wave.

I've also assumed this puts a lot of strain on ArgoCD, especially since I've noticed it keeping pods around in a degraded state in the UI that do not appear to exist any longer in cluster, leading me to believe the cache or some other component might be in a degraded state.

I had wondered if this PR might help clear the situation with a hard refresh: #6463

To Reproduce

We've seen this on a few of our larger batch clusters that scale from 100-500 nodes at a time with ArgoCD Applications in an App of Apps pattern that manage daemonsets (CNI Plugins, Logging Daemonsets, etc)

Expected behavior

I'd expect the Application to eventually finish progressing.

Version

argocd: v2.0.4+0842d44
  BuildDate: 2021-06-23T01:27:53Z
  GitCommit: 0842d448107eb1397b251e63ec4d4bc1b4efdd6e
  GitTreeState: clean
  GoVersion: go1.16
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.0.4+0842d44
  BuildDate: 2021-06-23T01:27:53Z
  GitCommit: 0842d448107eb1397b251e63ec4d4bc1b4efdd6e
  GitTreeState: clean
  GoVersion: go1.16
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v3.9.4 2021-02-09T19:22:10Z
  Helm Version: v3.5.1+g32c2223
  Kubectl Version: v0.20.4
  Jsonnet Version: v0.17.0

We've also scaled up our repo servers and Application Controller replicas to 8 to try and shard the large clusters out more.

Logs

Happy to try and provide any logs if someone can point me to which component they'd like to see. The logs otherwise are pretty noisy.

The text was updated successfully, but these errors were encountered:

jannfis · 2021-10-20T09:23:58Z

Is it only the health checks that stays in a progressing state, or is the underlying DaemonSet not reporting readyness either?

sidewinder12s · 2021-10-21T14:11:47Z

Is it only the health checks that stays in a progressing state, or is the underlying DaemonSet not reporting readyness either?

@jannfis I believe the daemonset is usually overall reporting progressing (as the target count for the daemonset fluctuates), with the underlying pods being a mixed bag.

Many are heathy, some will report unhealthy as they start or stop, some report unhealthy as they restart as a part of their launch process (the VPC CNI plugin does this for example I think).

I believe I'm also seeing issues where the app controller gets behind the ball and then the cache goes stale/hangs onto pods that no longer exist. With 2.0.x it seemed the only way to really clear these was to restart the app controller but I think I saw some changes in 2.1 that might let a hard refresh invalidate the app cache, I'm upgrading later today to see if this is improved.

Edit:

Taking a look at the AWS VPC CNI Plugin chart for example:

Health details on the daemonset is: Waiting for daemon set "aws-node" rollout to finish: 437 of 465 updated pods are available... As it is scaling up, even though the apps been synced/deployed for months.

The main app health is then simply progressing.

I'm not super familiar with the Lua health scripts to know if it can differentiate between a sync/rollout being successful for a generation and whether pods generally report healthy when showing the apps overall health.

sidewinder12s · 2021-10-23T18:52:00Z

I'm still observing issues with Applications with large daemonsets (100+ pods/nodes) retaining stale pods in the daemonset that no longer exist in ArgoCD 2.1.4. These present as progressing, which may actually be what is making this issue seem worse than it is.

Any idea what component is falling behind here/any way to bust that cache? If I describe one of the pods that reports as Progressing, the pod doesn't exist anymore when talking to the Kubernetes API (And the argo UI presents an empty live manifest). The only way I had found to get it to properly update was to restart the application controllers (I am sharded across 8). Most of the rest of our metrics on ArgoCD components don't show any of them particularly overloaded.

I also have timeout.reconciliation set to 2 hours to reduce load on our git/helm repos.

eelkonio · 2021-11-03T14:49:02Z

I've encountered the same problem. When one pod of a daemonset cannot be scheduled onto the node it belongs to due to limits-reached on the node, the pod will stay in pending/processing status.
When the application is updated to a new version, all resources are updated but the daemonset. The old pods of the daemonset will continue to run until that one pod is scheduled correctly. Only then can the daemonset update all pods again.

What I would expect/wish-to-see is that ArgoCD would update the daemonset and restart all pods accordingly, no matter if pods are in processing/pending state or not.

This has resulted in quite a few applications on multiple clusters not being updated for several updates until we noticed this behaviour. Now we manually delete any daemonsets of "Processing" applications, which triggers the new version to come alive. Cumbersome, but the only way I think. Until a solution is presented in ArgoCD, that is.

Cheers,
Eelko

duxing · 2022-10-20T19:40:30Z

running into the same issue and would love to hear from argocd community on what's the proper solution.

context: deployed a daemonset with its official helm chart and one of the values are incorrect, causing pods to stuck in initializing state and the corresponding argocd app show the health status being progressing.

I pushed a change to github with the wrong values corrected but I noticed the app is claimed to be in sync but the daemonset is still stuck in progressing with the old values.

I've attempted force sync, deleting the the daemonset in argocd, and deleting the daemonset via kubectl directly and the daemonset is still defined with the old values.

output from argocd version:

argocd: v2.4.0+91aefab
  BuildDate: 2022-06-10T17:23:37Z
  GitCommit: 91aefabc5b213a258ddcfe04b8e69bb4a2dd2566
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.4.0+91aefab
  BuildDate: 2022-06-10T17:23:37Z
  GitCommit: 91aefabc5b213a258ddcfe04b8e69bb4a2dd2566
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.1+g5cb9af4
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

davidmontoyago · 2023-09-19T15:15:21Z

hi everyone, my team is encountering the same issue with Daemonsets and clusters of +150 nodes. the symptoms are the following:

logs for application-controller pods report healthy syncing operations for the Application containing the Daemonsets. log entries are of the sort: Skipping auto-sync: application status is Synced and No status changes. Skipping patch.
Once we open the specific Application in the UI, ArgoCD's API server begins making LOTS of calls to service application.ApplicationService, specifically methods ListResourceActions and ListResourceLinks. As long as there are new pods being added to the Daemonset and the UI is trying to render them, these calls keep flooding the logs, the Application status is reported as Progressing, and slowness in the UI becomes noticeable. These method calls eventually get rate limited by the K8s API server translating in errors in the UI.
We then proceed to click on the Daemonset as to inspect its manifests and status, but it's unable to render any details, just displaying a blank page and the error message Unable to load data: Request has been terminated Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.. In the browser console, we see lots of 400 and 404 errors which are attempts at fetching pods that are no longer present. This continues until the browser becomes really slow. The browser reports up to 500Mb of content downloaded until it freezes. At the same time, we see lots of calls to method CanI which eventually triggers an error on our SSO kicking the user session out.

about our environment:

ArgoCD v2.8.4+c279299.
our Daemonsets are currently configured with updateStrategy type set to RollingUpdate.

i'm really curious to hear how others have mitigated for this issue. please help!

jaredhancock31 · 2023-11-30T23:40:46Z

I've also seen this issue. A DaemonSet is fully up and matching the number of nodes that exist in the cluster. Yet, the argo app for it perpetually spins in "progressing" state. Example:

total number of nodes: 100
daemonSet status:

status:
  currentNumberScheduled: 100
  desiredNumberScheduled: 100
  numberAvailable: 100
  numberMisscheduled: 0
  numberReady: 100
  observedGeneration: 2
  updatedNumberScheduled: 100

argoCD app message: Waiting for daemon set "foobar" rollout to finish: 99 of 100 updated pods are available

all pods in the DS are totally healthy. Curious if anyone has found a resolution to this as it leads to a lot of alerts for me.

ArgoCD Version: v2.8.0+804d4b8

SweetOps · 2023-12-01T18:56:43Z

Hi there, I know that's not a complete fix, however you can set up a custom health check for Daemonset. In my example ds will be marked as Healthy when ~10% of pods are ready.

    resource.customizations: |
      apps/DaemonSet:
        health.lua: |
          hs = {}
          hs.status = "Progressing"
          hs.message = ""
          if obj.status ~= nil then
            if obj.status.numberReady ~= nil and obj.status.desiredNumberScheduled ~= nil and obj.status.numberMisscheduled ~= nil then
              if obj.status.numberMisscheduled / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Degraded"
              elseif obj.status.numberReady / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Healthy"
              else
                hs.status = "Progressing"
              end
              hs.message = obj.status.numberReady .. "/" .. obj.status.desiredNumberScheduled .. " pods are ready"
            end
          end
          return hs

sidewinder12s · 2023-12-01T20:03:24Z

Hi there, I know that's not a complete fix, however you can set up a custom health check for Daemonset. In my example ds will be marked as Healthy when ~10% of pods are ready.

    resource.customizations: |
      apps/DaemonSet:
        health.lua: |
          hs = {}
          hs.status = "Progressing"
          hs.message = ""
          if obj.status ~= nil then
            if obj.status.numberReady ~= nil and obj.status.desiredNumberScheduled ~= nil and obj.status.numberMisscheduled ~= nil then
              if obj.status.numberMisscheduled / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Degraded"
              elseif obj.status.numberReady / obj.status.desiredNumberScheduled >= 0.1 then
                hs.status = "Healthy"
              else
                hs.status = "Progressing"
              end
              hs.message = obj.status.numberReady .. "/" .. obj.status.desiredNumberScheduled .. " pods are ready"
            end
          end
          return hs

Huh, I hadn't thought about modifying that. There a reason you didn't go with the inverse and say its healthy as long as 90% of pods are ready/available?

SweetOps · 2023-12-01T20:50:29Z

Huh, I hadn't thought about modifying that. There a reason you didn't go with the inverse and say its healthy as long as 90% of pods are ready/available?

@sidewinder12s , I believe that waiting for 90% takes a while if your workload is more than 100 nodes. Anyway, it's just an example and you can adapt it to your needs.

evheniyt · 2025-01-24T08:54:11Z

Inspired by the customization proposed above, we found another approach:

      resource.customizations: |
        apps/DaemonSet:
          health.lua: |
            hs = {}
            hs.status = "Progressing"
            hs.message = ""
            if obj.status ~= nil then
              if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
                if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
                  hs.status = "Healthy"
                else
                  hs.status = "Progressing"
                end
                hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
              end
            end
            return hs

The idea is instead of using % of ready Pods, is to mark daemonset Progressing if there were real changes in its config (desiredNumberScheduled != updatedNumberScheduled). Because by default daemonsets updates pods 1 by 1 approach with % may not show the Progressing status. My approach allows us to track changes in the daemonset.

sidewinder12s · 2025-01-28T23:57:44Z

Inspired by the customization proposed above, we found another approach:
  resource.customizations: |
    apps/DaemonSet:
      health.lua: |
        hs = {}
        hs.status = "Progressing"
        hs.message = ""
        if obj.status ~= nil then
          if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
            if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
              hs.status = "Healthy"
            else
              hs.status = "Progressing"
            end
            hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
          end
        end
        return hs
The idea is instead of using % of ready Pods, is to mark daemonset Progressing if there were real changes in its config (desiredNumberScheduled != updatedNumberScheduled). Because by default daemonsets updates pods 1 by 1 approach with % may not show the Progressing status. My approach allows us to track changes in the daemonset.

So far so good, though it appears this may not deal with the case where a DS has no scheduled pods or desired scheduled pods.

Ya if you have a DS that has never scheduled a pod, the updatedNumberScheduled status field isn't there.

This appears to deal with that:

    apps/DaemonSet:
      health.lua: |
        hs = {}
        hs.status = "Progressing"
        hs.message = ""
        if obj.status ~= nil then
          if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
            if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
              hs.status = "Healthy"
            else
              hs.status = "Progressing"
            end
            hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
          -- If a daemonset has never scheduled a pod, updatedNumberScheduled is nil
          elseif obj.status.desiredNumberScheduled == 0 then
              hs.status = "Healthy"
          end
        end
        return hs

sidewinder12s added the bug Something isn't working label Oct 14, 2021

jannfis added the bug/in-triage This issue needs further triage to be correctly classified label Oct 20, 2021

crenshaw-dev added the sync-waves label Oct 31, 2023

sathieu mentioned this issue Jan 8, 2025

2025 Issues kubitus-project/upstream-issues#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application DaemonSet Progressing forever blocks sync waves #7448

Application DaemonSet Progressing forever blocks sync waves #7448

sidewinder12s commented Oct 14, 2021

jannfis commented Oct 20, 2021

sidewinder12s commented Oct 21, 2021 •

edited

Loading

sidewinder12s commented Oct 23, 2021

eelkonio commented Nov 3, 2021

duxing commented Oct 20, 2022 •

edited

Loading

davidmontoyago commented Sep 19, 2023

jaredhancock31 commented Nov 30, 2023

SweetOps commented Dec 1, 2023

sidewinder12s commented Dec 1, 2023

SweetOps commented Dec 1, 2023

evheniyt commented Jan 24, 2025

sidewinder12s commented Jan 28, 2025 •

edited

Loading

Application DaemonSet Progressing forever blocks sync waves #7448

Application DaemonSet Progressing forever blocks sync waves #7448

Comments

sidewinder12s commented Oct 14, 2021

jannfis commented Oct 20, 2021

sidewinder12s commented Oct 21, 2021 • edited Loading

sidewinder12s commented Oct 23, 2021

eelkonio commented Nov 3, 2021

duxing commented Oct 20, 2022 • edited Loading

davidmontoyago commented Sep 19, 2023

jaredhancock31 commented Nov 30, 2023

SweetOps commented Dec 1, 2023

sidewinder12s commented Dec 1, 2023

SweetOps commented Dec 1, 2023

evheniyt commented Jan 24, 2025

sidewinder12s commented Jan 28, 2025 • edited Loading

sidewinder12s commented Oct 21, 2021 •

edited

Loading

duxing commented Oct 20, 2022 •

edited

Loading

sidewinder12s commented Jan 28, 2025 •

edited

Loading