-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application DaemonSet Progressing forever blocks sync waves #7448
Comments
Is it only the health checks that stays in a progressing state, or is the underlying |
@jannfis I believe the daemonset is usually overall reporting progressing (as the target count for the daemonset fluctuates), with the underlying pods being a mixed bag. Many are heathy, some will report unhealthy as they start or stop, some report unhealthy as they restart as a part of their launch process (the VPC CNI plugin does this for example I think). I believe I'm also seeing issues where the app controller gets behind the ball and then the cache goes stale/hangs onto pods that no longer exist. With 2.0.x it seemed the only way to really clear these was to restart the app controller but I think I saw some changes in 2.1 that might let a hard refresh invalidate the app cache, I'm upgrading later today to see if this is improved. Edit: Taking a look at the AWS VPC CNI Plugin chart for example: Health details on the daemonset is: The main app health is then simply progressing. I'm not super familiar with the Lua health scripts to know if it can differentiate between a sync/rollout being successful for a generation and whether pods generally report healthy when showing the apps overall health. |
I'm still observing issues with Applications with large daemonsets (100+ pods/nodes) retaining stale pods in the daemonset that no longer exist in ArgoCD 2.1.4. These present as progressing, which may actually be what is making this issue seem worse than it is. Any idea what component is falling behind here/any way to bust that cache? If I describe one of the pods that reports as I also have |
I've encountered the same problem. When one pod of a daemonset cannot be scheduled onto the node it belongs to due to limits-reached on the node, the pod will stay in pending/processing status. What I would expect/wish-to-see is that ArgoCD would update the daemonset and restart all pods accordingly, no matter if pods are in processing/pending state or not. This has resulted in quite a few applications on multiple clusters not being updated for several updates until we noticed this behaviour. Now we manually delete any daemonsets of "Processing" applications, which triggers the new version to come alive. Cumbersome, but the only way I think. Until a solution is presented in ArgoCD, that is. Cheers, |
running into the same issue and would love to hear from context: deployed a I pushed a change to github with the wrong values corrected but I noticed the app is claimed to be in sync but the I've attempted force sync, deleting the the output from
|
hi everyone, my team is encountering the same issue with
about our environment:
i'm really curious to hear how others have mitigated for this issue. please help! |
I've also seen this issue. A DaemonSet is fully up and matching the number of nodes that exist in the cluster. Yet, the argo app for it perpetually spins in "progressing" state. Example: total number of nodes: 100 status:
currentNumberScheduled: 100
desiredNumberScheduled: 100
numberAvailable: 100
numberMisscheduled: 0
numberReady: 100
observedGeneration: 2
updatedNumberScheduled: 100 argoCD app message: all pods in the DS are totally healthy. Curious if anyone has found a resolution to this as it leads to a lot of alerts for me. ArgoCD Version: |
Hi there, I know that's not a complete fix, however you can set up a custom health check for Daemonset. In my example ds will be marked as resource.customizations: |
apps/DaemonSet:
health.lua: |
hs = {}
hs.status = "Progressing"
hs.message = ""
if obj.status ~= nil then
if obj.status.numberReady ~= nil and obj.status.desiredNumberScheduled ~= nil and obj.status.numberMisscheduled ~= nil then
if obj.status.numberMisscheduled / obj.status.desiredNumberScheduled >= 0.1 then
hs.status = "Degraded"
elseif obj.status.numberReady / obj.status.desiredNumberScheduled >= 0.1 then
hs.status = "Healthy"
else
hs.status = "Progressing"
end
hs.message = obj.status.numberReady .. "/" .. obj.status.desiredNumberScheduled .. " pods are ready"
end
end
return hs |
Huh, I hadn't thought about modifying that. There a reason you didn't go with the inverse and say its healthy as long as 90% of pods are ready/available? |
@sidewinder12s , I believe that waiting for 90% takes a while if your workload is more than 100 nodes. Anyway, it's just an example and you can adapt it to your needs. |
Inspired by the customization proposed above, we found another approach: resource.customizations: |
apps/DaemonSet:
health.lua: |
hs = {}
hs.status = "Progressing"
hs.message = ""
if obj.status ~= nil then
if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
hs.status = "Healthy"
else
hs.status = "Progressing"
end
hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
end
end
return hs The idea is instead of using % of ready Pods, is to mark daemonset |
So far so good, though it appears this may not deal with the case where a DS has no scheduled pods or desired scheduled pods. Ya if you have a DS that has never scheduled a pod, the This appears to deal with that: apps/DaemonSet:
health.lua: |
hs = {}
hs.status = "Progressing"
hs.message = ""
if obj.status ~= nil then
if obj.status.updatedNumberScheduled ~= nil and obj.status.desiredNumberScheduled ~= nil then
if obj.status.desiredNumberScheduled == obj.status.updatedNumberScheduled then
hs.status = "Healthy"
else
hs.status = "Progressing"
end
hs.message = obj.status.updatedNumberScheduled .. "/" .. obj.status.desiredNumberScheduled .. " pods are updated"
-- If a daemonset has never scheduled a pod, updatedNumberScheduled is nil
elseif obj.status.desiredNumberScheduled == 0 then
hs.status = "Healthy"
end
end
return hs |
Checklist:
argocd version
.Describe the bug
In very large clusters where nodes are constantly cycling where an ArgoCD application is managing DaemonSets, many times the health status of the Application will be stuck in progressing for extended periods of time.
This has the effect of blocking Sync Waves from functioning if you put those application in a early wave.
I've also assumed this puts a lot of strain on ArgoCD, especially since I've noticed it keeping pods around in a degraded state in the UI that do not appear to exist any longer in cluster, leading me to believe the cache or some other component might be in a degraded state.
I had wondered if this PR might help clear the situation with a hard refresh: #6463
To Reproduce
We've seen this on a few of our larger batch clusters that scale from 100-500 nodes at a time with ArgoCD Applications in an App of Apps pattern that manage daemonsets (CNI Plugins, Logging Daemonsets, etc)
Expected behavior
I'd expect the Application to eventually finish progressing.
Version
We've also scaled up our repo servers and Application Controller replicas to 8 to try and shard the large clusters out more.
Logs
Happy to try and provide any logs if someone can point me to which component they'd like to see. The logs otherwise are pretty noisy.
The text was updated successfully, but these errors were encountered: