-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS Secret Not Found Causes Upstream Timeout on Startup #12926
Comments
This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Your issue description dies not state if you had live traffic hitting the cluster and yet the controller pod(s) restarted or not. So its not possible to guess as there is literally no data provided that can be analyzed like the complete logs of the controller, and the output of all the kubectl commands asked in the new bug report template. Since the reproduce steps hint that you restarted an existing controller, it is a highly likely guess that the traffic was persistent throughout the restart. If so, then there is no instrumentation in the controller currently, to do what you expect because the Kubernetes Ingress API will match the incoming request to a rule, before the controller has completed locating and dealing with requirements like certs in the TLS sections. You can stop traffic when you restart the controller, or hack at the various probes and extend the liveness/readiness etc. |
Hi @longwuyuan thanks for looking into this. Actually it's rolling upgrade instead of restarting pods(the previous regenerating step was trying to simplify the step, sorry for the confusing). it's a pod on a new VM, there was no traffic before. I updated with more logs, and uploaded all logs from the time when the pod was created to the timestamp of last "Upstream time out" log. Logs afterwards are just logs indicating successful handling and forwarding. Could you take a look again? |
@juniwang Thank you for clarification. It explain the use case. But still the problem is for established connections, there will be no new TLS negotiation, in theory but for any new connection, TLS cert needs to be presented by the server and if the controller pod is not fully sychronized and has not read and accessed the required certificate in a secret, just because there was a rolling upgrade, then it will not change the TLS handshale process and error will be logged. Only choice to avoid the error message is to stop any HTTPS request coming to the cluster, for that short time period, where the controller pod has fully read the secret of type TLS. |
@longwuyuan yes I agree that it's because the required tls cert is not ready. And we are trying to add a startup probe to make sure the cert is ready before it takes any traffic. But this is just a workaround. There is nothing special in our setup(we fetch the cert from azure key vault via csi driver, mounted as tls cert and start ingress). Just want to learn that if it's a general issue in ingress controller and if there is a better way to stop the error. Any suggestion is appreciated. |
I apologize, I don't know of a faster way to sync state. On the other hand, sometimes I think about the basic realities. Regardless of how secure my vault of secrets is, all secrets can be base64 decoded. So I think a security expert needs to advise on best practices, in case you want to fetch secrets once during cluster creation, instead of repeatedly needing to fetch the same cert. |
you are right, there must be some ways to improve the secrets loading. besides this, is it possible to add some sort of certificate validation in ingress's built in healthz api? |
TLS validation does not need new code I think. Its built into HTTPS. |
that's not what I mean. Most ingress controller users make use of the default built-in I understand not all users need this, maybe an option to allow users to control it? |
Sorry for not being clear. And thanks for explaining. So the 200 healthz is for the state of the controller. This is to say in the context of ingress resources are part of the controller code. Ingress resources are objects outside the controller. So it would not completely fit the description of Its unfortunate that you have this situation and its unfortunate that there is no known solution for this problem. I thunk currently the only action possible is to stop traffic to the ingress-resources during a upgrade. And open traffic only when everything is in sync. And meanwhile ignore all the error related to certs. But there is a complete mystery on one aspect here that I do not understand. If you certificates for different ingress-resources tls section, is in vault, why do you have to fetch it newly each time the controller restarts. If you fetch the certificate one time, it will remain as a secret in the cluster, till you delete it. |
Description:
We are experiencing an issue where the NGINX Ingress Controller logs errors related to missing TLS secrets during its startup phase. This results in upstream timeouts, but after approximately 60 seconds, everything starts working as expected.
Logs Observed:
Logs_2025_03_04_09_04.csv
Steps to Reproduce:
Expected Behavior:
Actual Behavior:
"tls secret not found"
errors.(110: Operation timed out)
.Environment:
v1.10.6
nginx/1.25.5
1.30.7
Additional Context:
It appears that the Ingress Controller is attempting to process requests before the TLS secret is fully available, leading to these transient errors. We would like to understand if there's a way to delay the initialization or improve handling in such scenarios.
Would appreciate any insights or guidance on potential workarounds or fixes! 🚀
The text was updated successfully, but these errors were encountered: