Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS Secret Not Found Causes Upstream Timeout on Startup #12926

Open
juniwang opened this issue Mar 3, 2025 · 10 comments
Open

TLS Secret Not Found Causes Upstream Timeout on Startup #12926

juniwang opened this issue Mar 3, 2025 · 10 comments
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@juniwang
Copy link

juniwang commented Mar 3, 2025

Description:

We are experiencing an issue where the NGINX Ingress Controller logs errors related to missing TLS secrets during its startup phase. This results in upstream timeouts, but after approximately 60 seconds, everything starts working as expected.

Logs Observed:

-------------------------------------------------------------------------------
2025-03-03T02:00:00.0000000Z,NGINX Ingress controller
2025-03-03T02:00:00.0000000Z,"  Release:       v1.10.6"
2025-03-03T02:00:00.0000000Z,"  Build:         git-ae204a74d"
2025-03-03T02:00:00.0000000Z,"  Repository:    https://github.com/kubernetes/ingress-nginx"
2025-03-03T02:00:00.0000000Z,"  nginx version: nginx/1.25.5"
2025-03-03T02:00:00.0000000Z,
2025-03-03T02:00:00.0000000Z,-------------------------------------------------------------------------------
2025-03-03T02:00:00.0000000Z,
2025-03-03T02:00:00.0000000Z,W0303 02:00:50.146369       7 client_config.go:667] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2025-03-03T02:00:00.0000000Z,"I0303 02:00:50.146583       7 main.go:205] ""Creating API client"" host=""https://10.16.0.1:443"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:50.168673       7 main.go:248] ""Running in Kubernetes cluster"" major=""1"" minor=""30"" git=""v1.30.7"" state=""clean"" commit=""0c76c645d5a665cfeb736719b1cc47354193dc9a"" platform=""linux/amd64"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:50.387078       7 main.go:101] ""SSL fake certificate created"" file=""/etc/ingress-controller/ssl/default-fake-certificate.pem"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:50.425829       7 nginx.go:267] ""Starting NGINX Ingress controller"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:50.428885       7 store.go:531] ""adding ingressclass as ingress-class-by-name is configured"" ingressclass=""<redacted>"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:50.433040       7 event.go:377] Event(v1.ObjectReference{Kind:""ConfigMap"", Namespace:""mynamespace"", Name:""<redacted>"", UID:""5687de03-50d3-494b-b682-206d582ac75d"", APIVersion:""v1"", ResourceVersion:""126794660"", FieldPath:""""}): type: 'Normal' reason: 'CREATE' ConfigMap <redacted>"
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.530096       7 store.go:440] ""Found valid IngressClass"" ingress=""<redacted>"" ingressclass=""<redacted>"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.530293       7 event.go:377] Event(v1.ObjectReference{Kind:""Ingress"", Namespace:""mynamespace"", Name:""<redacted>"", UID:""ed1d133f-3250-47f6-add2-fe00d560be5a"", APIVersion:""networking.k8s.io/v1"", ResourceVersion:""126794735"", FieldPath:""""}): type: 'Normal' reason: 'Sync' Scheduled for sync"
2025-03-03T02:00:00.0000000Z,I0303 02:00:51.628195       7 leaderelection.go:257] attempting to acquire leader lease <redacted>...
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.628234       7 nginx.go:310] ""Starting NGINX process"""
2025-03-03T02:00:00.0000000Z,"W0303 02:00:51.628826       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"W0303 02:00:51.628894       7 controller.go:1435] Error getting SSL certificate ""my-tls-secret"": local SSL certificate my-tls-secret was not found. Using default certificate"
2025-03-03T02:00:00.0000000Z,"W0303 02:00:51.628911       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"W0303 02:00:51.629037       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.629113       7 controller.go:190] ""Configuration changes detected, backend reload required"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.632136       7 status.go:84] ""New leader elected"" identity=""<redacted>-6cf7b498bkfvgk"""
2025-03-03T02:00:00.0000000Z,"W0303 02:00:51.634237       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.693386       7 controller.go:210] ""Backend successfully reloaded"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.693489       7 controller.go:221] ""Initial sync, sleeping for 1 second"""
2025-03-03T02:00:00.0000000Z,"I0303 02:00:51.693543       7 event.go:377] Event(v1.ObjectReference{Kind:""Pod"", Namespace:""mynamespace"", Name:""myIngressController"", UID:""8307f3b3-1c06-4319-b232-8986b7b78181"", APIVersion:""v1"", ResourceVersion:""130033760"", FieldPath:""""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration"
2025-03-03T02:00:00.0000000Z,"W0303 02:00:55.506421       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"W0303 02:00:55.506489       7 controller.go:1435] Error getting SSL certificate ""my-tls-secret"": local SSL certificate my-tls-secret was not found. Using default certificate"
2025-03-03T02:00:00.0000000Z,"W0303 02:00:55.506496       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"W0303 02:00:55.506527       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"W0303 02:00:58.840193       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"W0303 02:00:58.840258       7 controller.go:1435] Error getting SSL certificate ""my-tls-secret"": local SSL certificate my-tls-secret was not found. Using default certificate"
2025-03-03T02:00:00.0000000Z,"W0303 02:00:58.840266       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:00:00.0000000Z,"W0303 02:00:58.840345       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:00:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:02.173234       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:02.173289       7 controller.go:1435] Error getting SSL certificate ""my-tls-secret"": local SSL certificate my-tls-secret was not found. Using default certificate"
2025-03-03T02:01:00.0000000Z,"W0303 02:01:02.173301       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:02.173431       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:05.507434       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:05.507489       7 controller.go:1435] Error getting SSL certificate ""my-tls-secret"": local SSL certificate my-tls-secret was not found. Using default certificate"
2025-03-03T02:01:00.0000000Z,"W0303 02:01:05.507495       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:05.507535       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:08.841072       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:08.841120       7 controller.go:1435] Error getting SSL certificate ""my-tls-secret"": local SSL certificate my-tls-secret was not found. Using default certificate"
2025-03-03T02:01:00.0000000Z,"W0303 02:01:08.841128       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:08.841164       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:09 [crit] 34#34: *615 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:10 [crit] 32#32: *702 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:11 [crit] 38#38: *863 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"W0303 02:01:12.173904       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:12.174015       7 controller.go:1435] Error getting SSL certificate ""my-tls-secret"": local SSL certificate my-tls-secret was not found. Using default certificate"
2025-03-03T02:01:00.0000000Z,"W0303 02:01:12.174046       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"W0303 02:01:12.174107       7 controller.go:1232] Error loading custom default certificate, falling back to generated default:"
2025-03-03T02:01:00.0000000Z,local SSL certificate my-tls-secret was not found
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:14 [crit] 39#39: *1470 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:17 [crit] 32#32: *2025 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:18 [crit] 34#34: *2286 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:18 [crit] 35#35: *2405 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:19 [crit] 34#34: *2549 SSL_do_handshake() failed (SSL: error:0A000119:SSL routines::decryption failed or bad record mac error:0A000139:SSL routines::record layer failure) while SSL handshaking, client: <redacted IP>, server: 0.0.0.0:443"
2025-03-03T02:01:00.0000000Z,"E0303 02:01:20.443651       7 ssl.go:75] ""Error generating certificate chain for Secret"" err=""Get \""http://www.microsoft.com/pkiops/certs/Microsoft%20Azure%20RSA%20TLS%20Issuing%20CA%2007%20-%20xsign.crt\"": dial tcp: lookup www.microsoft.com: i/o timeout"""
2025-03-03T02:01:00.0000000Z,"I0303 02:01:20.444203       7 backend_ssl.go:67] ""Adding secret to local store"" name=""my-tls-secret"""
2025-03-03T02:01:00.0000000Z,"I0303 02:01:20.444238       7 store.go:611] ""Secret was added and it is used in ingress annotations. Parsing"" secret=""my-tls-secret"""
2025-03-03T02:01:00.0000000Z,"I0303 02:01:20.444478       7 controller.go:190] ""Configuration changes detected, backend reload required"""
2025-03-03T02:01:00.0000000Z,"I0303 02:01:20.517637       7 controller.go:210] ""Backend successfully reloaded"""
2025-03-03T02:01:00.0000000Z,"I0303 02:01:20.517880       7 event.go:377] Event(v1.ObjectReference{Kind:""Pod"", Namespace:""mynamespace"", Name:""myIngressController"", UID:""8307f3b3-1c06-4319-b232-8986b7b78181"", APIVersion:""v1"", ResourceVersion:""130033760"", FieldPath:""""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration"
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:21 [error] 38#38: *1945 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,
2025-03-03T02:01:00.0000000Z,
2025-03-03T02:01:00.0000000Z,
2025-03-03T02:01:00.0000000Z,
2025-03-03T02:01:00.0000000Z,
2025-03-03T02:01:00.0000000Z,
2025-03-03T02:01:00.0000000Z,
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 36#36: *2794 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 38#38: *2803 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 36#36: *2798 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 300#300: *2839 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 301#301: *2848 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 305#305: *2855 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 305#305: *2837 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,"2025/03/03 02:01:25 [error] 33#33: *2849 upstream timed out (110: Operation timed out) while connecting to upstream, client: <redacted IP>, server: ~^(?<subdomain>[\w-]+)\.mydomain\.com$, request: ""GET /redactedPath HTTP/1.1"", upstream: ""https://<internalIp>:443/redactedPath"" host: ""<redacted host>"""
2025-03-03T02:01:00.0000000Z,

Logs_2025_03_04_09_04.csv

Steps to Reproduce:

  1. Deploy the NGINX Ingress Controller using the default Helm chart.
  2. Configure an Ingress resource with TLS enabled. The TLS secret is mounted using a CSI driver.
  3. Update anything other than tls settings and re-deploy the ingress controller to trigger a rolling update.
  4. Observe the logs during the initial startup phase.

Expected Behavior:

  • The Ingress Controller should properly wait for the TLS secret to be available before processing requests.
  • The upstream should not time out due to missing TLS secrets.

Actual Behavior:

  • During startup, the controller logs "tls secret not found" errors.
  • The upstream times out with error (110: Operation timed out).
  • After around 30 seconds, the issue resolves itself, and the Ingress Controller starts handling traffic correctly. The "Upstream time out" ended then.

Environment:

  • Ingress-NGINX version: v1.10.6
  • nginx version: nginx/1.25.5
  • Kubernetes version: 1.30.7
  • Cloud Provider: Azure AKS
  • TLS Secret Source: CSI Driver

Additional Context:

It appears that the Ingress Controller is attempting to process requests before the TLS secret is fully available, leading to these transient errors. We would like to understand if there's a way to delay the initialization or improve handling in such scenarios.

Would appreciate any insights or guidance on potential workarounds or fixes! 🚀

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 3, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Mar 3, 2025
@longwuyuan
Copy link
Contributor

Your issue description dies not state if you had live traffic hitting the cluster and yet the controller pod(s) restarted or not. So its not possible to guess as there is literally no data provided that can be analyzed like the complete logs of the controller, and the output of all the kubectl commands asked in the new bug report template.

Since the reproduce steps hint that you restarted an existing controller, it is a highly likely guess that the traffic was persistent throughout the restart. If so, then there is no instrumentation in the controller currently, to do what you expect because the Kubernetes Ingress API will match the incoming request to a rule, before the controller has completed locating and dealing with requirements like certs in the TLS sections.

You can stop traffic when you restart the controller, or hack at the various probes and extend the liveness/readiness etc.

@juniwang
Copy link
Author

juniwang commented Mar 4, 2025

Hi @longwuyuan thanks for looking into this. Actually it's rolling upgrade instead of restarting pods(the previous regenerating step was trying to simplify the step, sorry for the confusing). it's a pod on a new VM, there was no traffic before.

I updated with more logs, and uploaded all logs from the time when the pod was created to the timestamp of last "Upstream time out" log. Logs afterwards are just logs indicating successful handling and forwarding. Could you take a look again?

@longwuyuan
Copy link
Contributor

@juniwang Thank you for clarification. It explain the use case.

But still the problem is for established connections, there will be no new TLS negotiation, in theory but for any new connection, TLS cert needs to be presented by the server and if the controller pod is not fully sychronized and has not read and accessed the required certificate in a secret, just because there was a rolling upgrade, then it will not change the TLS handshale process and error will be logged. Only choice to avoid the error message is to stop any HTTPS request coming to the cluster, for that short time period, where the controller pod has fully read the secret of type TLS.

@juniwang
Copy link
Author

juniwang commented Mar 4, 2025

@longwuyuan yes I agree that it's because the required tls cert is not ready. And we are trying to add a startup probe to make sure the cert is ready before it takes any traffic. But this is just a workaround. There is nothing special in our setup(we fetch the cert from azure key vault via csi driver, mounted as tls cert and start ingress). Just want to learn that if it's a general issue in ingress controller and if there is a better way to stop the error. Any suggestion is appreciated.

@longwuyuan
Copy link
Contributor

I apologize, I don't know of a faster way to sync state.

On the other hand, sometimes I think about the basic realities. Regardless of how secure my vault of secrets is, all secrets can be base64 decoded. So I think a security expert needs to advise on best practices, in case you want to fetch secrets once during cluster creation, instead of repeatedly needing to fetch the same cert.

@juniwang
Copy link
Author

juniwang commented Mar 5, 2025

you are right, there must be some ways to improve the secrets loading. besides this, is it possible to add some sort of certificate validation in ingress's built in healthz api?

@longwuyuan
Copy link
Contributor

TLS validation does not need new code I think. Its built into HTTPS.

@juniwang
Copy link
Author

juniwang commented Mar 5, 2025

that's not what I mean. Most ingress controller users make use of the default built-in healthz api for liveness/readiness check. From my observation, healthz always return 200 Success regardless of certifictes status. In our case, we'd expect it to return other code than 200 before certs is ready. Is this something that be supported in ingress-controller? so that users don't have to work around it.

I understand not all users need this, maybe an option to allow users to control it?

@longwuyuan
Copy link
Contributor

Sorry for not being clear. And thanks for explaining.

So the 200 healthz is for the state of the controller. This is to say in the context of ingress resources are part of the controller code. Ingress resources are objects outside the controller. So it would not completely fit the description of health even if your expectation was met.

Its unfortunate that you have this situation and its unfortunate that there is no known solution for this problem. I thunk currently the only action possible is to stop traffic to the ingress-resources during a upgrade. And open traffic only when everything is in sync. And meanwhile ignore all the error related to certs.

But there is a complete mystery on one aspect here that I do not understand. If you certificates for different ingress-resources tls section, is in vault, why do you have to fetch it newly each time the controller restarts. If you fetch the certificate one time, it will remain as a secret in the cluster, till you delete it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

3 participants