-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry for error conditions #3498
Comments
If this was intentionally not implemented, was there another recommended approach for monitoring failures in a Spire cluster? |
Hi @keeganwitt, we have our current telemetry documented at TELEMETRY.md. I would suggest taking a look there to see which existing metrics may help your observability into SPIRE.
These should already be emitted with RPC call counter metrics. Server RPC call counter metric for node attestation failures (including x509pop case you pointed out):
Agent RPC call counter metrics for workload attestation failures: X.509-SVID
JWT-SVID
General networking issues between the Agent and Server should be noticeable through the Agent manager metrics:
This should be observable with these Agent metrics:
I would generally classify all of these cases as "issues that render a component of SPIRE to be inoperable". Offering a couple ways these cases could be detected:
If you discover any other telemetry gaps, feel free to create a new issue with the specific cases, and we'd be happy to review them. |
The RPC metrics include a "status" label with the gRPC status code that the RPC completed with. Does that help? |
Ohh -- I had not noticed this was captured by the metric. Thank you for mentioning this! I will go back through the existing telemetry and re-evaluate whether the monitoring I need actually already exists. This might be something worth adding to the documentation. |
Yes, i totally agree. We can do better describing exactly what is included with the "CallCounter" metrics. |
I've made an attempt at describing call counters here. Happy to have your review on that, @keeganwitt :) |
Thank you for adding this! I left a couple comments. Side note for anyone who finds themselves here from Google: In my testing, |
It would be useful to have telemetry (like Datadog metrics) for system errors/failures. For example,
The text was updated successfully, but these errors were encountered: