Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry for error conditions #3498

Closed
keeganwitt opened this issue Oct 12, 2022 · 8 comments
Closed

Telemetry for error conditions #3498

keeganwitt opened this issue Oct 12, 2022 · 8 comments
Assignees
Labels
triage/in-progress Issue triage is in progress

Comments

@keeganwitt
Copy link
Contributor

It would be useful to have telemetry (like Datadog metrics) for system errors/failures. For example,

  • Failed agent attestations
  • Failed workload attestations
  • Expired SVIDs
  • Server crashes (although this can be captured currently with a Kubernetes metric)
  • Agent crashes (although this can be captured currently with a Kubernetes metric)
  • Failures to update datastore (network problems, corrupt database, etc)
  • Failures to connect to datastore (network problem, invalid creds, etc)
  • Migration script failures (e.g. can't run downgrade migration)
  • Expired/invalid certificates for x509pop attestations
  • Configuration file errors (invalid configuration, syntax errors, missing file, etc)
  • Network issues between agent and server
@keeganwitt
Copy link
Contributor Author

If this was intentionally not implemented, was there another recommended approach for monitoring failures in a Spire cluster?

@evan2645 evan2645 added the triage/in-progress Issue triage is in progress label Oct 13, 2022
@rturner3
Copy link
Collaborator

rturner3 commented Oct 18, 2022

Hi @keeganwitt, we have our current telemetry documented at TELEMETRY.md. I would suggest taking a look there to see which existing metrics may help your observability into SPIRE.

  • Failed agent attestations
  • Failed workload attestations
  • Network issues between agent and server
  • Expired/invalid certificates for x509pop attestations

These should already be emitted with RPC call counter metrics.

Server RPC call counter metric for node attestation failures (including x509pop case you pointed out):

  • Keys: rpc, agent, v1, agent, attest_agent

Agent RPC call counter metrics for workload attestation failures:

X.509-SVID

  • Keys: rpc, workload_api, fetch, x509svid

JWT-SVID

  • Keys: rpc, workload_api, fetch, jwtsvid

General networking issues between the Agent and Server should be noticeable through the Agent manager metrics:

  • Keys: manager, sync, fetch_entries_updates
  • Keys: manager, sync, fetch_svids_updates
  • Expired SVIDs

This should be observable with these Agent metrics:

  • Keys: cache_manager, expiring_svids
  • Keys: cache_manager, outdated_svids
  • Failures to update datastore (network problems, corrupt database, etc)
    These should be observable through the Server metrics prefixed with the datastore key.
  • Server crashes (although this can be captured currently with a Kubernetes metric)
  • Agent crashes (although this can be captured currently with a Kubernetes metric)
  • Failures to connect to datastore (network problem, invalid creds, etc)
  • Migration script failures (e.g. can't run downgrade migration)
  • Configuration file errors (invalid configuration, syntax errors, missing file, etc)

I would generally classify all of these cases as "issues that render a component of SPIRE to be inoperable". Offering a couple ways these cases could be detected:

  • Tap into existing uptime_in_ms metric to see if a Server or Agent is unexpectedly crashlooping
  • All of these cases will result in ERROR level logs being emitted, which will contain more detailed information about the nature of the failure. Those logs can be also be used as a source for detection. In general, we emit logs anywhere an error happens, but if the error condition is known to be something sparse (e.g. startup failure), we don't usually emit metrics since it is often difficult to define alert conditions/thresholds with a high signal-to-noise ratio for sparse metrics.

If you discover any other telemetry gaps, feel free to create a new issue with the specific cases, and we'd be happy to review them.

@keeganwitt
Copy link
Contributor Author

keeganwitt commented Oct 19, 2022

Thank you for your response! I'm still confused about how the existing telemetry would help monitor some of these cases. Let's talk about one or two specific examples, then maybe it'll become clearer to me.

In the case of node attestation failures, you suggest to use spire_server.rpc.agent.v1.agent.attest_agent. My understanding is that this metric is a count of each time an agent is attempted to be attested. Are you suggesting that a spike in this number would imply failures since each retry would increase that number? Or that the number would be low because there's a lower-than-expected number of agents that successfully complete attestation?

Here's an example of a Spire cluster that has a few nodes failing to attest. How would you infer from this metric that a failure is occurring?
image

Probably some the metrics I mentioned wouldn't make sense to implement, and you couldn't catch start-up issues very easily, as you say.

@azdagron
Copy link
Member

The RPC metrics include a "status" label with the gRPC status code that the RPC completed with. Does that help?

@keeganwitt
Copy link
Contributor Author

Ohh -- I had not noticed this was captured by the metric. Thank you for mentioning this! I will go back through the existing telemetry and re-evaluate whether the monitoring I need actually already exists.

This might be something worth adding to the documentation.

@azdagron
Copy link
Member

This might be something worth adding to the documentation.

Yes, i totally agree. We can do better describing exactly what is included with the "CallCounter" metrics.

@azdagron
Copy link
Member

I've made an attempt at describing call counters here. Happy to have your review on that, @keeganwitt :)

@keeganwitt
Copy link
Contributor Author

keeganwitt commented Oct 20, 2022

Thank you for adding this! I left a couple comments.

Side note for anyone who finds themselves here from Google: In my testing, spire_server.rpc.agent.v1.agent.attest_agent > status does not show failures (looking for any status other than OK) when an entry is missing for x509pop attestation. But it might work for other attestors or other failure causes. I'm not really sure why this is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/in-progress Issue triage is in progress
Projects
None yet
Development

No branches or pull requests

4 participants