Telemetry for error conditions #3498

keeganwitt · 2022-10-12T17:01:18Z

It would be useful to have telemetry (like Datadog metrics) for system errors/failures. For example,

Failed agent attestations
Failed workload attestations
Expired SVIDs
Server crashes (although this can be captured currently with a Kubernetes metric)
Agent crashes (although this can be captured currently with a Kubernetes metric)
Failures to update datastore (network problems, corrupt database, etc)
Failures to connect to datastore (network problem, invalid creds, etc)
Migration script failures (e.g. can't run downgrade migration)
Expired/invalid certificates for x509pop attestations
Configuration file errors (invalid configuration, syntax errors, missing file, etc)
Network issues between agent and server

keeganwitt · 2022-10-12T17:02:11Z

If this was intentionally not implemented, was there another recommended approach for monitoring failures in a Spire cluster?

rturner3 · 2022-10-18T19:55:56Z

Hi @keeganwitt, we have our current telemetry documented at TELEMETRY.md. I would suggest taking a look there to see which existing metrics may help your observability into SPIRE.

Failed agent attestations

Failed workload attestations

Network issues between agent and server

Expired/invalid certificates for x509pop attestations

These should already be emitted with RPC call counter metrics.

Server RPC call counter metric for node attestation failures (including x509pop case you pointed out):

Keys: rpc, agent, v1, agent, attest_agent

Agent RPC call counter metrics for workload attestation failures:

X.509-SVID

Keys: rpc, workload_api, fetch, x509svid

JWT-SVID

Keys: rpc, workload_api, fetch, jwtsvid

General networking issues between the Agent and Server should be noticeable through the Agent manager metrics:

Keys: manager, sync, fetch_entries_updates
Keys: manager, sync, fetch_svids_updates

Expired SVIDs

This should be observable with these Agent metrics:

Keys: cache_manager, expiring_svids
Keys: cache_manager, outdated_svids

Failures to update datastore (network problems, corrupt database, etc)
These should be observable through the Server metrics prefixed with the datastore key.

Server crashes (although this can be captured currently with a Kubernetes metric)

Agent crashes (although this can be captured currently with a Kubernetes metric)

Failures to connect to datastore (network problem, invalid creds, etc)

Migration script failures (e.g. can't run downgrade migration)

Configuration file errors (invalid configuration, syntax errors, missing file, etc)

I would generally classify all of these cases as "issues that render a component of SPIRE to be inoperable". Offering a couple ways these cases could be detected:

Tap into existing uptime_in_ms metric to see if a Server or Agent is unexpectedly crashlooping
All of these cases will result in ERROR level logs being emitted, which will contain more detailed information about the nature of the failure. Those logs can be also be used as a source for detection. In general, we emit logs anywhere an error happens, but if the error condition is known to be something sparse (e.g. startup failure), we don't usually emit metrics since it is often difficult to define alert conditions/thresholds with a high signal-to-noise ratio for sparse metrics.

If you discover any other telemetry gaps, feel free to create a new issue with the specific cases, and we'd be happy to review them.

keeganwitt · 2022-10-19T15:14:38Z

Thank you for your response! I'm still confused about how the existing telemetry would help monitor some of these cases. Let's talk about one or two specific examples, then maybe it'll become clearer to me.

In the case of node attestation failures, you suggest to use spire_server.rpc.agent.v1.agent.attest_agent. My understanding is that this metric is a count of each time an agent is attempted to be attested. Are you suggesting that a spike in this number would imply failures since each retry would increase that number? Or that the number would be low because there's a lower-than-expected number of agents that successfully complete attestation?

Here's an example of a Spire cluster that has a few nodes failing to attest. How would you infer from this metric that a failure is occurring?

Probably some the metrics I mentioned wouldn't make sense to implement, and you couldn't catch start-up issues very easily, as you say.

azdagron · 2022-10-19T16:31:47Z

The RPC metrics include a "status" label with the gRPC status code that the RPC completed with. Does that help?

keeganwitt · 2022-10-19T17:48:21Z

Ohh -- I had not noticed this was captured by the metric. Thank you for mentioning this! I will go back through the existing telemetry and re-evaluate whether the monitoring I need actually already exists.

This might be something worth adding to the documentation.

azdagron · 2022-10-19T17:49:30Z

This might be something worth adding to the documentation.

Yes, i totally agree. We can do better describing exactly what is included with the "CallCounter" metrics.

azdagron · 2022-10-20T00:14:25Z

I've made an attempt at describing call counters here. Happy to have your review on that, @keeganwitt :)

keeganwitt · 2022-10-20T04:26:13Z

Thank you for adding this! I left a couple comments.

Side note for anyone who finds themselves here from Google: In my testing, spire_server.rpc.agent.v1.agent.attest_agent > status does not show failures (looking for any status other than OK) when an entry is missing for x509pop attestation. But it might work for other attestors or other failure causes. I'm not really sure why this is the case.

evan2645 added the triage/in-progress Issue triage is in progress label Oct 13, 2022

evan2645 assigned rturner3 Oct 13, 2022

rturner3 closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry for error conditions #3498

Telemetry for error conditions #3498

keeganwitt commented Oct 12, 2022

keeganwitt commented Oct 12, 2022

rturner3 commented Oct 18, 2022 •

edited

Loading

keeganwitt commented Oct 19, 2022 •

edited

Loading

azdagron commented Oct 19, 2022

keeganwitt commented Oct 19, 2022

azdagron commented Oct 19, 2022

azdagron commented Oct 20, 2022

keeganwitt commented Oct 20, 2022 •

edited

Loading

Telemetry for error conditions #3498

Telemetry for error conditions #3498

Comments

keeganwitt commented Oct 12, 2022

keeganwitt commented Oct 12, 2022

rturner3 commented Oct 18, 2022 • edited Loading

keeganwitt commented Oct 19, 2022 • edited Loading

azdagron commented Oct 19, 2022

keeganwitt commented Oct 19, 2022

azdagron commented Oct 19, 2022

azdagron commented Oct 20, 2022

keeganwitt commented Oct 20, 2022 • edited Loading

rturner3 commented Oct 18, 2022 •

edited

Loading

keeganwitt commented Oct 19, 2022 •

edited

Loading

keeganwitt commented Oct 20, 2022 •

edited

Loading