Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter notion of esReacheable: require health response #5796

Merged
merged 4 commits into from
Jun 24, 2022

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Jun 17, 2022

Fixes #5776

Simplest (?) possible fix without doing extra requests. The goal is avoid confusing error messages like "could not verify license" but instead have a meaningful status like "Elasticsearch is not reachable". The implication here is that esReachable is now tied to the first successful observation of cluster health. Those happen asynchronously in the observer mechanism and are not done inside the reconciliation loop. I think that is OK as the cluster is not available immediately but with a certain delay anyway, depending on hardware and the time it takes for the ES container in the Pods to become ready. If the timing is unfortunate we might lose an additional 10 seconds (which is the current observation interval) which seems acceptable to me.

@botelastic botelastic bot added the triage label Jun 17, 2022
@pebrc pebrc added the >enhancement Enhancement of existing functionality label Jun 17, 2022
@botelastic botelastic bot removed the triage label Jun 17, 2022
@botelastic botelastic bot removed the triage label Jun 17, 2022
@pebrc
Copy link
Collaborator Author

pebrc commented Jun 20, 2022

run/e2e-tests tags=es

@pebrc pebrc marked this pull request as ready for review June 22, 2022 06:41
@pebrc pebrc marked this pull request as draft June 22, 2022 06:50
@pebrc pebrc marked this pull request as ready for review June 22, 2022 08:52
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this PR, I agree that the current status may be confusing.
I'm must confess however I'm a bit hesitant regarding the way to improve things as there can be 2 reasons for having an "unknown" state:

  1. The operator has not attempted to reach the cluster yet, in which case I would expect the condition to be corev1.ConditionUnknown. We don't really no if Elasticsearch is not responding to requests.
  2. The operator attempted to get the cluster state, but an error occurred (time out, credentials issue...). Health is still unknown but we now have an error to explain why. The condition should then be corev1.ConditionFalse, with maybe the error as the condition message.

Edit: Sorry I was focusing on the code while working on the PR and forgot your comment 🤦

That being said you can argue that the observer should attempt a first connection pretty quickly once the operator is started or when a cluster is created, so may be I'm overthinking...

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timing is unfortunate we might lose an additional 10 seconds (which is the current observation interval) which seems acceptable to me.

Same. We could still consider it as a first step and improve it later if needed.

@pebrc
Copy link
Collaborator Author

pebrc commented Jun 24, 2022

cla/check

@pebrc pebrc merged commit 444de57 into elastic:main Jun 24, 2022
fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
The goal is avoid confusing error messages like "could not verify license" but instead have a meaningful status like "Elasticsearch is not reachable". The implication here is that esReachable is now tied to the first successful observation of cluster health. Those happen asynchronously in the observer mechanism and are not done inside the reconciliation loop. I think that is OK as the cluster is not available immediately but with a certain delay anyway, depending on hardware and the time it takes for the ES container in the Pods to become ready. If the timing is unfortunate we might lose an additional 10 seconds (which is the current observation interval) which seems acceptable to me.

Co-authored-by: Michael Morello <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality v2.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot verify license message is misleading
2 participants