-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stricter notion of esReacheable: require health response #5796
Conversation
run/e2e-tests tags=es |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this PR, I agree that the current status may be confusing.
I'm must confess however I'm a bit hesitant regarding the way to improve things as there can be 2 reasons for having an "unknown" state:
The operator has not attempted to reach the cluster yet, in which case I would expect the condition to becorev1.ConditionUnknown
. We don't really no if Elasticsearch is not responding to requests.The operator attempted to get the cluster state, but an error occurred (time out, credentials issue...). Health is still unknown but we now have an error to explain why. The condition should then becorev1.ConditionFalse
, with maybe the error as the condition message.
Edit: Sorry I was focusing on the code while working on the PR and forgot your comment 🤦
That being said you can argue that the observer should attempt a first connection pretty quickly once the operator is started or when a cluster is created, so may be I'm overthinking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timing is unfortunate we might lose an additional 10 seconds (which is the current observation interval) which seems acceptable to me.
Same. We could still consider it as a first step and improve it later if needed.
Co-authored-by: Michael Morello <[email protected]>
cla/check |
The goal is avoid confusing error messages like "could not verify license" but instead have a meaningful status like "Elasticsearch is not reachable". The implication here is that esReachable is now tied to the first successful observation of cluster health. Those happen asynchronously in the observer mechanism and are not done inside the reconciliation loop. I think that is OK as the cluster is not available immediately but with a certain delay anyway, depending on hardware and the time it takes for the ES container in the Pods to become ready. If the timing is unfortunate we might lose an additional 10 seconds (which is the current observation interval) which seems acceptable to me. Co-authored-by: Michael Morello <[email protected]>
Fixes #5776
Simplest (?) possible fix without doing extra requests. The goal is avoid confusing error messages like "could not verify license" but instead have a meaningful status like "Elasticsearch is not reachable". The implication here is that
esReachable
is now tied to the first successful observation of cluster health. Those happen asynchronously in the observer mechanism and are not done inside the reconciliation loop. I think that is OK as the cluster is not available immediately but with a certain delay anyway, depending on hardware and the time it takes for the ES container in the Pods to become ready. If the timing is unfortunate we might lose an additional 10 seconds (which is the current observation interval) which seems acceptable to me.