-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend full upgrade to any version upgrade of non-HA ES #5408
Conversation
I think I'm missing something, assuming the upscaled+renamed |
I think the problem is that once you rename a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from a code perspective.
I'm running TestVersionUpgrade*
e2e tests locally, will ✅ once it's done.
@@ -75,7 +76,7 @@ func shouldSetInitialMasterNodes(es esv1.Elasticsearch, k8sClient k8s.Client, no | |||
return true, nil | |||
} | |||
// - we're upgrading (effectively restarting) a single zen1 master to zen2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this comment should be updated
I think we have to use the diff --git a/test/e2e/es/version_upgrade_test.go b/test/e2e/es/version_upgrade_test.go
index 113fac858..63b9e8705 100644
--- a/test/e2e/es/version_upgrade_test.go
+++ b/test/e2e/es/version_upgrade_test.go
@@ -55,6 +55,10 @@ func TestVersionUpgradeTwoNodes68xTo7x(t *testing.T) {
// due to minimum_master_nodes=2, the cluster is unavailable while the first master is upgraded
initial := elasticsearch.NewBuilder("test-version-up-2-68x-to-7x").
WithVersion(srcVersion).
+ // 7x non-HA upgrades cannot honour a change budget with maxUnavailable 1 because we are rolling both nodes at once
+ // setting the change budget accordingly allows this test to pass as otherwise the changeBudgetWatcher step would
+ // fail as it would detect a change budget violation.
+ WithChangeBudget(2, 2).
WithESMasterDataNodes(2, elasticsearch.DefaultResources)
mutated := initial.WithNoESTopology().
@@ -173,6 +177,10 @@ func TestVersionUpgradeTwoNodesToLatest7x(t *testing.T) {
initial := elasticsearch.NewBuilder("test-version-up-2-to-7x").
WithVersion(srcVersion).
+ // 7x non-HA upgrades cannot honour a change budget with maxUnavailable 1 because we are rolling both nodes at once
+ // setting the change budget accordingly allows this test to pass as otherwise the changeBudgetWatcher step would
+ // fail as it would detect a change budget violation.
+ WithChangeBudget(2, 2).
WithESMasterDataNodes(2, elasticsearch.DefaultResources)
mutated := initial.WithNoESTopology(). |
Good catch. I thought I tested that but apparently not 😞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Elasticsearch documentation implies that two node clusters cannot be upgraded in a rolling fashion because we would be removing more than 50% of the nodes during the upgrade. A stricter validation of this invariant introduced in 7.14 was causing the test failures we have seen in our e2e test pipelines once we had fixed the faulty non-HA upgrade test This affects any 7.x version but as of 7.14 there is an error and the nodes will not boot up. In the light of this information this PR extends the full rolling upgrade mode to any version upgrade. It also adjusts the setting of initial master nodes during the 6.x to 7.x transition to include two node clusters. This gives us a uniform way of handling those upgrades. There is still a case where an upgrade can get stuck that this change does not cover: if a user combines a version upgrade with an upscale/downscale operation e.g. by bumping the Elasticsearch version and renaming the nodeSet at the same time. This will lead to something that looks like a rolling upgrade as we are removing the old nodes and adding new nodes one by one. There is still a chance that the upgrade gets stuck in that scenario and I don't see any easy way to fix that (other than through documentation). The problem is that once you rename a nodeSet we are no longer executing the upgrade path of the reconciliation loop but we have instead two independent operations that are only coordinated through the maxUnavailable change budget: a downscale of the old nodeSet and an upscale of the new nodeSet. Both downscale and upscale logic do not benefit from the changes in this commit and thus the we are again in a situation where things can go wrong.
Elasticsearch documentation implies that two node clusters cannot be upgraded in a rolling fashion because we would be removing more than 50% of the nodes during the upgrade. A stricter validation of this invariant introduced in 7.14 was causing the test failures we have seen in our e2e test pipelines once we had fixed the faulty non-HA upgrade test This affects any 7.x version but as of 7.14 there is an error and the nodes will not boot up.
In the light of this information this PR extends the full restart upgrade mode to any version upgrade. It also adjusts the setting of initial master nodes during the 6.x to 7.x transition to include two node clusters. This gives us a uniform way of handling those upgrades.
There is still a case where an upgrade can get stuck that this PR does not cover: if a user combines a version upgrade with an upscale/downscale operation e.g. by bumping the Elasticsearch version and renaming the
nodeSet
at the same time. This will lead to something that looks like a rolling upgrade as we are removing the old nodes and adding new nodes one by one. There is still a chance that the upgrade gets stuck in that scenario and I don't see any easy way to fix that (other than through documentation)One potential way of how the upgrade might get stuck is as follows:
I think this scenario can be avoided by setting a change budget with
maxUnavailable: 0
or by avoiding the combination ofnodeSet
addition/removal with a version upgrade.