Rolling upgrades between major versions for 2-node Elasticsearch cluster impossible #5321

thbkrkr · 2022-02-03T10:20:25Z

Rolling upgrades between major versions for a 2-node Elasticsearch cluster can be impossible. If the first node that is upgraded is the master node, then the rolling upgrade gets stuck because we never reach a state where we have two nodes in the cluster. The cluster health stays yellow. The second node can't join the cluster formed by the first upgraded node.

Log of the second node log:

{
  "type": "server",
  "timestamp": "2022-02-03T09:29:04,631Z",
  "level": "WARN",
  "component": "o.e.c.c.JoinHelper",
  "cluster.name": "test-version-up-2-to-8x-wddt",
  "node.name": "test-version-up-2-to-8x-wddt-es-masterdata-0",
	
  "message": "last failed join attempt was 7ms ago, failed to join {test-version-up-2-to-8x-wddt-es-masterdata-1}{Dm1aKtG4QtqAWdG1qYhxrg}{66D72IvmSR2LTi77eKbd_g}{10.42.176.208}{10.42.176.208:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-9292dcc0-3h08, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824} with JoinRequest{sourceNode={test-version-up-2-to-8x-wddt-es-masterdata-0}{A50qiG3DRo-dL0toCbiQxA}{_tnSXZa1QOGe97A63qoKDw}{10.42.177.120}{10.42.177.120:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-a617668f-p82k, ml.machine_memory=2147483648, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}, minimumTerm=2, optionalJoin=Optional[Join{term=2, lastAcceptedTerm=1, lastAcceptedVersion=65, sourceNode={test-version-up-2-to-8x-wddt-es-masterdata-0}{A50qiG3DRo-dL0toCbiQxA}{_tnSXZa1QOGe97A63qoKDw}{10.42.177.120}{10.42.177.120:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-a617668f-p82k, ml.machine_memory=2147483648, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}, targetNode={test-version-up-2-to-8x-wddt-es-masterdata-1}{Dm1aKtG4QtqAWdG1qYhxrg}{66D72IvmSR2LTi77eKbd_g}{10.42.176.208}{10.42.176.208:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-9292dcc0-3h08, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}}]}",

  "cluster.uuid": "En4wbZQ-Ru-J0IXK-Ysl0g",
  "node.id": "A50qiG3DRo-dL0toCbiQxA" , 
  "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [test-version-up-2-to-8x-wddt-es-masterdata-1][10.42.176.208:9300][internal:cluster/coordination/join]",
    "Caused by: java.lang.IllegalStateException:

      node version [7.17.0] may not join a cluster comprising only nodes of version [8.0.0] or greater",

    "at org.elasticsearch.cluster.coordination.JoinTaskExecutor.ensureVersionBarrier(JoinTaskExecutor.java:325) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
    "at org.elasticsearch.cluster.coordination.Coordinator.validateJoinRequest(Coordinator.java:585) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
    "at org.elasticsearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$9(Coordinator.java:556) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
    "at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
...

YAML manifest to reproduce:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: test-version-up-2-to-8x
  namespace: e2e-mercury
  version: 7.17.0-SNAPSHOT
  #version: 8.0.0-SNAPSHOT
  nodeSets:
  - config:
      logger.org.elasticsearch.cluster.service.MasterService: trace
      node.store.allow_mmap: false
    count: 2
    name: masterdata
      spec:
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 2Gi
        securityContext:
          fsGroup: 12345
          runAsNonRoot: true
          runAsUser: 12345
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

ECK should do a full cluster restart on clusters with 1 or 2 voting nodes/master role nodes.

The text was updated successfully, but these errors were encountered:

pebrc · 2022-02-03T10:26:55Z

I think we should extend the condition for "forced upgrades" to include 2 node clusters and not restrict this to major version upgrades. There will always be a loss of availability on a two node cluster when upgrading so it does not make sense to try to orchestrate it gracefully.

pebrc · 2022-02-04T09:34:21Z

A few more observations:

this seems only to affect major version upgrades, I have no been able to reproduce the issue on a minor upgrade
I was originally tempted to just do all cluster changes on two/single node clusters in a full restart fashion but there is an argument for sticking to rolling upgrades for most changes and that is that even after cluster break down individual nodes will be able to serve partial search results (depending on shard placement ofc)

thbkrkr added the >bug Something isn't working label Feb 3, 2022

pebrc self-assigned this Feb 3, 2022

This was referenced Feb 3, 2022

Force-upgrade non-HA clusters #5324

Closed

Do not attempt rolling upgrades for non-HA clusters #5327

Merged

pebrc closed this as completed in #5327 Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling upgrades between major versions for 2-node Elasticsearch cluster impossible #5321

Rolling upgrades between major versions for 2-node Elasticsearch cluster impossible #5321

thbkrkr commented Feb 3, 2022

pebrc commented Feb 3, 2022

pebrc commented Feb 4, 2022

Rolling upgrades between major versions for 2-node Elasticsearch cluster impossible #5321

Rolling upgrades between major versions for 2-node Elasticsearch cluster impossible #5321

Comments

thbkrkr commented Feb 3, 2022

pebrc commented Feb 3, 2022

pebrc commented Feb 4, 2022