Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling upgrades between major versions for 2-node Elasticsearch cluster impossible #5321

Closed
thbkrkr opened this issue Feb 3, 2022 · 2 comments · Fixed by #5327
Closed
Assignees
Labels
>bug Something isn't working

Comments

@thbkrkr
Copy link
Contributor

thbkrkr commented Feb 3, 2022

Rolling upgrades between major versions for a 2-node Elasticsearch cluster can be impossible. If the first node that is upgraded is the master node, then the rolling upgrade gets stuck because we never reach a state where we have two nodes in the cluster. The cluster health stays yellow. The second node can't join the cluster formed by the first upgraded node.

Log of the second node log:

{
  "type": "server",
  "timestamp": "2022-02-03T09:29:04,631Z",
  "level": "WARN",
  "component": "o.e.c.c.JoinHelper",
  "cluster.name": "test-version-up-2-to-8x-wddt",
  "node.name": "test-version-up-2-to-8x-wddt-es-masterdata-0",
	
  "message": "last failed join attempt was 7ms ago, failed to join {test-version-up-2-to-8x-wddt-es-masterdata-1}{Dm1aKtG4QtqAWdG1qYhxrg}{66D72IvmSR2LTi77eKbd_g}{10.42.176.208}{10.42.176.208:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-9292dcc0-3h08, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824} with JoinRequest{sourceNode={test-version-up-2-to-8x-wddt-es-masterdata-0}{A50qiG3DRo-dL0toCbiQxA}{_tnSXZa1QOGe97A63qoKDw}{10.42.177.120}{10.42.177.120:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-a617668f-p82k, ml.machine_memory=2147483648, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}, minimumTerm=2, optionalJoin=Optional[Join{term=2, lastAcceptedTerm=1, lastAcceptedVersion=65, sourceNode={test-version-up-2-to-8x-wddt-es-masterdata-0}{A50qiG3DRo-dL0toCbiQxA}{_tnSXZa1QOGe97A63qoKDw}{10.42.177.120}{10.42.177.120:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-a617668f-p82k, ml.machine_memory=2147483648, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}, targetNode={test-version-up-2-to-8x-wddt-es-masterdata-1}{Dm1aKtG4QtqAWdG1qYhxrg}{66D72IvmSR2LTi77eKbd_g}{10.42.176.208}{10.42.176.208:9300}{cdfhilmrstw}{k8s_node_name=gke-thbkrkr-dev-cluster-default-pool-9292dcc0-3h08, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}}]}",

  "cluster.uuid": "En4wbZQ-Ru-J0IXK-Ysl0g",
  "node.id": "A50qiG3DRo-dL0toCbiQxA" , 
  "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [test-version-up-2-to-8x-wddt-es-masterdata-1][10.42.176.208:9300][internal:cluster/coordination/join]",
    "Caused by: java.lang.IllegalStateException:

      node version [7.17.0] may not join a cluster comprising only nodes of version [8.0.0] or greater",

    "at org.elasticsearch.cluster.coordination.JoinTaskExecutor.ensureVersionBarrier(JoinTaskExecutor.java:325) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
    "at org.elasticsearch.cluster.coordination.Coordinator.validateJoinRequest(Coordinator.java:585) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
    "at org.elasticsearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$9(Coordinator.java:556) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
    "at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) ~[elasticsearch-7.17.0-SNAPSHOT.jar:7.17.0-SNAPSHOT]",
...
YAML manifest to reproduce:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: test-version-up-2-to-8x
  namespace: e2e-mercury
  version: 7.17.0-SNAPSHOT
  #version: 8.0.0-SNAPSHOT
  nodeSets:
  - config:
      logger.org.elasticsearch.cluster.service.MasterService: trace
      node.store.allow_mmap: false
    count: 2
    name: masterdata
      spec:
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 2Gi
        securityContext:
          fsGroup: 12345
          runAsNonRoot: true
          runAsUser: 12345
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi

ECK should do a full cluster restart on clusters with 1 or 2 voting nodes/master role nodes.

@thbkrkr thbkrkr added the >bug Something isn't working label Feb 3, 2022
@pebrc
Copy link
Collaborator

pebrc commented Feb 3, 2022

I think we should extend the condition for "forced upgrades" to include 2 node clusters and not restrict this to major version upgrades. There will always be a loss of availability on a two node cluster when upgrading so it does not make sense to try to orchestrate it gracefully.

@pebrc
Copy link
Collaborator

pebrc commented Feb 4, 2022

A few more observations:

  • this seems only to affect major version upgrades, I have no been able to reproduce the issue on a minor upgrade
  • I was originally tempted to just do all cluster changes on two/single node clusters in a full restart fashion but there is an argument for sticking to rolling upgrades for most changes and that is that even after cluster break down individual nodes will be able to serve partial search results (depending on shard placement ofc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
2 participants