[Bug] Upgrading from ECK `2.1.0` to `2.2.0` Causes issues with Kibana and Fleet during Rolling Restart #5684

BenB196 · 2022-05-21T16:42:54Z

Bug Report

What did you do?

I upgraded ECK from 2.1.0 to 2.2.0

What did you expect to see?

I expected a rolling upgrade to happen (related #5648).

What did you see instead? Under which circumstances?

I saw a rolling upgrade, but I saw Kibana and Fleet stuck in crash loops for a good duration of the rolling restart time.

Environment

ECK version: 2.1.0 -> 2.2.0
Kubernetes information:
- On premise ? yes
- Cloud: GKE / EKS / AKS ? n/a
- Kubernetes distribution: Openshift / Rancher / PKS ? Rancher RKE2 v1.22.8+rke2r1
Resource definition:
Setup a "large" cluster across multiple availability zones (AZs), and the ClusterIP service point to a subset of nodes across the different AZs.
Logs:

Relevant Kibana Log

License information could not be obtained from Elasticsearch due to {"error":{"root_cause":[{"type":"security_exception","reason":"failed to authenticate service account [elastic/kibana] with token name [kibana-prod_kibana-prod_3de51e32-e911-41a2-acd0-c74242b32687]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"failed to authenticate service account [elastic/kibana] with token name [kibana-prod_kibana-prod_3de51e32-e911-41a2-acd0-c74242b32687]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401} error

Issue/Bug:

It appears that with the switch to service accounts for Kibana and Fleet server (#5468) an issue that happens is that it can take a decent amount of time in large clusters to be to have all of the ingress nodes in the cluster be available with the new service accounts to be able to successfully authenticate.

In large clusters where a rolling restart can take hours, this can leave Kibana and Fleet unusable for some time.

Steps to Reproduce:

(Note: I'm providing a setup similar to mine as I know its reproducible with it, but it might also be reproducible with a smaller setup.

Have ECK operation on 2.1.0
Have a relatively large cluster (something that takes a decent amount of time to restart).
Have the cluster be spread across multiple availability zones, and have pod and node affinities set, so that ECK will restart the cluster in a per AZ style.
Have the ClusterIP service setup to hit 1 (or more) nodes across the different AZs.
- e.g.: Having 1 dedicated coordinating node per AZ, and having the ClusterIP service only hit those nodes.
Have at least 1 Kibana instance and Fleet server instance deployed (having more, like 2 or 3, makes the issue more easily noticeable/reproducible)
Upgrade ECK operator to 2.2.0
See that it starts a rolling upgrade on the Elasticsearch cluster, Kibana deployment, and Fleet server deployment
See that Kibana deployment and Fleet server deployment go into crash loops because service account auth is failing.
Wait for all nodes that are backing the ClusterIP service to be restarted
Notice that now the Kibana and Fleet server deployments now are able to properly auth with the Elasticsearch cluster.

I would expect ECK to either not start a rolling restart of Kibana and Fleet server until after the rolling restart of Elasticsearch cluster has completed, or that it doesn't switch over to using the new service account authentication method until after the rolling restart is completed.

The text was updated successfully, but these errors were encountered:

framsouza · 2022-06-01T14:59:41Z

This issue also impact the connectiong between Elasticsearch and Kibana

Gainutdinov · 2022-06-23T13:05:36Z

same thing on EKS :(

chart version 2.2.0

Kibana says:

[2022-06-23T12:35:18.090+00:00][ERROR][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. security_exception: [security_exception] Reason: failed to authenticate service ac
count [elastic/kibana] with token name [elastic-system_kibana-cluster0_679e6c32-3897-4f54-81b6-6ab95d390182]

pebrc · 2022-06-23T16:17:10Z

While we are looking into a fix for the next release. There is currently two things you can do if you are affected:

simply wait until the service account token is rolled out which might take quite a bit of time
if that is not an option for you: temporarily create another service account token via the API and configure Kibana or respectively Fleet Server to use it.

For this second approach:

For Kibana:

create the service account token as documented in the Elasticsearch docs

NAME=$WHATEVER_NAME_YOU_USED_FOR_THE_ELASTICSEARCH_CLUSTER
PW=$(kubectl get secret "$NAME-es-elastic-user" -o go-template='{{.data.elastic | base64decode }}')
# assuming you run this from outside the k8s cluster
kubectl port-forward service/$NAME-es-http 9200
curl -k -X POST "https://localhost:9200/_security/service/elastic/kibana/credential/token/issue-5684

A successful request will return a JSON document like

{
  "created": true,
  "token": {
    "name": "issue-5684",
    "value": "AAEAAWVsYXN0aWM...vZmxlZXQtc2VydmVyL3Rva2VuMTo3TFdaSDZ" 
  }
}

Configure Kibana to use the token

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 8.2.2
  count: 1
  elasticsearchRef:
    name: $NAME
  config:
    elasticsearch.serviceAccountToken: AAEAAWVsYXN0aWM...vZmxlZXQtc2VydmVyL3Rva2VuMTo3TFdaSDZ

This will quickly roll out a new Kibana replica set with a working configuration.
5. Remove this configuration again once the file based token the ECK operator uses has been fully deployed in the Elasticsearch cluster to allow ECK to continue to manage the connection in the future.

For Fleet Server

Similar steps as above but different service account token:

curl -k -X POST "https://localhost:9200/_security/service/elastic/fleet-server/credential/token/issue-5684

And configuration of Fleet Server needs to use an environment variable:

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
spec:
  version: 8.2.2
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: $NAME
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
        containers: 
        - name: agent
          env:
          - name: FLEET_SERVER_SERVICE_TOKEN
            value: AAEAAWVsYXN0aWM...vZmxlZXQtc2VydmVyL3Rva2VuMTo3TFdaSDZ

The reason why this restores availability quickly is because these API based tokens are immediately available through the Elasticsearch cluster and rolling out a new Kibana or Fleet Server instance is usually quick unless you have many Kibana instances running.

pebrc · 2022-07-04T19:00:38Z

We have a fix in #5830 which shipped with ECK 2.3 so I am closing this issue for now and will also update our known issue in the documentation for ECK 2.2

botelastic bot added the triage label May 21, 2022

thbkrkr added the >bug Something isn't working label Jun 1, 2022

botelastic bot removed the triage label Jun 1, 2022

jiuker mentioned this issue Jun 16, 2022

Move first ES cluster state observation out of go routine #5783

Merged

pebrc assigned barkbay Jun 23, 2022

pebrc added the v2.3.0 label Jun 23, 2022

pebrc mentioned this issue Jun 23, 2022

Add 2.2 known issue about service token migration #5827

Merged

barkbay mentioned this issue Jun 24, 2022

Do not use service accounts until Elasticsearch nodes have been upgraded #5830

Merged

pebrc closed this as completed Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Upgrading from ECK `2.1.0` to `2.2.0` Causes issues with Kibana and Fleet during Rolling Restart #5684

[Bug] Upgrading from ECK `2.1.0` to `2.2.0` Causes issues with Kibana and Fleet during Rolling Restart #5684

BenB196 commented May 21, 2022

framsouza commented Jun 1, 2022

Gainutdinov commented Jun 23, 2022

pebrc commented Jun 23, 2022 •

edited

Loading

pebrc commented Jul 4, 2022

[Bug] Upgrading from ECK 2.1.0 to 2.2.0 Causes issues with Kibana and Fleet during Rolling Restart #5684

[Bug] Upgrading from ECK 2.1.0 to 2.2.0 Causes issues with Kibana and Fleet during Rolling Restart #5684

Comments

BenB196 commented May 21, 2022

Bug Report

Issue/Bug:

Steps to Reproduce:

framsouza commented Jun 1, 2022

Gainutdinov commented Jun 23, 2022

pebrc commented Jun 23, 2022 • edited Loading

For Kibana:

For Fleet Server

pebrc commented Jul 4, 2022

[Bug] Upgrading from ECK `2.1.0` to `2.2.0` Causes issues with Kibana and Fleet during Rolling Restart #5684

[Bug] Upgrading from ECK `2.1.0` to `2.2.0` Causes issues with Kibana and Fleet during Rolling Restart #5684

pebrc commented Jun 23, 2022 •

edited

Loading