Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Upgrading from ECK 2.1.0 to 2.2.0 Causes issues with Kibana and Fleet during Rolling Restart #5684

Closed
BenB196 opened this issue May 21, 2022 · 4 comments
Assignees
Labels
>bug Something isn't working v2.3.0

Comments

@BenB196
Copy link

BenB196 commented May 21, 2022

Bug Report

What did you do?

I upgraded ECK from 2.1.0 to 2.2.0

What did you expect to see?

I expected a rolling upgrade to happen (related #5648).

What did you see instead? Under which circumstances?

I saw a rolling upgrade, but I saw Kibana and Fleet stuck in crash loops for a good duration of the rolling restart time.

Environment

  • ECK version: 2.1.0 -> 2.2.0

  • Kubernetes information:

    • On premise ? yes
    • Cloud: GKE / EKS / AKS ? n/a
    • Kubernetes distribution: Openshift / Rancher / PKS ? Rancher RKE2 v1.22.8+rke2r1
  • Resource definition:
    Setup a "large" cluster across multiple availability zones (AZs), and the ClusterIP service point to a subset of nodes across the different AZs.

  • Logs:

Relevant Kibana Log

License information could not be obtained from Elasticsearch due to {"error":{"root_cause":[{"type":"security_exception","reason":"failed to authenticate service account [elastic/kibana] with token name [kibana-prod_kibana-prod_3de51e32-e911-41a2-acd0-c74242b32687]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"failed to authenticate service account [elastic/kibana] with token name [kibana-prod_kibana-prod_3de51e32-e911-41a2-acd0-c74242b32687]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401} error

Issue/Bug:

It appears that with the switch to service accounts for Kibana and Fleet server (#5468) an issue that happens is that it can take a decent amount of time in large clusters to be to have all of the ingress nodes in the cluster be available with the new service accounts to be able to successfully authenticate.

In large clusters where a rolling restart can take hours, this can leave Kibana and Fleet unusable for some time.

Steps to Reproduce:

(Note: I'm providing a setup similar to mine as I know its reproducible with it, but it might also be reproducible with a smaller setup.

  1. Have ECK operation on 2.1.0
  2. Have a relatively large cluster (something that takes a decent amount of time to restart).
  3. Have the cluster be spread across multiple availability zones, and have pod and node affinities set, so that ECK will restart the cluster in a per AZ style.
  4. Have the ClusterIP service setup to hit 1 (or more) nodes across the different AZs.
    • e.g.: Having 1 dedicated coordinating node per AZ, and having the ClusterIP service only hit those nodes.
  5. Have at least 1 Kibana instance and Fleet server instance deployed (having more, like 2 or 3, makes the issue more easily noticeable/reproducible)
  6. Upgrade ECK operator to 2.2.0
  7. See that it starts a rolling upgrade on the Elasticsearch cluster, Kibana deployment, and Fleet server deployment
  8. See that Kibana deployment and Fleet server deployment go into crash loops because service account auth is failing.
  9. Wait for all nodes that are backing the ClusterIP service to be restarted
  10. Notice that now the Kibana and Fleet server deployments now are able to properly auth with the Elasticsearch cluster.

I would expect ECK to either not start a rolling restart of Kibana and Fleet server until after the rolling restart of Elasticsearch cluster has completed, or that it doesn't switch over to using the new service account authentication method until after the rolling restart is completed.

@botelastic botelastic bot added the triage label May 21, 2022
@thbkrkr thbkrkr added the >bug Something isn't working label Jun 1, 2022
@botelastic botelastic bot removed the triage label Jun 1, 2022
@framsouza
Copy link
Contributor

This issue also impact the connectiong between Elasticsearch and Kibana

@Gainutdinov
Copy link

same thing on EKS :(

chart version 2.2.0

Kibana says:

[2022-06-23T12:35:18.090+00:00][ERROR][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. security_exception: [security_exception] Reason: failed to authenticate service ac
count [elastic/kibana] with token name [elastic-system_kibana-cluster0_679e6c32-3897-4f54-81b6-6ab95d390182]

@pebrc
Copy link
Collaborator

pebrc commented Jun 23, 2022

While we are looking into a fix for the next release. There is currently two things you can do if you are affected:

  1. simply wait until the service account token is rolled out which might take quite a bit of time
  2. if that is not an option for you: temporarily create another service account token via the API and configure Kibana or respectively Fleet Server to use it.

For this second approach:

For Kibana:

  1. create the service account token as documented in the Elasticsearch docs
NAME=$WHATEVER_NAME_YOU_USED_FOR_THE_ELASTICSEARCH_CLUSTER
PW=$(kubectl get secret "$NAME-es-elastic-user" -o go-template='{{.data.elastic | base64decode }}')
# assuming you run this from outside the k8s cluster
kubectl port-forward service/$NAME-es-http 9200
curl -k -X POST "https://localhost:9200/_security/service/elastic/kibana/credential/token/issue-5684
  1. A successful request will return a JSON document like
{
  "created": true,
  "token": {
    "name": "issue-5684",
    "value": "AAEAAWVsYXN0aWM...vZmxlZXQtc2VydmVyL3Rva2VuMTo3TFdaSDZ" 
  }
}
  1. Configure Kibana to use the token
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 8.2.2
  count: 1
  elasticsearchRef:
    name: $NAME
  config:
    elasticsearch.serviceAccountToken: AAEAAWVsYXN0aWM...vZmxlZXQtc2VydmVyL3Rva2VuMTo3TFdaSDZ

This will quickly roll out a new Kibana replica set with a working configuration.
5. Remove this configuration again once the file based token the ECK operator uses has been fully deployed in the Elasticsearch cluster to allow ECK to continue to manage the connection in the future.

For Fleet Server

Similar steps as above but different service account token:

curl -k -X POST "https://localhost:9200/_security/service/elastic/fleet-server/credential/token/issue-5684

And configuration of Fleet Server needs to use an environment variable:

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
spec:
  version: 8.2.2
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: $NAME
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
        containers: 
        - name: agent
          env:
          - name: FLEET_SERVER_SERVICE_TOKEN
            value: AAEAAWVsYXN0aWM...vZmxlZXQtc2VydmVyL3Rva2VuMTo3TFdaSDZ

The reason why this restores availability quickly is because these API based tokens are immediately available through the Elasticsearch cluster and rolling out a new Kibana or Fleet Server instance is usually quick unless you have many Kibana instances running.

@pebrc
Copy link
Collaborator

pebrc commented Jul 4, 2022

We have a fix in #5830 which shipped with ECK 2.3 so I am closing this issue for now and will also update our known issue in the documentation for ECK 2.2

@pebrc pebrc closed this as completed Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v2.3.0
Projects
None yet
Development

No branches or pull requests

6 participants