Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet service not started on scaleup node boot #1899

Closed
DanyC97 opened this issue Jun 24, 2019 · 24 comments
Closed

kubelet service not started on scaleup node boot #1899

DanyC97 opened this issue Jun 24, 2019 · 24 comments

Comments

@DanyC97
Copy link
Contributor

DanyC97 commented Jun 24, 2019

Version

$ openshift-install version
openshift-install v4.1.0-201905212232-dirty
built from commit 71d8978039726046929729ad15302973e3da18ce
release image quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6

RHCOS

rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b
              CustomOrigin: Provisioned from oscontainer
                   Version: 410.8.20190520.0 (2019-05-20T22:55:04Z)

Platform (aws|libvirt|openstack):

vmware

What happened?

Deployed an UPI VMware cluster and then started to add a new node. The VM booted, the igniton kicked in and lay down the RHCOS, including setting the static IP and the hostname however the node it never joined the K8s cluster, oc get nodes didn't show the new node

What you expected to happen?

i expected the new node to show up in the oc get nodes

How to reproduce it (as minimally and precisely as possible)?

  1. create a cluster
  2. scale up a node using a custom dani-k8s-node-2.ign file where the below files were added such that we can set static IP and hostname
cat > ${NGINX_DIRECTORY}/${HOST}-ens192 << EOF
DEVICE=ens192
BOOTPROTO=none
ONBOOT=yes
NETMASK=${NETMASK}
IPADDR=${IP}
GATEWAY=${GATEWAY}
PEERDNS=no
DNS1=${DNS1}
DNS2=${DNS2}
DOMAIN=${DOMAIN_NAME}
IPV6INIT=no
EOF

  ENS192=$(cat ${NGINX_DIRECTORY}/${HOST}-ens192 | base64 -w0)
  rm ${NGINX_DIRECTORY}/${HOST}-ens192

  cat > ${NGINX_DIRECTORY}/${HOST}-ifcfg-ens192.json << EOF
{
  "append" : false,
  "mode" : 420,
  "filesystem" : "root",
  "path" : "/etc/sysconfig/network-scripts/ifcfg-ens192",
  "contents" : {
    "source" : "data:text/plain;charset=utf-8;base64,${ENS192}",
    "verification" : {}
  },
  "user" : {
    "name" : "root"
  },
  "group": {
    "name": "root"
  }
}
EOF

&

cat > ${NGINX_DIRECTORY}/${HOST}-hostname << EOF
${HOST}.${DOMAIN_NAME}
EOF
  HN=$(cat ${NGINX_DIRECTORY}/${HOST}-hostname | base64 -w0)
  rm ${NGINX_DIRECTORY}/${HOST}-hostname
  cat > ${NGINX_DIRECTORY}/${HOST}-hostname.json << EOF
{
  "append" : false,
  "mode" : 420,
  "filesystem" : "root",
  "path" : "/etc/hostname",
  "contents" : {
    "source" : "data:text/plain;charset=utf-8;base64,${HN}",
    "verification" : {}
  },
  "user" : {
    "name" : "root"
  },
  "group": {
    "name": "root"
  }
}
EOF

Note that a dummy ignition file

dani-k8s-node-2-ignition-starter.ign
{
  "ignition": {
    "config": {
      "append": [
        {
          "source": "http://10.85.174.63/repo/dani-k8s-node-2.ign",
          "verification": {}
        }
      ]
    },
    "timeouts": {},
    "version": "2.2.0"
  },
  "networkd": {},
  "passwd": {},
  "storage": {},
  "systemd": {}
}


was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

  1. observe if the node join the cluster, if not then check if kubelet.service is up and running.

Note - i guess your terraform UPI's vsphere example should allow you to reproduce the issue, i haven't tried using your code

Anything else we need to know?

On closer inspection while ssh'ed onto the node i found the following

  • kubectl.service not active and disabled
[root@dani-k8s-node-2 ~]# systemctl list-unit-files | grep disabled | egrep "mcd|kubelet|rpm-ostree|podman|crio"
crio-shutdown.service                          disabled
io.podman.service                              disabled
kubelet.service                                disabled
mcd-write-pivot-reboot.service                 disabled
io.podman.socket                               disabled
rpm-ostreed-automatic.timer                    disabled

and

[root@dani-k8s-node-2 ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

  • although the systemd file looks okay, there is no symlink created which explains why the service is not running and enabled
[root@dani-k8s-node-2 ~]# cat  /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Wants=rpc-statd.service

[Service]
Type=notify
ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests
ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state
EnvironmentFile=/etc/os-release
EnvironmentFile=-/etc/kubernetes/kubelet-workaround
EnvironmentFile=-/etc/kubernetes/kubelet-env

ExecStart=/usr/bin/hyperkube \
    kubelet \
      --config=/etc/kubernetes/kubelet.conf \
      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
      --kubeconfig=/var/lib/kubelet/kubeconfig \
      --container-runtime=remote \
      --container-runtime-endpoint=/var/run/crio/crio.sock \
      --allow-privileged \
      --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_version=${VERSION_ID},node.openshift.io/os_id=${ID} \
      --minimum-container-ttl-duration=6m0s \
      --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \
      --client-ca-file=/etc/kubernetes/ca.crt \
      --cloud-provider=vsphere \
       \
      --anonymous-auth=false \
      --v=3 \

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

but no symlink present

[root@dani-k8s-node-2 ~]# ll /etc/systemd/system/multi-user.target.wants/
total 0
lrwxrwxrwx. 1 root root 39 May 20 23:06 chronyd.service -> /usr/lib/systemd/system/chronyd.service
lrwxrwxrwx. 1 root root 70 May 20 23:06 console-login-helper-messages-issuegen.service -> /usr/lib/systemd/system/console-login-helper-messages-issuegen.service
lrwxrwxrwx. 1 root root 47 May 20 23:06 coreos-growpart.service -> /usr/lib/systemd/system/coreos-growpart.service
lrwxrwxrwx. 1 root root 69 May 20 23:06 coreos-regenerate-iscsi-initiatorname.service -> /usr/lib/systemd/system/coreos-regenerate-iscsi-initiatorname.service
lrwxrwxrwx. 1 root root 67 May 20 23:06 coreos-root-bash-profile-workaround.service -> /usr/lib/systemd/system/coreos-root-bash-profile-workaround.service
lrwxrwxrwx. 1 root root 51 May 20 23:06 coreos-useradd-core.service -> /usr/lib/systemd/system/coreos-useradd-core.service
lrwxrwxrwx. 1 root root 36 May 20 23:06 crio.service -> /usr/lib/systemd/system/crio.service
lrwxrwxrwx. 1 root root 59 May 20 23:06 ignition-firstboot-complete.service -> /usr/lib/systemd/system/ignition-firstboot-complete.service
lrwxrwxrwx. 1 root root 42 May 20 23:06 irqbalance.service -> /usr/lib/systemd/system/irqbalance.service
lrwxrwxrwx. 1 root root 41 May 20 23:06 mdmonitor.service -> /usr/lib/systemd/system/mdmonitor.service
lrwxrwxrwx. 1 root root 46 May 20 23:06 NetworkManager.service -> /usr/lib/systemd/system/NetworkManager.service
lrwxrwxrwx. 1 root root 51 Jun 17 17:24 ostree-finalize-staged.path -> /usr/lib/systemd/system/ostree-finalize-staged.path
lrwxrwxrwx. 1 root root 37 May 20 23:06 pivot.service -> /usr/lib/systemd/system/pivot.service
lrwxrwxrwx. 1 root root 48 May 20 23:06 remote-cryptsetup.target -> /usr/lib/systemd/system/remote-cryptsetup.target
lrwxrwxrwx. 1 root root 40 May 20 23:06 remote-fs.target -> /usr/lib/systemd/system/remote-fs.target
lrwxrwxrwx. 1 root root 53 May 20 23:06 rpm-ostree-bootstatus.service -> /usr/lib/systemd/system/rpm-ostree-bootstatus.service
lrwxrwxrwx. 1 root root 36 May 20 23:06 sshd.service -> /usr/lib/systemd/system/sshd.service
lrwxrwxrwx. 1 root root 40 May 20 23:06 vmtoolsd.service -> /usr/lib/systemd/system/vmtoolsd.service

Running systemctl start kubelet.service it does kick the whole process and at the end the node joins the cluster.

References

  • enter text here.
@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 24, 2019

/cc @cgwalters in case you have some thoughts from RHCOS/ MCO side.

@abhinavdahiya
Copy link
Contributor

@abhinavdahiya
Copy link
Contributor

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 25, 2019

@abhinavdahiya thank you for taking the time to respond, much appreciated !

are you sure you approved the CSR for your machine https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere

i never needed to approve any CSR, what i saw on previous deployments (for control and compute nodes) was that the CSRs were auto approved.

I'm not sure if there are two paths (if there are please help me understand how it works) here in v4 w.r.t approving the CSRs however all i can say is that looking in the cluster i have running (where the above node didn't join the cluster) i see

oc get -n openshift-cluster-machine-approver all
NAME                                    READY   STATUS    RESTARTS   AGE
pod/machine-approver-7cd7f97455-g9x5q   1/1     Running   0          11d

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/machine-approver   1/1     1            1           11d

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/machine-approver-7cd7f97455   1         1         1       11d

which is coming from cluster-machine-approver and i can see in the pod's log traces like

I0624 20:31:13.706332       1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.708184       1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
I0624 20:31:13.708206       1 main.go:164] Error syncing csr csr-5tpq5: Invalid request
I0624 20:31:13.748375       1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.750478       1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
I0624 20:31:13.750496       1 main.go:164] Error syncing csr csr-5tpq5: Invalid request
I0624 20:31:13.830649       1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.833242       1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
E0624 20:31:13.833321       1 main.go:174] Invalid request
I0624 20:31:13.833366       1 main.go:175] Dropping CSR "csr-5tpq5" out of the queue: Invalid request

Saying that i am not sure if this is part of the machine-api-operator as i haven't worked out how to trace backwards a deployment -> operator (any hints much appreciated ;) ) ..but i guess it is ..

Update

see attached the whole machine-approver-7cd7f97455-g9x5q pod's log in case you might find s'thing ...

pod-machine-approver-7cd7f97455-g9x5q.log

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 25, 2019

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks ### dani ### so you can see why i assumed the above sequence triggered the reboot.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 26, 2019

@abhinavdahiya any thoughts? or maybe @staebler might be able to chime in?
any addition info please let me know, i'm keeping the lab env up in case more info is needed, hopefully will help you out

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 27, 2019

@cgwalters @abhinavdahiya can anyone of you please help understand the design/ behavior of

  • CSR auto approving ? is there a bug in the docs? if not, then please teach me how to fish - aka understand what the design was:
    • manual CSRs approved after kubelet started
    • CSRs auto-approved
  • the kubelet service not started on worker nodes

@rphillips
Copy link
Contributor

@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.

@staebler
Copy link
Contributor

@DanyC97 The cluster-machine-approver will only approve machines that are added via a machine resource. The vSphere platform does not use machine resources yet. So, as a user, you must manually approve CSR requests for new nodes. As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 28, 2019

@DanyC97 The cluster-machine-approver will only approve machines that are added via a machine resource. The vSphere platform does not use machine resources yet. So, as a user, you must manually approve CSR requests for new nodes.

oh so there are 2 paths, many many thanks for sharing this info @staebler !

As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.

right, so i'll try a test of switching off the bootstrap and see if the CSRs requests are left in pending state, that should confirm.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 28, 2019

@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.

i'll check @rphillips , thanks for the info!
i'll be curious to understand the whole flow as things don't add up (yet) in my head:

  • if vSphere platform does not use machine resources yet then no cluster-machine-approver - OK
  • if no cluster-machine-approver then let's say we fallback to bootstrap node - TBC
  • so then where controller-manager within the openshift-controller-manager fits in the whole picture? cause is not part of the bootstrap node, isn't it? -> will find out

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jun 29, 2019

@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.

i'll check @rphillips , thanks for the info!
i'll be curious to understand the whole flow as things don't add up (yet) in my head:

* if _vSphere platform does not use machine resources yet_ then no _cluster-machine-approver_ - OK

* if no _cluster-machine-approver_ then let's say we fallback to bootstrap node - TBC

* so then where _controller-manager within the openshift-controller-manager_ fits in the whole picture? cause is not part of the bootstrap node, isn't it? -> will find out

sadly @rphillips the only output i see in the pods running in openshift-controller-manager ns is

W0629 08:33:30.961010       1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.ConfigMap ended with: too old resource version: 4962577 (4963900)
W0629 08:33:49.357787       1 reflector.go:256] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.TemplateInstance ended with: The resourceVersion for the provided watch is too old.
W0629 08:35:23.082164       1 reflector.go:256] github.com/openshift/client-go/route/informers/externalversions/factory.go:101: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
W0629 08:36:01.201731       1 reflector.go:256] github.com/openshift/client-go/build/informers/externalversions/factory.go:101: watch of *v1.Build ended with: The resourceVersion for the provided watch is too old.
W0629 08:36:21.794776       1 reflector.go:256] github.com/openshift/client-go/apps/informers/externalversions/factory.go:101: watch of *v1.DeploymentConfig ended with: The resourceVersion for the provided watch is too old.
W0629 08:37:07.652800       1 reflector.go:256] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: watch of *v1.ImageStream ended with: The resourceVersion for the provided watch is too old.
W0629 08:38:49.543850       1 reflector.go:256] github.com/openshift/origin/pkg/unidling/controller/unidling_controller.go:199: watch of *v1.Event ended with: The resourceVersion for the provided watch is too old.
W0629 08:41:06.451980       1 reflector.go:256] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.TemplateInstance ended with: The resourceVersion for the provided watch is too old.
W0629 08:41:17.966140       1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.ConfigMap ended with: too old resource version: 4964042 (4965587)

so not much related to CSRs

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jul 1, 2019

@DanyC97 The cluster-machine-approver will only approve machines that are added via a machine resource. The vSphere platform does not use machine resources yet. So, as a user, you must manually approve CSR requests for new nodes.

oh so there are 2 paths, many many thanks for sharing this info @staebler !

As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.

right, so i'll try a test of switching off the bootstrap and see if the CSRs requests are left in pending state, that should confirm.

@staebler you were spot on Sir !

i've turned off bootstrap node and it behaves as per the docs

[root@dani-dev ~]# oc get csr
NAME        AGE    REQUESTOR                                                                   CONDITION
csr-28p87   3m4s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

which imo this is a docs bug.
And for folks who need to know if bootstrap node did approve anything journalctl |grep approve-csr |grep -v "No resources found" will do the job

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jul 1, 2019

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks ### dani ### so you can see why i assumed the above sequence triggered the reboot.

@abhinavdahiya if you can point me in the right direction, i'd be happy to keep digging.

@staebler
Copy link
Contributor

staebler commented Jul 1, 2019

which imo this is a docs bug.

@DanyC97 Can you be more specific about how you feel that the docs are deficient?

The following is an excerpt from the docs. It implies to me that the CSRs may be approved automatically or may need to be approved manually. It is the responsibility of the user to verify that the CSRs are approved--whether automatically or manually.

When you add machines to a cluster, two pending certificates signing request (CSRs) are generated for each machine that you added. You must confirm that these CSRs are approved or, if necessary, approve them yourself.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jul 1, 2019

which imo this is a docs bug.

@DanyC97 Can you be more specific about how you feel that the docs are deficient?

The following is an excerpt from the docs. It implies to me that the CSRs may be approved automatically or may need to be approved manually. It is the responsibility of the user to verify that the CSRs are approved--whether automatically or manually.

When you add machines to a cluster, two pending certificates signing request (CSRs) are generated for each machine that you added. You must confirm that these CSRs are approved or, if necessary, approve them yourself.

sure, sorry i wasn't clear enough.

Indeed you are, the text implies that however You must confirm that these CSRs are approved -> if you trying to double check if they were approved, you won't be able to do so w/o a bit more info becasue:

  • oc get csr => will show no output (e.g Approved, Issued), they've been already dealt by bootstrap node. Maybe the assumption here is that: if you see no Pending certs then everything worked okay.
  • checking the journald on bootstrap node -> if you have the knowledge to do so

Imo having a section - a small note and/ or paragraph to mention what you taught me here is miles better, hence me saying a bug

Also in the docs it says on step 1

Confirm that the cluster recognizes the machines:

but you can't confirm that since there are no nodes in NotReady state, they will appear if kubelet service is running .. however if you are unlucky like me then the docs won't help much.

Maybe is a section for troubleshooting, either way i think a clue can be added to help folks.

Update

sorry if i was too strong on the docs claiming is a bug, is maybe an enhancement

@cgwalters
Copy link
Member

Ignition should have enabled kubelet.service, it's part of the Ignition generated by the MCO.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jul 1, 2019

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks ### dani ### so you can see why i assumed the above sequence triggered the reboot.

@abhinavdahiya if you can point me in the right direction, i'd be happy to keep digging.

right after keep spinning new nodes, i found out that the dependent service

 cat /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Wants=rpc-statd.service

hasn't started ...hmmm

systemctl status rpc-statd
● rpc-statd.service - NFS status monitor for NFSv2/3 locking.
   Loaded: loaded (/usr/lib/systemd/system/rpc-statd.service; static; vendor preset: disabled)
   Active: inactive (dead)
[root@localhost ~]# cat /usr/lib/systemd/system/rpc-statd.service
[Unit]
Description=NFS status monitor for NFSv2/3 locking.
DefaultDependencies=no
Conflicts=umount.target
Requires=nss-lookup.target rpcbind.socket
Wants=network-online.target
After=network-online.target nss-lookup.target rpcbind.socket

PartOf=nfs-utils.service

[Service]
Environment=RPC_STATD_NO_NOTIFY=1
Type=forking
PIDFile=/var/run/rpc.statd.pid
ExecStart=/usr/sbin/rpc.statd

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jul 2, 2019

and the dependencies for rpc-statd is

systemctl list-dependencies rpc-statd
rpc-statd.service
● ├─rpcbind.socket
● ├─system.slice
● ├─network-online.target
● │ └─NetworkManager-wait-online.service
● └─nss-lookup.target


where rpcbind.service is not running.
And the output of the network-online.target is up and happy

systemctl status network-online.target
● network-online.target - Network is Online
   Loaded: loaded (/usr/lib/systemd/system/network-online.target; static; vendor preset: disabled)
   Active: active since Tue 2019-07-02 12:29:55 UTC; 18min ago
     Docs: man:systemd.special(7)
           https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget

Jul 02 12:29:55 dani-k8s-node-2.dani.local systemd[1]: Reached target Network is Online.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jul 8, 2019

updating in case someone else bumps into this issue.

A) the initial issue was around the fact that nodes we failing to join the cluster because the kubelet.service

After a lot of digging (few dead ends and lots of bunny hop) it turned out the problem was caused by my DNS setup, in particular my nodes fqdn was not withing the subdomain where the api and api-int were

E.g

NOK

  • node fqdn => dani-k8s-node-1.dani.local
  • endpoints => api-int.dev-okd4.dani.local

OK

  • node fqdn => dani-k8s-node-1.dev-okd4.dani.local
  • endpoints => api-int.dev-okd4.dani.local

B) the second issue (which was more like a question for my own knowledge) was around the fact that the CSRs were approved. With @staebler 's help i've understood there are 2 paths:

  • if bootstrap node is still up, the CSRs are auto aproved
  • otherwise manual approval is needed as per the docs

@DanyC97 DanyC97 closed this as completed Jul 8, 2019
@cgwalters
Copy link
Member

if bootstrap node is still up, the CSRs are auto aproved

Probably what we should have done is only have the bootstrap approve CSRs for masters only or so...that's all we actually need for installs, and having it do workers too adds confusion.

@DanyC97
Copy link
Contributor Author

DanyC97 commented Jul 9, 2019

@cgwalters thinking loud i think is okay to have both use-cases, especially when you build a 100 nodes cluster, you don't need a human to accept the CSRs nor keep running/ watching using your favor tool/ script for pending CSRs.

A note in the docs imo will be exactly what folks need:

  • if folks are willing to keep running a bootstrap node and do understand the security implication => they will get something for free
  • if folks are not willing to have a running bootsrap node only for auto approve CSR => no free lunch ;)

@staebler
Copy link
Contributor

staebler commented Jul 9, 2019

I do not think that we want to advise leaving the bootstrap node running any longer than necessary.

@fdammeke
Copy link

@DanyC97 could you elaborate your findings regarding DNS a bit more? I'm bumping in the same issue as you describe when adding a node to an existing cluster. The initial provisioned cluster doesn't have fqdn hostnames within the api / api-int endpoint url either, so I'm curious how this is related?

What comes to my attention is the fact there are no symbolic links for kubelet or machine-config-daemon in /etc/systemd/system/multi-user.target.wants/ on the extra worker nodes, but they are present on the initial installation worker nodes. Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants