kubelet service not started on scaleup node boot #1899

DanyC97 · 2019-06-24T22:34:33Z

Version

$ openshift-install version
openshift-install v4.1.0-201905212232-dirty
built from commit 71d8978039726046929729ad15302973e3da18ce
release image quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6

RHCOS

rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b
              CustomOrigin: Provisioned from oscontainer
                   Version: 410.8.20190520.0 (2019-05-20T22:55:04Z)

Platform (aws|libvirt|openstack):

vmware

What happened?

Deployed an UPI VMware cluster and then started to add a new node. The VM booted, the igniton kicked in and lay down the RHCOS, including setting the static IP and the hostname however the node it never joined the K8s cluster, oc get nodes didn't show the new node

What you expected to happen?

i expected the new node to show up in the oc get nodes

How to reproduce it (as minimally and precisely as possible)?

create a cluster
scale up a node using a custom dani-k8s-node-2.ign file where the below files were added such that we can set static IP and hostname

cat > ${NGINX_DIRECTORY}/${HOST}-ens192 << EOF
DEVICE=ens192
BOOTPROTO=none
ONBOOT=yes
NETMASK=${NETMASK}
IPADDR=${IP}
GATEWAY=${GATEWAY}
PEERDNS=no
DNS1=${DNS1}
DNS2=${DNS2}
DOMAIN=${DOMAIN_NAME}
IPV6INIT=no
EOF

  ENS192=$(cat ${NGINX_DIRECTORY}/${HOST}-ens192 | base64 -w0)
  rm ${NGINX_DIRECTORY}/${HOST}-ens192

  cat > ${NGINX_DIRECTORY}/${HOST}-ifcfg-ens192.json << EOF
{
  "append" : false,
  "mode" : 420,
  "filesystem" : "root",
  "path" : "/etc/sysconfig/network-scripts/ifcfg-ens192",
  "contents" : {
    "source" : "data:text/plain;charset=utf-8;base64,${ENS192}",
    "verification" : {}
  },
  "user" : {
    "name" : "root"
  },
  "group": {
    "name": "root"
  }
}
EOF

&

cat > ${NGINX_DIRECTORY}/${HOST}-hostname << EOF
${HOST}.${DOMAIN_NAME}
EOF
  HN=$(cat ${NGINX_DIRECTORY}/${HOST}-hostname | base64 -w0)
  rm ${NGINX_DIRECTORY}/${HOST}-hostname
  cat > ${NGINX_DIRECTORY}/${HOST}-hostname.json << EOF
{
  "append" : false,
  "mode" : 420,
  "filesystem" : "root",
  "path" : "/etc/hostname",
  "contents" : {
    "source" : "data:text/plain;charset=utf-8;base64,${HN}",
    "verification" : {}
  },
  "user" : {
    "name" : "root"
  },
  "group": {
    "name": "root"
  }
}
EOF

Note that a dummy ignition file

dani-k8s-node-2-ignition-starter.ign
{
  "ignition": {
    "config": {
      "append": [
        {
          "source": "http://10.85.174.63/repo/dani-k8s-node-2.ign",
          "verification": {}
        }
      ]
    },
    "timeouts": {},
    "version": "2.2.0"
  },
  "networkd": {},
  "passwd": {},
  "storage": {},
  "systemd": {}
}

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

observe if the node join the cluster, if not then check if kubelet.service is up and running.

Note - i guess your terraform UPI's vsphere example should allow you to reproduce the issue, i haven't tried using your code

Anything else we need to know?

On closer inspection while ssh'ed onto the node i found the following

kubectl.service not active and disabled

[root@dani-k8s-node-2 ~]# systemctl list-unit-files | grep disabled | egrep "mcd|kubelet|rpm-ostree|podman|crio"
crio-shutdown.service                          disabled
io.podman.service                              disabled
kubelet.service                                disabled
mcd-write-pivot-reboot.service                 disabled
io.podman.socket                               disabled
rpm-ostreed-automatic.timer                    disabled

and

[root@dani-k8s-node-2 ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

although the systemd file looks okay, there is no symlink created which explains why the service is not running and enabled

[root@dani-k8s-node-2 ~]# cat  /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Wants=rpc-statd.service

[Service]
Type=notify
ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests
ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state
EnvironmentFile=/etc/os-release
EnvironmentFile=-/etc/kubernetes/kubelet-workaround
EnvironmentFile=-/etc/kubernetes/kubelet-env

ExecStart=/usr/bin/hyperkube \
    kubelet \
      --config=/etc/kubernetes/kubelet.conf \
      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
      --kubeconfig=/var/lib/kubelet/kubeconfig \
      --container-runtime=remote \
      --container-runtime-endpoint=/var/run/crio/crio.sock \
      --allow-privileged \
      --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_version=${VERSION_ID},node.openshift.io/os_id=${ID} \
      --minimum-container-ttl-duration=6m0s \
      --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \
      --client-ca-file=/etc/kubernetes/ca.crt \
      --cloud-provider=vsphere \
       \
      --anonymous-auth=false \
      --v=3 \

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

but no symlink present

[root@dani-k8s-node-2 ~]# ll /etc/systemd/system/multi-user.target.wants/
total 0
lrwxrwxrwx. 1 root root 39 May 20 23:06 chronyd.service -> /usr/lib/systemd/system/chronyd.service
lrwxrwxrwx. 1 root root 70 May 20 23:06 console-login-helper-messages-issuegen.service -> /usr/lib/systemd/system/console-login-helper-messages-issuegen.service
lrwxrwxrwx. 1 root root 47 May 20 23:06 coreos-growpart.service -> /usr/lib/systemd/system/coreos-growpart.service
lrwxrwxrwx. 1 root root 69 May 20 23:06 coreos-regenerate-iscsi-initiatorname.service -> /usr/lib/systemd/system/coreos-regenerate-iscsi-initiatorname.service
lrwxrwxrwx. 1 root root 67 May 20 23:06 coreos-root-bash-profile-workaround.service -> /usr/lib/systemd/system/coreos-root-bash-profile-workaround.service
lrwxrwxrwx. 1 root root 51 May 20 23:06 coreos-useradd-core.service -> /usr/lib/systemd/system/coreos-useradd-core.service
lrwxrwxrwx. 1 root root 36 May 20 23:06 crio.service -> /usr/lib/systemd/system/crio.service
lrwxrwxrwx. 1 root root 59 May 20 23:06 ignition-firstboot-complete.service -> /usr/lib/systemd/system/ignition-firstboot-complete.service
lrwxrwxrwx. 1 root root 42 May 20 23:06 irqbalance.service -> /usr/lib/systemd/system/irqbalance.service
lrwxrwxrwx. 1 root root 41 May 20 23:06 mdmonitor.service -> /usr/lib/systemd/system/mdmonitor.service
lrwxrwxrwx. 1 root root 46 May 20 23:06 NetworkManager.service -> /usr/lib/systemd/system/NetworkManager.service
lrwxrwxrwx. 1 root root 51 Jun 17 17:24 ostree-finalize-staged.path -> /usr/lib/systemd/system/ostree-finalize-staged.path
lrwxrwxrwx. 1 root root 37 May 20 23:06 pivot.service -> /usr/lib/systemd/system/pivot.service
lrwxrwxrwx. 1 root root 48 May 20 23:06 remote-cryptsetup.target -> /usr/lib/systemd/system/remote-cryptsetup.target
lrwxrwxrwx. 1 root root 40 May 20 23:06 remote-fs.target -> /usr/lib/systemd/system/remote-fs.target
lrwxrwxrwx. 1 root root 53 May 20 23:06 rpm-ostree-bootstatus.service -> /usr/lib/systemd/system/rpm-ostree-bootstatus.service
lrwxrwxrwx. 1 root root 36 May 20 23:06 sshd.service -> /usr/lib/systemd/system/sshd.service
lrwxrwxrwx. 1 root root 40 May 20 23:06 vmtoolsd.service -> /usr/lib/systemd/system/vmtoolsd.service

Running systemctl start kubelet.service it does kick the whole process and at the end the node joins the cluster.

References

enter text here.

The text was updated successfully, but these errors were encountered:

DanyC97 · 2019-06-24T22:37:02Z

/cc @cgwalters in case you have some thoughts from RHCOS/ MCO side.

abhinavdahiya · 2019-06-24T23:44:38Z

are you sure you approved the CSR for your machine https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere

abhinavdahiya · 2019-06-24T23:47:30Z

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

DanyC97 · 2019-06-25T19:40:24Z

@abhinavdahiya thank you for taking the time to respond, much appreciated !

are you sure you approved the CSR for your machine https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere

i never needed to approve any CSR, what i saw on previous deployments (for control and compute nodes) was that the CSRs were auto approved.

I'm not sure if there are two paths (if there are please help me understand how it works) here in v4 w.r.t approving the CSRs however all i can say is that looking in the cluster i have running (where the above node didn't join the cluster) i see

oc get -n openshift-cluster-machine-approver all
NAME                                    READY   STATUS    RESTARTS   AGE
pod/machine-approver-7cd7f97455-g9x5q   1/1     Running   0          11d

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/machine-approver   1/1     1            1           11d

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/machine-approver-7cd7f97455   1         1         1       11d

which is coming from cluster-machine-approver and i can see in the pod's log traces like

I0624 20:31:13.706332       1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.708184       1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
I0624 20:31:13.708206       1 main.go:164] Error syncing csr csr-5tpq5: Invalid request
I0624 20:31:13.748375       1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.750478       1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
I0624 20:31:13.750496       1 main.go:164] Error syncing csr csr-5tpq5: Invalid request
I0624 20:31:13.830649       1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.833242       1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
E0624 20:31:13.833321       1 main.go:174] Invalid request
I0624 20:31:13.833366       1 main.go:175] Dropping CSR "csr-5tpq5" out of the queue: Invalid request

Saying that i am not sure if this is part of the machine-api-operator as i haven't worked out how to trace backwards a deployment -> operator (any hints much appreciated ;) ) ..but i guess it is ..

Update

see attached the whole machine-approver-7cd7f97455-g9x5q pod's log in case you might find s'thing ...

pod-machine-approver-7cd7f97455-g9x5q.log

DanyC97 · 2019-06-25T20:13:14Z

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks ### dani ### so you can see why i assumed the above sequence triggered the reboot.

DanyC97 · 2019-06-26T21:00:45Z

@abhinavdahiya any thoughts? or maybe @staebler might be able to chime in?
any addition info please let me know, i'm keeping the lab env up in case more info is needed, hopefully will help you out

DanyC97 · 2019-06-27T21:58:02Z

@cgwalters @abhinavdahiya can anyone of you please help understand the design/ behavior of

CSR auto approving ? is there a bug in the docs? if not, then please teach me how to fish - aka understand what the design was:
- manual CSRs approved after kubelet started
- CSRs auto-approved
the kubelet service not started on worker nodes

rphillips · 2019-06-27T22:33:29Z

@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.

staebler · 2019-06-27T22:33:59Z

@DanyC97 The cluster-machine-approver will only approve machines that are added via a machine resource. The vSphere platform does not use machine resources yet. So, as a user, you must manually approve CSR requests for new nodes. As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.

DanyC97 · 2019-06-28T07:14:37Z

@DanyC97 The cluster-machine-approver will only approve machines that are added via a machine resource. The vSphere platform does not use machine resources yet. So, as a user, you must manually approve CSR requests for new nodes.

oh so there are 2 paths, many many thanks for sharing this info @staebler !

As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.

right, so i'll try a test of switching off the bootstrap and see if the CSRs requests are left in pending state, that should confirm.

DanyC97 · 2019-06-28T07:20:19Z

@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.

i'll check @rphillips , thanks for the info!
i'll be curious to understand the whole flow as things don't add up (yet) in my head:

if vSphere platform does not use machine resources yet then no cluster-machine-approver - OK
if no cluster-machine-approver then let's say we fallback to bootstrap node - TBC
so then where controller-manager within the openshift-controller-manager fits in the whole picture? cause is not part of the bootstrap node, isn't it? -> will find out

DanyC97 · 2019-06-29T08:47:30Z

@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.

i'll check @rphillips , thanks for the info!
i'll be curious to understand the whole flow as things don't add up (yet) in my head:
* if _vSphere platform does not use machine resources yet_ then no _cluster-machine-approver_ - OK

* if no _cluster-machine-approver_ then let's say we fallback to bootstrap node - TBC

* so then where _controller-manager within the openshift-controller-manager_ fits in the whole picture? cause is not part of the bootstrap node, isn't it? -> will find out

sadly @rphillips the only output i see in the pods running in openshift-controller-manager ns is

W0629 08:33:30.961010       1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.ConfigMap ended with: too old resource version: 4962577 (4963900)
W0629 08:33:49.357787       1 reflector.go:256] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.TemplateInstance ended with: The resourceVersion for the provided watch is too old.
W0629 08:35:23.082164       1 reflector.go:256] github.com/openshift/client-go/route/informers/externalversions/factory.go:101: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
W0629 08:36:01.201731       1 reflector.go:256] github.com/openshift/client-go/build/informers/externalversions/factory.go:101: watch of *v1.Build ended with: The resourceVersion for the provided watch is too old.
W0629 08:36:21.794776       1 reflector.go:256] github.com/openshift/client-go/apps/informers/externalversions/factory.go:101: watch of *v1.DeploymentConfig ended with: The resourceVersion for the provided watch is too old.
W0629 08:37:07.652800       1 reflector.go:256] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: watch of *v1.ImageStream ended with: The resourceVersion for the provided watch is too old.
W0629 08:38:49.543850       1 reflector.go:256] github.com/openshift/origin/pkg/unidling/controller/unidling_controller.go:199: watch of *v1.Event ended with: The resourceVersion for the provided watch is too old.
W0629 08:41:06.451980       1 reflector.go:256] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.TemplateInstance ended with: The resourceVersion for the provided watch is too old.
W0629 08:41:17.966140       1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.ConfigMap ended with: too old resource version: 4964042 (4965587)

so not much related to CSRs

DanyC97 · 2019-07-01T12:52:42Z

@DanyC97 The cluster-machine-approver will only approve machines that are added via a machine resource. The vSphere platform does not use machine resources yet. So, as a user, you must manually approve CSR requests for new nodes.

oh so there are 2 paths, many many thanks for sharing this info @staebler !

As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.

right, so i'll try a test of switching off the bootstrap and see if the CSRs requests are left in pending state, that should confirm.

@staebler you were spot on Sir !

i've turned off bootstrap node and it behaves as per the docs

[root@dani-dev ~]# oc get csr
NAME        AGE    REQUESTOR                                                                   CONDITION
csr-28p87   3m4s   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

which imo this is a docs bug.
And for folks who need to know if bootstrap node did approve anything journalctl |grep approve-csr |grep -v "No resources found" will do the job

DanyC97 · 2019-07-01T12:53:41Z

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks ### dani ### so you can see why i assumed the above sequence triggered the reboot.

@abhinavdahiya if you can point me in the right direction, i'd be happy to keep digging.

staebler · 2019-07-01T13:14:55Z

which imo this is a docs bug.

@DanyC97 Can you be more specific about how you feel that the docs are deficient?

The following is an excerpt from the docs. It implies to me that the CSRs may be approved automatically or may need to be approved manually. It is the responsibility of the user to verify that the CSRs are approved--whether automatically or manually.

When you add machines to a cluster, two pending certificates signing request (CSRs) are generated for each machine that you added. You must confirm that these CSRs are approved or, if necessary, approve them yourself.

DanyC97 · 2019-07-01T14:44:54Z

which imo this is a docs bug.

@DanyC97 Can you be more specific about how you feel that the docs are deficient?

The following is an excerpt from the docs. It implies to me that the CSRs may be approved automatically or may need to be approved manually. It is the responsibility of the user to verify that the CSRs are approved--whether automatically or manually.

When you add machines to a cluster, two pending certificates signing request (CSRs) are generated for each machine that you added. You must confirm that these CSRs are approved or, if necessary, approve them yourself.

sure, sorry i wasn't clear enough.

Indeed you are, the text implies that however You must confirm that these CSRs are approved -> if you trying to double check if they were approved, you won't be able to do so w/o a bit more info becasue:

oc get csr => will show no output (e.g Approved, Issued), they've been already dealt by bootstrap node. Maybe the assumption here is that: if you see no Pending certs then everything worked okay.
checking the journald on bootstrap node -> if you have the knowledge to do so

Imo having a section - a small note and/ or paragraph to mention what you taught me here is miles better, hence me saying a bug

Also in the docs it says on step 1

Confirm that the cluster recognizes the machines:

but you can't confirm that since there are no nodes in NotReady state, they will appear if kubelet service is running .. however if you are unlucky like me then the docs won't help much.

Maybe is a section for troubleshooting, either way i think a clue can be added to help folks.

Update

sorry if i was too strong on the docs claiming is a bug, is maybe an enhancement

cgwalters · 2019-07-01T20:42:30Z

Ignition should have enabled kubelet.service, it's part of the Ignition generated by the MCO.

DanyC97 · 2019-07-01T20:48:24Z

was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps

what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.

please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks ### dani ### so you can see why i assumed the above sequence triggered the reboot.

@abhinavdahiya if you can point me in the right direction, i'd be happy to keep digging.

right after keep spinning new nodes, i found out that the dependent service

 cat /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Wants=rpc-statd.service

hasn't started ...hmmm

systemctl status rpc-statd
● rpc-statd.service - NFS status monitor for NFSv2/3 locking.
   Loaded: loaded (/usr/lib/systemd/system/rpc-statd.service; static; vendor preset: disabled)
   Active: inactive (dead)
[root@localhost ~]# cat /usr/lib/systemd/system/rpc-statd.service
[Unit]
Description=NFS status monitor for NFSv2/3 locking.
DefaultDependencies=no
Conflicts=umount.target
Requires=nss-lookup.target rpcbind.socket
Wants=network-online.target
After=network-online.target nss-lookup.target rpcbind.socket

PartOf=nfs-utils.service

[Service]
Environment=RPC_STATD_NO_NOTIFY=1
Type=forking
PIDFile=/var/run/rpc.statd.pid
ExecStart=/usr/sbin/rpc.statd

DanyC97 · 2019-07-02T12:43:28Z

and the dependencies for rpc-statd is

systemctl list-dependencies rpc-statd
rpc-statd.service
● ├─rpcbind.socket
● ├─system.slice
● ├─network-online.target
● │ └─NetworkManager-wait-online.service
● └─nss-lookup.target

where rpcbind.service is not running.
And the output of the network-online.target is up and happy

systemctl status network-online.target
● network-online.target - Network is Online
   Loaded: loaded (/usr/lib/systemd/system/network-online.target; static; vendor preset: disabled)
   Active: active since Tue 2019-07-02 12:29:55 UTC; 18min ago
     Docs: man:systemd.special(7)
           https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget

Jul 02 12:29:55 dani-k8s-node-2.dani.local systemd[1]: Reached target Network is Online.

DanyC97 · 2019-07-08T15:21:34Z

updating in case someone else bumps into this issue.

A) the initial issue was around the fact that nodes we failing to join the cluster because the kubelet.service

After a lot of digging (few dead ends and lots of bunny hop) it turned out the problem was caused by my DNS setup, in particular my nodes fqdn was not withing the subdomain where the api and api-int were

E.g

NOK

node fqdn => dani-k8s-node-1.dani.local
endpoints => api-int.dev-okd4.dani.local

OK

node fqdn => dani-k8s-node-1.dev-okd4.dani.local
endpoints => api-int.dev-okd4.dani.local

B) the second issue (which was more like a question for my own knowledge) was around the fact that the CSRs were approved. With @staebler 's help i've understood there are 2 paths:

if bootstrap node is still up, the CSRs are auto aproved
otherwise manual approval is needed as per the docs

cgwalters · 2019-07-08T16:03:07Z

if bootstrap node is still up, the CSRs are auto aproved

Probably what we should have done is only have the bootstrap approve CSRs for masters only or so...that's all we actually need for installs, and having it do workers too adds confusion.

DanyC97 · 2019-07-09T10:10:56Z

@cgwalters thinking loud i think is okay to have both use-cases, especially when you build a 100 nodes cluster, you don't need a human to accept the CSRs nor keep running/ watching using your favor tool/ script for pending CSRs.

A note in the docs imo will be exactly what folks need:

if folks are willing to keep running a bootstrap node and do understand the security implication => they will get something for free
if folks are not willing to have a running bootsrap node only for auto approve CSR => no free lunch ;)

staebler · 2019-07-09T11:22:23Z

I do not think that we want to advise leaving the bootstrap node running any longer than necessary.

fdammeke · 2021-02-17T12:32:58Z

@DanyC97 could you elaborate your findings regarding DNS a bit more? I'm bumping in the same issue as you describe when adding a node to an existing cluster. The initial provisioned cluster doesn't have fqdn hostnames within the api / api-int endpoint url either, so I'm curious how this is related?

What comes to my attention is the fact there are no symbolic links for kubelet or machine-config-daemon in /etc/systemd/system/multi-user.target.wants/ on the extra worker nodes, but they are present on the initial installation worker nodes. Any thoughts?

DanyC97 mentioned this issue Jun 24, 2019

version: Output RHEL CoreOS bootimage version #1815

Closed

DanyC97 closed this as completed Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet service not started on scaleup node boot #1899

kubelet service not started on scaleup node boot #1899

DanyC97 commented Jun 24, 2019 •

edited

Loading

DanyC97 commented Jun 24, 2019

abhinavdahiya commented Jun 24, 2019

abhinavdahiya commented Jun 24, 2019

DanyC97 commented Jun 25, 2019 •

edited

Loading

DanyC97 commented Jun 25, 2019

DanyC97 commented Jun 26, 2019

DanyC97 commented Jun 27, 2019 •

edited

Loading

rphillips commented Jun 27, 2019

staebler commented Jun 27, 2019

DanyC97 commented Jun 28, 2019

DanyC97 commented Jun 28, 2019

DanyC97 commented Jun 29, 2019

DanyC97 commented Jul 1, 2019

DanyC97 commented Jul 1, 2019

staebler commented Jul 1, 2019

DanyC97 commented Jul 1, 2019 •

edited

Loading

cgwalters commented Jul 1, 2019

DanyC97 commented Jul 1, 2019

DanyC97 commented Jul 2, 2019 •

edited

Loading

DanyC97 commented Jul 8, 2019

cgwalters commented Jul 8, 2019

DanyC97 commented Jul 9, 2019

staebler commented Jul 9, 2019

fdammeke commented Feb 17, 2021

kubelet service not started on scaleup node boot #1899

kubelet service not started on scaleup node boot #1899

Comments

DanyC97 commented Jun 24, 2019 • edited Loading

Version

Platform (aws|libvirt|openstack):

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

References

DanyC97 commented Jun 24, 2019

abhinavdahiya commented Jun 24, 2019

abhinavdahiya commented Jun 24, 2019

DanyC97 commented Jun 25, 2019 • edited Loading

DanyC97 commented Jun 25, 2019

DanyC97 commented Jun 26, 2019

DanyC97 commented Jun 27, 2019 • edited Loading

rphillips commented Jun 27, 2019

staebler commented Jun 27, 2019

DanyC97 commented Jun 28, 2019

DanyC97 commented Jun 28, 2019

DanyC97 commented Jun 29, 2019

DanyC97 commented Jul 1, 2019

DanyC97 commented Jul 1, 2019

staebler commented Jul 1, 2019

DanyC97 commented Jul 1, 2019 • edited Loading

cgwalters commented Jul 1, 2019

DanyC97 commented Jul 1, 2019

DanyC97 commented Jul 2, 2019 • edited Loading

DanyC97 commented Jul 8, 2019

cgwalters commented Jul 8, 2019

DanyC97 commented Jul 9, 2019

staebler commented Jul 9, 2019

fdammeke commented Feb 17, 2021

DanyC97 commented Jun 24, 2019 •

edited

Loading

DanyC97 commented Jun 25, 2019 •

edited

Loading

DanyC97 commented Jun 27, 2019 •

edited

Loading

DanyC97 commented Jul 1, 2019 •

edited

Loading

DanyC97 commented Jul 2, 2019 •

edited

Loading