-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet service not started on scaleup node boot #1899
Comments
/cc @cgwalters in case you have some thoughts from RHCOS/ MCO side. |
are you sure you approved the CSR for your machine https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere |
what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot. |
@abhinavdahiya thank you for taking the time to respond, much appreciated !
i never needed to approve any CSR, what i saw on previous deployments (for control and compute nodes) was that the CSRs were auto approved. I'm not sure if there are two paths (if there are please help me understand how it works) here in v4 w.r.t approving the CSRs however all i can say is that looking in the cluster i have running (where the above node didn't join the cluster) i see
which is coming from cluster-machine-approver and i can see in the pod's log traces like
Saying that i am not sure if this is part of the machine-api-operator as i haven't worked out how to trace backwards a deployment -> operator (any hints much appreciated ;) ) ..but i guess it is .. Update see attached the whole |
please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks |
@abhinavdahiya any thoughts? or maybe @staebler might be able to chime in? |
@cgwalters @abhinavdahiya can anyone of you please help understand the design/ behavior of
|
@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help. |
@DanyC97 The cluster-machine-approver will only approve machines that are added via a |
oh so there are 2 paths, many many thanks for sharing this info @staebler !
right, so i'll try a test of switching off the bootstrap and see if the CSRs requests are left in pending state, that should confirm. |
i'll check @rphillips , thanks for the info!
|
sadly @rphillips the only output i see in the pods running in
so not much related to CSRs |
@staebler you were spot on Sir ! i've turned off bootstrap node and it behaves as per the docs
which imo this is a docs bug. |
@abhinavdahiya if you can point me in the right direction, i'd be happy to keep digging. |
@DanyC97 Can you be more specific about how you feel that the docs are deficient? The following is an excerpt from the docs. It implies to me that the CSRs may be approved automatically or may need to be approved manually. It is the responsibility of the user to verify that the CSRs are approved--whether automatically or manually.
|
sure, sorry i wasn't clear enough. Indeed you are, the text implies that however You must confirm that these CSRs are approved -> if you trying to double check if they were approved, you won't be able to do so w/o a bit more info becasue:
Imo having a section - a small note and/ or paragraph to mention what you taught me here is miles better, hence me saying a bug Also in the docs it says on step 1 Confirm that the cluster recognizes the machines: but you can't confirm that since there are no nodes in Maybe is a section for troubleshooting, either way i think a clue can be added to help folks. Update sorry if i was too strong on the docs claiming is a bug, is maybe an enhancement |
Ignition should have enabled |
right after keep spinning new nodes, i found out that the dependent service
hasn't started ...hmmm
|
and the dependencies for
where
|
updating in case someone else bumps into this issue. A) the initial issue was around the fact that nodes we failing to join the cluster because the After a lot of digging (few dead ends and lots of bunny hop) it turned out the problem was caused by my DNS setup, in particular my nodes fqdn was not withing the subdomain where the E.g NOK
OK
B) the second issue (which was more like a question for my own knowledge) was around the fact that the
|
Probably what we should have done is only have the bootstrap approve CSRs for masters only or so...that's all we actually need for installs, and having it do workers too adds confusion. |
@cgwalters thinking loud i think is okay to have both use-cases, especially when you build a 100 nodes cluster, you don't need a human to accept the CSRs nor keep running/ watching using your favor tool/ script for pending CSRs. A note in the docs imo will be exactly what folks need:
|
I do not think that we want to advise leaving the bootstrap node running any longer than necessary. |
@DanyC97 could you elaborate your findings regarding DNS a bit more? I'm bumping in the same issue as you describe when adding a node to an existing cluster. The initial provisioned cluster doesn't have fqdn hostnames within the api / api-int endpoint url either, so I'm curious how this is related? What comes to my attention is the fact there are no symbolic links for kubelet or machine-config-daemon in /etc/systemd/system/multi-user.target.wants/ on the extra worker nodes, but they are present on the initial installation worker nodes. Any thoughts? |
Version
RHCOS
Platform (aws|libvirt|openstack):
vmware
What happened?
Deployed an UPI VMware cluster and then started to add a new node. The VM booted, the igniton kicked in and lay down the RHCOS, including setting the static IP and the hostname however the node it never joined the K8s cluster,
oc get nodes
didn't show the new nodeWhat you expected to happen?
i expected the new node to show up in the
oc get nodes
How to reproduce it (as minimally and precisely as possible)?
dani-k8s-node-2.ign
file where the below files were added such that we can set static IP and hostname&
Note that a dummy ignition file
was passed which then redirected to the custom
dani-k8s-node-2.ign
file which had injected the above snippets. That triggered a reboot between the ignition apply stepskubelet.service
is up and running.Note - i guess your terraform UPI's vsphere example should allow you to reproduce the issue, i haven't tried using your code
Anything else we need to know?
On closer inspection while ssh'ed onto the node i found the following
kubectl.service
not active and disabledand
but no symlink present
Running
systemctl start kubelet.service
it does kick the whole process and at the end the node joins the cluster.References
The text was updated successfully, but these errors were encountered: