Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck after reboot #1589

Closed
jknipper opened this issue Nov 26, 2024 · 14 comments · Fixed by sapcc/kubernikus#989
Closed

Stuck after reboot #1589

jknipper opened this issue Nov 26, 2024 · 14 comments · Fixed by sapcc/kubernikus#989
Labels
kind/bug Something isn't working platform/VMWare

Comments

@jknipper
Copy link

jknipper commented Nov 26, 2024

Description

From time to time we see servers stuck in boot process after reboot. This only happens in rare cases after an OS update was applied.

Impact

The server is stuck after reboot and needs to be rebooted manually a second time to bring it up.

Environment and steps to reproduce

  1. it's virtual machines and we are using VMware as a hypervisor
  2. automatic reboot after OS update
  3. we couldn't reproduce the behavior yet and it happens very rarely

Expected behavior

The machine boots without interruption.

Additional information

From the attached screenshot of the server console it seems that the boot process got stuck while trying to mount sysroot.

Screenshot from 2024-11-26 12-22-17

This issue started some time ago and is hard to debug for us. Any suggestions how we could investigate further in this matter are greatly appreciated!

@tormath1
Copy link
Contributor

Hello @jknipper, thanks for raising this. Are you able to tell us since when (which release) did you notice this issue? It seems to be an issue with the disk, do you have a specific configuration for your instances?

@jknipper
Copy link
Author

From what I can see in our alerts/logging it started in July or August on a regular basis. We are running the stable release and are updating all instances shortly after a new release is published. It's hard to see if this is on the hypervisor side or an issue with flatcar. There was no real change on how the instances are provisioned, maybe on VMWare side, I'll try to find out.

@tormath1
Copy link
Contributor

All the instances reboot at the same time on the hypervisor?

@jknipper
Copy link
Author

jknipper commented Dec 3, 2024

All the instances reboot at the same time on the hypervisor?

No, reboots are managed by some operator in Kubernetes and are pretty much random.

What is interesting from that screenshot is that the filesystem check for that same device seem so succeed but the mounting afterwards fails. When looking at the release history of Flatcar, our observations seem to collide with the switch from kernel version 6.1.x to 6.6.x. but could also be a coincidence.

@kayrus
Copy link

kayrus commented Feb 21, 2025

systemd-fsck runs fsck as a child process and it may exit with non 0 rc.
see https://www.freedesktop.org/software/systemd/man/latest/[email protected]
and https://man7.org/linux/man-pages/man8/fsck.8.html

I wonder how can I see the https://github.com/systemd/systemd/blob/4d65c9f70c866e678bd1c408a0a9ebf9046c5db0/src/fsck/fsck.c#L399 log in console output during the boot? It seems that boot logs are not saved to the root partition and after the next reboot they are missed.

Does it make sense to add this unit add-on by default?

# /etc/systemd/system/systemd-fsck-root.service.d/override.conf
[Service]
Restart=on-failure
RestartSec=5s

@kayrus
Copy link

kayrus commented Mar 4, 2025

We discovered that we're using the default openstack flatcar image, while our openstack hypervisor is based on VMware, and there is no open-vm-tools in the openstack image. Therefore a regular VM reboot acts as a hard reboot.

I tried to reboot a Flatcar VM based on the vmware image with a tiny adjustment on the oem fs: sed -i 's|vmware|openstack|g' /oem/grub.cfg and it reboots flawlessly with all services being gracefully shutdown. See the recorded output of two VMs spawned from different images:

qemu image - https://s3.gifyu.com/images/bbEbQ.gif
modified vmware image - https://s3.gifyu.com/images/bbEbW.gif

Now I have a question. How can I adjust boot parameters using ignition without the sed -i 's|vmware|openstack|g' /oem/grub.cfg? Adding shouldExist with a flatcar.oem.id=openstack doesn't make any effect. Our VM user-data is not taken into account.

P.S. Since when a basic flatcar image doesn't contain open-vm-tools? Previously we didn't have such reboot issues. We also noticed that systemd was updated from 252 to 255 on Aug 7, 2024. Could this be another reason for our issue?

@tormath1
Copy link
Contributor

tormath1 commented Mar 4, 2025

Thanks for get back on this and on your investigation.

Now I have a question. How can I adjust boot parameters using ignition without the sed -i 's|vmware|openstack|g' /oem/grub.cfg? Adding shouldExist with a flatcar.oem.id=openstack doesn't make any effect. Our VM user-data is not taken into account.

I am confused: if you run OpenStack on top of VMware, you should just run OpenStack images or I missed something?

P.S. Since when a basic flatcar image doesn't contain open-vm-tools? Previously we didn't have such reboot issues. We also noticed that systemd was updated from 252 to 255 on Aug 7, 2024. Could this be another reason for our issue?

open-vm-tools has always been part of the vmware OEM Flatcar image (as it's a specific VMWare tools, we don't ship it in the generic image: https://stable.release.flatcar-linux.net/amd64-usr/current/oem-vmware_packages.txt)

@kayrus
Copy link

kayrus commented Mar 4, 2025

I am confused: if you run OpenStack on top of VMware, you should just run OpenStack images or I missed something?

Right, since the very beginning (coreos) we're using openstack images and unfortunately they don't contain open-vm-tools. And my discovery is that the lack of open-vm-tools causes incorrect VM reboot, the VM doesn't try to reach the reboot.target. My tests showed that if we use vmware specific image with an oem partition modification (e.g. sed -i 's|vmware|openstack|g' /oem/grub.cfg), everything works as expected and when I run openstack reboot my-flatcar-vm I see that it tries to reach the reboot.target. See the difference between openstack and adjusted vmware images console recordings. Openstack image reboots instantly with no console logs output, the modified vmware image reboots flalessly.

@kayrus
Copy link

kayrus commented Mar 4, 2025

UPD: I'm trying to find a way to tell flatcar vmware image that it is running in openstack environment. Unfortunately modifying the EFI partition's /flatcar/grub/grub.cfg.tar with a strict:

set oem=""
if [ -n "$oem_id" ]; then
    set oem="flatcar.oem.id=openstack"
fi

doesn't help. The only way I found is to modify BTRFS OEM partition's /oem/grub.cfg with:

set oem_id="openstack"

Is there any other methods to force vmware image to think that it is openstack without modifying an image?

What if the open-vm-tools fixes some flatcar bug not allowing it to reach reboot.target? I just made a test with a flatcar VM spawned from a modified vmware image. I tried to reboot it with started vmtoolsd.service service and without it.

In the first case I got a broadcast message in ssh and my ssh connection was closed by remote server:

Broadcast message from [email protected] (Tue 2025-03-04 21:27:41 UTC):

The system will reboot now!

With the stopped vmtoolsd.service the VM was rebooted without graceful shutdown, there was no broadcast message or connection was closed by remote server in ssh, the connection just got stuck.

I also tested flatcar openstack releases down to 3602.2.3, they all behave the same way. ssh connection get stuck, no broadcast reboot message. In addition I enabled acpid service with logging, but there were no logs at all during the reboot.

Looks like we were lucky not to have faced the broken partition issue earlier.

@kayrus
Copy link

kayrus commented Mar 5, 2025

TLDR: we're using vmware hypervizors for our openstack environment, when we use

  • flatcar openstack image - we have working ignition, but broken reboot process
  • flatcar vmware image - we have broken ignition, but working reboot process

Solution found - modify oem_id="openstack" in the /oem/grub.cfg BTRFS. Cons: requires addition flatcar vmware modification.
Ideal solution - find a way to avoid flatcar vmware image modification, or at least avoid modifying the OEM BTRFS, which requires privileged permissions for our pipeline docker container.

@tormath1
Copy link
Contributor

tormath1 commented Mar 5, 2025

@kayrus using OpenStack image did you try to set hw_qemu_guest_agent=yes in the metadata (https://www.flatcar.org/docs/latest/setup/customization/acpi/#qemu-guest-agent) ?

Another thing you could try is to use the oem-vmware sysext image (that you can provisioning through Ignition) on the OpenStack image but this is hacky: https://stable.release.flatcar-linux.net/amd64-usr/current/oem-vmware.raw

@kayrus
Copy link

kayrus commented Mar 5, 2025

@kayrus using OpenStack image did you try to set hw_qemu_guest_agent=yes in the metadata (https://www.flatcar.org/docs/latest/setup/customization/acpi/#qemu-guest-agent) ?

I don't think this may help. VMware hypervizors don't send ACPI events on reboot, they rely on open-vm-tools, and when hypervisor doesn't see this tool running, it forces the VM reboot.

Another thing you could try is to use the oem-vmware sysext image (that you can provisioning through Ignition) on the OpenStack image but this is hacky: https://stable.release.flatcar-linux.net/amd64-usr/current/oem-vmware.raw

This might be an option. I'll try this and let you know if this helps.

@kayrus
Copy link

kayrus commented Mar 5, 2025

Using the ignition config below did the trick, thank you!

{
  "ignition": {
    "version": "3.3.0"
  },
  "storage": {
    "files": [
      {
        "path": "/etc/extensions/oem-vmware.raw",
        "contents": {
          "source": "https://stable.release.flatcar-linux.net/amd64-usr/current/oem-vmware.raw"
        },
        "mode": 420
      }
    ]
  }
}

@tormath1
Copy link
Contributor

tormath1 commented Mar 5, 2025

@kayrus good that it works for you - but I'll investigate deeper. It's not supposed to work this way haha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working platform/VMWare
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants