Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amd64: arm64: hang during file descriptor close #1044

Open
ader1990 opened this issue May 25, 2023 · 3 comments
Open

amd64: arm64: hang during file descriptor close #1044

ader1990 opened this issue May 25, 2023 · 3 comments
Labels
kind/bug Something isn't working

Comments

@ader1990
Copy link

ader1990 commented May 25, 2023

Description

First issue:

Baremetal Flatcar AMD64 hangs during ceph installation (containerized ceph installation using rook-ceph kubernetes). Debugging showed that this line hangs, during mon_osd_crush_smoke_test stage:
https://github.com/ceph/ceph/blob/389ea1666327f508d01b9318718c70eec0ed6972/src/common/fork_function.h#L66

When mon_osd_crush_smoke_test = false was set, the issue did not reproduce. Same containerized environment on Ubuntu 22.04 did not reproduce the issue. Same hardware, same identical container images, same worfklow -- which means the OS is most likely having the issue.

Second issue:

Virtual Flatcar ARM64 (qemu-kvm virtual machine) hangs during flatcar-install run. Debugging showed that the command udevadm settle hangs.

I think these two issues are related, because it looks to be a filedescriptor close problem.

More investigation is required, I will add more information while it becomes available.

Did anyone encounter this behaviour?

Impact

TBD.

Environment and steps to reproduce

First issue:

During ceph installation on baremetal (Lenovo baremetal) + virtual (qemu kvm), amd64, ceph osd goes to 100% cpu for no aparrent reason. If we set mon_osd_crush_smoke_test = false, the issue is not present anymore.

After some debugging, this line hangs forever: https://github.com/ceph/ceph/blob/389ea1666327f508d01b9318718c70eec0ed6972/src/common/fork_function.h#L66 . the ceph osd container gets killed after a while because it was not ready, and gets recreated and then same scenario applies.
The 100% CPU usage is due to the fact the code tries to retry the fdclose without any maximum retry count.

As a consequence, ceph cannot be deployed on Flatcar without the flag override from the global config:

    [global]
    mon_osd_crush_smoke_test = false

Second issue:

During installation of flatcar on arm64 qemu kvm, using flatcar-install that uses channel | version. After executing the flatcar-install bash script, the following line from the bash script flatcar-install hangs: udevadm settle.

Expected behavior

To not hang.

Additional information

There were no useful logs in the system logs, dmesg or any hung task error message usually thrown by the kernel. Tried with ALL channels versions of Flatcar available, it looks to be a generic issue for all the Flatcar images.

@pothos
Copy link
Member

pothos commented May 25, 2023

These are two separate things, or? One is udevadm settle hanging, one is the ceph container?
From where do you run flatcar-install - from a PXE booted image, i.e., from RAM and where do you install to? The ceph container issue could be related to the kernel or containerd version, what release did you use?

@pothos pothos moved this from 📝 Needs Triage to 🪵Backlog in Flatcar tactical, release planning, and roadmap May 25, 2023
@ader1990
Copy link
Author

These are two separate things, or? One is udevadm settle hanging, one is the ceph container? From where do you run flatcar-install - from a PXE booted image, i.e., from RAM and where do you install to? The ceph container issue could be related to the kernel or containerd version, what release did you use?

Further investigation is required to see if actually udevadm settle is actually hanging during an fd close too (I ll need to manually build a debug version to see where exactly it does hang).

I just presented these two issues in the same bug report because they seem to have the same cause. I will split them up when it s proven that they do not have the same cause.

flatcar-install was run manually from a KVM VM ARM64, VM running the image provided by https://alpha.release.flatcar-linux.net/arm64-usr/current/flatcar_production_qemu_uefi_image.img.bz2

The ceph issue is definitely not a containerd issue, as we used the same CAPI image binaries for both Flatcar / Ubuntu (curtoasy of the capi image builder project - https://github.com/kubernetes-sigs/image-builder). It might be a kernel issue, definitely, we need to find a way to reproduce it outside of the whole ceph code first.

These two issues are not blockers (as can be avoided by minimal code changes), but may be caused by an underlying issue that might produce non-reproducible side effects or down-right bugs.

Thank you,
Adrian Vladu.

@pothos
Copy link
Member

pothos commented Jun 27, 2023

The flatcar-install bash regression got addressed #1059

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Development

No branches or pull requests

2 participants