-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
amd64: arm64: hang during file descriptor close #1044
Comments
These are two separate things, or? One is |
Further investigation is required to see if actually I just presented these two issues in the same bug report because they seem to have the same cause. I will split them up when it s proven that they do not have the same cause.
The ceph issue is definitely not a containerd issue, as we used the same CAPI image binaries for both Flatcar / Ubuntu (curtoasy of the capi image builder project - https://github.com/kubernetes-sigs/image-builder). It might be a kernel issue, definitely, we need to find a way to reproduce it outside of the whole ceph code first. These two issues are not blockers (as can be avoided by minimal code changes), but may be caused by an underlying issue that might produce non-reproducible side effects or down-right bugs. Thank you, |
The |
Description
First issue:
Baremetal Flatcar AMD64 hangs during ceph installation (containerized ceph installation using rook-ceph kubernetes). Debugging showed that this line hangs, during mon_osd_crush_smoke_test stage:
https://github.com/ceph/ceph/blob/389ea1666327f508d01b9318718c70eec0ed6972/src/common/fork_function.h#L66
When
mon_osd_crush_smoke_test = false
was set, the issue did not reproduce. Same containerized environment on Ubuntu 22.04 did not reproduce the issue. Same hardware, same identical container images, same worfklow -- which means the OS is most likely having the issue.Second issue:
Virtual Flatcar ARM64 (qemu-kvm virtual machine) hangs during flatcar-install run. Debugging showed that the command
udevadm settle
hangs.I think these two issues are related, because it looks to be a filedescriptor close problem.
More investigation is required, I will add more information while it becomes available.
Did anyone encounter this behaviour?
Impact
TBD.
Environment and steps to reproduce
First issue:
During ceph installation on baremetal (Lenovo baremetal) + virtual (qemu kvm), amd64, ceph osd goes to 100% cpu for no aparrent reason. If we set mon_osd_crush_smoke_test = false, the issue is not present anymore.
After some debugging, this line hangs forever: https://github.com/ceph/ceph/blob/389ea1666327f508d01b9318718c70eec0ed6972/src/common/fork_function.h#L66 . the ceph osd container gets killed after a while because it was not ready, and gets recreated and then same scenario applies.
The 100% CPU usage is due to the fact the code tries to retry the fdclose without any maximum retry count.
As a consequence, ceph cannot be deployed on Flatcar without the flag override from the global config:
Second issue:
During installation of flatcar on arm64 qemu kvm, using flatcar-install that uses channel | version. After executing the flatcar-install bash script, the following line from the bash script
flatcar-install
hangs:udevadm settle
.Expected behavior
To not hang.
Additional information
There were no useful logs in the system logs, dmesg or any hung task error message usually thrown by the kernel. Tried with ALL channels versions of Flatcar available, it looks to be a generic issue for all the Flatcar images.
The text was updated successfully, but these errors were encountered: