Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rawhide: kola: 20220205: coreos.boot-mirror.luks test enters emergency.target #1092

Closed
dustymabe opened this issue Feb 6, 2022 · 28 comments · Fixed by coreos/fedora-coreos-config#1603 or coreos/fedora-coreos-config#1617
Assignees
Labels
jira for syncing to jira kind/bug rawhide Issues that uniquely pertain to rawhide

Comments

@dustymabe
Copy link
Member

dustymabe commented Feb 6, 2022

The coreos.boot-mirror.luks just started failing semi consistently in the rawhide stream. We first saw it in CI and now it's happening in the pipeline fedora-coreos-pipeline#19. Cracking open the console.txt for the test shows us:

  106.786833] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed c
[  106.816385] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M
         Starting ^[[0;1;39mignition-ostree-t…tion OSTree: Restore Partitions...^M
[  106.873485] ignition-ostree-transposefs[1852]: Restoring rootfs from RAM...^M
[  106.983108] ignition-ostree-transposefs[1856]: realpath: /dev/disk/by-label/root: No such file or directory^M
[  106.991587] systemd[1]: ignition-ostree-transposefs-restore.service: Main process exited, code=exited, status=1/FAILURE^M
[  106.994594] systemd[1]: ignition-ostree-transposefs-restore.service: Failed with result 'exit-code'.^M
[  106.999748] systemd[1]: Failed to start ignition-ostree-transposefs-restore.service


Full console.txt file: coreos-boot-mirror-luks-console.txt

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Feb 6, 2022
The ignition-ostree-transposefs-restore.service is failing on boot
and we're dropping into the emergency shell. Let's snooze the test
for now on rawhide while we investigate.

See coreos/fedora-coreos-tracker#1092
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Feb 7, 2022
The ignition-ostree-transposefs-restore.service is failing on boot
and we're dropping into the emergency shell. Let's snooze the test
for now on rawhide while we investigate.

See coreos/fedora-coreos-tracker#1092
@HuijingHei HuijingHei added the jira for syncing to jira label Feb 8, 2022
@HuijingHei HuijingHei self-assigned this Feb 8, 2022
@HuijingHei
Copy link
Member

HuijingHei commented Feb 9, 2022

I do a local build with the rawhide branch of the fcos configs repo, run coreos.boot-mirror.luks test with kola, and result is passed

There are something different between the logs, maybe the failed reason is:
when running ignition-ostree-transposefs, find root device /dev/disk/by-label/root, but actual is /dev/dm-0

Failed logs:

Reached target  - Initrd Root Device.
[  106.786833] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed condition check (ConditionPathExists=!/run/ignition-ostree-transposefs).
[  106.816385] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...
         Starting mignition-ostree-t…tion OSTree: Restore Partitions...
[  106.873485] ignition-ostree-transposefs[1852]: Restoring rootfs from RAM...
[  106.983108] ignition-ostree-transposefs[1856]: realpath: /dev/disk/by-label/root: No such file or directory

Successful logs:

[  143.842513] systemd[1]: Reached target initrd-root-device.target - Initrd Root Device.^M
[  143.859420] ignition-ostree-transposefs[1825]: Expected /dev/disk/by-label/root to point to /dev/dm-0, but points to /dev/disk/by-label/root; triggering udev^M
[  143.876207] systemd[1]: ignition-ostree-uuid-root.service - Ignition OSTree: Regenerate Filesystem UUID (root) was skipped because of a failed condition check (ConditionPathExists=!/run/ignition-ostree-transposefs).^M
[  143.898409] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M
[  143.991660] ignition-ostree-transposefs[1825]: Mounting /dev/disk/by-label/root (/dev/dm-0) to /sysroot^M

@dustymabe
Copy link
Member Author

If you run it ten times does it pass every time? cosa kola run coreos.boot-mirror.luks --multiply=10 should do it.

@HuijingHei
Copy link
Member

If you run it ten times does it pass every time? cosa kola run coreos.boot-mirror.luks --multiply=10 should do it.

Yes, all passed after run totally 20 times

@dustymabe
Copy link
Member Author

Hmm. Maybe the issue only happens in our CI environment. Regardless we can remove the snooze and see if it happens again.

HuijingHei added a commit to HuijingHei/fedora-coreos-config that referenced this issue Feb 10, 2022
Can not reproduce coreos.boot-mirror.luks failed issue locally after
running 20 times, remove from denylist to see if it happens.

See coreos/fedora-coreos-tracker#1092
HuijingHei added a commit to HuijingHei/fedora-coreos-config that referenced this issue Feb 10, 2022
Can not reproduce `coreos.boot-mirror.luks` failed issue locally
 with rawhide after running 20 times, remove from denylist to see
if it happens.

See coreos/fedora-coreos-tracker#1092
dustymabe pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Feb 10, 2022
Can not reproduce `coreos.boot-mirror.luks` failed issue locally
 with rawhide after running 20 times, remove from denylist to see
if it happens.

See coreos/fedora-coreos-tracker#1092
dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue Feb 10, 2022
Can not reproduce `coreos.boot-mirror.luks` failed issue locally
 with rawhide after running 20 times, remove from denylist to see
if it happens.

See coreos/fedora-coreos-tracker#1092
@travier travier added the rawhide Issues that uniquely pertain to rawhide label Feb 10, 2022
@dustymabe
Copy link
Member Author

This is back in 37.20220217.91.0 (build#80).

I was able to reproduce locally pretty easily. #HuijingHei - can you try with this qcow (will need to uncompress first).

@jlebon
Copy link
Member

jlebon commented Feb 17, 2022

Looking at the error, this patch may help:

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
index 18224c36..4f72aacb 100755
--- a/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh
@@ -56,7 +56,7 @@ udev_trigger_on_label_mismatch() {
     local expected_dev=$1; shift
     local actual_dev
     expected_dev=$(realpath "${expected_dev}")
-    actual_dev=$(realpath "/dev/disk/by-label/$label")
+    actual_dev=$(realpath "/dev/disk/by-label/$label" || :)
     if [ "$actual_dev" != "$expected_dev" ]; then
         echo "Expected /dev/disk/by-label/$label to point to $expected_dev, but points to $actual_dev; triggering udev"
         udevadm trigger --settle "$expected_dev"

Though would be good to narrow down on what's happening there at the udev level and possibly file another RHBZ before working around it. The udevadm trigger was initially added because of issues in RHCOS, not FCOS.

dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Feb 17, 2022
@HuijingHei
Copy link
Member

This is back in 37.20220217.91.0 (build#80).

I was able to reproduce locally pretty easily. #HuijingHei - can you try with this qcow (will need to uncompress first).

Test with fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2, failed to entered emergency.target in initramfs 2 times(total 10), but the error logs are different with the first description

Details

[  204.645515] systemd[1]: Timed out waiting for device dev-disk-by\x2dpartlabel-root\x2d2.device - /dev/disk/by-partlabel/root-2.
[  204.695643] ignition[861]: disks: createRaids: op(a): [failed]   waiting for devices [/dev/disk/by-partlabel/boot-1 /dev/disk/by-partlabel/boot-2 /dev/dist
[  204.788764] multipathd[552]: exit (signal)
[  206.313988] systemd[1]: dev-disk-by\x2dpartlabel-root\x2d2.device: Job dev-disk-by\x2dpartlabel-root\x2d2.device/start failed with result 'timeout'.
[  206.361174] ignition[861]: Ignition failed: failed to create raids: failed to wait on raids devs: device unit dev-disk-by\x2dpartlabel-root\x2d2.device tit
[  206.423551] ignition[861]: }
[  206.431028] systemd[1]: ignition-disks.service: Main process exited, code=exited, status=1/FAILURE
[  206.459967] systemd[1]: ignition-disks.service: Failed with result 'exit-code'.

@HuijingHei
Copy link
Member

@dustymabe is there anyway I can find console logs about coreos.boot-mirror.luks in build#80?

@dustymabe
Copy link
Member Author

It looks like maybe the logs got cleaned up from that run already :(

@dustymabe
Copy link
Member Author

I just ran this locally against the ISO from #1092 (comment) and it failed:

[   81.712037] systemd[1]: Starting ignition-ostree-transposefs-restore.service - Ignition OSTree: Restore Partitions...^M  
         Starting ^[[0;1;39mignition-ostree-t…tion OSTree: Restore Partitions...^M                                          
[   81.733279] ignition-ostree-transposefs[1856]: Restoring rootfs from RAM...^M                                            
[   81.802958] ignition-ostree-transposefs[1860]: realpath: /dev/disk/by-label/root: No such file or directory^M            
[   81.805947] systemd[1]: ignition-ostree-transposefs-restore.service: Main process exited, code=exited, status=1/FAILURE^M
[   81.806792] systemd[1]: ignition-ostree-transposefs-restore.service: Failed with result 'exit-code'.^M                   
[^[[0;1;31mFAILED^[[0m] Failed to start ^[[0;1;39mignition-o…nition OSTree: Restore Partitions.^M                           

So.. indeed looks like your failure is different because mine found those devices:

[   39.664588] ignition[855]: disks: createRaids: op(a): [finished] waiting for devices [/dev/disk/by-partlabel/boot-1 /dev/disk/by-partlabel/boot-2 /dev/disk/by-partlabel/root-1 /dev/disk/by-partlabel/root-2]

@HuijingHei
Copy link
Member

I can reproduce with fedora-coreos-37.20220217.dev.0-qemu.x86_64.qcow2 1 time ( totally 20), the error logs are the same as the description. But I have no idea how to debug this, any suggestions? Thanks!

@dustymabe
Copy link
Member Author

The snooze for coreos.boot-mirror.luks was extended in coreos/fedora-coreos-config#1543.

@jlebon
Copy link
Member

jlebon commented Feb 28, 2022

@HuijingHei One thing I find helpful when debugging udev-related things is to boot with rd.udev.log_level=debug. That will give you debug output for systemd-udevd and you should be able to see for example at what point the root symlink goes away.

@HuijingHei
Copy link
Member

HuijingHei commented Mar 1, 2022

Thanks @jlebon for your suggestion.
Append rd.udev.log_level=debug to kernel argument and run coreos.boot-mirror.luks, here is console log, hope it will be helpful
console.txt

@jlebon
Copy link
Member

jlebon commented Mar 1, 2022

@HuijingHei The debug logs are in the journal. If you want it on the console, you can add systemd.log_target=console but because it's a lot of data, the printing may influence timing and make the bug not appear.

Hmm but actually, you're triggering this by running the test and not manually doing cosa run with a similar Butane config, right? In that case, you should have the initrd journal dump in the test output directory. The file is called ignition-virtio-dump.txt.

@miabbott
Copy link
Member

Skimmed that diff, but nothing jumped out related to storage devices, mounts, udev, etc. I think we'd need to bisect.

$ git log --pretty=oneline --no-merges v5.17-rc2..dcb85f85fa6f | wc -l
143
dcb85f85fa6f gcc-plugins/stackleak: Use noinstr in favor of notrace
87563a043cef ax25: fix reference count leaks of ax25_dev
80d4609008e6 net: stmmac: ensure PTP time register reads are consistent
1f2cfdd349b7 printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin()
67d6212afda2 Revert "module, async: async_synchronize_full() on module init iff async is used"
34a081761e4e net: ipa: request IPA register values be retained
ac62a0174d62 dt-bindings: net: qcom,ipa: add optional qcom,qmp property
2bdfd2825c96 cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
7f3bdbc3f131 tools/resolve_btfids: Do not print any commands when building silently
c36c04c2e132 Revert "mm/gup: small refactoring: simplify try_grab_page()"
b293dcc473d2 bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
4a81f6da9cb2 net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
b67985be4009 tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
81eb8b0b1878 net: sparx5: do not refer to skb after passing it on
186edf7e368c selinux: fix double free of cond_list on error paths
c86d86131ab7 Partially revert "net/smc: Add netlink net namespace support"
ad5185735f7d net/mlx5e: Avoid field-overflowing memcpy()
6d5c900eb641 net/mlx5e: Use struct_group() for memcpy() region
5b209d1a22af net/mlx5e: Avoid implicit modify hdr for decap drop rule
de47db0cf7f4 net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
5352859b3bfa net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
736dfe4e68b8 net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
d8e5883d694b net/mlx5: E-Switch, Fix uninitialized variable modact
ec41332e02bd net/mlx5e: Fix handling of wrong devices during bond netevent
7957837b816f net/mlx5e: Fix broken SKB allocation in HW-GRO
b8d91145ed7c net/mlx5e: Fix wrong calculation of header index in HW_GRO
880b51769190 net/mlx5: Bridge, Fix devlink deadlock on net namespace deletion
55b2ca702cfa net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE
5623ef8a1188 net/mlx5e: TC, Reject rules with forward and drop actions
3c5193a87b0f net/mlx5: Use del_timer_sync in fw reset flow of halting poll
4a08a131351e net/mlx5e: Fix module EEPROM query
a2446bc77a16 net/mlx5e: TC, Reject rules with drop and modify hdr action
350d9a823734 net/mlx5: Bridge, ensure dev_name is null-terminated
04f8c12f031f net/mlx5: Bridge, take rtnl lock in init error handler
d0cfa548dbde net: macsec: Verify that send_sci is on when setting Tx sci explicitly
63e4b45c82ed ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
479f5547239d tcp: fix mem under-charging with zerocopy sendmsg()
e42e70ad6ae2 af_packet: fix data-race in packet_setsockopt / packet_setsockopt
c6f6f2444bdb rtnetlink: make sure to refresh master_dev/m_ops in __rtnl_newlink()
04c2a47ffb13 net: sched: fix use-after-free in tc_new_tfilter()
6dde7acdb3dc ethernet: smc911x: fix indentation in get/set EEPROM
b7892f7d5cb2 tools: Ignore errors from `which' when searching a GCC toolchain
1a2beb3d5a0b mailmap: update Christian Brauner's email address
24f600856418 cgroup-v1: Require capabilities to set release_agent
cad014b7b5a6 e1000e: Handshake with CSME starts from ADL platforms
68defd528f94 e1000e: Separate ADP board type from TGP
836f35f79153 platform/x86: thinkpad_acpi: Fix incorrect use of platform profile on AMD platforms
881cc731df6a net: phy: Fix qca8081 with speeds lower than 2.5Gb/s
ee12595147ac fanotify: Fix stale file descriptor in copy_event_to_user()
2161ba070999 MIPS: KVM: fix vz.c kernel-doc notation
50317b636e71 MIPS: octeon: Fix missed PTR->PTR_WD conversion
9cef24c8b76c net: macsec: Fix offload support for NETDEV_UNREGISTER event
7af037c39b60 net: stmmac: dump gmac4 DMA registers correctly
4223f8651287 net: dsa: mt7530: make NET_DSA_MT7530 select MEDIATEK_GE_PHY
6533e558c650 i40e: Fix reset path while removing the driver
3d2504663c41 i40e: Fix reset bw limit when DCB enabled with 1 TC
ec049891b2dc kselftest: Fix vdso_test_abi return status
f83a96e5f033 spi: mediatek: Avoid NULL pointer crash in interrupt
baf927a833ca pinctrl: microchip-sgpio: Fix support for regmap
341adeec9ada net/smc: Forward wakeup to smc socket waitqueue after fallback
3a5286955bf5 pinctrl: bcm63xx: fix unmet dependency on REGMAP for GPIO_REGMAP
5297c693d8c8 pinctrl: bcm2835: Fix a few error paths
6449520391df net: stmmac: properly handle with runtime pm in stmmac_dvr_remove()
1f84a9450d75 gve: fix the wrong AdminQ buffer queue index check
d01ffb9eee4a ax25: add refcount in ax25_dev to avoid UAF bugs
4e0f718daf97 ax25: improve the incomplete fix to avoid UAF and NPD bugs
500c77eed0fe pinctrl: zynqmp: Revert "Unify pin naming"
928d6fe996f6 net: stmmac: dwmac-visconti: No change to ETHER_CLOCK_SEL for unexpected speed request.
ab451ea952fe nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client.
60b1e97140a4 spi: dt-bindings: Fix 'reg' child node schema
2cbd27267ffe spi: bcm-qspi: check for valid cs before applying chip select
7674b7b559b6 net: amd-xgbe: ensure to reset the tx_timer_active flag
5aac9108a180 net: amd-xgbe: Fix skb data length underflow
dae1d8ac3189 selftests: skip mincore.check_file_mmap when fs lacks needed support
ac9e0a250bb1 selftests: openat2: Skip testcases that fail with EOPNOTSUPP
ea3396725aa1 selftests: openat2: Add missing dependency in Makefile
e051cdf655fa selftests: openat2: Print also errno in failure messages
b9199181a9ef selftests: futex: Use variable MAKE instead of make
908a26e139e8 selftests/exec: Remove pipe from TEST_GEN_FILES
01dabed20573 selftests/zram: Adapt the situation that /dev/zram0 is being used
d18da7ec3719 selftests/zram01.sh: Fix compression ratio calculation
fc4eb486a59d selftests/zram: Skip max_comp_streams interface on newer kernel
dede34b2c1a8 docs/kselftest: clarify running mainline tests on stables
b07f41373254 netfilter: nf_tables: remove assignment with no effect in chain blob builder
d19a7af73b5e lockd: fix failure to cleanup client locks
5d8a8b324ff4 MAINTAINERS: Remove Harry Morris bouncing address
79c37ca73a6e net: ieee802154: Return meaningful error codes from the netlink helpers
621b24b09eb6 net: ieee802154: ca8210: Stop leaking skb's
e5ce576d45bf net: ieee802154: at86rf230: Stop leaking skb's
d753c4004820 net: ieee802154: mcr20a: Fix lifs/sifs periods
1293fccc9e89 net: ieee802154: hwsim: Ensure proper channel selection at probe time
eda0cf1202ac selftests: nft_concat_range: add test for reload with no element add/del
f459bfd4b979 netfilter: nft_byteorder: track register operations
aeac4554eb54 netfilter: nft_reject_bridge: Fix for missing reply from prerouting
aad51ca71ad8 selftests: netfilter: check stateless nat udp checksum fixup
c858620d2ae3 selftests: netfilter: reduce zone stress test running time
34243b9ec856 netfilter: nft_ct: fix use after free when attaching zone template
7355bfe0e0cc netfilter: Remove flowtable relics
c80d401c52a2 cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
37c2c83ca4f1 spi: uniphier: fix reference count leak in uniphier_spi_probe()
e937440f7fc4 spi: meson-spicc: add IRQ check in meson_spicc_probe
23e3404de1ae spi: uniphier: Fix a bug that doesn't point to private data correctly
90cafce461de spi: change clk_disable_unprepare to clk_unprepare
e2bcbd7769ee tools headers UAPI: remove stale lirc.h
0e3135d3bfa5 bpf: Fix possible race in inc_misses_counter
92d25637a3a4 kselftest: signal all child processes
40d70d4d6097 selftests: cpufreq: Write test output to stdout as well
f034cc1301e7 selftests: rtc: Increase test timeout so that all tests run
235528072f28 kunit: tool: Import missing importlib.abc
f26d04331360 audit: improve audit queue handling when "audit=1" on cmdline
63ee956f69d8 bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
e12963c45326 pinctrl: intel: Fix a glitch when updating IRQ flags on a preconfigured line
e986f0e602f1 pinctrl: intel: fix unexpected interrupt
77311237eaff pinctrl: Place correctly CONFIG_PINCTRL_ST in the Makefile
bdac3bbd0dc6 spi: spi-rockchip: Add rk3568-spi compatible
b8fb0d9b4766 platform/x86: amd-pmc: Correct usage of SMU version
f8c28b93d262 platform/x86: asus-tf103c-dock: Make 2 global structs static
f7086daab3b5 platform/x86: amd-pmc: Make amd_pmc_stb_debugfs_fops static
17da2d5f9369 platform/x86: ISST: Fix possible circular locking dependency detected
a29012ab2316 platform/x86: intel_crystal_cove_charger: Fix IRQ masking / unmasking
b288420e773f platform/x86: thinkpad_acpi: Add quirk for ThinkPads without a fan
512eb73cfd12 platform/x86: touchscreen_dmi: Add info for the RWC NANOTE P8 AY07J 2-in-1
c197e969e308 platform/surface: Reinstate platform dependency
17f6736a020e platform/x86: x86-android-tablets: Trivial typo fix for MODULE_AUTHOR
5de2ffd5acd3 platform/x86: x86-android-tablets: Fix the buttons on CZC P10T tablet
442bf564eb0c platform/x86: x86-android-tablets: Constify the gpiod_lookup_tables arrays
84c2dcdd475f platform/x86: x86-android-tablets: Add an init() callback to struct x86_dev_info
4ce2a32d4026 platform/x86: x86-android-tablets: Add support for disabling ACPI _AEI handlers
ddec7abd4d93 platform/x86: x86-android-tablets: Correct crystal_cove_charger module name
1fd6bb5b47a6 pinctrl: sunxi: Fix H616 I2S3 pin data
aa28514592d5 pinctrl: cherryview: Trigger hwirq0 for interrupt-lines without a mapping
25d2e41cf59b pinctrl: thunderbay: rework loops looking for groups names
30cc53897470 pinctrl: thunderbay: comment process of building functions a bit
5298d4bfe80f unicode: clean up the Kconfig symbol confusion
9df15d842a0f spi: stm32: make SPI_MASTER_MUST_TX flags only specific to STM32F4
3cefddb72f80 spi: stm32: remove inexistant variables in struct stm32_spi_cfg comment
e4d63473d311 spi: stm32-qspi: Update spi registering
b4c18c18ebf7 regulator: MAX20086: add gpio/consumer.h
6e7f90d163af lockd: fix server crash on reboot of client holding lock
94fd19752b28 ovl: don't fail copy up if no fileattr support on upper
4ee7e4a6c9b2 ovl: fix NULL pointer dereference in copy up warning
d068eebbd482 cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
879cf8006475 regulator: max20086: fix error code in max20086_parse_regulators_dt()

@HuijingHei
Copy link
Member

I do not know why I can not reproduce locally, not sure my steps are correct.

cosa init -b rawhide https://github.com/coreos/fedora-coreos-config
cosa fetch
cosa build
kola run coreos.boot-mirror.luks --multiply=10

@dustymabe
Copy link
Member Author

I do not know why I can not reproduce locally, not sure my steps are correct.

Most likely it's a race condition and my system is either faster or slower than yours :)

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 15, 2022
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Mar 15, 2022
@dustymabe
Copy link
Member Author

I'm no longer seeing this issue with latest kernels in rawhide. It appears it stopped showing up in the following kernel transition:

kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 16, 2022
We're no longer seeing a test failure there. It appears the following
kernel transition fixed the issue:

```
kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64
```

Closes coreos/fedora-coreos-tracker#1092
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 16, 2022
We're no longer seeing a test failure there. It appears the following
kernel transition fixed the issue:

```
kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64
```

Closes coreos/fedora-coreos-tracker#1092
jlebon pushed a commit to coreos/fedora-coreos-config that referenced this issue Mar 16, 2022
We're no longer seeing a test failure there. It appears the following
kernel transition fixed the issue:

```
kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64
```

Closes coreos/fedora-coreos-tracker#1092
@dustymabe
Copy link
Member Author

dustymabe commented Mar 21, 2022

ok more information here.. I was mislead in my previous understanding that the issue was fixed. It turns out that kernel NVR versions that include ${date}git${hash} have debug turned on, whereas the kernels that don't do not. For example:

  • kernel-5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64
  • kernel-5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64

have debug turned ON, where:

  • kernel-5.17.0-0.rc2.83.fc36.x86_64
  • kernel-5.17.0-0.rc6.109.fc37.x86_64

do NOT.

This problem actually only shows itself when debug is turned on. I was able to reproduce this issue on testing-devel based on Fedora Linux 35 when I used the debug kernel:

$ git diff
diff --git a/manifests/bootable-rpm-ostree.yaml b/manifests/bootable-rpm-ostree.yaml
index 784acd4..d23d3a6 100644
--- a/manifests/bootable-rpm-ostree.yaml
+++ b/manifests/bootable-rpm-ostree.yaml
@@ -7,7 +7,7 @@
 packages:
  # Kernel + systemd.  Note we explicitly specify kernel-{core,modules}
  # because otherwise depsolving could bring in kernel-debug.
- - kernel kernel-core kernel-modules systemd
+ - kernel-debug kernel-debug-core kernel-debug-modules systemd
  # linux-firmware now a recommends so let's explicitly include it
  # https://gitlab.com/cki-project/kernel-ark/-/commit/32271d0cd9bd52d386eb35497c4876a8f041f70b
  # https://src.fedoraproject.org/rpms/kernel/c/f55c3e9ed8605ff28cb9a922efbab1055947e213?branch=rawhide

NOTE: For now I needed this patch to rpm-ostree to do the build.

The problem actually goes back much farther. I chased it for a while but then gave up. Here are the debug kernels I tried and confirmed the issue occurs on (all built on top of testing-devel as of today):

  • kernel-5.16.0-60.fc36
  • kernel-5.15.18-200.fc35
  • kernel-5.14.18-300.fc35
  • kernel-5.13.16-200.fc34
  • kernel-5.12.19-300.fc34
  • kernel-5.10.19-200.fc33

So this problem has been around over a year in the kernel. I imagine we only started hitting it this year because some of our code introduced some path which made the race easier to hit.

Rather than waste much more time on this issue (since it only happens with debug kernels) we should consider ways to make the test artificially not fail if we're on a debug kernel.

@dustymabe dustymabe reopened this Mar 21, 2022
dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Mar 21, 2022
… to debug kernels

It appears that on debug kernels the "/dev/disk/by-label/$label" device *can* not
appear at all (some sort of race condition) and thus calling `realpath` on it will
fail. Let's just make the call to `realpath` not be fatal so we can workaround this
issue that has been around in the kernel for some time.

Closes coreos/fedora-coreos-tracker#1092
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Mar 21, 2022
… to debug kernels

It appears that on debug kernels the "/dev/disk/by-label/$label" device *can* not
appear at all (some sort of race condition) and thus calling `realpath` on it will
fail. Let's just make the call to `realpath` not be fatal so we can workaround this
issue that has been around in the kernel for some time.

Closes coreos/fedora-coreos-tracker#1092
@dustymabe
Copy link
Member Author

I decided for now to punt on this issue with a workaround since I've spent way too much time on it and the fact that it only presents itself with a debug kernel, which we'll never ship to our prod streams.

Someone else is welcome to pick up and investigate further and open bugs if they find enough information to open a BZ, but I need to move on.

HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The ignition-ostree-transposefs-restore.service is failing on boot
and we're dropping into the emergency shell. Let's snooze the test
for now on rawhide while we investigate.

See coreos/fedora-coreos-tracker#1092
HuijingHei added a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Can not reproduce `coreos.boot-mirror.luks` failed issue locally
 with rawhide after running 20 times, remove from denylist to see
if it happens.

See coreos/fedora-coreos-tracker#1092
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We're no longer seeing a test failure there. It appears the following
kernel transition fixed the issue:

```
kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64
```

Closes coreos/fedora-coreos-tracker#1092
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
… to debug kernels

It appears that on debug kernels the "/dev/disk/by-label/$label" device *can* not
appear at all (some sort of race condition) and thus calling `realpath` on it will
fail. Let's just make the call to `realpath` not be fatal so we can workaround this
issue that has been around in the kernel for some time.

Closes coreos/fedora-coreos-tracker#1092
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
The ignition-ostree-transposefs-restore.service is failing on boot
and we're dropping into the emergency shell. Let's snooze the test
for now on rawhide while we investigate.

See coreos/fedora-coreos-tracker#1092
HuijingHei added a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Can not reproduce `coreos.boot-mirror.luks` failed issue locally
 with rawhide after running 20 times, remove from denylist to see
if it happens.

See coreos/fedora-coreos-tracker#1092
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We're no longer seeing a test failure there. It appears the following
kernel transition fixed the issue:

```
kernel 5.17.0-0.rc5.20220225git53ab78cd6d5a.106.fc37.x86_64 → 5.17.0-0.rc6.109.fc37.x86_64
```

Closes coreos/fedora-coreos-tracker#1092
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
… to debug kernels

It appears that on debug kernels the "/dev/disk/by-label/$label" device *can* not
appear at all (some sort of race condition) and thus calling `realpath` on it will
fail. Let's just make the call to `realpath` not be fatal so we can workaround this
issue that has been around in the kernel for some time.

Closes coreos/fedora-coreos-tracker#1092
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/bug rawhide Issues that uniquely pertain to rawhide
Projects
None yet
5 participants