comma three with NVMe: loggerd crash #34742

yuzisee · 2025-03-01T17:23:58Z

Describe the bug

I got "Process not running" (loggerd) during release3-staging

But it won't create logs to upload if loggerd isn't working... so I don't have a route to share? (At least, connect.comma.ai isn't showing this particular route)

Provide a route where the issue occurs

Dongle ID is 21015c36062c3cef (without loggerd working there is no logs, right?)

openpilot version

release3-staging (installed Feb 28, 2025)

Additional info

Original thread: https://discord.com/channels/469524606043160576/616456819027607567/1345445493869777119

The text was updated successfully, but these errors were encountered:

adeebshihadeh · 2025-03-01T19:33:46Z

It hit this assert. Do you ever SSH into your device? Is it possible you changed the perms on the log directory? Was it fine on subsequent drives?

https://commaai.sentry.io/issues/5994831372/?query=user.id%3A21015c36062c3cef&referrer=issue-stream&statsPeriod=14d&stream_index=0

yuzisee · 2025-03-01T20:05:32Z

It hit this assert.

Hmm, looks like it's trying to create a folder for the route itself?

Do you ever SSH into your device?

Once long ago, but not on this device in over a year at least and haven't had any issues since.

Is it possible you changed the perms on the log directory? Was it fine on subsequent drives?

Previous drives also work (I can see routes I drove earlier that same morning with release3-staging and they successfully uploaded and are visible on connect.comma.ai

Also, let me see if Settings > Device > Reboot fixes the issue... stay tuned.

https://commaai.sentry.io/issues/5994831372/?query=user.id%3A21015c36062c3cef&referrer=issue-stream&statsPeriod=14d&stream_index=0

yuzisee · 2025-03-01T20:08:31Z

Also, let me see if Settings > Device > Reboot fixes the issue... stay tuned.

Yes. Rebooting the device also fixes it... so it couldn't be a permissions issue right?

No reason for the permission to change

(a) by switching vehicle ignition from on → off → on
nor, (b) via Settings > Device > Reboot on openpilot itself

right?

adeebshihadeh · 2025-03-01T20:13:23Z

Ah, if you haven't SSH'd in for a long time or used a fork without doing a factory reset, then this should definitely be handled.

yuzisee · 2025-03-01T21:11:51Z

Ah, if you haven't SSH'd in for a long time or used a fork without doing a factory reset, then this should definitely be handled.

But even 9735cf2 already shipped in https://github.com/commaai/openpilot/releases/tag/v0.9.7 so if this is, in fact, a 0.9.8 regression it would have to

be some sort of timing issue / race condition?
maybe a buffer overflow (the path itself is corrupted at some point)

For what it's worth, the last path to upload correctly before the issue was
00000000--7b39adbf9f
and then the first path to upload correctly after the issue (once the device had rebooted) was
00000005--73586bbde4

so presumably I attempted to engage openpilot 4 times in between (sounds about right) and all of those hit the assert (you should see exactly 4 reports of that assert triggering from my device, if so)

adeebshihadeh · 2025-03-01T22:13:45Z

So that number is the route count, not engaged count (route = one "ignition" session). If this is a regression, I suspect it has to do with an NVMe race condition. If that's true, I'd expect the issue to start on the first route after a bootup and persist until reboot.

adeebshihadeh · 2025-03-01T22:22:11Z

Yeah, it's NVMe related. You're also getting:

Traceback (most recent call last):
  File "/data/openpilot/system/loggerd/uploader.py", line 58, in listdir_by_creation
    paths = [f for f in os.listdir(d) if os.path.isdir(os.path.join(d, f))]
                        ^^^^^^^^^^^^^
OSError: [Errno 5] Input/output error: '/data/media/0/realdata/'

Unfortunately, I haven't been able to repro this on my desk so far, and we don't even have a bootlog from the bad boot. If it happens again, it'd be great if you can post the output of this:

dmesg | nc termbin.com 9999
journalctl -xe | nc termbin.com 9999

yuzisee · 2025-03-03T05:18:07Z

If it happens again, it'd be great if you can post the output of this:
dmesg | nc termbin.com 9999
journalctl -xe | nc termbin.com 9999

FWIW it's happened in four separate occasions since installing release3-staging ~48 hours ago, and it seems to be easiest to reproduce by either...
(a) leaving the device on for a long time after a drive or two, or
(b) or going for a long drive

Seemingly, as long as I can avoid having the device turn off (e.g. if the 12V battery in the car gets low, the Comma device will turn itself off) it does seem like it will always eventually trigger at some point.

Here are the kernel & systemd logs:

dmesg.txt

journalctl_xe.txt

I've connected a trickle charger to my Prius just now so we can hopefully keep the device in this state for as long as possible, in case there's anything else you want to try and debug

adeebshihadeh · 2025-03-03T06:06:18Z

Your NVMe is getting disconnected.

[30106.651116] nvme 0001:01:00.0: Failed status: 0xffffffff, reset controller.
[30106.651282] nvme 0001:01:00.0: of_irq_parse_pci() failed with rc=134
[30106.656727] q6asm_callback: cmd = 0x10bcd returned error = 0x1
[30106.656805] __q6asm_cmd: DSP returned error[ADSP_EFAILED] opcode 68557
[30106.671085] pci_raw_set_power_state: 20 callbacks suppressed
[30106.671094] nvme 0001:01:00.0: Refused to change power state, currently in D3
[30106.671153] nvme nvme0: Removing after probe failure status: -19
[30106.671433] nvme0n1: detected capacity change from 250059350016 to 0

Although we made big userspace updates, the kernel is largely unchanged. It's possible this is a HW failure rather than a regression in the software. Since it's so reproducible, can you try to repro on the current release?

adeebshihadeh · 2025-03-03T06:09:19Z

It seemed to happen right as you went onroad though... perhaps it's reproducible by cycling ignition rather than pure uptime.

adeebshihadeh · 2025-03-03T17:31:39Z

Ah, I forgot that I recently made this stricter: 8272221.

The NVMe behavior likely always happened on your device, but it didn't show the alert until now. I'm going to just revert that for this release and revisit for the next one.

yuzisee · 2025-03-03T18:07:23Z

It seemed to happen right as you went onroad though... perhaps it's reproducible by cycling ignition rather than pure uptime.

Yeah. I did see it only once in the middle of driving (it would have been the 00000012-… route (on Sunday Mar 2), which never uploaded for the same reason) but the other times were just as I went onroad, that's true. However, it's possible that the NVME issue occurs while parked and the assertion triggers only once we go onroad (because there's no reason for 8272221 to trigger when parked)

Since it's so reproducible, can you try to repro on the current release?

Once on release, is there anything else I can log or capture other than dmesg and journalctl that would be helpful?

I'm going to just revert that for this release and revisit for the next one.

If we revert the change, I won't be able to tell when the issue triggers — so I won't know when to dmesg/journalctl I assume?

yuzisee added the bug label Mar 1, 2025

adeebshihadeh added this to the 0.9.8 milestone Mar 1, 2025

adeebshihadeh modified the milestones: 0.9.8, 0.9.9 Mar 3, 2025

adeebshihadeh changed the title ~~release3-staging Error message: Process not running (loggerd)~~ Handle comma three NVMe going down Mar 3, 2025

adeebshihadeh changed the title ~~Handle comma three NVMe going down~~ comma three with NVMe: loggerd crash Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comma three with NVMe: loggerd crash #34742

comma three with NVMe: loggerd crash #34742

yuzisee commented Mar 1, 2025

adeebshihadeh commented Mar 1, 2025 •

edited

Loading

yuzisee commented Mar 1, 2025

yuzisee commented Mar 1, 2025 •

edited

Loading

adeebshihadeh commented Mar 1, 2025

yuzisee commented Mar 1, 2025 •

edited

Loading

adeebshihadeh commented Mar 1, 2025

adeebshihadeh commented Mar 1, 2025 •

edited

Loading

yuzisee commented Mar 3, 2025 •

edited

Loading

adeebshihadeh commented Mar 3, 2025

adeebshihadeh commented Mar 3, 2025

adeebshihadeh commented Mar 3, 2025

yuzisee commented Mar 3, 2025 •

edited

Loading

comma three with NVMe: loggerd crash #34742

comma three with NVMe: loggerd crash #34742

Comments

yuzisee commented Mar 1, 2025

Describe the bug

Provide a route where the issue occurs

openpilot version

Additional info

adeebshihadeh commented Mar 1, 2025 • edited Loading

yuzisee commented Mar 1, 2025

yuzisee commented Mar 1, 2025 • edited Loading

adeebshihadeh commented Mar 1, 2025

yuzisee commented Mar 1, 2025 • edited Loading

adeebshihadeh commented Mar 1, 2025

adeebshihadeh commented Mar 1, 2025 • edited Loading

yuzisee commented Mar 3, 2025 • edited Loading

adeebshihadeh commented Mar 3, 2025

adeebshihadeh commented Mar 3, 2025

adeebshihadeh commented Mar 3, 2025

yuzisee commented Mar 3, 2025 • edited Loading

adeebshihadeh commented Mar 1, 2025 •

edited

Loading

yuzisee commented Mar 1, 2025 •

edited

Loading

yuzisee commented Mar 1, 2025 •

edited

Loading

adeebshihadeh commented Mar 1, 2025 •

edited

Loading

yuzisee commented Mar 3, 2025 •

edited

Loading

yuzisee commented Mar 3, 2025 •

edited

Loading