Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comma three with NVMe: loggerd crash #34742

Open
yuzisee opened this issue Mar 1, 2025 · 12 comments
Open

comma three with NVMe: loggerd crash #34742

yuzisee opened this issue Mar 1, 2025 · 12 comments
Labels
Milestone

Comments

@yuzisee
Copy link

yuzisee commented Mar 1, 2025

Describe the bug

I got "Process not running" (loggerd) during release3-staging

But it won't create logs to upload if loggerd isn't working... so I don't have a route to share? (At least, connect.comma.ai isn't showing this particular route)

Provide a route where the issue occurs

Dongle ID is 21015c36062c3cef (without loggerd working there is no logs, right?)

openpilot version

release3-staging (installed Feb 28, 2025)

Additional info

Original thread: https://discord.com/channels/469524606043160576/616456819027607567/1345445493869777119

Image

@yuzisee yuzisee added the bug label Mar 1, 2025
@adeebshihadeh adeebshihadeh added this to the 0.9.8 milestone Mar 1, 2025
@adeebshihadeh
Copy link
Contributor

adeebshihadeh commented Mar 1, 2025

It hit this assert. Do you ever SSH into your device? Is it possible you changed the perms on the log directory? Was it fine on subsequent drives?

https://commaai.sentry.io/issues/5994831372/?query=user.id%3A21015c36062c3cef&referrer=issue-stream&statsPeriod=14d&stream_index=0

@yuzisee
Copy link
Author

yuzisee commented Mar 1, 2025

It hit this assert.

Hmm, looks like it's trying to create a folder for the route itself?

Do you ever SSH into your device?

Once long ago, but not on this device in over a year at least and haven't had any issues since.

Is it possible you changed the perms on the log directory? Was it fine on subsequent drives?

Previous drives also work (I can see routes I drove earlier that same morning with release3-staging and they successfully uploaded and are visible on connect.comma.ai

Also, let me see if Settings > Device > Reboot fixes the issue... stay tuned.

https://commaai.sentry.io/issues/5994831372/?query=user.id%3A21015c36062c3cef&referrer=issue-stream&statsPeriod=14d&stream_index=0

@yuzisee
Copy link
Author

yuzisee commented Mar 1, 2025

Also, let me see if Settings > Device > Reboot fixes the issue... stay tuned.

Yes. Rebooting the device also fixes it... so it couldn't be a permissions issue right?

No reason for the permission to change

  • (a) by switching vehicle ignition from onoffon
  • nor, (b) via Settings > Device > Reboot on openpilot itself

right?

@adeebshihadeh
Copy link
Contributor

Ah, if you haven't SSH'd in for a long time or used a fork without doing a factory reset, then this should definitely be handled.

@yuzisee
Copy link
Author

yuzisee commented Mar 1, 2025

Ah, if you haven't SSH'd in for a long time or used a fork without doing a factory reset, then this should definitely be handled.

But even 9735cf2 already shipped in https://github.com/commaai/openpilot/releases/tag/v0.9.7 so if this is, in fact, a 0.9.8 regression it would have to

  1. be some sort of timing issue / race condition?
  2. maybe a buffer overflow (the path itself is corrupted at some point)

For what it's worth, the last path to upload correctly before the issue was
00000000--7b39adbf9f
and then the first path to upload correctly after the issue (once the device had rebooted) was
00000005--73586bbde4

so presumably I attempted to engage openpilot 4 times in between (sounds about right) and all of those hit the assert (you should see exactly 4 reports of that assert triggering from my device, if so)

@adeebshihadeh
Copy link
Contributor

So that number is the route count, not engaged count (route = one "ignition" session). If this is a regression, I suspect it has to do with an NVMe race condition. If that's true, I'd expect the issue to start on the first route after a bootup and persist until reboot.

@adeebshihadeh
Copy link
Contributor

adeebshihadeh commented Mar 1, 2025

Yeah, it's NVMe related. You're also getting:

Traceback (most recent call last):
  File "/data/openpilot/system/loggerd/uploader.py", line 58, in listdir_by_creation
    paths = [f for f in os.listdir(d) if os.path.isdir(os.path.join(d, f))]
                        ^^^^^^^^^^^^^
OSError: [Errno 5] Input/output error: '/data/media/0/realdata/'

Unfortunately, I haven't been able to repro this on my desk so far, and we don't even have a bootlog from the bad boot. If it happens again, it'd be great if you can post the output of this:

dmesg | nc termbin.com 9999
journalctl -xe | nc termbin.com 9999

@yuzisee
Copy link
Author

yuzisee commented Mar 3, 2025

If it happens again, it'd be great if you can post the output of this:

dmesg | nc termbin.com 9999
journalctl -xe | nc termbin.com 9999

FWIW it's happened in four separate occasions since installing release3-staging ~48 hours ago, and it seems to be easiest to reproduce by either...
(a) leaving the device on for a long time after a drive or two, or
(b) or going for a long drive

Seemingly, as long as I can avoid having the device turn off (e.g. if the 12V battery in the car gets low, the Comma device will turn itself off) it does seem like it will always eventually trigger at some point.


Here are the kernel & systemd logs:

dmesg.txt

journalctl_xe.txt


I've connected a trickle charger to my Prius just now so we can hopefully keep the device in this state for as long as possible, in case there's anything else you want to try and debug

@adeebshihadeh
Copy link
Contributor

Your NVMe is getting disconnected.

[30106.651116] nvme 0001:01:00.0: Failed status: 0xffffffff, reset controller.
[30106.651282] nvme 0001:01:00.0: of_irq_parse_pci() failed with rc=134
[30106.656727] q6asm_callback: cmd = 0x10bcd returned error = 0x1
[30106.656805] __q6asm_cmd: DSP returned error[ADSP_EFAILED] opcode 68557
[30106.671085] pci_raw_set_power_state: 20 callbacks suppressed
[30106.671094] nvme 0001:01:00.0: Refused to change power state, currently in D3
[30106.671153] nvme nvme0: Removing after probe failure status: -19
[30106.671433] nvme0n1: detected capacity change from 250059350016 to 0

Although we made big userspace updates, the kernel is largely unchanged. It's possible this is a HW failure rather than a regression in the software. Since it's so reproducible, can you try to repro on the current release?

@adeebshihadeh
Copy link
Contributor

It seemed to happen right as you went onroad though... perhaps it's reproducible by cycling ignition rather than pure uptime.

@adeebshihadeh
Copy link
Contributor

Ah, I forgot that I recently made this stricter: 8272221.

The NVMe behavior likely always happened on your device, but it didn't show the alert until now. I'm going to just revert that for this release and revisit for the next one.

@adeebshihadeh adeebshihadeh modified the milestones: 0.9.8, 0.9.9 Mar 3, 2025
@adeebshihadeh adeebshihadeh changed the title release3-staging Error message: Process not running (loggerd) Handle comma three NVMe going down Mar 3, 2025
@adeebshihadeh adeebshihadeh changed the title Handle comma three NVMe going down comma three with NVMe: loggerd crash Mar 3, 2025
@yuzisee
Copy link
Author

yuzisee commented Mar 3, 2025

It seemed to happen right as you went onroad though... perhaps it's reproducible by cycling ignition rather than pure uptime.

Yeah. I did see it only once in the middle of driving (it would have been the 00000012-… route (on Sunday Mar 2), which never uploaded for the same reason) but the other times were just as I went onroad, that's true. However, it's possible that the NVME issue occurs while parked and the assertion triggers only once we go onroad (because there's no reason for 8272221 to trigger when parked)

Since it's so reproducible, can you try to repro on the current release?

Once on release, is there anything else I can log or capture other than dmesg and journalctl that would be helpful?

I'm going to just revert that for this release and revisit for the next one.

If we revert the change, I won't be able to tell when the issue triggers — so I won't know when to dmesg/journalctl I assume?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants