-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic in [email protected]/hamt/hamt.go #9063
Comments
This is on my radar but will wait for confirmation on the new version and any possible reproduction of the issue. |
Although it seemed to be stable for longer time than usual, we encountered the same issue today. Here's the output below. I added some comments starting with It is basically a Docker container with ipfs. Initiall it runs the Here's the output of the container:
Also adding the output of the periodic publish operations:
One additional thing to note is that the ipfs node is incredibly resource intensive in our case. Our server nearly always run at full capacity with ~350/400% of the CPU and around 8GB of RAM consumed by the ipfs node. |
Thanks for the report @kuzdogan; will take a closer look next week but seems consistent with the original report panicking in our new custom parallel walk. |
Avoiding panic on ipfs/go-unixfs#123 (blocking this on landing it). Other than that I have no idea what to do (or how to investigate this further) other than letting it roll and see if we can get more information through the added log line. |
Fixing the problem here: ipfs/go-unixfs#124 |
@ajnavarro Could you point me to where did we have the race in our parallel walk? |
Any call to these Shard methods (Set, SetLink, Swap, Remove, Take) on different threads can cause a race condition on both cases, reading or writing, and corrupt slice data. |
@ajnavarro Yes, but where are those threads created that manipulate the same shard? By that criteria anything that doesn't have a lock has a race condition, but it depends on how its being used. I need more specifics to understand why is this panic specifically being fixed by ipfs/go-unixfs#124. |
@kuzdogan, @schomatis I tried to reproduce the error locally with no luck. I tried with One thing I noticed is the add process takes only 10-15 mins for me instead of 10h (add 13Gb into kubo):
Are you using mechanical disks? I try to discard if I'm doing something incorrectly. |
2022-08-04 maintainer conversation: hypothesis is that MFS is related, that it's potentially not following certain concurrency expectations. Things we can do: fuzzing at the top level (run a hundred time with large sharded directories). @Jorropo is going to paste in the link for a relevant Discord conversation. |
https://discord.com/channels/806902334369824788/847893063841349652/1003259317383868426
|
Some more context from @aschmahmann on #ipfs-implementers channel:
|
What's the next step here? Given we've had this open for weeks, do we believe this is a P1? |
The following steps should be to do some Fuzz testing on mfs and check if we can reproduce the error. Fixing attempt here: ipfs/go-unixfs#124, but maybe we want to do the blocking at mfs level. |
Lowering the priority because:
When we pick this up again, we need to:
|
2022-09-08 conversation:
|
I have good and bad news. I was able to reproduce a panic on Shard from MFS here: ipfs/go-mfs#103. |
Nil pointers are coming when childer is nil. This can only happen on https://github.com/ipfs/go-unixfs/blob/2c23c3ea6fae3ef1b487cfc0c606a4ffc7893676/hamt/hamt.go#L798 , when swapping values that don't exist, so a new value is added. |
(iiuc) this was fixed in #9402 and will be included in Kubo 0.17.0-rc2 |
Version
(updated to 0.13.0, now waiting to see if it reproduces there too)
Description
Running
ipfs files mkdir -p
followed byipfs files write -c
in a tight loop sometimes produces a panic:Relevant code calling /api/v0/files/mkdir and /api/v0/files/write over HTTP RPC API client is here.
The panic happens randomly for them, every 2-3 days.
Asked them to update to 0.13.0 and report if it happens again.
cc @schomatis
The text was updated successfully, but these errors were encountered: