-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorting takes extremely long when sorting a four shank probe by property #2625
Comments
This is good info to have. I thought maybe this was a Windows only issue, but maybe there is a global problem with the KS4 wrapper @alejoe91? @guidomeijer (for background also see #2569), I was doing some KS4 testing and finding similar problems. But since all my data is multishank I had just set this aside. I'm busy this week, but maybe I'll pick this back up and work on it some more unless Alessio figures it out first. |
Hi @guidomeijer I'll take a look! Maybe the GPU capability is not propagated correctly when running by property, which makes KS4 run on CPU. |
@zm711 the |
I noticed that it does allocate some GPU memory (500-700 MB) but it's much lower than if you do |
How much memory do you have? Is it maybe doing something where it is dividing the memory by the number of shanks? Or how many n_jobs (maybe memory divded/n_jobs)? |
The GPU has 16 GB of memory. I didn't specify the n_jobs parameter anywhere. |
The default is to use all available so if you don't change it, it will use all cpus. So how many cpu cores do you have? |
It's an AMD Ryzer Threadripper with 16 cores. But when I look at the usage it's only using one of the cores at the same time. |
Maybe there is some sort of interaction because 16 GB for the GRAM/16 cores would give 1GB or less/core and then if it is only using one core then maybe that is part of the issue. I'll try to read into it bit more. When you do |
Still only one core at a time and 2 GB of GPU memory. But it processes a recording very fast (~2 hours). |
I've been trying to reread deeply into how Kilosort uses |
Actually thinking about this a bit more, I think this could be related to the Mountainsort5 caching issue. When I do Just to elaborate a bit more let's say I have a 10 gb file and I'm splitting in half. When I do |
Thanks @zm711 This could be a lead. Each run sorter job should get a channel slice object, so it shouldn't "know" that there are more channels. Maybe there is indeed something wrong in the splitting and job distribution! I'll take a look |
False start on my part. I added a bunch of prints to check the status and the reason why ms5 is doubling in size and it is that it is being cast to a float from the uint dtype of the recording. I'll keep searching in KS4 when I have free moments, but still not sure. |
Hi everyone, does anyone has some new information regarding this issue? I have been having similar issues using kilosort4 through spikeinterface on a multi-shank probe. Thank you! |
Kilosort 4.0.4 should now handle multiple shanks (MouseLand/Kilosort#641 (comment)) so you can also just sort the whole recording in one go instead of sorting by property. I'm testing it now. |
Update: using Kilosort 4.0.4. still does not work for me when sorting 4-shank probes. It took 16 hours just to complete the first step and there are many more steps to go. |
Couple questions. When you start from KS4 directly do you use a binary file? How did you write the binary file if you have it written? When sorting are you doing it locally or some sort of network/server mount? |
I use the binary file that comes directly out of SpikeGLX. I'm doing the sorting locally on my computer. |
The part that I'm struggling with is that at the beginning of the issue is said that splitting the recoding and running |
I found this thread because I am having the same issue. If I run kilosort directly on a single probe's .dat file, the run time is about the same as the recording time. However, if I have both probes loaded in SpikeInterface, each probe's sorting takes about 4x longer. When I open a single probe's results in Phy, I can actually see 2 probes in ProbeView. So I'm inclined to think that both probes are somehow present. Maybe this has a weird interaction with Kilosort's native handling of multiple probes as @guidomeijer mentioned. I see this issue on both Ubuntu 22.04 and Windows 10. I've also had problems with running multiple jobs on Windows (e.g., when running the analyzer). Not sure if that plays a role here, but this is in reference to @alejoe91's comment:
In the meantime, am I to understand that using |
@gkBCCN, what recording are you using? What probe type? Seeing 2 probes is weird and we could look into that more. |
Hey @zm711 . I'm using the SpikeGadgets .rec format that I added recently and my file has 2 Neuropixels1 probes. |
I think I found the issue in my case. If I split up the probe in shanks to run the destriping per shank and then merge the result back together into one recording, everything after that takes an insane amount of time. Even just plotting the traces doesn't work. If I skip this step, or do the destriping on the whole recording without splitting it up, it runs fine. This is the code that causes the slow-down:
Bear in mind when I say it runs fine I am talking about using |
Hi guys, I'm also looking into this and try to reproduce the performance issue. I'm quite convinced, as @guidomeijer said, that this is mainly due to preprocessing. Here's a simple benchmark using simulated data (artificially split into 4 groups):
And these are the printed elapsed times:
So there is an overhead in running by group, but I think it's due to the overhead of running KS 4 times rather than once. @guidomeijer I'll look into why the hishpass spatial filter behaves so differently if you split by group! |
@gkBCCN can you share your code? |
Do you mean the preprocessing by KS4 or by SpikeInterface? All steps were slower in my case (templates, clustering, etc.). |
The SpikeInterface code up to the sorting run |
I just followed the tutorial:
where "recording" is BTW: there's a typo in the example on https://github.com/SpikeInterface/spikeinterface/blob/main/doc/how_to/process_by_channel_group.rst, Option 1: Manual splitting:
It should be sub_recording instead of split_preprocessed_recording. |
OK, if I use Option 1 - Manual Splitting (as described in my previous comment), which is the |
Hold the phone. Even if I use
My bad. I guess I should use sub_recording as I did in the sorting loop, correct? |
Yes, if you want one phy folder for each probe. It might be a good approach also to export everything into one Phy folder |
Hi guys, I found and fixed the issue!!! @guidomeijer the problem was in the aggregate channels (see PR #2736 ). The The new implementation is 10 times faster: @guidomeijer @gkBCCN Can you try the |
Hi @alejoe91 Recording time: 00:06:26.6 h:m:s Kilosort GUI: Total = 324.52s = 00:05:25 h:m:s Single probe (run_sorter): Total = 469.16s = 00:07:49 h:m:s Aggregate (run_sorter_by_property): Total = 1265.55s = 00:21:6 h:m:s |
Based on my reading it looks like single probe vs aggregate is now scaling appropriately. I think that the overhead of using the wrapper will mean that running this through SI would be expected to be slightly slower, but you get the benefit of sorting shanks individually rather than relying on KS4 trying to make the multiple shanks work (like with the stacking trick that was previously necessary). That's my interpretation at least. |
But the aggregate time is not for both probes. These are all processing times for a single probe. |
Sorry I misunderstood that. I thought it was for multiple probes! |
Do you know what your time for a |
@gkBCCN how are you using the aggregate function to run with a single probe? |
@zm711 |
So it's the same. I guess that's because your using CMR, which is way faster than highpass_spatial_filtering..I'll run some more tests on my side |
Just to be clear, I'm using SpikeGadgets .rec files as input. |
Yep, but it's a memmap file so I don't think that makes a difference |
Honestly at this point it might make sense for us to try a profiler so we can see which step is taking long. If @gkBCCN knows how to use a profiler you could try it. I'm working on some other analysis but next week I can try profiling a call to kilosort4 on some of my data that I use on |
@gkBCCN how long does it take to run both probes in KS directly or with the run_sorter_by_property? Note that an overhead is expected for sure, because of all the machinery in place to initialize and transfer data back and forth to the GPU... |
In my experience so far it always takes twice as long for a second probe when running run_sorter_by_property. That will necessarily be the case when running KS directly, because I run it twice, each time on a separate .dat file, but both have the same size. |
Anyways, @guidomeijer I tested again and your problem with destriping+aggregate should be solved. Here are some run times on a 384-channel recording:
Again, the overhead of running multiple probe is expected, but IMO it will give better results, especially when applying probe-specific preprocessing as in @guidomeijer example and for drift correction! |
Yes it's solved, thanks! |
Hi there! I'm trying to sort a four-shank Neuropixel recording with Kilosort4 but when I try to use
run_sorter_by_property
it takes 8 hours to sort one of the four shanks and an estimated 160 hours to recompute the spike templates. When I split the recording by shank and dorun_sorter
on a single shank it runs normally. I'm using a local installation of Kilosort4 which has access to the GPU (I checked). Any idea what might be going on? I had a related issue on the Kilosort4 github but it doesn't seem to be a Kilosort issue: MouseLand/Kilosort#631Setup:
NVIDIA RTX 4080
64 GB RAM
Ubuntu 20.04
kilosort 4.0.2
spikeinterface 0.100.2
Code:
The text was updated successfully, but these errors were encountered: