Sorting takes extremely long when sorting a four shank probe by property #2625

guidomeijer · 2024-03-26T13:13:58Z

Hi there! I'm trying to sort a four-shank Neuropixel recording with Kilosort4 but when I try to use run_sorter_by_property it takes 8 hours to sort one of the four shanks and an estimated 160 hours to recompute the spike templates. When I split the recording by shank and do run_sorter on a single shank it runs normally. I'm using a local installation of Kilosort4 which has access to the GPU (I checked). Any idea what might be going on? I had a related issue on the Kilosort4 github but it doesn't seem to be a Kilosort issue: MouseLand/Kilosort#631

Setup:
NVIDIA RTX 4080
64 GB RAM
Ubuntu 20.04
kilosort 4.0.2
spikeinterface 0.100.2

Code:

            # Apply high-pass filter
            print('Applying high-pass filter.. ')
            rec_filtered = spre.highpass_filter(rec)
            
            # Correct for inter-sample phase shift
            print('Correcting for phase shift.. ')
            rec_shifted = spre.phase_shift(rec_filtered)
            
            # Detect and interpolate over bad channels
            print('Detecting and interpolating over bad channels.. ')
            bad_channel_ids, all_channels = spre.detect_bad_channels(rec_shifted)
            
            # If there are too many bad channels, skip the interpolation step
            prec_bad_ch = np.sum(all_channels == 'noise') / all_channels.shape[0]
            if prec_bad_ch < (1/3):
                rec_interpolated = spre.interpolate_bad_channels(rec_shifted, bad_channel_ids)
                print(f'{np.sum(all_channels == "noise")} ({prec_bad_ch*100:.0f}%) bad channels')
            else:
                rec_interpolated = rec_shifted
                print(f'{np.sum(all_channels == "noise")} ({prec_bad_ch*100:.0f}%) bad channels,',
                      'skipping the interpolation step')
            
            # If there are multiple shanks, do destriping per shank
            print('Destriping.. ')
            if np.unique(rec_interpolated.get_property('group')).shape[0] > 1:
                
                # Loop over shanks and do preprocessing per shank
                rec_split = rec_interpolated.split_by(property='group')
                rec_destriped = []
                for sh in range(len(rec_split)):
                    rec_destriped.append(spre.highpass_spatial_filter(rec_split[sh]))
                
                # Merge back together
                rec_final = si.aggregate_channels(rec_destriped)
                
                # Run spike sorting per shank 
                try:
                  print(f'\nStarting {split(probe_path)[-1]} spike sorting at {datetime.now().strftime("%H:%M")}')
                  sort = run_sorter_by_property(
                      sorter_name=settings_dict['SPIKE_SORTER'],
                      recording=rec_final,
                      grouping_property='group',
                      working_folder=join(probe_path, settings_dict['SPIKE_SORTER'] + id_str),
                      verbose=True,
                      docker_image=settings_dict['USE_DOCKER'],
                      **sorter_params)
                
                except Exception as err:
                    # Log error to disk
                    print(err)
                    logf = open(os.path.join(probe_path, 'error_log.txt'), 'w')
                    logf.write(str(err))
                    logf.close()
                    
                    # Continue with next recording
                    continue

The text was updated successfully, but these errors were encountered:

zm711 · 2024-03-26T13:38:12Z

This is good info to have. I thought maybe this was a Windows only issue, but maybe there is a global problem with the KS4 wrapper @alejoe91?

@guidomeijer (for background also see #2569), I was doing some KS4 testing and finding similar problems. But since all my data is multishank I had just set this aside. I'm busy this week, but maybe I'll pick this back up and work on it some more unless Alessio figures it out first.

alejoe91 · 2024-03-26T14:36:19Z

Hi @guidomeijer

I'll take a look! Maybe the GPU capability is not propagated correctly when running by property, which makes KS4 run on CPU.
That's the only think I can think of.

alejoe91 · 2024-03-26T15:17:09Z

@zm711 the run_sorter_by_property is basically a map to the run_sorters_jobs, so we probably need to focus there..I'll try to reproduce the issue locally over the next days

guidomeijer · 2024-03-26T16:07:41Z

I noticed that it does allocate some GPU memory (500-700 MB) but it's much lower than if you do run_sorter

zm711 · 2024-03-26T16:59:28Z

How much memory do you have? Is it maybe doing something where it is dividing the memory by the number of shanks?

Or how many n_jobs (maybe memory divded/n_jobs)?

guidomeijer · 2024-03-27T08:27:32Z

The GPU has 16 GB of memory. I didn't specify the n_jobs parameter anywhere.

zm711 · 2024-03-27T09:39:00Z

The default is to use all available so if you don't change it, it will use all cpus. So how many cpu cores do you have?

guidomeijer · 2024-03-28T13:15:47Z

It's an AMD Ryzer Threadripper with 16 cores. But when I look at the usage it's only using one of the cores at the same time.

zm711 · 2024-03-28T14:44:16Z

Maybe there is some sort of interaction because 16 GB for the GRAM/16 cores would give 1GB or less/core and then if it is only using one core then maybe that is part of the issue. I'll try to read into it bit more. When you do run_sorter how many cores and GPU memory are engaged?

guidomeijer · 2024-04-02T12:18:12Z

Still only one core at a time and 2 GB of GPU memory. But it processes a recording very fast (~2 hours).

zm711 · 2024-04-02T17:15:35Z

I've been trying to reread deeply into how Kilosort uses get_traces, and I can't see why it would be different between creating the
ChannelSliceRecording before run_sorter or inside of run_sorter_by_property. I've run a bunch of testing and at each step inside our wrapper it says device being used is 'cuda' so it is not switching to 'cpu' in between steps. @alejoe91, I'm happy to test anything out, but I'm really not sure for this one.

zm711 · 2024-04-03T14:10:02Z

@alejoe91,

Actually thinking about this a bit more, I think this could be related to the Mountainsort5 caching issue. When I do run_sorter_by_property mountainsort5 still writes the whole recording rather than the sub_recording for its caching. I think the same thing could be happening here. Where even though we are giving it a sub_recording the RecordingExractorAsArray in Kilosort might be looking at the whole recording rather than the sub. Any ideas why that might be? (again this makes sense to me in that 2hours/shank = ~8 hours for 4 shank).

Just to elaborate a bit more let's say I have a 10 gb file and I'm splitting in half. When I do run_sorter_by_property for KS2 or KS3 the recording.dat is ~5gb. Doing the exact same thing with MS5 leads to a recording.dat of 10 gb. So if we stay in python we either write the whole recording erroneously or we write the half data twice.

alejoe91 · 2024-04-04T07:45:35Z

Thanks @zm711

This could be a lead. Each run sorter job should get a channel slice object, so it shouldn't "know" that there are more channels. Maybe there is indeed something wrong in the splitting and job distribution! I'll take a look

zm711 · 2024-04-04T11:08:39Z

False start on my part. I added a bunch of prints to check the status and the reason why ms5 is doubling in size and it is that it is being cast to a float from the uint dtype of the recording. I'll keep searching in KS4 when I have free moments, but still not sure.

BovenE · 2024-04-15T14:55:37Z

Hi everyone, does anyone has some new information regarding this issue? I have been having similar issues using kilosort4 through spikeinterface on a multi-shank probe. Thank you!

guidomeijer · 2024-04-16T08:23:40Z

Kilosort 4.0.4 should now handle multiple shanks (MouseLand/Kilosort#641 (comment)) so you can also just sort the whole recording in one go instead of sorting by property. I'm testing it now.

guidomeijer · 2024-04-18T06:49:04Z

Update: using Kilosort 4.0.4. still does not work for me when sorting 4-shank probes. It took 16 hours just to complete the first step and there are many more steps to go.

zm711 · 2024-04-18T10:47:16Z

Couple questions. When you start from KS4 directly do you use a binary file? How did you write the binary file if you have it written?

When sorting are you doing it locally or some sort of network/server mount?

guidomeijer · 2024-04-18T10:58:50Z

I use the binary file that comes directly out of SpikeGLX. I'm doing the sorting locally on my computer.

zm711 · 2024-04-18T11:07:09Z

The part that I'm struggling with is that at the beginning of the issue is said that splitting the recoding and running run_sorter on the pre split is the same speed as KS4 but run_sorter_by_property is slow. So I was wondering if the binary file vs making a spikeinterface was the slow thing. My logic being that with a straight binary file they just call to the binary file, but with a spikeinterface array they call to a class they made which then calls a function that calls to our class which calls a function. So we are running through an extra 3-4 layers of python in that case. But if that is really slowing things down I would expect a similar slow down for run_sorter and run_sorter_by_property. Could you re-confirm your test conditions speed. Something like:
KS4 native: ~1 hour/shank
KS4 run_sorter (pre-split)...
KS4 run_sorter_by_property....

gkBCCN · 2024-04-18T12:06:51Z

I found this thread because I am having the same issue. If I run kilosort directly on a single probe's .dat file, the run time is about the same as the recording time. However, if I have both probes loaded in SpikeInterface, each probe's sorting takes about 4x longer. When I open a single probe's results in Phy, I can actually see 2 probes in ProbeView.

So I'm inclined to think that both probes are somehow present. Maybe this has a weird interaction with Kilosort's native handling of multiple probes as @guidomeijer mentioned.

I see this issue on both Ubuntu 22.04 and Windows 10.

I've also had problems with running multiple jobs on Windows (e.g., when running the analyzer). Not sure if that plays a role here, but this is in reference to @alejoe91's comment:

Thanks @zm711

This could be a lead. Each run sorter job should get a channel slice object, so it shouldn't "know" that there are more channels.
Maybe there is indeed something wrong in the splitting and job distribution! I'll take a look

In the meantime, am I to understand that using si.run_sorter is a working alternative?

zm711 · 2024-04-18T12:14:56Z

@gkBCCN, what recording are you using? What probe type? Seeing 2 probes is weird and we could look into that more.

gkBCCN · 2024-04-18T12:18:17Z

Hey @zm711 . I'm using the SpikeGadgets .rec format that I added recently and my file has 2 Neuropixels1 probes.

guidomeijer · 2024-04-18T12:41:11Z

I think I found the issue in my case. If I split up the probe in shanks to run the destriping per shank and then merge the result back together into one recording, everything after that takes an insane amount of time. Even just plotting the traces doesn't work. If I skip this step, or do the destriping on the whole recording without splitting it up, it runs fine.

This is the code that causes the slow-down:

            # If there are multiple shanks, do destriping per shank
            print('Destriping.. ')
            if np.unique(rec_interpolated.get_property('group')).shape[0] > 1:
                
                # Loop over shanks and do preprocessing per shank
                rec_split = rec_interpolated.split_by(property='group')
                rec_destriped = []
                for sh in range(len(rec_split)):
                    rec_destriped.append(spre.highpass_spatial_filter(rec_split[sh]))
                
                # Merge back together
                rec_final = si.aggregate_channels(rec_destriped)

Bear in mind when I say it runs fine I am talking about using run_sorter. I think run_sorter_by_property is still very slow but I haven't checked this recently because with Kilosort 4.0.4 it's not necessary anymore.

alejoe91 · 2024-04-18T13:03:06Z

Hi guys,

I'm also looking into this and try to reproduce the performance issue. I'm quite convinced, as @guidomeijer said, that this is mainly due to preprocessing.

Here's a simple benchmark using simulated data (artificially split into 4 groups):

# create simulated recording
rec, sort = si.generate_ground_truth_recording(num_channels=128, durations=[150], num_units=60)
num_groups = 4
channels_per_group = rec.get_num_channels() // num_groups
channel_groups = []
for i in range(num_groups):
    channel_groups.extend([i] * channels_per_group)
rec.set_channel_groups(channel_groups)

print("\n\nRUNNING ALL\n\n")
t_start = time.perf_counter()
sorting_ks4_all = ss.run_sorter("kilosort4", rec, output_folder="ks4_all")
t_stop = time.perf_counter()
elapsed_all = np.round(t_stop - t_start, 2)
print(f"Elapsed time all: {elapsed_all} s")

print("\n\nRUNNING LOOP\n\n")
t_start = time.perf_counter()
rec_dict = rec.split_by("group")
for g, rec_g in rec_dict.items():
    sorting_ks4_loop = ss.run_sorter("kilosort4", rec_g, output_folder=f"ks4_loop{g}")
t_stop = time.perf_counter()
elapsed_loop = np.round(t_stop - t_start, 2)
print(f"Elapsed time loop: {elapsed_loop} s")

print("\n\nRUNNING BY PROPERTY\n\n")
t_start = time.perf_counter()
sorting_ks4_prop = ss.run_sorter_by_property("kilosort4", rec, grouping_property="group", working_folder="ks4_prop")
t_stop = time.perf_counter()
elapsed_prop = np.round(t_stop - t_start, 2)
print(f"Elapsed time by property: {elapsed_prop} s")

And these are the printed elapsed times:

RUNNING ALL
Elapsed time all: 51.15 s

RUNNING LOOP
Elapsed time loop: 132.78 s

RUNNING BY PROPERTY
Elapsed time by property: 130.43 s

So there is an overhead in running by group, but I think it's due to the overhead of running KS 4 times rather than once.

@guidomeijer I'll look into why the hishpass spatial filter behaves so differently if you split by group!

alejoe91 · 2024-04-18T13:04:57Z

@gkBCCN can you share your code?

gkBCCN · 2024-04-18T13:05:18Z

Do you mean the preprocessing by KS4 or by SpikeInterface? All steps were slower in my case (templates, clustering, etc.).

alejoe91 · 2024-04-18T13:09:08Z

Do you mean the preprocessing by KS4 or by SpikeInterface? All steps were slower in my case (templates, clustering, etc.).

The SpikeInterface code up to the sorting run

gkBCCN · 2024-04-18T13:10:27Z

I just followed the tutorial:

sorting = si.run_sorter_by_property(
sorter_name=sorter_algorithm,
recording=recording,
grouping_property='group',
working_folder=sorting_folder
)

where "recording" is raw_dat = read_spikegadgets(rec_file), which is then high pass filtered and common referenced.

BTW: there's a typo in the example on https://github.com/SpikeInterface/spikeinterface/blob/main/doc/how_to/process_by_channel_group.rst, Option 1: Manual splitting:

split_preprocessed_recording = preprocessed_recording.split_by("group")

sortings = {}
for group, sub_recording in split_preprocessed_recording.items():
sorting = run_sorter(
sorter_name='kilosort2',
recording=split_preprocessed_recording,
output_folder=f"folder_KS2_group{group}"
)
sortings[group] = sorting

It should be sub_recording instead of split_preprocessed_recording.

gkBCCN · 2024-04-18T13:21:58Z

OK, if I use Option 1 - Manual Splitting (as described in my previous comment), which is the si.run_sorter option mentioned in the original post, the sorting completes in roughly the same time as the GUI, albeit a bit slower. Thanks for that, @guidomeijer!

gkBCCN · 2024-04-18T13:32:36Z

Hold the phone. Even if I use run_sorter on each probe separately, the binary file that is created by export_to_phy is double the size, as @zm711 noted. That could explain why phy sees two probes. After the sorting is finished, I loop over both probe numbers and load the results, then create an analyzer using the entire recording.

for probe_num in range(1, len(recording.get_probes())+1):
    sorting = si.read_sorter_folder(sorting_folder / f'{probe_num-1}')
    analyzer = si.create_sorting_analyzer(sorting, recording, sparse=True, format="memory")

My bad. I guess I should use sub_recording as I did in the sorting loop, correct?

alejoe91 · 2024-04-18T14:28:07Z

Yes, if you want one phy folder for each probe. It might be a good approach also to export everything into one Phy folder

alejoe91 · 2024-04-19T09:22:33Z

Hi guys, I found and fixed the issue!!!

@guidomeijer the problem was in the aggregate channels (see PR #2736 ). The get_traces was grabbing one channel at a time...since the highpass spatial filter uses all channels for processing, this made it extremely slow.

The new implementation is 10 times faster:

@guidomeijer @gkBCCN Can you try the aggregate_channels + run_sorter_by_property from the PR?

gkBCCN · 2024-04-19T13:35:17Z

Hi @alejoe91
I created a new conda environment with your changes and ran all three versions and this is what I got:

Recording time: 00:06:26.6 h:m:s

Kilosort GUI: Total = 324.52s = 00:05:25 h:m:s
preprocessing = 0.64s
drift = 98.42s
extracting spikes using templates = 91.09s
1st clustering = 47.34s
extracting spikes using cluster waveforms = 30.10s
final clustering = 51.01s

Single probe (run_sorter): Total = 469.16s = 00:07:49 h:m:s
preprocessing = 3.21s
drift = 119.10s
extracting spikes using templates = 106.68s
1st clustering = 105.24s
extracting spikes using cluster waveforms = 50.54s
final clustering = 78.57s

Aggregate (run_sorter_by_property): Total = 1265.55s = 00:21:6 h:m:s
preprocessing = 19.15s
drift = 398.12s
extracting spikes using templates = 372.54s
1st clustering = 94.45s
extracting spikes using cluster waveforms = 299.04s
final clustering = 76.56s

zm711 · 2024-04-19T13:39:56Z

Based on my reading it looks like single probe vs aggregate is now scaling appropriately. I think that the overhead of using the wrapper will mean that running this through SI would be expected to be slightly slower, but you get the benefit of sorting shanks individually rather than relying on KS4 trying to make the multiple shanks work (like with the stacking trick that was previously necessary). That's my interpretation at least.

gkBCCN · 2024-04-19T13:41:18Z

But the aggregate time is not for both probes. These are all processing times for a single probe.

zm711 · 2024-04-19T13:41:50Z

Sorry I misunderstood that. I thought it was for multiple probes!

zm711 · 2024-04-19T14:00:18Z

Do you know what your time for a run_sorter_by_property is without the PR. Is the ~1200 seconds a 10x speedup from before? There might be something else in addition that is still slowing things down, but did the PR cause an improvement from the baseline?

alejoe91 · 2024-04-19T14:12:56Z

@gkBCCN how are you using the aggregate function to run with a single probe?

gkBCCN · 2024-04-19T14:21:11Z

@alejoe91 I just interrupted the kernel after one probe finished. But KS reports the times as it's running, which I wrote above. @zm711 I'm currently running the non-PR run_sorter_by_property. It should be done soon-ish...

gkBCCN · 2024-04-19T14:27:53Z

@zm711
non PR Aggregate (run_sorter_by_property): Total = 1227.03s = 00:20:27 h:m:s
preprocessing = 13.31s
drift = 381.12s
extracting spikes using templates = 359.76s
1st clustering = 103.20s
extracting spikes using cluster waveforms = 299.05s
final clustering = 78.12s

alejoe91 · 2024-04-19T14:34:13Z

So it's the same. I guess that's because your using CMR, which is way faster than highpass_spatial_filtering..I'll run some more tests on my side

gkBCCN · 2024-04-19T14:36:37Z

Just to be clear, I'm using SpikeGadgets .rec files as input.

alejoe91 · 2024-04-19T14:39:34Z

Yep, but it's a memmap file so I don't think that makes a difference

zm711 · 2024-04-19T14:42:03Z

Honestly at this point it might make sense for us to try a profiler so we can see which step is taking long. If @gkBCCN knows how to use a profiler you could try it. I'm working on some other analysis but next week I can try profiling a call to kilosort4 on some of my data that I use on run_sorter_by_property. Then hopefully we can see which class is causing the bottle neck on our side.

alejoe91 · 2024-04-19T14:53:50Z

@gkBCCN how long does it take to run both probes in KS directly or with the run_sorter_by_property?

Note that an overhead is expected for sure, because of all the machinery in place to initialize and transfer data back and forth to the GPU...

gkBCCN · 2024-04-19T14:57:11Z

In my experience so far it always takes twice as long for a second probe when running run_sorter_by_property. That will necessarily be the case when running KS directly, because I run it twice, each time on a separate .dat file, but both have the same size.

alejoe91 · 2024-04-19T15:00:12Z

Anyways, @guidomeijer I tested again and your problem with destriping+aggregate should be solved.

Here are some run times on a 384-channel recording:

Apply destriping -> run KS4 (no grouping): 365.5 s
Split by group (every 96th channel) -> apply destriping -> run sorter in a loop: 647.3 s
Split by group, apply destriping, aggregate, run sorter: 596.27 s

Again, the overhead of running multiple probe is expected, but IMO it will give better results, especially when applying probe-specific preprocessing as in @guidomeijer example and for drift correction!

guidomeijer · 2024-04-30T13:29:33Z

Yes it's solved, thanks!

alejoe91 added the sorters Related to sorters module label Mar 26, 2024

alejoe91 added a commit to alejoe91/spikeinterface that referenced this issue Mar 28, 2024

propagate SpikeInterface#2639 and SpikeInterface#2625

14b84cf

This was referenced Apr 19, 2024

Fix aggregate_channels performance #2735

Closed

Fix performance issue for aggregate_channels #2736

Merged

alejoe91 closed this as completed in #2736 Apr 19, 2024

alejoe91 reopened this Apr 19, 2024

alejoe91 closed this as completed Apr 30, 2024

Sorting takes extremely long when sorting a four shank probe by property #2625

Sorting takes extremely long when sorting a four shank probe by property #2625

Comments

guidomeijer commented Mar 26, 2024

zm711 commented Mar 26, 2024

alejoe91 commented Mar 26, 2024

alejoe91 commented Mar 26, 2024

guidomeijer commented Mar 26, 2024

zm711 commented Mar 26, 2024 • edited Loading

guidomeijer commented Mar 27, 2024 • edited Loading

zm711 commented Mar 27, 2024

guidomeijer commented Mar 28, 2024

zm711 commented Mar 28, 2024

guidomeijer commented Apr 2, 2024 • edited Loading

zm711 commented Apr 2, 2024

zm711 commented Apr 3, 2024

alejoe91 commented Apr 4, 2024

zm711 commented Apr 4, 2024

BovenE commented Apr 15, 2024

guidomeijer commented Apr 16, 2024

guidomeijer commented Apr 18, 2024

zm711 commented Apr 18, 2024

guidomeijer commented Apr 18, 2024

zm711 commented Apr 18, 2024 • edited Loading

gkBCCN commented Apr 18, 2024 • edited Loading

zm711 commented Apr 18, 2024

gkBCCN commented Apr 18, 2024

guidomeijer commented Apr 18, 2024 • edited Loading

alejoe91 commented Apr 18, 2024

alejoe91 commented Apr 18, 2024

gkBCCN commented Apr 18, 2024

alejoe91 commented Apr 18, 2024 • edited Loading

gkBCCN commented Apr 18, 2024

gkBCCN commented Apr 18, 2024

gkBCCN commented Apr 18, 2024

alejoe91 commented Apr 18, 2024

alejoe91 commented Apr 19, 2024

gkBCCN commented Apr 19, 2024

zm711 commented Apr 19, 2024

gkBCCN commented Apr 19, 2024

zm711 commented Apr 19, 2024

zm711 commented Apr 19, 2024

alejoe91 commented Apr 19, 2024

gkBCCN commented Apr 19, 2024 • edited Loading

gkBCCN commented Apr 19, 2024

alejoe91 commented Apr 19, 2024

gkBCCN commented Apr 19, 2024

alejoe91 commented Apr 19, 2024

zm711 commented Apr 19, 2024

alejoe91 commented Apr 19, 2024

gkBCCN commented Apr 19, 2024

alejoe91 commented Apr 19, 2024

guidomeijer commented Apr 30, 2024

zm711 commented Mar 26, 2024 •

edited

Loading

guidomeijer commented Mar 27, 2024 •

edited

Loading

guidomeijer commented Apr 2, 2024 •

edited

Loading

zm711 commented Apr 18, 2024 •

edited

Loading

gkBCCN commented Apr 18, 2024 •

edited

Loading

guidomeijer commented Apr 18, 2024 •

edited

Loading

alejoe91 commented Apr 18, 2024 •

edited

Loading

gkBCCN commented Apr 19, 2024 •

edited

Loading