Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpykingCircus2 clustering crash at template estimation #3722

Open
b-grimaud opened this issue Feb 25, 2025 · 23 comments
Open

SpykingCircus2 clustering crash at template estimation #3722

b-grimaud opened this issue Feb 25, 2025 · 23 comments
Labels
sorters Related to sorters module

Comments

@b-grimaud
Copy link
Contributor

I'm trying to run SC2 on 3Brain HD-MEA data (4096 channels, 20 kHz) with mostly default parameters :

p = {
 'apply_motion_correction': False,
 'apply_preprocessing': True,
 'cache_preprocessing': {'delete_cache': True,
                         'memory_limit': 0.5,
                         'mode': 'zarr'},
 'clustering': {'legacy': True},
 'debug': False,
 'detection': {'detect_threshold': 5, 'peak_sign': 'neg'},
 'filtering': {'filter_order': 2,
               'freq_max': 7000,
               'freq_min': 150,
               'ftype': 'bessel',
               'margin_ms': 10},
 'general': {'ms_after': 2, 'ms_before': 2, 'radius_um': 75},
 'job_kwargs': {'n_jobs': 40,},# 'total_memory': '50G'},
 'matched_filtering': True,
 'matching': {'method': 'circus-omp-svd'},
 'merging': {'auto_merge': {'corr_diff_thresh': 0.25, 'min_spikes': 10},
             'correlograms_kwargs': {},
             'similarity_kwargs': {'max_lag_ms': 0.2,
                                   'method': 'cosine',
                                   'support': 'union'}},
 'motion_correction': {'preset': 'dredge_fast'},
 'multi_units_only': False,
 'seed': 42,
 'selection': {'method': 'uniform',
               'min_n_peaks': 100000,
               'n_peaks_per_channel': 5000,
               'seed': 42,
               'select_per_channel': False},
 'sparsity': {'amplitude_mode': 'peak_to_peak',
              'method': 'snr',
              'threshold': 0.25},
 'whitening': {'mode': 'local', 'regularize': False}}

The only trace I get is :

spykingcircus2 could benefit from using torch. Consider installing it
Preprocessing the recording (bandpass filtering + CMR + whitening)
noise_level (workers: 20 processes): 100%|███████████████████████████████████████████████████| 20/20 [00:24<00:00,  1.23s/it]
Use zarr_path=/tmp/spikeinterface_cache/tmpdx30z0db/CCLII4NL.zarr
write_zarr_recording 
engine=process - n_jobs=40 - samples_per_chunk=19,753 - chunk_memory=308.64 MiB - total_memory=12.06 GiB - chunk_duration=1.00s (999.96 ms)
write_zarr_recording (workers: 40 processes): 100%|██████████████████████████████████████████| 61/61 [01:34<00:00,  1.54s/it]
detect peaks using locally_exclusive + 1 node (workers: 40 processes): 100%|█████████████████| 61/61 [00:11<00:00,  5.10it/s]
detect peaks using matched_filtering (workers: 40 processes): 100%|██████████████████████████| 61/61 [02:18<00:00,  2.27s/it]
Kept 179242 peaks for clustering
extracting features (workers: 40 processes): 100%|███████████████████████████████████████████| 61/61 [00:06<00:00,  9.63it/s]
split_clusters with local_feature_clustering: 100%|███████████████████████████████████| 4210/4210 [00:00<00:00, 42564.83it/s]
Bus error (core dumped)

I've been able to trace the crash back to a call to estimate_templates (here) which then seems to call estimate_templates_with_accumulator.

From what I could gather this looks like an out of memory error, but I've never seen something quite like this with other OOM Python issues.

The GUI monitor shows a modest 17x10⁶ TB being used :

Image

And dmesg shows the following :

[  954.327570] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/[email protected]/tmux-spawn-ad081ad7-fd73-4200-8a54-e76e0ce4b80d.scope,task=python,pid=8624,uid=1000
[  954.327924] Out of memory: Killed process 8624 (python) total-vm:14242184kB, anon-rss:10079372kB, file-rss:6260kB, shmem-rss:0kB, UID:1000 pgtables:20984kB oom_score_adj:0
[  956.739762] systemd-journald[641]: Under memory pressure, flushing caches.
[  957.536557] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00007300] Failed to grab modeset ownership
[  957.631410] rfkill: input handler disabled

I'm not sure if this is normal behavior and there are simply too many redundant units or if there's an actual issue with memory handling.

I've tried passing total_memory to both the sorters's job_kwargs and SpikeInterface's global job_kwargs, but I'm not sure it's taken into account when not dealing with the recording itself.

@alejoe91 alejoe91 added the sorters Related to sorters module label Feb 25, 2025
@yger
Copy link
Collaborator

yger commented Feb 25, 2025

I never tested with 4096 channels, so I should have a look. This estimate_template() function is preallocating a big matrix of n_channels x n_templates x num_time_samples. So if n_templates is large, then this could be large. However, assuming you have roughly 5000 templates, this should fit in memory. I'll try some tests, but with 125Go of RAM it should work smoothly

@yger
Copy link
Collaborator

yger commented Feb 25, 2025

note that total_memory will not really work here, this is only for chunk preprocessing, not for the clustering step. Is this obtained with the main branch? No idea whatsoever on the number of templates found? If recording not too big, I could have a look, let me know while I'll do some tests with synthetic data

@b-grimaud
Copy link
Contributor Author

note that total_memory will not really work here, this is only for chunk preprocessing

This is what I had gathered from some issues with computing PCs during quality metrics, with similar memory usage.

Is this obtained with the main branch?

Yes, I checked with the latest commits.

No idea whatsoever on the number of templates found?

With HerdingSpikes I get around 2000 units from that recording.

If recording not too big, I could have a look, let me know

I have a minute-long recording that's just under 10 GB, would that be ok ?

@yger
Copy link
Collaborator

yger commented Feb 25, 2025

Perfect, please share it with me such that I can troubleshoot

@b-grimaud
Copy link
Contributor Author

Here's the link : https://filesender.renater.fr/?s=download&token=25c4ac59-a7f6-45db-8eb5-ca153d154ef2

Then :

recording = se.read_biocam(data_path, mea_pitch=60, electrode_width=21)
recording = spre.unsigned_to_signed(recording, bit_depth=12)

@yger
Copy link
Collaborator

yger commented Feb 26, 2025

The problem, I think, might be the estimate_template() function. Indeed, this functions, when used in a parallel context, allocates a giant array of size (num_workers, num_templates, num_channels, num_samples). This might be very large for very high density arrays.
When we have workers working on a shared array in memory, can't we simply write to the same array? Do we have concurrent accesses issues? I'll need to think about that, but currently this is one of the reason I can imagine

@samuelgarcia @alejoe91, what do you think?

@b-grimaud
Copy link
Contributor Author

It looks like the giant array is indeed the problem. The sorter found 3768 raw units before cleaning, so that's 3768 clusters x 40 cores x 4096 channels x 80 samples, x 4 bytes for float32 which means about 184 GB for that array, on top of what's already loaded in memory.

I ran it again on a single core, no more OOM errors, but it took almost 15 hours to process a minute-long recording.

I do get a bunch of scipy.optimize.least_squares error: x0 is infeasible. just before the end, but I'm still not sure which extractor to use to check the sorting.

If there could be some functional equivalent of ChunkRecordingExecutor to process those large arrays, it would make it a lot easier to work with HD-MEA data.

@yger
Copy link
Collaborator

yger commented Feb 27, 2025

Yes, I'm working on a patch, trying to bypass this step and thus avoid this memory bottleneck. When you said you ran it on a single core, you mean the whole software or just this step. Because only estimate templates need to be launched on a reduced number of cores to avoid memory saturation. That could also be one option (use max cores for all steps except here, reduce as function of memory)

@b-grimaud
Copy link
Contributor Author

For now I just set the n_jobs sorter param to 1, I did not directly specify the amount of workers for estimate_templates.

@alejoe91
Copy link
Member

I'm also getting a bunch of scipy.optimize.least_squares error: x0 is infeasible!!!

@yger
Copy link
Collaborator

yger commented Feb 27, 2025

the scipy.optimize warnings should not be a major deal, and I'll make a PR to try to adjust automatically the number of workers during template estimation, before something more robust and stable

@yger
Copy link
Collaborator

yger commented Feb 27, 2025

I've tried a patch in the branch total_duration. See #3721
I'll test it only later, but in theory this should be a temporary patch, adjusting the amount of cores during the estimate_templates() call within the clustering to reduce the memory footprint (assuming we can only use 25% of available memory)

@yger
Copy link
Collaborator

yger commented Feb 27, 2025

I've checked, and the patch is working for now, allowing to keep the speed. Please use it while I'll keep working on a deeper patch

@yger
Copy link
Collaborator

yger commented Feb 27, 2025

However, I must say that while it seems to run, this is not at the speed of light. I'll keep your 1min long file as an example to optimize everything

@b-grimaud
Copy link
Contributor Author

I checked it out, I'll let it run overnight to be sure but it has already passed the point where it used to crash.
It selected 5 workers as the optimal amount given the memory constraints, which worked quite well.

If shared access of a single array is too tricky, I'd love to see the same system applied to some of the quality metrics calculations, especially the PCA-based ones, as they tend to also pre-allocate big arrays from what I can tell.

On an unrelated note, as I understand it SC2 creates a sorting analyzer in its final steps to do some curation. Would it make sense to be able to save that analyzer to disk without everything else saved by the debug option ? If an analyzer is going to be created down the line, might as well do it with the sorting still in memory and the recording properly pre-processed.

@yger
Copy link
Collaborator

yger commented Feb 28, 2025

For concurrent write to shared array, we'll have to wait for @samuelgarcia inputs. But what I did here is a simple hack: before estimate_template, I adapt the number of jobs as function of the RAM requested and available. I agree that maybe, this should be a generic option and go directly into this estimate_template() function, such that all sub functions using it will benefit from that and avoid crashing (maybe PCA, but not only). We'll discuss that, if no better options to make this function robust w.r.t. sizes /number of channels.

Yes, SC2 creates an analyzer internally, and indeed, I could save it with a special option (could be debug or something else). However, the API of run_sorter currently is to return a sorting, so it will be up to the user to go in the created folder to grasp this analyzer. But this is a good idea. The analyzer also won't have waveforms, but again, this can be added if needed by users. I'll think about that and make a PR

@b-grimaud
Copy link
Contributor Author

Seems like there's another hangup further down the line :

Preprocessing the recording (bandpass filtering + CMR + whitening)
noise_level (workers: 20 processes): 100%|██████████████████████████████████████████| 20/20 [00:17<00:00,  1.12it/s]
Use zarr_path=/tmp/spikeinterface_cache/tmpvgbkwrhv/60557ZEJ.zarr
write_zarr_recording 
engine=process - n_jobs=40 - samples_per_chunk=3,327 - chunk_memory=51.98 MiB - total_memory=2.03 GiB - chunk_duration=0.17s (168.42 ms)
write_zarr_recording (workers: 40 processes): 100%|███████████████████████████████| 357/357 [01:20<00:00,  4.46it/s]
detect peaks using locally_exclusive + 1 node (workers: 40 processes): 100%|██████| 357/357 [00:06<00:00, 54.69it/s]
detect peaks using matched_filtering (workers: 40 processes): 100%|███████████████| 357/357 [01:09<00:00,  5.12it/s]
Kept 174679 peaks for clustering
extracting features (workers: 40 processes): 100%|████████████████████████████████| 357/357 [00:04<00:00, 78.11it/s]
split_clusters with local_feature_clustering: 100%|███████████████████████████| 4219/4219 [00:00<00:00, 6412.43it/s]
estimate_templates (workers: 5 processes): 100%|██████████████████████████████████| 357/357 [00:13<00:00, 27.37it/s]
Found 3678 raw clusters, starting to clean with matching
Kept 3673 non-duplicated clusters
Bus error (core dumped)

I'ill try to trace it down properly but my guess is maybe find_spikes_from_templates() ?

The analyzer also won't have waveforms, but again, this can be added if needed by users.

Agreed, this is partly what I had mentionned in #3525 : it looks like a lot of sorters already compute things that could be useful for downstream analysis, only for them to be discarded after sorting and then recomputed. Having the ability to get an analyzer with some preloaded extensions could save some time.

@yger
Copy link
Collaborator

yger commented Feb 28, 2025

Yes, I'll keep debugging and making the software work for such a number of channels. At least, in this PR, I've also reduce the memory footprint for the SVD, and now as you are requesting, the final analyzer is saved

@b-grimaud
Copy link
Contributor Author

I tried again with the newer version of the PR, but I'm having some trouble replicating what I had before : if I pass the number of cores directly to the sorter with 'job_kwargs': {'n_jobs': 40,}, I get a (no parallelization) tag with each operation, and the process takes about 10 hours.

If I pass the job_kwargs globally with :

global_job_kwargs = {'chunk_duration': '1s', "n_jobs": -1, "progress_bar": True}
si.set_global_job_kwargs(**global_job_kwargs)

And still pass the arguments to the sorter, parallelization happens but I still get another crash down the road. As far as I can tell, global_job_kwargs overrides the sorter-specific arguments.

There might be a situation where the amount of available memory is so low that the patch only selects a single core, but as far as I know not all steps of the sorting pipeline make use of that adjustment.

@b-grimaud
Copy link
Contributor Author

It looks like get_optimal_n_jobs was removed in this commit, I'm not sure if this was intentional since apparently some commits were split into a different PR.

@b-grimaud
Copy link
Contributor Author

I get a (no parallelization) tag with each operation

I think I found the source of the issue : in set_optimal_chunk_size, job_kwargs is updated by job_kwargs = fix_job_kwargs(dict(chunk_duration=f"{chunk_duration}s")), which overwrites what is passed to the sorter with global job_kwargs.

If those weren't set previously, it reverts to the default value of a single core being used.

It would probably need something like :

job_kwargs.update(dict(chunk_duration=f"{chunk_duration}s"))
job_kwargs = fix_job_kwargs(job_kwargs)

Or job_kwargs = fix_job_kwargs({**job_kwargs,'chunk_duration': f"{chunk_duration}s"}) if you want to stay inline.

@yger
Copy link
Collaborator

yger commented Mar 12, 2025

Indeed, sorry for that. But this branch will still not work on your data. I'll try to finish one with a new clustering that would avoid the cleaning of the dict, and the reestimation of the templates from the data. Thus, this should make it work for your large arrays. I'm on it

@yger
Copy link
Collaborator

yger commented Mar 12, 2025

At least, I brought back the get_optimal_n_jobs (sorry for the mistake) and your fix for set_optimal_chunk_size. Thanks a lot ! Let's make the code work (quickly) on 4225 channels !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sorters Related to sorters module
Projects
None yet
Development

No branches or pull requests

3 participants