Spike loss in repeated simulations of exact same script #1394

jarsi · 2020-01-14T11:48:04Z

I have observed, that under some circumstances the number of recorded spikes can diverge. In these repeated simulations I run the exact same script (a static version of examples/nest/hpc_benchmark.sli) and randomly get different results. The following table summarizes the problem:

commit	ntasks	number of threads	jube_wp_iteration	number_of_spikes
v2.14.0	32	24	0	316209
v2.14.0	32	24	1	316209
v2.14.0	32	24	2	316209
v2.14.0	32	24	3	316209
v2.14.0	32	24	4	316209
v2.14.0	32	24	5	316209
v2.14.0	32	24	6	316209
v2.14.0	32	24	7	316209
v2.14.0	32	24	8	316209
v2.14.0	32	24	9	316209
v2.14.0	768	1	0	316209
v2.14.0	768	1	1	316209
v2.14.0	768	1	2	316209
v2.14.0	768	1	3	316209
v2.14.0	768	1	4	316209
v2.14.0	768	1	5	316209
v2.14.0	768	1	6	316209
v2.14.0	768	1	7	316209
v2.14.0	768	1	8	316209
v2.14.0	768	1	9	316209
v2.16.0	32	24	0	328277
v2.16.0	32	24	1	333873
v2.16.0	32	24	2	325318
v2.16.0	32	24	3	325193
v2.16.0	32	24	4	325318
v2.16.0	32	24	5	325318
v2.16.0	32	24	6	327136
v2.16.0	32	24	7	315251
v2.16.0	32	24	8	325318
v2.16.0	32	24	9	325318
v2.16.0	768	1	0	325318
v2.16.0	768	1	1	325318
v2.16.0	768	1	2	325318
v2.16.0	768	1	3	325318
v2.16.0	768	1	4	325318
v2.16.0	768	1	5	325318
v2.16.0	768	1	6	325318
v2.16.0	768	1	7	325318
v2.16.0	768	1	8	325318
v2.16.0	768	1	9	325318

The simulations using the 2.14 release yield the same results, irregardless of the division between threads and MPI processes. The number of spikes in the 2.16 version is constant when using only mpi processes and fluctuates when using an hybrid scheme. The slight variation between 2.14 and 2.16 is to be expected as the spikes ocuring in the last time step are only recorded to file in 2.16.

As spike loss only occurs with 2.16, the bug was probably introduced with the 5g kernel

Proof of spike loss in spike and membrane recordings

To further understand the problem I have recorded both the spikes and membrane voltages. I used
the 2.16 release. I either use 768 mpi processes with 1 thread each on 32 nodes
or 32 mpi processes with 24 threads each on 32 nodes. As spike loss has not
been seen in pure MPI simulation, I assume that the mpi-only simulation serves
as a ground truth.

Indeed a difference in the gdf files can be found. In the uncorrectly behaving
simulation, one neuron spikes 0.1 ms too late.

GID	time correct	time uncorrect
505355	15.400	15.500

The effect of spike loss occurs much earlier than an incorrect spike can be
registered. This can be verified by recording the membrane potentials. Here in
several neurons at timestep 8.500000000000000 the membrane potentials diverges.

GID	time	membrane correct	membrane uncorrect
9227	8.100000000000000	15.989619245265740	15.989619245265740
9227	8.199999999999999	16.117462627384160	16.117462627384160
9227	8.300000000000001	16.094458742910806	16.094458742910806
9227	8.400000000000000	15.910287217298723	15.910287217298723
9227	8.500000000000000	15.619021029332181	15.612824291718939
9227	8.600000000000000	15.196174602593645	15.175888438558591
9227	8.700000000000001	14.757521415092604	14.719968900712942
9227	8.800000000000001	14.411463132935133	14.356240673973033
9227	8.900000000000000	14.096045016672580	14.024277256018921
9227	9.000000000000000	13.708270006887023	13.621829152920018
93707	8.100000000000000	6.271737931625098	6.271737931625098
93707	8.199999999999999	6.071139950650960	6.071139950650960
93707	8.300000000000001	5.790890521131046	5.790890521131046
93707	8.400000000000000	5.438537419394722	5.438537419394722
93707	8.500000000000000	5.110002763271045	5.103806025657804
93707	8.600000000000000	4.808702211113251	4.788416047078197
93707	8.700000000000001	4.616389331353123	4.578836816973460
93707	8.800000000000001	4.484518465270027	4.429296006307928
93707	8.900000000000000	4.362229872249083	4.290462111595425
93707	9.000000000000000	4.195666685355289	4.109225831388283

heplesser · 2020-01-14T14:01:01Z

@jarsi Thank you for your excellent detective work! To summarise first

The error does not occur if using only MPI parallelisation
The error does not occur for smaller models / numbers of parallel processes
The error occurs when simulating a network with 500k+ neurons using 32 MPI processes with 24 threads each
The fact that if there is an error, the effect differs slightly between runs, indicates that we probably deal with a non-deterministic concurrency error
The difference in membrane potential occurs at 8.5 ms only for the two neurons with GIDs 9227 and 93707
Interestingly, both those GIDs map to VP 11, i.e., to the same VP

I did a small test and the difference in membrane potential observed at t=8.5ms, which is 0.006197 mV lower for the incorrect than for the reference case for both neurons, is consistent with exactly one input spike not being delivered correctly during the time step ending at 8.5ms.

The missing spike can have two origins

A spike emitted by another neuron.
A spike emitted by a poisson generator.

If it was a spike emitted by another neuron and that was lost in transmission to VP 11, I would have expected that more neurons would be affected, since each neuron has on average 14.6 targets per VP and the probability to have only 2 should be pretty low. It could still be that the spike arrives at the VP, but is not properly delivered to all local targets.

Still, right now it seems more likely that this is a spike generated by a poisson generator, where spike trains differ for all target neurons. One way to test if the problem is caused by neuron or generator spikes would be to change the weight of the generator -> neuron connections, e.g., by a factor 1.3 (not an integer multiple). This will change dynamics, but as soon as we see the first difference in membrane potential, the size of this change will tell us the weight of the connection that caused it, thus differentiating between generator and neuron input.

Another thing to look at would be the full V_m traces for the two neurons.

jarsi · 2020-01-14T14:35:35Z

@heplesser, thank you for your summary. I almost agree with all your points, besides a small inaccuracy, probably stemming from my description:
The difference in membrane potential at 8.5 ms does not only occur for two neurons, but way more. For keeping the included table short I just included 2 example cases which just happen to live on the same virtual process. I can supplement the exact number of occurences and the corresponding virtual processes. But this will take some time due to the number of files and their sizes.

jarsi · 2020-01-15T09:13:41Z

I had a closer look at the membrane potentials. A divergence can be seen on two virtual processes, process 11 and 25. In total there are 28 divergences at timestep 8.5
On VP 11 I found 16, on VP 25 I found 12.

heplesser · 2020-01-15T11:12:59Z

@jarsi 12 and 16 are highly compatible with and average of 14.6+-3.8 targets per neuron per VP. Therefore, this observation indicates to me that a spike has been lost in transmission from pre- to post-synaptic VP. We cannot be sure without further analysis if one and the same spike failed to reach two VPs or if two different spikes were not delivered correctly.

Next steps are a connectome analysis for the 28 affected targets. Since we have a fixed delay of 1.5 ms, the lost spike must have been fired at 7.0 ms, so by comparing connectome information with spike traces, we may be able to pinpoint precisely which spike was lost.

heplesser · 2020-01-15T15:28:53Z

Here one more update based on further analysis by @jarsi.

Connectome analysis shows that there is exactly one neuron (GID 5) in the network projecting to the 28 affected targets mentioned above (for a specific, erroneous trial). The connectome was obtained using GetConnections, i.e., from information on the post-synaptic process of each connection after simulating for 0.1 ms to force spike exchange.
The neuron with GID 5 spiked at 6.9 ms, consistent with a first observable membrane potential difference for reference and erroneous simulations at 8.5 ms.
We can conclude from this that the spike fired by GID 5 at 6.9 ms was not correctly transmitted to VPs 11 and 25.
Unfortunately, the resulting "snowball effect" leads to very different spiking patterns by the time of the next spike fired by GID 5, so we cannot determine if also later spikes by GID 5 are subject to the same failure.
The next question now is whether
- the spike is not transmitted correctly because the presynaptic connection tables are incorrect (error in the network construction phase) or
- the spike is lost during transmission (error in the network simulation phase)

suku248 · 2020-01-20T18:26:11Z

This is a summary of my experiments with the hpc_benchmark.sli from the NEST examples, which I adjusted in order reproduce the bug.

The main finding is: The bug only occurs if the same synapse model is used for both E->E and E->I connections (i.e. no STDP synapses for E->E). I have tested static_synapse, static_synapse_hpc and a static synapse that was created through CopyModel (syn_ex already used for E->I in the original script). Just a side note: It's not relevant whether I use a separate synapse model created through CopyModel to connect the stimulus to E and I. And I don't think synaptic plasticity plays a role but haven't tested it.

I have used 32 MPI processes and 24 threads per process (same as Jari). If I reduce the number of threads to 22 or 20, the bug does not occur (in 1000ms of simulation). This kind of speaks against a race condition. I have not tested with more than 24 threads, yet.

The minimum number of neurons per VP (in steps of 100) is 500, i.e. no bug for 400 or fewer neurons per VP.

So, the occurrence of the bug seems to require a minimum number of synapses in a connection vector and a minimum number of threads. Therefore, I think this bug might be related to #1088.

jarsi · 2020-02-18T13:28:30Z

A short update on this issue.

@suku248 found through systematic search that the bug is related to method SourceTable::save_entry_point( const thread tid ) and that protecting all calls to this method by omp critical ensured correct behavior.

This function saves the current position of the SourceTable before MPI communication such that operation can continue safely from where it left off after MPI communication. There are two ways of how this function can be accessed.

The if ( not saved_entry_point_[ tid ] ) inside of SourceTable::save_entry_point( const thread tid ) is there to to ensure that saving the position is only done at the first passing during each MPI communication round. But the print statements hint that the inner parts are actually accessed several times.

The reason for this seems to be the way in which saved_entry_point_ is defined:

std::vector< bool > saved_entry_point_;

This is a std::vector< bool > where each entry belongs to a different thread. While this is perfectly save for most std::vector< T >, it unfortunately is not when dealing with bools. Amongst other, cppreference has the following to say about this container:

Does not guarantee that different elements in the same container can be modified concurrently by different threads.

This is due to the way how this datastructure is implemented. In order to make this datastructure space efficient, the vector elements are coalesced such that each element occupies a single bit instead of sizeof(bool) bytes. Meaning that when we write individual elements, i.e. update a single bit, we read the whole byte containing it. This byte contains information from other threads as well. If multiple threads operate on the same byte at the same time, they overwrite each other resulting in undefined behavior.

To test whether this is indeed the problem, I changed the type of saved_entry_point_ to std::deque< bool > which does not implement any of these std::vector< bool> optimizations. This indeed fixes the problem. The spike counts agree and are the same for each run.

std::vector< bool > must be avoided when dealing with threaded code.

The next todo is to think which alternative container to use.

heplesser · 2020-02-18T14:22:30Z

@jarsi @suku248 @jakobj Congratulations on this excellent detective work!

Concerning the new choice of container, cppreference (Section "Thread safety", item 3) states that

Different elements in the same container can be modified concurrently by different threads, except for the elements of std::vector ...

so using any other container should be fine. Beyond correctness, cache trashing may be an issue here. I ran a few tests on a 2x16 core Epyc2 Rome server. 64 threads write in parallel 10^7 times to a 64 element std::vector, where each element is a uint type of different widths, resulting in the following runtimes (some fluctuation, values are medians "estimated by eye"):

uint size	Runtime
1	6.5 s
8	2.7 s
16	0.8 s
32	0.7 s
64	0.7 s
128	0.5 s
256	0.3 s
512	0.2 s

For > 64 Bit, I padded with unused vector elements. Effects might be smaller when writes are mixed with other operations, but I think this suggests using a std::vector<std::uint32_t> std::vector<std::uint64_t> to represent vectors of bools.

We should in this context also review the CompletionChecker (uses C-style bool array+atomic write, which should be safe) to have the same implementation everywhere.

Furthermore, we need to review use of vector<bool> in correlospinmatrix_detector and topology/connection_creator, although these uses seem different at first check and may not be affected (but might still benefit from refactoring).

jarsi · 2020-02-18T15:28:40Z

In addition to the files you mentioned, there are even more std::vector< bool >:

$ git grep "std::vector< bool >"
models/correlospinmatrix_detector.h:    std::vector< bool > curr_state_; //!< current state of neuron i
nestkernel/event.h:      // and a std::vector entry may differ (e.g. for std::vector< bool >)
nestkernel/simulation_manager.cpp:  std::vector< bool > done;
nestkernel/source_table.h:  std::vector< bool > is_cleared_;
nestkernel/source_table.h:  std::vector< bool > saved_entry_point_;
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions.size() );
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions.size() );
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions->size() );
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions->size() );
topology/connection_creator_impl.h:    std::vector< bool > is_selected( targets.size() );

Both occurrences in nestkernel/source_table.h deal with threads. The one in nestkernel/simulation_manager.cpp is only written to inside of a #pragma omp single and #pragma omp critical. But still it could be replaced.

heplesser · 2020-02-20T05:48:44Z

The following PRs are related to this issue: #1416, #1421, #1422, #1427, #1428, #1430.

heplesser · 2020-02-20T12:38:34Z

I have reviewed the use of vector< bool > in correlospinmatrix_detector and in topology/connection_creator. In the first case the use is per detector object and therefore threadsafe, while in the second case the use is only for the fixed_{in,out}degree cases, which are handled single-threaded in topology.

Therefore, those uses are safe and need not be changed.

suku248 · 2020-02-25T17:09:14Z

PR #1442 implements the strategy of using a vector of integers wrapped in PerThreadBoolIndicator (replacing CompletedChecker) and should resolve this issue.

Just for the record: std::vector< bool > done that is local in SimulationManager::update_() uses a different strategy to ensure thread safety. For code consistency, it should be checked whether this could also be replaced with PerThreadBoolIndicator.

heplesser · 2020-02-26T08:30:27Z

@suku248 Thanks a lot! Concerning done in SimulationManager::update_() I agree that the current implementation is safe, but looks as it could be improved upon. I would suggest to do this in a later comprehensive review of threading.

suku248 added ZC: Kernel DO NOT USE THIS LABEL I: Behavior changes Introduces changes that produce different results for some users S: Critical Needs to be addressed immediately T: Bug Wrong statements in the code or documentation labels Jan 14, 2020

suku248 added this to the NEST 2.18.1 milestone Jan 14, 2020

suku248 self-assigned this Jan 14, 2020

heplesser added the ZP: In progess DO NOT USE THIS LABEL label Jan 14, 2020

jougs modified the milestones: NEST 2.18.1, NEST 3.0 Feb 10, 2020

jakobj mentioned this issue Feb 12, 2020

Use proper delete[] and atomic read in CompletedChecker #1428

Merged

suku248 mentioned this issue Feb 25, 2020

Use vector of integer for per-thread bool indicators #1442

Merged

heplesser added ZP: PR Created DO NOT USE THIS LABEL and removed ZP: In progess DO NOT USE THIS LABEL labels Feb 26, 2020

terhorstd closed this as completed in #1442 Mar 2, 2020

heplesser mentioned this issue Mar 6, 2020

Use vector of integer for per-thread bool indicators (backport) #1455

Merged

jougs mentioned this issue Jun 15, 2021

nest: Add v3.0 spack/spack#24326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike loss in repeated simulations of exact same script #1394

Spike loss in repeated simulations of exact same script #1394

jarsi commented Jan 14, 2020

heplesser commented Jan 14, 2020

jarsi commented Jan 14, 2020

jarsi commented Jan 15, 2020

heplesser commented Jan 15, 2020

heplesser commented Jan 15, 2020

suku248 commented Jan 20, 2020

jarsi commented Feb 18, 2020 •

edited

Loading

heplesser commented Feb 18, 2020 •

edited

Loading

jarsi commented Feb 18, 2020

heplesser commented Feb 20, 2020

heplesser commented Feb 20, 2020

suku248 commented Feb 25, 2020 •

edited

Loading

heplesser commented Feb 26, 2020

Spike loss in repeated simulations of exact same script #1394

Spike loss in repeated simulations of exact same script #1394

Comments

jarsi commented Jan 14, 2020

Proof of spike loss in spike and membrane recordings

heplesser commented Jan 14, 2020

jarsi commented Jan 14, 2020

jarsi commented Jan 15, 2020

heplesser commented Jan 15, 2020

heplesser commented Jan 15, 2020

suku248 commented Jan 20, 2020

jarsi commented Feb 18, 2020 • edited Loading

heplesser commented Feb 18, 2020 • edited Loading

jarsi commented Feb 18, 2020

heplesser commented Feb 20, 2020

heplesser commented Feb 20, 2020

suku248 commented Feb 25, 2020 • edited Loading

heplesser commented Feb 26, 2020

jarsi commented Feb 18, 2020 •

edited

Loading

heplesser commented Feb 18, 2020 •

edited

Loading

suku248 commented Feb 25, 2020 •

edited

Loading