Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike loss in repeated simulations of exact same script #1394

Closed
jarsi opened this issue Jan 14, 2020 · 13 comments · Fixed by #1442
Closed

Spike loss in repeated simulations of exact same script #1394

jarsi opened this issue Jan 14, 2020 · 13 comments · Fixed by #1442
Assignees
Labels
I: Behavior changes Introduces changes that produce different results for some users S: Critical Needs to be addressed immediately T: Bug Wrong statements in the code or documentation ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL
Milestone

Comments

@jarsi
Copy link
Contributor

jarsi commented Jan 14, 2020

I have observed, that under some circumstances the number of recorded spikes can diverge. In these repeated simulations I run the exact same script (a static version of examples/nest/hpc_benchmark.sli) and randomly get different results. The following table summarizes the problem:

commit ntasks number of threads jube_wp_iteration number_of_spikes
v2.14.0 32 24 0 316209
v2.14.0 32 24 1 316209
v2.14.0 32 24 2 316209
v2.14.0 32 24 3 316209
v2.14.0 32 24 4 316209
v2.14.0 32 24 5 316209
v2.14.0 32 24 6 316209
v2.14.0 32 24 7 316209
v2.14.0 32 24 8 316209
v2.14.0 32 24 9 316209
v2.14.0 768 1 0 316209
v2.14.0 768 1 1 316209
v2.14.0 768 1 2 316209
v2.14.0 768 1 3 316209
v2.14.0 768 1 4 316209
v2.14.0 768 1 5 316209
v2.14.0 768 1 6 316209
v2.14.0 768 1 7 316209
v2.14.0 768 1 8 316209
v2.14.0 768 1 9 316209
v2.16.0 32 24 0 328277
v2.16.0 32 24 1 333873
v2.16.0 32 24 2 325318
v2.16.0 32 24 3 325193
v2.16.0 32 24 4 325318
v2.16.0 32 24 5 325318
v2.16.0 32 24 6 327136
v2.16.0 32 24 7 315251
v2.16.0 32 24 8 325318
v2.16.0 32 24 9 325318
v2.16.0 768 1 0 325318
v2.16.0 768 1 1 325318
v2.16.0 768 1 2 325318
v2.16.0 768 1 3 325318
v2.16.0 768 1 4 325318
v2.16.0 768 1 5 325318
v2.16.0 768 1 6 325318
v2.16.0 768 1 7 325318
v2.16.0 768 1 8 325318
v2.16.0 768 1 9 325318

The simulations using the 2.14 release yield the same results, irregardless of the division between threads and MPI processes. The number of spikes in the 2.16 version is constant when using only mpi processes and fluctuates when using an hybrid scheme. The slight variation between 2.14 and 2.16 is to be expected as the spikes ocuring in the last time step are only recorded to file in 2.16.

As spike loss only occurs with 2.16, the bug was probably introduced with the 5g kernel

Proof of spike loss in spike and membrane recordings

To further understand the problem I have recorded both the spikes and membrane voltages. I used
the 2.16 release. I either use 768 mpi processes with 1 thread each on 32 nodes
or 32 mpi processes with 24 threads each on 32 nodes. As spike loss has not
been seen in pure MPI simulation, I assume that the mpi-only simulation serves
as a ground truth.

Indeed a difference in the gdf files can be found. In the uncorrectly behaving
simulation, one neuron spikes 0.1 ms too late.

GID time correct time uncorrect
505355 15.400 15.500

The effect of spike loss occurs much earlier than an incorrect spike can be
registered. This can be verified by recording the membrane potentials. Here in
several neurons at timestep 8.500000000000000 the membrane potentials diverges.

GID time membrane correct membrane uncorrect
9227 8.100000000000000 15.989619245265740 15.989619245265740
9227 8.199999999999999 16.117462627384160 16.117462627384160
9227 8.300000000000001 16.094458742910806 16.094458742910806
9227 8.400000000000000 15.910287217298723 15.910287217298723
9227 8.500000000000000 15.619021029332181 15.612824291718939
9227 8.600000000000000 15.196174602593645 15.175888438558591
9227 8.700000000000001 14.757521415092604 14.719968900712942
9227 8.800000000000001 14.411463132935133 14.356240673973033
9227 8.900000000000000 14.096045016672580 14.024277256018921
9227 9.000000000000000 13.708270006887023 13.621829152920018
93707 8.100000000000000 6.271737931625098 6.271737931625098
93707 8.199999999999999 6.071139950650960 6.071139950650960
93707 8.300000000000001 5.790890521131046 5.790890521131046
93707 8.400000000000000 5.438537419394722 5.438537419394722
93707 8.500000000000000 5.110002763271045 5.103806025657804
93707 8.600000000000000 4.808702211113251 4.788416047078197
93707 8.700000000000001 4.616389331353123 4.578836816973460
93707 8.800000000000001 4.484518465270027 4.429296006307928
93707 8.900000000000000 4.362229872249083 4.290462111595425
93707 9.000000000000000 4.195666685355289 4.109225831388283
@suku248 suku248 added ZC: Kernel DO NOT USE THIS LABEL I: Behavior changes Introduces changes that produce different results for some users S: Critical Needs to be addressed immediately T: Bug Wrong statements in the code or documentation labels Jan 14, 2020
@suku248 suku248 added this to the NEST 2.18.1 milestone Jan 14, 2020
@suku248 suku248 self-assigned this Jan 14, 2020
@heplesser heplesser added the ZP: In progess DO NOT USE THIS LABEL label Jan 14, 2020
@heplesser
Copy link
Contributor

@jarsi Thank you for your excellent detective work! To summarise first

  • The error does not occur if using only MPI parallelisation
  • The error does not occur for smaller models / numbers of parallel processes
  • The error occurs when simulating a network with 500k+ neurons using 32 MPI processes with 24 threads each
  • The fact that if there is an error, the effect differs slightly between runs, indicates that we probably deal with a non-deterministic concurrency error
  • The difference in membrane potential occurs at 8.5 ms only for the two neurons with GIDs 9227 and 93707
  • Interestingly, both those GIDs map to VP 11, i.e., to the same VP

I did a small test and the difference in membrane potential observed at t=8.5ms, which is 0.006197 mV lower for the incorrect than for the reference case for both neurons, is consistent with exactly one input spike not being delivered correctly during the time step ending at 8.5ms.

The missing spike can have two origins

  1. A spike emitted by another neuron.
  2. A spike emitted by a poisson generator.

If it was a spike emitted by another neuron and that was lost in transmission to VP 11, I would have expected that more neurons would be affected, since each neuron has on average 14.6 targets per VP and the probability to have only 2 should be pretty low. It could still be that the spike arrives at the VP, but is not properly delivered to all local targets.

Still, right now it seems more likely that this is a spike generated by a poisson generator, where spike trains differ for all target neurons. One way to test if the problem is caused by neuron or generator spikes would be to change the weight of the generator -> neuron connections, e.g., by a factor 1.3 (not an integer multiple). This will change dynamics, but as soon as we see the first difference in membrane potential, the size of this change will tell us the weight of the connection that caused it, thus differentiating between generator and neuron input.

Another thing to look at would be the full V_m traces for the two neurons.

@jarsi
Copy link
Contributor Author

jarsi commented Jan 14, 2020

@heplesser, thank you for your summary. I almost agree with all your points, besides a small inaccuracy, probably stemming from my description:
The difference in membrane potential at 8.5 ms does not only occur for two neurons, but way more. For keeping the included table short I just included 2 example cases which just happen to live on the same virtual process. I can supplement the exact number of occurences and the corresponding virtual processes. But this will take some time due to the number of files and their sizes.

@jarsi
Copy link
Contributor Author

jarsi commented Jan 15, 2020

I had a closer look at the membrane potentials. A divergence can be seen on two virtual processes, process 11 and 25. In total there are 28 divergences at timestep 8.5
On VP 11 I found 16, on VP 25 I found 12.

@heplesser
Copy link
Contributor

@jarsi 12 and 16 are highly compatible with and average of 14.6+-3.8 targets per neuron per VP. Therefore, this observation indicates to me that a spike has been lost in transmission from pre- to post-synaptic VP. We cannot be sure without further analysis if one and the same spike failed to reach two VPs or if two different spikes were not delivered correctly.

Next steps are a connectome analysis for the 28 affected targets. Since we have a fixed delay of 1.5 ms, the lost spike must have been fired at 7.0 ms, so by comparing connectome information with spike traces, we may be able to pinpoint precisely which spike was lost.

@heplesser
Copy link
Contributor

Here one more update based on further analysis by @jarsi.

  • Connectome analysis shows that there is exactly one neuron (GID 5) in the network projecting to the 28 affected targets mentioned above (for a specific, erroneous trial). The connectome was obtained using GetConnections, i.e., from information on the post-synaptic process of each connection after simulating for 0.1 ms to force spike exchange.
  • The neuron with GID 5 spiked at 6.9 ms, consistent with a first observable membrane potential difference for reference and erroneous simulations at 8.5 ms.
  • We can conclude from this that the spike fired by GID 5 at 6.9 ms was not correctly transmitted to VPs 11 and 25.
  • Unfortunately, the resulting "snowball effect" leads to very different spiking patterns by the time of the next spike fired by GID 5, so we cannot determine if also later spikes by GID 5 are subject to the same failure.
  • The next question now is whether
    • the spike is not transmitted correctly because the presynaptic connection tables are incorrect (error in the network construction phase) or
    • the spike is lost during transmission (error in the network simulation phase)

@suku248
Copy link
Contributor

suku248 commented Jan 20, 2020

This is a summary of my experiments with the hpc_benchmark.sli from the NEST examples, which I adjusted in order reproduce the bug.

The main finding is: The bug only occurs if the same synapse model is used for both E->E and E->I connections (i.e. no STDP synapses for E->E). I have tested static_synapse, static_synapse_hpc and a static synapse that was created through CopyModel (syn_ex already used for E->I in the original script). Just a side note: It's not relevant whether I use a separate synapse model created through CopyModel to connect the stimulus to E and I. And I don't think synaptic plasticity plays a role but haven't tested it.

I have used 32 MPI processes and 24 threads per process (same as Jari). If I reduce the number of threads to 22 or 20, the bug does not occur (in 1000ms of simulation). This kind of speaks against a race condition. I have not tested with more than 24 threads, yet.

The minimum number of neurons per VP (in steps of 100) is 500, i.e. no bug for 400 or fewer neurons per VP.

So, the occurrence of the bug seems to require a minimum number of synapses in a connection vector and a minimum number of threads. Therefore, I think this bug might be related to #1088.

@jarsi
Copy link
Contributor Author

jarsi commented Feb 18, 2020

A short update on this issue.

@suku248 found through systematic search that the bug is related to method SourceTable::save_entry_point( const thread tid ) and that protecting all calls to this method by omp critical ensured correct behavior.

This function saves the current position of the SourceTable before MPI communication such that operation can continue safely from where it left off after MPI communication. There are two ways of how this function can be accessed.

The if ( not saved_entry_point_[ tid ] ) inside of SourceTable::save_entry_point( const thread tid ) is there to to ensure that saving the position is only done at the first passing during each MPI communication round. But the print statements hint that the inner parts are actually accessed several times.

The reason for this seems to be the way in which saved_entry_point_ is defined:

std::vector< bool > saved_entry_point_;

This is a std::vector< bool > where each entry belongs to a different thread. While this is perfectly save for most std::vector< T >, it unfortunately is not when dealing with bools. Amongst other, cppreference has the following to say about this container:

Does not guarantee that different elements in the same container can be modified concurrently by different threads.

This is due to the way how this datastructure is implemented. In order to make this datastructure space efficient, the vector elements are coalesced such that each element occupies a single bit instead of sizeof(bool) bytes. Meaning that when we write individual elements, i.e. update a single bit, we read the whole byte containing it. This byte contains information from other threads as well. If multiple threads operate on the same byte at the same time, they overwrite each other resulting in undefined behavior.

To test whether this is indeed the problem, I changed the type of saved_entry_point_ to std::deque< bool > which does not implement any of these std::vector< bool> optimizations. This indeed fixes the problem. The spike counts agree and are the same for each run.

std::vector< bool > must be avoided when dealing with threaded code.

The next todo is to think which alternative container to use.

@heplesser
Copy link
Contributor

heplesser commented Feb 18, 2020

@jarsi @suku248 @jakobj Congratulations on this excellent detective work!

Concerning the new choice of container, cppreference (Section "Thread safety", item 3) states that

Different elements in the same container can be modified concurrently by different threads, except for the elements of std::vector ...

so using any other container should be fine. Beyond correctness, cache trashing may be an issue here. I ran a few tests on a 2x16 core Epyc2 Rome server. 64 threads write in parallel 10^7 times to a 64 element std::vector, where each element is a uint type of different widths, resulting in the following runtimes (some fluctuation, values are medians "estimated by eye"):

uint size Runtime
1 6.5 s
8 2.7 s
16 0.8 s
32 0.7 s
64 0.7 s
128 0.5 s
256 0.3 s
512 0.2 s

For > 64 Bit, I padded with unused vector elements. Effects might be smaller when writes are mixed with other operations, but I think this suggests using a std::vector<std::uint32_t> std::vector<std::uint64_t> to represent vectors of bools.

We should in this context also review the CompletionChecker (uses C-style bool array+atomic write, which should be safe) to have the same implementation everywhere.

Furthermore, we need to review use of vector<bool> in correlospinmatrix_detector and topology/connection_creator, although these uses seem different at first check and may not be affected (but might still benefit from refactoring).

@jarsi
Copy link
Contributor Author

jarsi commented Feb 18, 2020

In addition to the files you mentioned, there are even more std::vector< bool >:

$ git grep "std::vector< bool >"
models/correlospinmatrix_detector.h:    std::vector< bool > curr_state_; //!< current state of neuron i
nestkernel/event.h:      // and a std::vector entry may differ (e.g. for std::vector< bool >)
nestkernel/simulation_manager.cpp:  std::vector< bool > done;
nestkernel/source_table.h:  std::vector< bool > is_cleared_;
nestkernel/source_table.h:  std::vector< bool > saved_entry_point_;
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions.size() );
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions.size() );
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions->size() );
topology/connection_creator_impl.h:        std::vector< bool > is_selected( positions->size() );
topology/connection_creator_impl.h:    std::vector< bool > is_selected( targets.size() );

Both occurrences in nestkernel/source_table.h deal with threads. The one in nestkernel/simulation_manager.cpp is only written to inside of a #pragma omp single and #pragma omp critical. But still it could be replaced.

@heplesser
Copy link
Contributor

The following PRs are related to this issue: #1416, #1421, #1422, #1427, #1428, #1430.

@heplesser
Copy link
Contributor

I have reviewed the use of vector< bool > in correlospinmatrix_detector and in topology/connection_creator. In the first case the use is per detector object and therefore threadsafe, while in the second case the use is only for the fixed_{in,out}degree cases, which are handled single-threaded in topology.

Therefore, those uses are safe and need not be changed.

@suku248
Copy link
Contributor

suku248 commented Feb 25, 2020

PR #1442 implements the strategy of using a vector of integers wrapped in PerThreadBoolIndicator (replacing CompletedChecker) and should resolve this issue.

Just for the record: std::vector< bool > done that is local in SimulationManager::update_() uses a different strategy to ensure thread safety. For code consistency, it should be checked whether this could also be replaced with PerThreadBoolIndicator.

@heplesser heplesser added ZP: PR Created DO NOT USE THIS LABEL and removed ZP: In progess DO NOT USE THIS LABEL labels Feb 26, 2020
@heplesser
Copy link
Contributor

@suku248 Thanks a lot! Concerning done in SimulationManager::update_() I agree that the current implementation is safe, but looks as it could be improved upon. I would suggest to do this in a later comprehensive review of threading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: Behavior changes Introduces changes that produce different results for some users S: Critical Needs to be addressed immediately T: Bug Wrong statements in the code or documentation ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants