-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike loss in repeated simulations of exact same script #1394
Comments
@jarsi Thank you for your excellent detective work! To summarise first
I did a small test and the difference in membrane potential observed at t=8.5ms, which is 0.006197 mV lower for the incorrect than for the reference case for both neurons, is consistent with exactly one input spike not being delivered correctly during the time step ending at 8.5ms. The missing spike can have two origins
If it was a spike emitted by another neuron and that was lost in transmission to VP 11, I would have expected that more neurons would be affected, since each neuron has on average 14.6 targets per VP and the probability to have only 2 should be pretty low. It could still be that the spike arrives at the VP, but is not properly delivered to all local targets. Still, right now it seems more likely that this is a spike generated by a poisson generator, where spike trains differ for all target neurons. One way to test if the problem is caused by neuron or generator spikes would be to change the weight of the generator -> neuron connections, e.g., by a factor 1.3 (not an integer multiple). This will change dynamics, but as soon as we see the first difference in membrane potential, the size of this change will tell us the weight of the connection that caused it, thus differentiating between generator and neuron input. Another thing to look at would be the full V_m traces for the two neurons. |
@heplesser, thank you for your summary. I almost agree with all your points, besides a small inaccuracy, probably stemming from my description: |
I had a closer look at the membrane potentials. A divergence can be seen on two virtual processes, process 11 and 25. In total there are 28 divergences at timestep 8.5 |
@jarsi 12 and 16 are highly compatible with and average of 14.6+-3.8 targets per neuron per VP. Therefore, this observation indicates to me that a spike has been lost in transmission from pre- to post-synaptic VP. We cannot be sure without further analysis if one and the same spike failed to reach two VPs or if two different spikes were not delivered correctly. Next steps are a connectome analysis for the 28 affected targets. Since we have a fixed delay of 1.5 ms, the lost spike must have been fired at 7.0 ms, so by comparing connectome information with spike traces, we may be able to pinpoint precisely which spike was lost. |
Here one more update based on further analysis by @jarsi.
|
This is a summary of my experiments with the The main finding is: The bug only occurs if the same synapse model is used for both E->E and E->I connections (i.e. no STDP synapses for E->E). I have tested I have used 32 MPI processes and 24 threads per process (same as Jari). If I reduce the number of threads to 22 or 20, the bug does not occur (in 1000ms of simulation). This kind of speaks against a race condition. I have not tested with more than 24 threads, yet. The minimum number of neurons per VP (in steps of 100) is 500, i.e. no bug for 400 or fewer neurons per VP. So, the occurrence of the bug seems to require a minimum number of synapses in a connection vector and a minimum number of threads. Therefore, I think this bug might be related to #1088. |
A short update on this issue. @suku248 found through systematic search that the bug is related to method This function saves the current position of the SourceTable before MPI communication such that operation can continue safely from where it left off after MPI communication. There are two ways of how this function can be accessed. The The reason for this seems to be the way in which
This is a
This is due to the way how this datastructure is implemented. In order to make this datastructure space efficient, the vector elements are coalesced such that each element occupies a single bit instead of To test whether this is indeed the problem, I changed the type of
The next todo is to think which alternative container to use. |
@jarsi @suku248 @jakobj Congratulations on this excellent detective work! Concerning the new choice of container, cppreference (Section "Thread safety", item 3) states that
so using any other container should be fine. Beyond correctness, cache trashing may be an issue here. I ran a few tests on a 2x16 core Epyc2 Rome server. 64 threads write in parallel 10^7 times to a 64 element
For > 64 Bit, I padded with unused vector elements. Effects might be smaller when writes are mixed with other operations, but I think this suggests using a We should in this context also review the Furthermore, we need to review use of |
In addition to the files you mentioned, there are even more
Both occurrences in |
I have reviewed the use of Therefore, those uses are safe and need not be changed. |
PR #1442 implements the strategy of using a vector of integers wrapped in Just for the record: |
@suku248 Thanks a lot! Concerning |
I have observed, that under some circumstances the number of recorded spikes can diverge. In these repeated simulations I run the exact same script (a static version of examples/nest/hpc_benchmark.sli) and randomly get different results. The following table summarizes the problem:
The simulations using the 2.14 release yield the same results, irregardless of the division between threads and MPI processes. The number of spikes in the 2.16 version is constant when using only mpi processes and fluctuates when using an hybrid scheme. The slight variation between 2.14 and 2.16 is to be expected as the spikes ocuring in the last time step are only recorded to file in 2.16.
As spike loss only occurs with 2.16, the bug was probably introduced with the 5g kernel
Proof of spike loss in spike and membrane recordings
To further understand the problem I have recorded both the spikes and membrane voltages. I used
the 2.16 release. I either use 768 mpi processes with 1 thread each on 32 nodes
or 32 mpi processes with 24 threads each on 32 nodes. As spike loss has not
been seen in pure MPI simulation, I assume that the mpi-only simulation serves
as a ground truth.
Indeed a difference in the gdf files can be found. In the uncorrectly behaving
simulation, one neuron spikes 0.1 ms too late.
The effect of spike loss occurs much earlier than an incorrect spike can be
registered. This can be verified by recording the membrane potentials. Here in
several neurons at timestep 8.500000000000000 the membrane potentials diverges.
The text was updated successfully, but these errors were encountered: