Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed: (source_lid < MAX_LID) for GetConnections in Small Neural networks #1489

Closed
sternj98 opened this issue Mar 28, 2020 · 17 comments · Fixed by #1502
Closed

Assertion failed: (source_lid < MAX_LID) for GetConnections in Small Neural networks #1489

sternj98 opened this issue Mar 28, 2020 · 17 comments · Fixed by #1502
Assignees
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Critical Needs to be addressed immediately T: Bug Wrong statements in the code or documentation
Milestone

Comments

@sternj98
Copy link

sternj98 commented Mar 28, 2020

Describe the bug
Assertion failed: (source_lid < MAX_LID) error querying connectivity with GetConnections for 24 or more connections on Conda installation on Mac os 10.14

To Reproduce
To reproduce the behavior, run the following script
import nest
nest.ResetKernel()
n = 5 # number of neurons
epop1 = nest.Create("iaf_psc_alpha",n)
epop2 = nest.Create("iaf_psc_alpha",n)
nest.Connect(epop1,epop2,{'rule': 'all_to_all'})
conns = nest.GetConnections(epop1,target = epop2)
nest.GetStatus(conns,["target","weight"])

This runs fine for n < 5, but I get the following error with n >=5:

Expected behavior
Assertion failed: (source_lid < MAX_LID), function set_source_lid, file /usr/local/miniconda/conda-bld/nest-simulator_1583214474797/work/nestkernel/target_data.h, line 264.
Abort trap: 6

Note that other scripts run fine, including simulations of larger networks as in example script brunel_alpha_numpy.py .

Desktop/Environment (please complete the following information):

  • OS: Mac OS 10.14
  • Python-Version: Python 3.6
  • NEST-Version: Nest 2.20
  • Installation: Conda

Additional context
Reproduced successfully by Hans Ekkehard

@terhorstd terhorstd added ZC: Kernel DO NOT USE THIS LABEL I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) ZP: Pending DO NOT USE THIS LABEL S: High Should be handled next T: Bug Wrong statements in the code or documentation labels Mar 30, 2020
@heplesser
Copy link
Contributor

heplesser commented Mar 30, 2020

I can reproduce the problem under OSX 10.14 and 10.15, but only when installing from conda-forge (nompi-variant). The assertion is not triggered if I build NEST from source.

Here is a reproducer in SLI:

/iaf_psc_alpha 10 Create ;
/p1 [1 5] Range def
/p2 [6 10] Range def
p1 p2 /all_to_all Connect
<<  >> GetConnections

The assertion is not triggered if I create only 8 neurons and split them in two groups of 4. Indeed, it turns out that the absolute minimum required to trigger the assertion is a single group of five neurons connected to itself:

 /iaf_psc_alpha 5 Create 1 arraystore Range dup Connect << >> GetConnections

triggering the same exception. Instead of calling GetConnections, one can also trigger the assertion by simulating, since target data needs to be collected in both cases:

 /iaf_psc_alpha 5 Create 1 arraystore Range dup Connect 1 Simulate

@heplesser
Copy link
Contributor

@steffengraber Could you for debugging purposes create a conda recipe that builds NEST as bare-bones as possible, especially without GSL, possibly also with ncurses and readline?

BTW, Conda is throwing in a lot of additional compiler flags, as revealed by running nest-config --cflags:

-march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0 -isystem /Users/plesser/miniconda3/envs/nest_test/include -fdebug-prefix-map=/usr/local/miniconda/conda-bld/nest-simulator_1583214474797/work=/usr/local/src/conda/nest-simulator-2.20.0 -fdebug-prefix-map=/Users/plesser/miniconda3/envs/nest_test=/usr/local/src/conda-prefix -std=c++11 -O2 -Wall 

I experimented with adding some of them to my local build, but that did not cause the problem.

I also traced library loading. When running the conda-installed NEST version, libc++ is loaded twice

dyld: loaded: /Users/plesser/miniconda3/envs/nest_test/lib/libc++.1.dylib
...
dyld: loaded: /usr/lib/libc++.1.dylib

and those are different files. Bending some paths, I made the first be the second, so only one libc++ is loaded, but the problem persists.

Or is there any way to build (compile) a conda package locally to be able to do some more debugging?

@heplesser
Copy link
Contributor

@steffengraber did excellent detective work, so it is now clear that building NEST with Boost causes the problem. The problem occurs under Linux and macOS and, if present, also leads to some failing tests. It is not clear why the problem does not lead to failing tests on Travis.

As a work-around, one can either build NEST locally without Boost (the default) or install the following versions from Conda, which do not include Boost:

Linux

conda install -c conda-forge nest-simulator=*=*nompi_py38hcb0619c_103*

macOS

conda install -c conda-forge nest-simulator=*=*nompi_py38h8ffda2a_103*

..

@hakonsbm
Copy link
Contributor

@heplesser Looks like it can be related to #1239, then.

@heplesser
Copy link
Contributor

@hakonsbm Indeed!

I tested this in the debugger now and have found the following: When the assertion triggers here, we have

source_tid == 18446744073709551615 == 2^64-1

This is the result of VPManager::gid_to_lid() when called with gid == 0. The method should really have an assertion insuring that it is only called with gid > 0. And it is called with gid == 0 because in SourceTable::get_next_target_data() a call to current_source.get_gid() returns 0; the full current_source object is (gid_ = 0, processed_ = true, primary_ = true).

I suspect that the problem is related to sorting, since that is where Boost comes in. Indeed, he following passes

0 << /sort_connections_by_source false >> SetStatus
/iaf_psc_alpha 5 Create 1 arraystore Range dup Connect 1 Simulate

while this does not:

0 << /sort_connections_by_source false >> SetStatus
/iaf_psc_alpha 5 Create 1 arraystore Range dup Connect 1 Simulate

So clearly using Boost for sorting causes the problem.

As @niltonlk pointed out in #1239, the problem seems to exist only for Boost 1.69.0 and later, so change in Boost must be breaking our code.

@hakonsbm I assigning you to this task since you have most experience with Boost sorting.

@heplesser heplesser added this to the NEST 2.20.1 milestone Mar 31, 2020
@heplesser heplesser added ZP: In progess DO NOT USE THIS LABEL S: Critical Needs to be addressed immediately and removed ZP: Pending DO NOT USE THIS LABEL S: High Should be handled next labels Mar 31, 2020
@heplesser
Copy link
Contributor

Escalated to critical due to identified incompatiblity with Boost 1.69.0 and later.

@lekshmideepu
Copy link
Contributor

@steffengraber did excellent detective work, so it is now clear that building NEST with Boost causes the problem. The problem occurs under Linux and macOS and, if present, also leads to some failing tests. It is not clear why the problem does not lead to failing tests on Travis.

As a work-around, one can either build NEST locally without Boost (the default) or install the following versions from Conda, which do not include Boost:

Linux

conda install -c conda-forge nest-simulator=*=*nompi_py38hcb0619c_103*

macOS

conda install -c conda-forge nest-simulator=*=*nompi_py38h8ffda2a_103*

..

@heplesser Boost is turned OFF in MacOS on Travis

@heplesser
Copy link
Contributor

@lekshmideepu The problem also occurs on Linux, but only if NEST is built with a recent version of Boost (1.69.0 or later). On Travis, we use 1.58.0. Would there be a way to move to a more recent version of Boost on Travis?

@lekshmideepu
Copy link
Contributor

@heplesser sure, I could try with a more recent version of Boost on Travis

@hakonsbm
Copy link
Contributor

I suspect that the problem is related to sorting, since that is where Boost comes in. Indeed, he following passes

0 << /sort_connections_by_source false >> SetStatus
/iaf_psc_alpha 5 Create 1 arraystore Range dup Connect 1 Simulate

while this does not:

0 << /sort_connections_by_source false >> SetStatus
/iaf_psc_alpha 5 Create 1 arraystore Range dup Connect 1 Simulate

@heplesser Should it be /sort_connections_by_source true in the failing example?

I'm still not able to reproduce the issue on my computer (with Boost 1.72.0). I will keep experimenting.

@heplesser
Copy link
Contributor

heplesser commented Mar 31, 2020

@hakonsbm Yes, the problem occurs if connections are sorted. I am using Boost 1.72.0, and @steffengraber had the problems also on Linux (which version?). The problem also occurs with current master, with the code from above then changed to

<< /sort_connections_by_source true >> SetKernelStatus
/iaf_psc_alpha 5 Create dup Connect 1 Simulate

BTW, on macOS the problem occurs when compiling with clang and with gcc; compiling with gcc requires an additional -D_GLIBCXX_USE_CXX11_ABI=0 compiler flag to avoid problems with the boost unit testing framework from Homebrew.

@heplesser
Copy link
Contributor

@hakonsbm I had a quick look at the code of Boost 1.68.0 vs 1.69.0. Among other things, sort/common/pivot.hpp got a different mid3() implementation, and sort/spreadsort/detail/integer_sort.hpp since 1.69 falls back on boost::sort::pdqsort() instead of std::sort() for small sizes.

@steffengraber
Copy link
Contributor

I have a conda build environment on Ubuntu 19.10 and have now built NEST 2.20.0 with different versions of Boost. A make installcheck gave the following results:

  • with boost v1.72.0 - errors
  • with boost v1.71.0 - errors
  • with boost v1.70.0 - errors
  • with boost v1.69.0 - errors
  • with boost v1.68.0 - ok

@heplesser
Copy link
Contributor

@steffengraber Thanks for testing! The problem seems indeed related to the change from std::sort() to boost::sort::pdqsort() for small sizes that happened with v1.69.0. @hakonsbm is on it.

@heplesser heplesser removed the ZP: In progess DO NOT USE THIS LABEL label Apr 7, 2020
@danxan
Copy link

danxan commented Jul 6, 2020

Hi,

I recently installed NEST 2.20 using Conda on a Linux system (Manjaro 20 OS) and I still experience this same error. Both with and without MPI.

Did you update the Conda package with the fix to this issue?

I am referring to the package used in this guide:
https://nest-simulator.readthedocs.io/en/nest-2.20.1/installation/

Thanks

@hakonsbm
Copy link
Contributor

@danxan The fix for this issue will be part of NEST 2.20.1, which is not released yet.

@Kodemannen
Copy link

Kodemannen commented Aug 10, 2020

Edit: @hakonsbm @danxan @heplesser It works for me now after installing Nest from source

Hey

I think I am having the same issue, but it happens even if I don't use nest.GetConnections().

Just a simple simulation of an interconnected network of >= 5 neurons will cause the error. The following snippet produces it:

import nest

n = 5
simtime = 10

neurons = nest.Create("iaf_psc_alpha", n)
nest.Connect(neurons, neurons)

nest.Simulate(simtime)

Any idea when the next release will be?

Cheers!

Desktop/Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Shell: bash
  • Python-Version: Python 3.8.5
  • NEST-Version: nest-2.20.0
  • Installation: conda-forge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Critical Needs to be addressed immediately T: Bug Wrong statements in the code or documentation
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants