Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault for slightly larger networks, related to version with openmpi #1881

Closed
ChristianKeup opened this issue Dec 18, 2020 · 12 comments
Assignees
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Normal Handle this with default priority T: External bug Not an issue that can be solved here. (May need documentation, though)

Comments

@ChristianKeup
Copy link
Contributor

ChristianKeup commented Dec 18, 2020

Hello,

I'm using a laptop with 16G memory running Ubuntu 18.04. If I install NEST with OpenMPI using conda (as in the documentation):
(of course, I do not quite need openmpi on the laptop)

conda create --name ENVNAME -c conda-forge nest-simulator=*=mpi_openmpi*

and then run the following minimal example "simulation" as a python script:

import nest  
nrns = nest.Create('iaf_psc_delta', 5000)  
conn_dict = {"rule": "fixed_indegree", "indegree": 200}  
nest.Connect(nrns, nrns, conn_dict, {'weight': -1.0})  
nest.Simulate(100)  

I get the following output with a segmentation fault error:

[INFO] [2020.12.18 15:55:56 /home/conda/feedstock_root/build_artifacts/nest-simulator_1604245416729/work/nestkernel/rng_manager.cpp:217 @ Network::create_rngs] : Creating default RNGs
[INFO] [2020.12.18 15:55:56 /home/conda/feedstock_root/build_artifacts/nest-simulator_1604245416729/work/nestkernel/rng_manager.cpp:260 @ Network::create_grng_] : Creating new default global RNG

          -- N E S T --

Copyright (C) 2004 The NEST Initiative

Version: nest-2.20.0
Built: Nov 1 2020 15:48:07

This program is provided AS IS and comes with
NO WARRANTY. See the file LICENSE for details.

Problems or suggestions?
Visit https://www.nest-simulator.org

Type 'nest.help()' to find out more about NEST.

Dec 18 15:55:56 NodeManager::prepare_nodes [Info]:
Preparing 5000 nodes for simulation.
[inm6187:15729] *** Process received signal ***
[inm6187:15729] Signal: Segmentation fault (11)
[inm6187:15729] Signal code: Address not mapped (1)
[inm6187:15729] Failing at address: 0xfffffffffffffff8
[inm6187:15729] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f9f7dbfb980]
[inm6187:15729] [ 1] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libmodels.so(+0x4b5960)[0x7f9f5f3c0960]
[inm6187:15729] [ 2] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libmodels.so(+0x4b5f5e)[0x7f9f5f3c0f5e]
[inm6187:15729] [ 3] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libmodels.so(+0x4b7936)[0x7f9f5f3c2936]
[inm6187:15729] [ 4] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libmodels.so(+0x4b8882)[0x7f9f5f3c3882]
[inm6187:15729] [ 5] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libmodels.so(+0x4b8ae4)[0x7f9f5f3c3ae4]
[inm6187:15729] [ 6] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libmodels.so(_ZN4nest4sortINS_6SourceENS_16StaticConnectionINS_24TargetIdentifierPtrRportEEEEEvR11BlockVectorIT_ERS5_IT0_E+0x1d9)[0x7f9f5f3c3d19]
[inm6187:15729] [ 7] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libnestkernel.so(_ZN4nest17ConnectionManager16sort_connectionsEi+0x99)[0x7f9f5eda5c19]
[inm6187:15729] [ 8] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libnestkernel.so(_ZN4nest17SimulationManager32update_connection_infrastructureEi+0x1f2)[0x7f9f5ed98ae2]
[inm6187:15729] [ 9] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libgomp.so.1(GOMP_parallel+0x42)[0x7f9f5e71ee8c]
[inm6187:15729] [10] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libnestkernel.so(_ZN4nest17SimulationManager7prepareEv+0x1be)[0x7f9f5ed97e4e]
[inm6187:15729] [11] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libnestkernel.so(_ZN4nest17SimulationManager8simulateERKNS_4TimeE+0x12)[0x7f9f5eda3442]
[inm6187:15729] [12] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libnestkernel.so(_ZN4nest8simulateERKd+0xc3)[0x7f9f5ed837d3]
[inm6187:15729] [13] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libnestkernel.so(_ZNK4nest10NestModule16SimulateFunction7executeEP14SLIInterpreter+0x45)[0x7f9f5ed56b05]
[inm6187:15729] [14] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libsli.so(+0x743b3)[0x7f9f5eaa73b3]
[inm6187:15729] [15] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter8execute_Em+0x222)[0x7f9f5eaabc62]
[inm6187:15729] [16] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/../../../libsli.so(_ZN14SLIInterpreter7executeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x162)[0x7f9f5eaac202]
[inm6187:15729] [17] /home/keup/miniconda2/envs/nesttestmpi/lib/python3.9/site-packages/nest/pynestkernel.so(+0x2c01c)[0x7f9f5fa4c01c]
[inm6187:15729] [18] python(+0x198199)[0x55b0b4506199]
[inm6187:15729] [19] python(_PyEval_EvalFrameDefault+0x608)[0x55b0b454df88]
[inm6187:15729] [20] python(_PyFunction_Vectorcall+0x19a)[0x55b0b450bdea]
[inm6187:15729] [21] python(_PyEval_EvalFrameDefault+0x3ba)[0x55b0b454dd3a]
[inm6187:15729] [22] python(_PyFunction_Vectorcall+0x19a)[0x55b0b450bdea]
[inm6187:15729] [23] python(_PyObject_Call+0x10b)[0x55b0b44cce4b]
[inm6187:15729] [24] python(_PyEval_EvalFrameDefault+0x2eaf)[0x55b0b455082f]
[inm6187:15729] [25] python(+0x1388f0)[0x55b0b44a68f0]
[inm6187:15729] [26] python(_PyFunction_Vectorcall+0x336)[0x55b0b450bf86]
[inm6187:15729] [27] python(_PyEval_EvalFrameDefault+0x4c85)[0x55b0b4552605]
[inm6187:15729] [28] python(+0x1388f0)[0x55b0b44a68f0]
[inm6187:15729] [29] python(PyEval_EvalCodeWithName+0x47)[0x55b0b458cfd7]
[inm6187:15729] *** End of error message ***

If however, I reduce the number of neurons from 5000 to 4000, or if I conda install NEST without openmpi support, it works just fine (and then also with much larger networks, e.g. 50000 iaf_psc_delta neurons).

system/installation:

  • OS: Ubuntu-18.04
  • Shell: bash
  • Python-Version: 3.9.1
  • NEST-Version: 2.20.0
  • Installation: conda with MPI

Since I don't need MPI on the laptop, the issue is not anymore a direct problem for me. Thanks @AlexVanMeegen for pointing me in the right direction. He also suggested that writing an issue could be useful.

Best, Christian

@heplesser heplesser added I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Normal Handle this with default priority T: Bug Wrong statements in the code or documentation labels Jan 4, 2021
@heplesser
Copy link
Contributor

@ChristianKeup Thanks for reporting this! This seems rather strange indeed. According to the stack trace, the segfault occurs when NEST sorts connections before starting the simulation. This does not happen (if I remember right) when not using MPI. Some questions/suggestions:

  • Are you running with more than one MPI process (i.e. mpirun -np N python or just serially)?
  • Which Boost version are you using? CMake reports this.
  • Could you try with current master?
  • Could you try without Python, i.e., run the simulation from a SLI script?

@ChristianKeup
Copy link
Contributor Author

ChristianKeup commented Jan 13, 2021

Hello,

  1. I ran the script only serially
  2. I am not sure whether the conda installation uses Boost, since I don't get the cmake output then. At least libboost is not among the packages installed in the conda environment. When I compile the current Nest master, I have tried both with boost version 1.73, and without boost.
  3. I compiled the current master with the MPI flag on, and indeed the error did not occur! (with/without boost worked both)
  4. Indeed, there is also no error when I execute the following SLI commands below (however I'm not sure I used the Connect command correctly, I couldn't find a help page on this):

ResetKernel
/iaf_psc_delta 5000 Create /neurons Set
neurons neurons Connect
100.0 Simulate

Maybe it would be good to see if the issue is reproducible on another machine?
Best, Christian

@heplesser
Copy link
Contributor

Thanks for the update! To build the same network with SLI, use

ResetKernel
 /n /iaf_psc_delta 5000 Create def
n n << /rule /fixed_indegree /indegree 200 >> << /weight -1.0 >> Connect
100 Simulate

On my computer, that works fine. Could you try on yours?

My suspicion now would be that some libraries are mixed up in the Conda install. If anyone else with a Conda installation of NEST on Linux could test that would be useful.

@ChristianKeup
Copy link
Contributor Author

Ah, thanks. This code works on the current master, but on the conda installed version 2.20 it throws an error:

Connect [Error]: ArgumentType
The type of the second parameter did not match the argument(s) of this function.

Could it be that in version 2.20 the Connect function took different arguments?

Concerning your suspicion, this is also in line with what Moritz Helias guessed when I mentioned the issue to him. Maybe it links to a wrong version of OpenMPI or a related library.

@ChristianKeup
Copy link
Contributor Author

Thanks @AlexVanMeegen for testing the issue on his laptop. He gets the same segmentation fault error: (Conda install with MPI)

[inm6184:30818] *** Process received signal ***
[inm6184:30818] Signal: Segmentation fault (11)
[inm6184:30818] Signal code: Address not mapped (1)
[inm6184:30818] Failing at address: 0xfffffffffffffff8
[inm6184:30818] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7efeb6ea2980]
[inm6184:30818] [ 1] /home/vanmeegen/miniconda3/envs/bugnest/lib/python3.9/site-packages/nest/../../../libmodels.so(+0x4b5960)[0x7efe98bb1960]
.
.
.
[inm6184:30818] [29] python(_PyEval_EvalCodeWithName+0x47)[0x55fab61b3cf7]
[inm6184:30818] *** End of error message ***
Segmentation fault (core dumped)

also he pointed out that maybe the very same problem was reported by Dominic Standage on the Nest user mailing list last year, the title of that mail was "Microcircuit example, segmentation fault" .

@heplesser
Copy link
Contributor

For NEST 2.20, which doens't have the fancy node collections yet, the corresponding script is

/nn 5000 def
/iaf_psc_delta nn Create pop
/n 1 nn cvgidcollection def
n n << /rule /fixed_indegree /indegree 200 >> << /weight -1.0 >> Connect
100 Simulate

@steffengraber Since this seems to be a conda problem, could you take a look?

@ChristianKeup
Copy link
Contributor Author

Thanks, this SLI code generated the same error as the python script.
I also got the error when using fixed_outdegree instead
n n << /rule /fixed_outdegree /outdegree 200 >> << /weight -1.0 >> Connect
but with all_to_all connectivity,
n n << /rule /all_to_all >> << /weight -1.0 >> Connect
the error did not appear.

@heplesser
Copy link
Contributor

Can you run the script in a debugger and report the stacktrace?

@hakonsbm
Copy link
Contributor

hakonsbm commented Jan 14, 2021

I can reproduce the issue. The Conda package is built without debugging symbols, so gdb doesn't give much info, but points to the function void boost::sort::pdqsort_detail::pdqsort_loop<...>(...) as the point where it goes wrong.

There are two Conda packages for Python 3.9, they depend on Boost 1.72 and 1.74. With a NEST Conda package that doesn't depend on Boost (nest-simulator=2.20.0=mpi_openmpi_py38hb43900d_0), the script doesn't cause a segfault. You can see the dependencies with for example conda search conda-forge::nest-simulator=2.20.0=mpi_openmpi_* --info.

So this looks the Boost sorting problem that was fixed in #1502, and which is part of 2.20.1. Note that there are also packages for NEST 2.20.1, but none of them are compiled with MPI.

@ChristianKeup Can you also try with a version that doesn't depend on Boost, for example nest-simulator=2.20.0=mpi_openmpi_py38hb43900d_0?

@steffengraber
Copy link
Contributor

@hakonsbm Thank you for sorting this out.
Because of some problems with mpi we decided not to offer this in the conda packages anymore and recommend installing with mpi from source.
https://nest-simulator.readthedocs.io/en/nest-2.20.1/installation/linux_install.html

@terhorstd
Copy link
Contributor

@steffengraber this should be documented and then closed. Moving to Documentation project.

@heplesser
Copy link
Contributor

Closing as this was an external error and is resolved by dropping MPI from conda package.

@heplesser heplesser added T: External bug Not an issue that can be solved here. (May need documentation, though) and removed T: Bug Wrong statements in the code or documentation labels Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Normal Handle this with default priority T: External bug Not an issue that can be solved here. (May need documentation, though)
Projects
Status: Done
Development

No branches or pull requests

5 participants