The Initialisation Optimisation #2097

thorstenhater · 2023-03-15T14:56:18Z

What?

Add profiler mark-up to initialisation
Brunel
- Fix a bug in the example if we cannot add enough connections
- Eliminate data movement in connections_on
Do some minor optimisations in label resolution and then ...
... replace std::visit w/ std::get/std::holds_alternative. I'll the numbers do the talking
Add busyring from NSUITE to examples
- Add example workloads for different scenarios
- This is mainly to use these for profilining.
Go nuts and refactor fvm_layout.cpp :/
Notice that much time in initialise is lost due minuscule allocations for piecewise push_back
- Cause: Many, many calls to pw_zip_with due to integrate
- Add reserve/resize where we can.
- Slate problem to be fixed once we get better access to std::pmr. GCC has it, Clang 16 will have it.

Impact

Before

✦ ❯ bin/brunel -m 5000 -n 5000 -t 10 -G 10000
==========================================
  Brunel model miniapp
  - distributed : 1 (mpi)
  - threads     : 8
  - gpus        : no
==========================================
REGION                           CALLS     WALL    THREAD        %
root                                 -    5.019    40.148    100.0
  init                               -    3.510    28.082     69.9

    communicator                     -    3.492    27.940     69.6
      update                         -    3.492    27.940     69.6
        connections                  1    2.796    22.365     55.7
        gid_connections              1    0.367     2.938      7.3
        sort_connections             1    0.330     2.636      6.6
        collect_gids                 1    0.000     0.000      0.0
        index                        1    0.000     0.000      0.0
        clear                        1    0.000     0.000      0.0
    simulation                       -    0.018     0.142      0.4
      update                         -    0.012     0.099      0.2
        generators                   1    0.012     0.099      0.2
      resolvers                      1    0.004     0.030      0.1
      group                          -    0.001     0.007      0.0
        factory                      1    0.001     0.007      0.0
        targets_and_sources          1    0.000     0.000      0.0
      source                         -    0.000     0.003      0.0
        MPI                          1    0.000     0.003      0.0
      sources                        1    0.000     0.002      0.0
      comm                           1    0.000     0.000      0.0
  communication                      -    1.370    10.957     27.3
    enqueue                          -    1.343    10.742     26.8
      setup                    4000000    0.537     4.299     10.7
      tree                     2000000    0.479     3.831      9.5
      sort                     2000000    0.214     1.715      4.3
      merge                    2000000    0.112     0.898      2.2
    walkspikes                     200    0.026     0.212      0.5
    exchange                         -    0.000     0.002      0.0
      gather                       200    0.000     0.001      0.0
      sort                         200    0.000     0.001      0.0
      gatherlocal                  200    0.000     0.001      0.0
    spikeio                        200    0.000     0.000      0.0
  advance                            -    0.139     1.110      2.8
    lif                            200    0.139     1.109      2.8
    spikes                         200    0.000     0.000      0.0


There were 10000 spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)
-------------------------------------------------------------------------------------------
setup                           0.000
model-init                     28.245
model-simulate                  2.672
meter-total                    30.918

After

✦ ❯ bin/brunel -m 5000 -n 5000 -t 10 -G 10000
==========================================
  Brunel model miniapp
  - distributed : 1 (mpi)
  - threads     : 8
  - gpus        : no
==========================================
REGION                           CALLS     WALL    THREAD        %
root                                 -    0.450     3.601    100.0
  communication                      -    0.236     1.886     52.4
    enqueue                          -    0.231     1.849     51.4
      setup                    4000000    0.110     0.883     24.5
      tree                     2000000    0.072     0.576     16.0
      sort                     2000000    0.030     0.239      6.6
      merge                    2000000    0.019     0.152      4.2
    walkspikes                     200    0.005     0.036      1.0
    exchange                         -    0.000     0.000      0.0
      gather                       200    0.000     0.000      0.0
      gatherlocal                  200    0.000     0.000      0.0
      sort                         200    0.000     0.000      0.0
    spikeio                        200    0.000     0.000      0.0
  init                               -    0.198     1.588     44.1
    communicator                     -    0.195     1.564     43.4
      update                         -    0.195     1.564     43.4
        connections                  1    0.122     0.977     27.1
        gid_connections              1    0.048     0.381     10.6
        sort_connections             1    0.026     0.206      5.7
        collect_gids                 1    0.000     0.000      0.0
        clear                        1    0.000     0.000      0.0
        index                        1    0.000     0.000      0.0
    simulation                       -    0.003     0.024      0.7
      update                         -    0.003     0.021      0.6
        generators                   1    0.003     0.021      0.6
      resolvers                      1    0.000     0.002      0.0
      group                          -    0.000     0.001      0.0
        factory                      1    0.000     0.001      0.0
        targets_and_sources          1    0.000     0.000      0.0
      source                         -    0.000     0.000      0.0
        MPI                          1    0.000     0.000      0.0
      sources                        1    0.000     0.000      0.0
      comm                           1    0.000     0.000      0.0
  advance                            -    0.016     0.127      3.5
    lif                            200    0.016     0.127      3.5
    spikes                         200    0.000     0.000      0.0


There were 10000 spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)
-------------------------------------------------------------------------------------------
setup                           0.000
model-init                      1.613
model-simulate                  1.427
meter-total                     3.040

How?

See here
https://stackoverflow.com/questions/57726401/stdvariant-vs-inheritance-vs-other-ways-performance
std::visit has horrible performance.

boeschf · 2023-03-15T16:06:29Z

Thanks for this! A couple remarks:

Could you group the profiler timings a bit differently by grouping everything that is not "init" under a common tag (maybe "run" or the like)?
Can you test what happens if the cell_groups are larger than 1 cell (i.e. the default when using the GPU?)

thorstenhater · 2023-03-15T16:15:34Z

Is already done: I did not observe any difference (using up to 8 threads) when setting group size = 1 (default) and = 10000 (that in the benchmark above -G)

thorstenhater · 2023-03-15T18:19:11Z

Is a reasonable request, but requires re-writing all the profiler tags. I'd rather make a new PR fixing the problem's root cause: We cannot have nested regions. That OK for you?

* Split fvm_layout into multiple phases. * Run system profiler on busyring with pure initialisation workload * Conclude we spend _all_ of our time in minuscule allocations * These are caused by integrate_* > pw_zip_with > push_back.

arbor/util/maputil.hpp

arbor/communication/communicator.cpp

python/recipe.cpp

arbor/fvm_layout.cpp

boeschf · 2023-03-21T15:28:51Z

arbor/fvm_layout.cpp

Thanks for refactoring! Big effort, but improves quality considerably.

At a certain point it was hard to stop. I am still a bit annoyed by my inability to get rid of a bunch of useless
allocations:

pw_over_cable

integrate_area

pw_zip_with

all these allocate tons of small, intermediate vectors. Like someone with a taste for functional languages and
their use of list-as-control-structure wrote that part (@halfflat, maybe?). It's not BAD, just 10% or so, but that
whole thing (=initialize) dies a death of a million papercuts.

maybe if pw_elements wasn't a container but a range adaptor it'd be easier to get rid of those allocations, but I haven't looked closely. For now I think this is good enough.

Me too, or I'd still be trying.

example/busyring/CMakeLists.txt

thorstenhater added 3 commits March 10, 2023 15:45

Try to squeeze initialize.

24bd4e8

Optimise and fix(!) brunel example. Improve label resolution init 10x.

819116a

Add connection to seen.

ac1f929

thorstenhater added 14 commits March 16, 2023 13:47

Add busyring.

9721d0e

Smaller optimisations in pw.

36cc771

Refactor fvm-layouting for profiling and sanity.

ab8eefe

* Split fvm_layout into multiple phases. * Run system profiler on busyring with pure initialisation workload * Conclude we spend _all_ of our time in minuscule allocations * These are caused by integrate_* > pw_zip_with > push_back.

Next round.

e6b84ed

Some simplification, fix tests.

b3bf7a6

the tyrant demands his due.

88f703d

Smaller optimisations after playing a bit with PMR.

c11e1a9

Simplify mech verification and make_*.

127ab95

Slight touchup.

ce5f14e

Compression.

cafd4d1

Compression part II: streamline parameters namings and defaultings.

51cafbc

Compression III: collate arguments.

b05eb14

Constexpr is too restrictive.

ab9a485

Final clean-up.

9e2fad7

boeschf reviewed Mar 21, 2023

View reviewed changes

thorstenhater added 2 commits March 21, 2023 20:41

Better readme, clean up some left-overs.

211bc1f

Templates are hard.

ff9da80

boeschf approved these changes Mar 22, 2023

View reviewed changes

thorstenhater merged commit 9b4cf86 into arbor-sim:master Mar 22, 2023

thorstenhater deleted the perf/the-initialisation-optimisation branch March 22, 2023 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Initialisation Optimisation #2097

The Initialisation Optimisation #2097

thorstenhater commented Mar 15, 2023 •

edited

Loading

boeschf commented Mar 15, 2023

thorstenhater commented Mar 15, 2023

thorstenhater commented Mar 15, 2023

boeschf Mar 21, 2023

thorstenhater Mar 21, 2023

boeschf Mar 22, 2023 •

edited

Loading

thorstenhater Mar 22, 2023

The Initialisation Optimisation #2097

The Initialisation Optimisation #2097

Conversation

thorstenhater commented Mar 15, 2023 • edited Loading

What?

Impact

Before

After

How?

boeschf commented Mar 15, 2023

thorstenhater commented Mar 15, 2023

thorstenhater commented Mar 15, 2023

boeschf Mar 21, 2023

Choose a reason for hiding this comment

thorstenhater Mar 21, 2023

Choose a reason for hiding this comment

boeschf Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

thorstenhater Mar 22, 2023

Choose a reason for hiding this comment

thorstenhater commented Mar 15, 2023 •

edited

Loading

boeschf Mar 22, 2023 •

edited

Loading