Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Initialisation Optimisation #2097

Conversation

thorstenhater
Copy link
Contributor

@thorstenhater thorstenhater commented Mar 15, 2023

What?

  • Add profiler mark-up to initialisation
  • Brunel
    • Fix a bug in the example if we cannot add enough connections
    • Eliminate data movement in connections_on
  • Do some minor optimisations in label resolution and then ...
  • ... replace std::visit w/ std::get/std::holds_alternative. I'll the numbers do the talking
  • Add busyring from NSUITE to examples
    • Add example workloads for different scenarios
    • This is mainly to use these for profilining.
  • Go nuts and refactor fvm_layout.cpp :/
  • Notice that much time in initialise is lost due minuscule allocations for piecewise push_back
    • Cause: Many, many calls to pw_zip_with due to integrate
    • Add reserve/resize where we can.
    • Slate problem to be fixed once we get better access to std::pmr. GCC has it, Clang 16 will have it.

Impact

Before

✦ ❯ bin/brunel -m 5000 -n 5000 -t 10 -G 10000
==========================================
  Brunel model miniapp
  - distributed : 1 (mpi)
  - threads     : 8
  - gpus        : no
==========================================
REGION                           CALLS     WALL    THREAD        %
root                                 -    5.019    40.148    100.0
  init                               -    3.510    28.082     69.9

    communicator                     -    3.492    27.940     69.6
      update                         -    3.492    27.940     69.6
        connections                  1    2.796    22.365     55.7
        gid_connections              1    0.367     2.938      7.3
        sort_connections             1    0.330     2.636      6.6
        collect_gids                 1    0.000     0.000      0.0
        index                        1    0.000     0.000      0.0
        clear                        1    0.000     0.000      0.0
    simulation                       -    0.018     0.142      0.4
      update                         -    0.012     0.099      0.2
        generators                   1    0.012     0.099      0.2
      resolvers                      1    0.004     0.030      0.1
      group                          -    0.001     0.007      0.0
        factory                      1    0.001     0.007      0.0
        targets_and_sources          1    0.000     0.000      0.0
      source                         -    0.000     0.003      0.0
        MPI                          1    0.000     0.003      0.0
      sources                        1    0.000     0.002      0.0
      comm                           1    0.000     0.000      0.0
  communication                      -    1.370    10.957     27.3
    enqueue                          -    1.343    10.742     26.8
      setup                    4000000    0.537     4.299     10.7
      tree                     2000000    0.479     3.831      9.5
      sort                     2000000    0.214     1.715      4.3
      merge                    2000000    0.112     0.898      2.2
    walkspikes                     200    0.026     0.212      0.5
    exchange                         -    0.000     0.002      0.0
      gather                       200    0.000     0.001      0.0
      sort                         200    0.000     0.001      0.0
      gatherlocal                  200    0.000     0.001      0.0
    spikeio                        200    0.000     0.000      0.0
  advance                            -    0.139     1.110      2.8
    lif                            200    0.139     1.109      2.8
    spikes                         200    0.000     0.000      0.0


There were 10000 spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)
-------------------------------------------------------------------------------------------
setup                           0.000
model-init                     28.245
model-simulate                  2.672
meter-total                    30.918

After

✦ ❯ bin/brunel -m 5000 -n 5000 -t 10 -G 10000
==========================================
  Brunel model miniapp
  - distributed : 1 (mpi)
  - threads     : 8
  - gpus        : no
==========================================
REGION                           CALLS     WALL    THREAD        %
root                                 -    0.450     3.601    100.0
  communication                      -    0.236     1.886     52.4
    enqueue                          -    0.231     1.849     51.4
      setup                    4000000    0.110     0.883     24.5
      tree                     2000000    0.072     0.576     16.0
      sort                     2000000    0.030     0.239      6.6
      merge                    2000000    0.019     0.152      4.2
    walkspikes                     200    0.005     0.036      1.0
    exchange                         -    0.000     0.000      0.0
      gather                       200    0.000     0.000      0.0
      gatherlocal                  200    0.000     0.000      0.0
      sort                         200    0.000     0.000      0.0
    spikeio                        200    0.000     0.000      0.0
  init                               -    0.198     1.588     44.1
    communicator                     -    0.195     1.564     43.4
      update                         -    0.195     1.564     43.4
        connections                  1    0.122     0.977     27.1
        gid_connections              1    0.048     0.381     10.6
        sort_connections             1    0.026     0.206      5.7
        collect_gids                 1    0.000     0.000      0.0
        clear                        1    0.000     0.000      0.0
        index                        1    0.000     0.000      0.0
    simulation                       -    0.003     0.024      0.7
      update                         -    0.003     0.021      0.6
        generators                   1    0.003     0.021      0.6
      resolvers                      1    0.000     0.002      0.0
      group                          -    0.000     0.001      0.0
        factory                      1    0.000     0.001      0.0
        targets_and_sources          1    0.000     0.000      0.0
      source                         -    0.000     0.000      0.0
        MPI                          1    0.000     0.000      0.0
      sources                        1    0.000     0.000      0.0
      comm                           1    0.000     0.000      0.0
  advance                            -    0.016     0.127      3.5
    lif                            200    0.016     0.127      3.5
    spikes                         200    0.000     0.000      0.0


There were 10000 spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)
-------------------------------------------------------------------------------------------
setup                           0.000
model-init                      1.613
model-simulate                  1.427
meter-total                     3.040

How?

See here
https://stackoverflow.com/questions/57726401/stdvariant-vs-inheritance-vs-other-ways-performance
std::visit has horrible performance.

@boeschf
Copy link
Contributor

boeschf commented Mar 15, 2023

Thanks for this! A couple remarks:

  • Could you group the profiler timings a bit differently by grouping everything that is not "init" under a common tag (maybe "run" or the like)?
  • Can you test what happens if the cell_groups are larger than 1 cell (i.e. the default when using the GPU?)

@thorstenhater
Copy link
Contributor Author

  1. Is already done: I did not observe any difference (using up to 8 threads) when setting group size = 1 (default) and = 10000 (that in the benchmark above -G)

@thorstenhater
Copy link
Contributor Author

  1. Is a reasonable request, but requires re-writing all the profiler tags. I'd rather make a new PR fixing the problem's root cause: We cannot have nested regions. That OK for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring! Big effort, but improves quality considerably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a certain point it was hard to stop. I am still a bit annoyed by my inability to get rid of a bunch of useless
allocations:

  • pw_over_cable
  • integrate_area
  • pw_zip_with

all these allocate tons of small, intermediate vectors. Like someone with a taste for functional languages and
their use of list-as-control-structure wrote that part (@halfflat, maybe?). It's not BAD, just 10% or so, but that
whole thing (=initialize) dies a death of a million papercuts.

Copy link
Contributor

@boeschf boeschf Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe if pw_elements wasn't a container but a range adaptor it'd be easier to get rid of those allocations, but I haven't looked closely. For now I think this is good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, or I'd still be trying.

@thorstenhater thorstenhater merged commit 9b4cf86 into arbor-sim:master Mar 22, 2023
@thorstenhater thorstenhater deleted the perf/the-initialisation-optimisation branch March 22, 2023 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants