-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python/concurrency defaults #1979
Python/concurrency defaults #1979
Conversation
…eads by default when no mpi comm is provided
Hi @brenthuisman, thanks that might save a few people a question on the help board. However, the tests run into a timeout. |
I restarted the job a few times now, it's the Python examples running on the Pip source install that does not terminate. Logs are not very helpful (we need to print script filenames to see which is actually not terminating)! I'll try to take a look today or tomorrow. Update: offending script is |
Interesting ... can you figure out where exactly it spins? |
Ok, this appears exactly in this constellation
and results in an uninterruptible hang in Further: Spike recording or the number of connections is not relevant. Yes,
instead of
Other threaded blocks here are ok. |
Ok, found the perpetrator. See #1984. |
This should be unblocked now. |
seems to hang in python examples again |
Merge remote-tracking branch 'upstream/master' into python/concurrency_defaults
Runs OK now locally! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A few changes I think we should have before merging
- Please pull out the pybind version bump.
- Make C++ behave like we'd like Python to
- Have 'avail_threads' even in presence of GPU
If you like you can also make this even more fancy by make it
play nice with MPI, but that's not neccessary to merge.
python/simulation.cpp
Outdated
@@ -210,7 +212,7 @@ void register_simulation(pybind11::module& m, pyarb_global_ptr global_ptr) { | |||
const std::optional<arb::domain_decomposition>& decomp, | |||
std::uint64_t seed) { | |||
try { | |||
auto ctx = ctx_ ? ctx_ : std::make_shared<context_shim>(arb::make_context()); | |||
auto ctx = ctx_ ? ctx_ : std::make_shared<context_shim>(arb::make_context({arbenv::thread_concurrency(),-1})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, under the idea that Python and C++ APIs should co-evolve I am against this. Keep it at the default here and change the default in C++. But still keep the option to ask for 'avail_threads' explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, also in cpp make avail_threads
an option you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, currently it is, right, via thread_concurrency
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sure, but it isn't a default (yet). I thought that was your request?
I was thinking that the best place to set this default (once) is in proc_allocation
, but no, because it does not know about MPI. So, the default context
constructors should set num_threads
to arbenv::thread_concurrency()
, if MPI is not enabled.
In the interest of keeping things similar, should in Python the variable be called thread_concurrency
, and be a member of arbor
? e.g. arbor.context(arbor.thread_concurrency(), None)
? I like it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be in arbor.env
, yes, but that's the vision.
The C++ c'tor could be this:
context(std::optional<int> threads={},
std::optional<int> gpu_id={},
std::optional<MPI_Comm> comm={}) {
if (!threads && !comm {
}
else if (!threads && comm) {
}
else if ...
I know that gpu_id=-1
is used instead of optional
and
that we cannot mention the type MPI_Comm
w/o MPI=ON
,
but this is type we should have in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike in Python, there is no explicit constructor in C++, it all goes through proc_allocation
: execution_context(const proc_allocation& resources = proc_allocation{})
. Also, having this in C++ would make (all of) Arbor dependent on arbenv
.
Since there is no way to currently do sth like auto ctx = arbor::context{gpu_id=1}
, we could set proc_allocation.num_threads
by changing the non-mpi constructor to execution_context(const proc_allocation& resources = proc_allocation{arbenv::thread_concurrency(),-1});
To the extent that execution_context
construction maps to Python, it sets num_threads
under the same circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike in Python, there is no explicit constructor in C++, it all goes through proc_allocation: execution_context(const proc_allocation& resources = proc_allocation{}).
You could call proc_allocation a set of kwargs. Once we have achieved C++20 in its full glory this becomes
a thing
execution_context({.threads=4, .comm=MPI_COMM_WORLD})
Also, having this in C++ would make (all of) Arbor dependent on arbenv.
That's fine, to a certain extent, arbenv
is a library made for Arbor to interact with the environment.
As long we do not pull it into the deepest layers.
Also, nothing stopping us from making new, purpose-built c'tors, eg
context from_mpi_one_thread_per_logical_core(MPI_Comm& comm, int gpu_id=-1);
context from_mpi_one_thread_per_physical_core(MPI_Comm& comm, int gpu_id=-1);
// ...
Also, if proc_allocation is really too much of a hassle to have around, feel free to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code as per my earlier suggestion in the interested of merging this now.
- Removal of proc_allocation is fine by me, but I'm not a user. The C++ tests do, so that's why I have not suggested removal.
- MPI cleverness: to me that can be seen as a separate issues (the PR was initially written to get a better default for local use). I'll ask in the channel if users are missing something.
doc/python/hardware.rst
Outdated
The number of threads available locally for execution. Must be set to 1 at minimum. 1 by default. | ||
Passing ``"avail_threads"`` (as string) will query and use the maximum number of threads the system makes available. | ||
|
||
The number of threads available locally for execution. Defaults to the maximum number of threads the system makes available if gpu_id and mpi are not set, else defaults to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the rest of the change set, how about digging in a bit deeper and come up with sensible defaults for all cases? It's not a requirement to merge, but if we are going to put it in eventually, why not in 0.8?
I don't know whether arbenv has the proper calls, but
- MPI: figure out how many threads a task is allowed to use (using hwloc maybe) and set that when handed 'avail_threads'. Other than that there's OS-specific calls (which hwloc uses), we would need to consider MacOS and Linux.
- GPU: can just use 'avail_threads', it speeds up construction a bit. Independent of the change above. If 'avail_threads' receives super-charged capabilities wrt MPI, it'll benefit, if not, then not.
- MPI+GPU: do whatever MPI does. Likely we are using one task per GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial idea was to only touch the default constructor, to provide the best experience for novice users.
If you'd like to touch more than the default constructor: we could initialize threads
to thread_concurrency
. It's straightforward to understand for the user, but in particular when MPI is used, may result in surprising behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the experienced folks appreciate less typing, too. So, yes, thread_concurrency
is fine, but it's interaction
with MPI must be understood and documented. A sensible approach could be to try and use thread_concurrency/MPI_size
. BUT if we are on a HPC cluster, which should be 90% of cases when MPI is in play, someone might have set up a thread mask accordingly already. Some time ago we looked at OMP_NUM_THREADS
which is used and exported by SLURM and ilk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI config is it's own world of expertise, and I wager normal users stick to 1 thread per rank (I have never seen or done differently, unless it was you trying to squeeze out every last drop).
So, my pref is then to default threads
to thread_concurrency
unless mpi
is set, in which case it stays 1
. If cleverer people than I come up with a thought out proposal for MPI then I'm happy to consider it. Agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I'd be fine with merging this part as is, but consider the extra effort for making it extra-nice for the user interesting. I think the common patterns are
- No MPI, 1 thread/(logical)core
- with GPU
- without GPU
- MPI, still 1 thread per (logical)core but now each task gets handed
1/n
th. N is the number of tasks according to
- 1 task per socket
- 1 task GPU
So, the approach is to compute
- the (logical)core count:
arbenv::cores(logical=true|false)
- the number of tasks: handed to you via
Comm_size
, defaults to 1
and context
or thread_concurrency
could be instructed whether to distribute threads over
tasks or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above's note on MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, it'd been nice to have rather than critical. I am personally fine with typing out these, but as we were sanding off some rougher edges it offered an easy win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to test this out, but for this week I don't think I'll have the time. Making the release will almost certainly eat up the rest of my hours, and I prefer not to postpone it further. Maybe a small discussion in the team meet about removing proc_alloc and have these new constructors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said, optional, and only if proc_alloc
blocks you from achieving more.
C++ was left out, since I assumed (some people in the past certainly argued so) that there should be no magic. I'm OK with keeping them aligned, just want to make sure this has changed. |
For me alignment beats the no_magic paradigm, so if there must be magic, it should be working the same always. |
…ncurrency threads. Update docs and tests.
Hmm, I see a CI build error pop up that I didn't see locally:
So, how do we handle this for the shared variant? Do we need that variant for anything (I don't recall noticing it before)? |
fdbc3e2
to
b661fd1
Compare
…arg construction.
Updated, changed. PR message reflects the major changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shaping up to a nice addition, some comments.
doc/concepts/interconnectivity.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is largely unrelated formatting changes? Fine, but confusing to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True-ish, I had to touch hardware.rst
and discovered some things neglected/incorrect in #2001.
@@ -214,6 +214,7 @@ TEST(SPIKES_TEST_CLASS, threshold_watcher_interpolation) { | |||
dict.set("mid", arb::ls::on_branches(0.5)); | |||
|
|||
arb::proc_allocation resources; | |||
resources.num_threads = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless testing the defaults, I like tests be explicit. default_concurrency()
would also be fine.
@@ -152,6 +152,7 @@ namespace { | |||
TEST(domain_decomposition, homogenous_population) | |||
{ | |||
proc_allocation resources; | |||
resources.num_threads = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below.
@@ -219,6 +220,7 @@ TEST(domain_decomposition, homogenous_population) | |||
TEST(domain_decomposition, heterogenous_population) | |||
{ | |||
proc_allocation resources; | |||
resources.num_threads = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below.
python/example/network_ring.py
Outdated
# (12) Create an execution context using all locally available threads and simulation | ||
ctx = arbor.context(threads="avail_threads") | ||
sim = arbor.simulation(recipe, ctx) | ||
# (12) Create a simulation using all locally available threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's short, yet less clear what's going on.
# (12) Create a simulation using all locally available threads | |
# (12) Create a simulation using the default settings: | |
# - Use all threads available | |
# - Use round-robin distribution of cells across groups with one cell per group | |
# - No GPU | |
# - No MPI | |
# - Empty label dictionary | |
# Other constructors of simulation can be used to change all of these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Slightly modified in upcoming commit. Label dict left out, it's set elsewhere.
@@ -66,6 +66,17 @@ std::optional<T> py2optional(pybind11::object o, const char* msg) { | |||
return o.is_none()? std::nullopt: std::optional<T>(std::move(value)); | |||
} | |||
|
|||
template <typename T, typename F> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename T, typename F>
std::optional<T> optional_when(T&& o, F&& pred) {
if (pred(std::forward(T)) {
return {std::forward(T)};
} else {
return {};
}
}
- A slightly more descriptive name
- Do not construct the optional if not needed
- Play nicely with rvalue refs, lvalue ref, and values.
@@ -136,8 +124,10 @@ void register_contexts(pybind11::module& m) { | |||
"bind_procs"_a=false, | |||
"bind_threads"_a=false, | |||
"Construct an allocation with arguments:\n" | |||
" threads: The number of threads available locally for execution. Must be set to 1 at minimum. 1 by default.\n" | |||
" gpu_id: The identifier of the GPU to use, None by default.\n") | |||
" threads: The number of threads available locally for execution. Must be set to 1 at minimum. 1 by default.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might this be a symbolic constant or string? Such as 'avail_threads'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Available as arbor.env.default_concurrency()
.
arb::proc_allocation allocation() const { | ||
return arb::proc_allocation(num_threads, gpu_id.value_or(-1), bind_procs, bind_threads); | ||
void proc_allocation_shim::set_num_threads(unsigned threads) { | ||
if (0==threads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we throw this from C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arb:::proc_allocation
does not do any checking. Nor does anything else for that matter. Only the pyobject ever did.
set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_DATA="${CMAKE_INSTALL_DATAROOTDIR}" APPEND) | ||
set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_CXX_COMPILER="${CMAKE_CXX_COMPILER}" APPEND) | ||
set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_PREFIX="${CMAKE_INSTALL_PREFIX}" APPEND) | ||
set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_BINARY="${CMAKE_INSTALL_BINDIR}" APPEND) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to list in PR message (added): arbor.config()
now returns timestamp
and python_lib_path
. The prefix need not be (and for pip installs, isn't, it'll be a venv typically) where CMake installed the package.
arbor/execution_context.hpp
Outdated
@@ -25,7 +25,7 @@ struct ARB_ARBOR_API execution_context { | |||
task_system_handle thread_pool; | |||
gpu_context_handle gpu; | |||
|
|||
execution_context(const proc_allocation& resources = proc_allocation{}); | |||
execution_context(const proc_allocation& resources = proc_allocation{1,-1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a symbolic constant here, ie adding
constexpr static int gpu_nil_id = -1;
to context.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One weirdness ... Other than that good.
# (12) Create a simulation using the default settings: | ||
# - Use all threads available | ||
# - Use round-robin distribution of cells across groups with one cell per group | ||
# - Use GPU if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we? Boy we are smart ;)
create_context
tomake_context_shim
, to align witharb::make_context
make_context_shim
initializes a context with no MPI, and aarbenv::default_allocation()
. This is aproc_allocation
witharbenv::thread_concurrency()
threads and aarbenv::default_gpu()
. This no-arg constructor is not exposed and used only by the no-argarbor.context()
constructor and the single-arg (=recipe)arbor.simulation()
constructor.arbenv
inarbor.env
, e.g.arbor.env.thread_concurrency()
bindsarbenv::thread_concurrency()
[x] Align Pybind minver inpython/CMakeLists.txt
with elsewhere.doc/cpp/hardware.rst
doc/concept/interconnectivity.rst
proc_allocation
printer for 📊 Pin threads #2094