-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use number of physical not logical cores for auto nthreads? #43692
Comments
I agree. My understanding is that determining the number of physical cores is surprisingly difficult, which is the main reason we haven't done this yet. |
Related: JuliaLang/LinearAlgebra.jl#671. This is about BLAS, but the main problem is the same: determining the number of physical cores, without pulling in yet another external dependency. |
To make matters worse, we now officially do need to care about big/little designs since M1 and 12 gen intel use it. |
It may very well be the case that we should ship |
How big is Also, is it generally the same size on all operating systems and architectures? |
That's not too bad. |
Are there other things that we can use hwloc for, besides just counting the number of physical cores? The more uses we can get out of hwloc, the more compelling the argument is for shipping hwloc with Julia. |
Yeah there are other interesting things one can do with HWloc, but not sure how much they matter for Base. One word of caution is that HWloc is being used by quite a few JLLs and we should use it as a static lib to avoid pinning the entire ecosystem to one version |
We can also query the cache hierarchy and NUMA nodes. In principle, we can have a more "intelligent" scheduler that uses knowledge like this. I'd guess the allocator/GC can do something interesting too. |
LLVM should provide this, too: |
If anyone wants to play around with hwloc I setup a minimal deps file to get a static library in https://github.com/JuliaLang/julia/tree/vc/hwloc |
Oh, this is cool. I didn't know this. But I was thinking more detailed info like which cores share which L3 (not all CPUs do this per-socket basis). Also, since we've decoupled codegen and runtime as separate libraries, I don't think we want to use LLVM in runtime. |
Are you building hwloc from source here? Is there a way that we can use the pre-built Ygg binaries for hwloc, but not interfere with the Hwloc_jll package? Because as you mentioned above, we don't want to force everyone to use the same version of Hwloc_jll (the way that the stdlib JLLs currently do). Maybe that's a question for @staticfloat |
I'm not sure Hwloc_jll builds a static library (can't check now, away from computer). |
Hwloc_jll does not build a static library right now. But we could, and could then download/extract it and use it just like everything else. That being said, if there were a way to get equivalent information from LLVM, I'd definitely prefer that, as 1.7MB (the size of the |
Given that there are some efforts for parsing cgroup in libuv libuv/libuv#2323, I don't think it's crazy to parse cgroup and proc in libuv to get the core count information in Linux. I don't know about Windows though. I also don't know what kind of other magics Hwloc has.
The methods @chriselrod linked queries cache size, cache line size and cache associativity. It doesn't look like it has a way to get the core count. I'd guess core count is less useful to LLVM (can it specialize to the number of CPUs?) although I wonder how much of OpenMP stuff is in libLLVM we ship? But I guess it's the runtime's job to count CPUs? cc @vchuravy I also note that #42340 partially solves this for "sufficiently well-behaving" environments that set up affinity for each job allocation; e.g., HPC clusters and cloud services. The point is that we can't determine the number of CPUs we should use solely from the hardware information. We need to respect how much computing resources are assigned to a |
In the multithreading meeting we discussed that adding the Hwloc dep might be useful. The cpu detection code has been getting more and more gnarly with efficiency cores becoming more common and etc. The apple-aarch64 code is already quite messy. I don't think LLVM has this information in an easy API, it cares more about what kind of core is there than how many cores are there. |
I do think it makes sense to expose this info programmatically, but I'm not so sure that we should default to physical cores. It seems that Julia's threading actually does pretty well with hyperthreading, which OpenMP and BLAS do not. What might make sense is for Julia to default to the number of logical cores and BLAS to default to the number of physical cores. |
In 1.7
auto
uses hyperthreading. For some workloads, using physical cores without hyperthreading may be faster. I'm not sure which workloads these are (does anybody have a reference with benchmarks?), but if Julia users would rather not use hyperthreading by default,auto
should do that.If Julia doesn't have access to the number of physical cores without Hwloc.jl, it could use a heuristic like
ceil(jl_cpu_threads/2)
.The text was updated successfully, but these errors were encountered: