Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use number of physical not logical cores for auto nthreads? #43692

Open
jtrakk opened this issue Jan 6, 2022 · 19 comments
Open

Use number of physical not logical cores for auto nthreads? #43692

jtrakk opened this issue Jan 6, 2022 · 19 comments
Labels
multithreading Base.Threads and related functionality speculative Whether the change will be implemented is speculative

Comments

@jtrakk
Copy link

jtrakk commented Jan 6, 2022

In 1.7 auto uses hyperthreading. For some workloads, using physical cores without hyperthreading may be faster. I'm not sure which workloads these are (does anybody have a reference with benchmarks?), but if Julia users would rather not use hyperthreading by default, auto should do that.

If Julia doesn't have access to the number of physical cores without Hwloc.jl, it could use a heuristic like ceil(jl_cpu_threads/2).

@JeffBezanson JeffBezanson added the multithreading Base.Threads and related functionality label Jan 7, 2022
@JeffBezanson
Copy link
Member

I agree. My understanding is that determining the number of physical cores is surprisingly difficult, which is the main reason we haven't done this yet.

@giordano
Copy link
Contributor

giordano commented Jan 7, 2022

Related: JuliaLang/LinearAlgebra.jl#671. This is about BLAS, but the main problem is the same: determining the number of physical cores, without pulling in yet another external dependency.

@oscardssmith
Copy link
Member

To make matters worse, we now officially do need to care about big/little designs since M1 and 12 gen intel use it.

@vtjnash vtjnash added the speculative Whether the change will be implemented is speculative label Jan 7, 2022
@ViralBShah
Copy link
Member

It may very well be the case that we should ship hwloc with Julia.

@DilumAluthge
Copy link
Member

How big is Hwloc_jll?

Also, is it generally the same size on all operating systems and architectures?

@ViralBShah
Copy link
Member

ViralBShah commented Jan 11, 2022

It's small - 2-4MB.

https://github.com/JuliaBinaryWrappers/Hwloc_jll.jl/releases/tag/Hwloc-v2.7.0%2B0

@DilumAluthge
Copy link
Member

That's not too bad.

@DilumAluthge
Copy link
Member

DilumAluthge commented Jan 11, 2022

Are there other things that we can use hwloc for, besides just counting the number of physical cores? The more uses we can get out of hwloc, the more compelling the argument is for shipping hwloc with Julia.

@vchuravy
Copy link
Member

Yeah there are other interesting things one can do with HWloc, but not sure how much they matter for Base.

One word of caution is that HWloc is being used by quite a few JLLs and we should use it as a static lib to avoid pinning the entire ecosystem to one version

@tkf
Copy link
Member

tkf commented Jan 14, 2022

besides just counting the number of physical cores

We can also query the cache hierarchy and NUMA nodes. In principle, we can have a more "intelligent" scheduler that uses knowledge like this. I'd guess the allocator/GC can do something interesting too.

@chriselrod
Copy link
Contributor

We can also query the cache hierarchy

LLVM should provide this, too:
https://github.com/llvm/llvm-project/blob/0af1808f9b99b49b87b8503466110baee42c5aea/llvm/include/llvm/Analysis/TargetTransformInfo.h#L2124-L2130

@vchuravy
Copy link
Member

If anyone wants to play around with hwloc I setup a minimal deps file to get a static library in https://github.com/JuliaLang/julia/tree/vc/hwloc

@tkf
Copy link
Member

tkf commented Jan 15, 2022

LLVM should provide this

Oh, this is cool. I didn't know this. But I was thinking more detailed info like which cores share which L3 (not all CPUs do this per-socket basis). Also, since we've decoupled codegen and runtime as separate libraries, I don't think we want to use LLVM in runtime.

@DilumAluthge
Copy link
Member

If anyone wants to play around with hwloc I setup a minimal deps file to get a static library in https://github.com/JuliaLang/julia/tree/vc/hwloc

Are you building hwloc from source here?

Is there a way that we can use the pre-built Ygg binaries for hwloc, but not interfere with the Hwloc_jll package? Because as you mentioned above, we don't want to force everyone to use the same version of Hwloc_jll (the way that the stdlib JLLs currently do).

Maybe that's a question for @staticfloat

@giordano
Copy link
Contributor

Is there a way that we can use the pre-built Ygg binaries for hwloc, but not interfere with the Hwloc_jll package?

I'm not sure Hwloc_jll builds a static library (can't check now, away from computer).

@staticfloat
Copy link
Member

Hwloc_jll does not build a static library right now. But we could, and could then download/extract it and use it just like everything else.

That being said, if there were a way to get equivalent information from LLVM, I'd definitely prefer that, as 1.7MB (the size of the .so) is still a hefty price to pay for this functionality. On macOS at least, you can get this kind of information via a few sysctl's, I'd hope that Linux/Windows don't make it too much worse.

@tkf
Copy link
Member

tkf commented Jan 15, 2022

Given that there are some efforts for parsing cgroup in libuv libuv/libuv#2323, I don't think it's crazy to parse cgroup and proc in libuv to get the core count information in Linux. I don't know about Windows though. I also don't know what kind of other magics Hwloc has.

That being said, if there were a way to get equivalent information from LLVM,

The methods @chriselrod linked queries cache size, cache line size and cache associativity. It doesn't look like it has a way to get the core count. I'd guess core count is less useful to LLVM (can it specialize to the number of CPUs?) although I wonder how much of OpenMP stuff is in libLLVM we ship? But I guess it's the runtime's job to count CPUs? cc @vchuravy

I also note that #42340 partially solves this for "sufficiently well-behaving" environments that set up affinity for each job allocation; e.g., HPC clusters and cloud services. The point is that we can't determine the number of CPUs we should use solely from the hardware information. We need to respect how much computing resources are assigned to a julia process by an "outer scheduler" (whatever spawns julia process). Of course, something like Hwloc is nice to have for making it work automatically for manually managed workstations and laptops.

@gbaraldi
Copy link
Member

gbaraldi commented Oct 5, 2022

In the multithreading meeting we discussed that adding the Hwloc dep might be useful. The cpu detection code has been getting more and more gnarly with efficiency cores becoming more common and etc. The apple-aarch64 code is already quite messy. I don't think LLVM has this information in an easy API, it cares more about what kind of core is there than how many cores are there.

@StefanKarpinski
Copy link
Member

I do think it makes sense to expose this info programmatically, but I'm not so sure that we should default to physical cores. It seems that Julia's threading actually does pretty well with hyperthreading, which OpenMP and BLAS do not. What might make sense is for Julia to default to the number of logical cores and BLAS to default to the number of physical cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests