Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split CompilerJob in dynamic and static part. #395

Merged
merged 5 commits into from
Mar 14, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Feb 23, 2023

Currently, we need to re-create the entire CompilerJob struct every time we want to check the cache because it contains the function we want to compile. For example, simplified from https://github.com/JuliaGPU/CUDA.jl/blob/dbcaca84191fb8621f097d85dad80e1627f1c11b/src/compiler/execution.jl#L299-L308:

function compile(ft::F, tt::TT) where {F,TT}
    function compiler(job)
        GPUCompiler.compile(:asm, job)[1]
    end
    function linker(job, compiled)
        compiled
    end

    source = FunctionSpec(ft, tt)
    target = NativeCompilerTarget()
    params = TestCompilerParams()
    job = CompilerJob(source, target, params)
    GPUCompiler.cached_compilation(cache, job, compiler, linker)::String
end

This results in unnecessary allocations on every kernel launch, even if it didn't cause any compilation:

julia> @benchmark main()
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.690 μs … 190.522 μs  ┊ GC (min … max): 0.00% … 93.60%
 Time  (median):     7.394 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.391 μs ±   1.973 μs  ┊ GC (mean ± σ):  0.24% ±  0.94%

           ▂▂▂▁      ▁▃▅▇█▆▆▅▃▂
  ▁▁▁▂▃▄▅▇█████▇▇▆▆▆▇██████████▇▆▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  6.69 μs         Histogram: frequency by time        8.58 μs <

 Memory estimate: 952 bytes, allocs estimate: 4.

This PR aims to avoid that by introducing a CompilerConfig containing all static bits of the CompilerJob, bundling it together with the FunctionSpec in the CompilerJob. In addition, cached_compilation now only takes the function and argument types, so nothing needs to be allocated on the hot path anymore (assuming the CompilerConfig can be retrieved from a global dict look-up or something):

const target = NativeCompilerTarget()
const params = TestCompilerParams()
const config = CompilerConfig(target, params)

const cache = Dict()

function compile(ft::F, tt::TT) where {F,TT}
function compiler(job)
        GPUCompiler.compile(:asm, job)[1]
    end
    function linker(job, compiled)
        compiled
    end
    GPUCompiler.cached_compilation(cache, config, ft, tt, compiler, linker)::String
end
julia> @benchmark main()
BenchmarkTools.Trial: 10000 samples with 245 evaluations.
 Range (min … max):  309.873 ns …   6.200 μs  ┊ GC (min … max): 0.00% … 93.55%
 Time  (median):     318.204 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   323.585 ns ± 122.187 ns  ┊ GC (mean ± σ):  0.83% ±  2.09%

         ▂▅▇█▇▆▅▃▃▁▂ ▂▁     ▁▁▁▂▃▃▃▃▂▁                          ▂
  ▆▅▄▄▆▆████████████████▇▆▆▇███████████▇▆▆▆▆▄▆▄▄▃▁▅▄▃▁▃▁▁▅▄▃▆▆▇ █
  310 ns        Histogram: log(frequency) by time        357 ns <

 Memory estimate: 96 bytes, allocs estimate: 2.

@maleadt
Copy link
Member Author

maleadt commented Feb 23, 2023

Getting rid of the final allocation would double performance again, but I can't seem to find where it comes from. Removing the lock/unlock does the job, but I fail to see how that causes an allocation (attributed to the function call to the outlined compilation code). Even with a SpinLock it allocates...

@vchuravy
Copy link
Member

Maybe use the allocation profiler with a frequency of 1

@maleadt maleadt force-pushed the tb/test_cached_compilation branch from 4d91a45 to 1ac5160 Compare March 13, 2023 13:16
Base automatically changed from tb/test_cached_compilation to master March 13, 2023 15:15
@maleadt maleadt force-pushed the tb/static_compiler_config branch from f57650a to c23bbba Compare March 13, 2023 15:37
@codecov
Copy link

codecov bot commented Mar 13, 2023

Codecov Report

Patch coverage: 87.03% and project coverage change: -1.19 ⚠️

Comparison is base (860ec6a) 80.47% compared to head (a819a59) 79.28%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #395      +/-   ##
==========================================
- Coverage   80.47%   79.28%   -1.19%     
==========================================
  Files          24       24              
  Lines        2863     2844      -19     
==========================================
- Hits         2304     2255      -49     
- Misses        559      589      +30     
Impacted Files Coverage Δ
src/bpf.jl 93.33% <ø> (ø)
src/reflection.jl 35.38% <50.00%> (-1.54%) ⬇️
src/driver.jl 91.45% <66.66%> (ø)
src/ptx.jl 67.40% <70.00%> (-4.85%) ⬇️
src/optim.jl 80.09% <75.00%> (-2.99%) ⬇️
src/jlgen.jl 72.01% <85.71%> (-6.07%) ⬇️
src/interface.jl 77.55% <88.88%> (-4.81%) ⬇️
src/cache.jl 98.33% <95.83%> (+7.42%) ⬆️
src/gcn.jl 69.91% <100.00%> (ø)
src/irgen.jl 86.81% <100.00%> (-1.22%) ⬇️
... and 6 more

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@maleadt
Copy link
Member Author

maleadt commented Mar 13, 2023

Got it down to 32 bytes, but not sure why those still come from. The profiler also doesn't help:

image

I don't see how hash(UInt, UInt) would allocate and the other node, well, is the very function I'm looking to optimize so doesn't help to narrow it down.

But anyway, this is a 25x speed-up, so probably good enough right now.

@maleadt maleadt requested review from jpsamaroo and vchuravy and removed request for jpsamaroo March 13, 2023 16:12
@vchuravy
Copy link
Member

vchuravy commented Mar 13, 2023

We are invoking hash. This means we need to allocate the return value, same for cached_compilation

@vchuravy
Copy link
Member

vchuravy commented Mar 13, 2023

Can you share the pprof? https://pprof.me if it's small enough

@maleadt
Copy link
Member Author

maleadt commented Mar 13, 2023

Sure: https://pprof.me/d48281f/
Although I'm not sure that works... so here's the actual db too:
alloc-profile.pb.gz

@maleadt maleadt marked this pull request as ready for review March 13, 2023 18:56
@maleadt maleadt merged commit c51dab3 into master Mar 14, 2023
@maleadt maleadt deleted the tb/static_compiler_config branch March 14, 2023 08:49
@maleadt
Copy link
Member Author

maleadt commented Mar 14, 2023

Ugh, turns out all of the improvement here comes from making the NativeCompilerTarget constant, because LLVMGetHostCPUFeatures takes ages. So the actual split into a static CompilerConfig doesn't help 🤦

@vchuravy Revert this, or keep it? It does make sense as a refactor, but adds another level of structuring that isn't terribly useful right now. On the other hand, I guess it will be needed if we ever want to make kernel launch fully allocation-less, but we're far from that right now:

julia> @benchmark @cuda launch=false identity(nothing)
BenchmarkTools.Trial: 10000 samples with 267 evaluations.
 Range (min … max):  298.236 ns …   7.093 μs  ┊ GC (min … max): 0.00% … 94.74%
 Time  (median):     310.446 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   323.410 ns ± 249.096 ns  ┊ GC (mean ± σ):  3.24% ±  3.99%

        ▃▆▆███▆▄▂▁▁▁▁                                           ▂
  ▅▃▄▄▁▇██████████████▇▆▆▆▆▃▃▃▁▅▆▇▆▇▇▆▆▃▃▄▅▆▇█████▇▇▇▆▆▇▇████▇▇ █
  298 ns        Histogram: log(frequency) by time        369 ns <

 Memory estimate: 320 bytes, allocs estimate: 8.

@maleadt
Copy link
Member Author

maleadt commented Mar 14, 2023

Actually, it allows to get me to:

julia> @benchmark cufunction(identity, Tuple{Nothing})
BenchmarkTools.Trial: 10000 samples with 323 evaluations.
 Range (min … max):  270.523 ns …   6.102 μs  ┊ GC (min … max): 0.00% … 93.72%
 Time  (median):     286.220 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   301.062 ns ± 101.832 ns  ┊ GC (mean ± σ):  0.59% ±  1.84%

   ▄██▆▇▆▅▄▁                              ▁▁                    ▂
  ▇██████████▇▇▄▅▅▄▄▄▅▄▄▄▄▄▁▄▄▃▄▁▃▁▁▁▁▅▆▆███████████▇▇█▇▇▇██▆▅▆ █
  271 ns        Histogram: log(frequency) by time        506 ns <

 Memory estimate: 64 bytes, allocs estimate: 4.

So not much faster, but significantly less allocations, which people are allergic to. So let's keep this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants