Split CompilerJob in dynamic and static part. #395

maleadt · 2023-02-23T14:41:05Z

Currently, we need to re-create the entire CompilerJob struct every time we want to check the cache because it contains the function we want to compile. For example, simplified from https://github.com/JuliaGPU/CUDA.jl/blob/dbcaca84191fb8621f097d85dad80e1627f1c11b/src/compiler/execution.jl#L299-L308:

function compile(ft::F, tt::TT) where {F,TT}
    function compiler(job)
        GPUCompiler.compile(:asm, job)[1]
    end
    function linker(job, compiled)
        compiled
    end

    source = FunctionSpec(ft, tt)
    target = NativeCompilerTarget()
    params = TestCompilerParams()
    job = CompilerJob(source, target, params)
    GPUCompiler.cached_compilation(cache, job, compiler, linker)::String
end

This results in unnecessary allocations on every kernel launch, even if it didn't cause any compilation:

julia> @benchmark main()
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.690 μs … 190.522 μs  ┊ GC (min … max): 0.00% … 93.60%
 Time  (median):     7.394 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.391 μs ±   1.973 μs  ┊ GC (mean ± σ):  0.24% ±  0.94%

           ▂▂▂▁      ▁▃▅▇█▆▆▅▃▂
  ▁▁▁▂▃▄▅▇█████▇▇▆▆▆▇██████████▇▆▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  6.69 μs         Histogram: frequency by time        8.58 μs <

 Memory estimate: 952 bytes, allocs estimate: 4.

This PR aims to avoid that by introducing a CompilerConfig containing all static bits of the CompilerJob, bundling it together with the FunctionSpec in the CompilerJob. In addition, cached_compilation now only takes the function and argument types, so nothing needs to be allocated on the hot path anymore (assuming the CompilerConfig can be retrieved from a global dict look-up or something):

const target = NativeCompilerTarget()
const params = TestCompilerParams()
const config = CompilerConfig(target, params)

const cache = Dict()

function compile(ft::F, tt::TT) where {F,TT}
function compiler(job)
        GPUCompiler.compile(:asm, job)[1]
    end
    function linker(job, compiled)
        compiled
    end
    GPUCompiler.cached_compilation(cache, config, ft, tt, compiler, linker)::String
end

julia> @benchmark main()
BenchmarkTools.Trial: 10000 samples with 245 evaluations.
 Range (min … max):  309.873 ns …   6.200 μs  ┊ GC (min … max): 0.00% … 93.55%
 Time  (median):     318.204 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   323.585 ns ± 122.187 ns  ┊ GC (mean ± σ):  0.83% ±  2.09%

         ▂▅▇█▇▆▅▃▃▁▂ ▂▁     ▁▁▁▂▃▃▃▃▂▁                          ▂
  ▆▅▄▄▆▆████████████████▇▆▆▇███████████▇▆▆▆▆▄▆▄▄▃▁▅▄▃▁▃▁▁▅▄▃▆▆▇ █
  310 ns        Histogram: log(frequency) by time        357 ns <

 Memory estimate: 96 bytes, allocs estimate: 2.

maleadt · 2023-02-23T15:28:49Z

Getting rid of the final allocation would double performance again, but I can't seem to find where it comes from. Removing the lock/unlock does the job, but I fail to see how that causes an allocation (attributed to the function call to the outlined compilation code). Even with a SpinLock it allocates...

vchuravy · 2023-02-23T16:53:38Z

Maybe use the allocation profiler with a frequency of 1

codecov · 2023-03-13T15:42:42Z

Codecov Report

Patch coverage: 87.03% and project coverage change: -1.19 ⚠️

Comparison is base (860ec6a) 80.47% compared to head (a819a59) 79.28%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #395      +/-   ##
==========================================
- Coverage   80.47%   79.28%   -1.19%     
==========================================
  Files          24       24              
  Lines        2863     2844      -19     
==========================================
- Hits         2304     2255      -49     
- Misses        559      589      +30

Impacted Files	Coverage Δ
src/bpf.jl	`93.33% <ø> (ø)`
src/reflection.jl	`35.38% <50.00%> (-1.54%)`	⬇️
src/driver.jl	`91.45% <66.66%> (ø)`
src/ptx.jl	`67.40% <70.00%> (-4.85%)`	⬇️
src/optim.jl	`80.09% <75.00%> (-2.99%)`	⬇️
src/jlgen.jl	`72.01% <85.71%> (-6.07%)`	⬇️
src/interface.jl	`77.55% <88.88%> (-4.81%)`	⬇️
src/cache.jl	`98.33% <95.83%> (+7.42%)`	⬆️
src/gcn.jl	`69.91% <100.00%> (ø)`
src/irgen.jl	`86.81% <100.00%> (-1.22%)`	⬇️
... and 6 more

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

maleadt · 2023-03-13T16:12:03Z

Got it down to 32 bytes, but not sure why those still come from. The profiler also doesn't help:

I don't see how hash(UInt, UInt) would allocate and the other node, well, is the very function I'm looking to optimize so doesn't help to narrow it down.

But anyway, this is a 25x speed-up, so probably good enough right now.

vchuravy · 2023-03-13T17:28:12Z

We are invoking hash. This means we need to allocate the return value, same for cached_compilation

vchuravy · 2023-03-13T17:29:15Z

Can you share the pprof? https://pprof.me if it's small enough

maleadt · 2023-03-13T18:36:23Z

Sure: https://pprof.me/d48281f/
Although I'm not sure that works... so here's the actual db too:
alloc-profile.pb.gz

maleadt · 2023-03-14T11:42:47Z

Ugh, turns out all of the improvement here comes from making the NativeCompilerTarget constant, because LLVMGetHostCPUFeatures takes ages. So the actual split into a static CompilerConfig doesn't help 🤦

@vchuravy Revert this, or keep it? It does make sense as a refactor, but adds another level of structuring that isn't terribly useful right now. On the other hand, I guess it will be needed if we ever want to make kernel launch fully allocation-less, but we're far from that right now:

julia> @benchmark @cuda launch=false identity(nothing)
BenchmarkTools.Trial: 10000 samples with 267 evaluations.
 Range (min … max):  298.236 ns …   7.093 μs  ┊ GC (min … max): 0.00% … 94.74%
 Time  (median):     310.446 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   323.410 ns ± 249.096 ns  ┊ GC (mean ± σ):  3.24% ±  3.99%

        ▃▆▆███▆▄▂▁▁▁▁                                           ▂
  ▅▃▄▄▁▇██████████████▇▆▆▆▆▃▃▃▁▅▆▇▆▇▇▆▆▃▃▄▅▆▇█████▇▇▇▆▆▇▇████▇▇ █
  298 ns        Histogram: log(frequency) by time        369 ns <

 Memory estimate: 320 bytes, allocs estimate: 8.

maleadt · 2023-03-14T12:02:14Z

Actually, it allows to get me to:

julia> @benchmark cufunction(identity, Tuple{Nothing})
BenchmarkTools.Trial: 10000 samples with 323 evaluations.
 Range (min … max):  270.523 ns …   6.102 μs  ┊ GC (min … max): 0.00% … 93.72%
 Time  (median):     286.220 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   301.062 ns ± 101.832 ns  ┊ GC (mean ± σ):  0.59% ±  1.84%

   ▄██▆▇▆▅▄▁                              ▁▁                    ▂
  ▇██████████▇▇▄▅▅▄▄▄▅▄▄▄▄▄▁▄▄▃▄▁▃▁▁▁▁▅▆▆███████████▇▇█▇▇▇██▆▅▆ █
  271 ns        Histogram: log(frequency) by time        506 ns <

 Memory estimate: 64 bytes, allocs estimate: 4.

So not much faster, but significantly less allocations, which people are allergic to. So let's keep this.

maleadt force-pushed the tb/test_cached_compilation branch from 4d91a45 to 1ac5160 Compare March 13, 2023 13:16

Base automatically changed from tb/test_cached_compilation to master March 13, 2023 15:15

maleadt added 4 commits March 13, 2023 16:28

Split CompilerJob in dynamic and static part.

5391e6c

Refactor for performance.

d0edae4

Fix test.

2e8840a

Go back to old field names to reduce breakage.

c23bbba

maleadt force-pushed the tb/static_compiler_config branch from f57650a to c23bbba Compare March 13, 2023 15:37

Micro-optimize.

a819a59

maleadt requested review from jpsamaroo and vchuravy and removed request for jpsamaroo March 13, 2023 16:12

maleadt marked this pull request as ready for review March 13, 2023 18:56

maleadt merged commit c51dab3 into master Mar 14, 2023

maleadt deleted the tb/static_compiler_config branch March 14, 2023 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split CompilerJob in dynamic and static part. #395

Split CompilerJob in dynamic and static part. #395

maleadt commented Feb 23, 2023

maleadt commented Feb 23, 2023

vchuravy commented Feb 23, 2023

codecov bot commented Mar 13, 2023 •

edited

Loading

maleadt commented Mar 13, 2023

vchuravy commented Mar 13, 2023 •

edited

Loading

vchuravy commented Mar 13, 2023 •

edited

Loading

maleadt commented Mar 13, 2023 •

edited

Loading

maleadt commented Mar 14, 2023 •

edited

Loading

maleadt commented Mar 14, 2023

Split CompilerJob in dynamic and static part. #395

Split CompilerJob in dynamic and static part. #395

Conversation

maleadt commented Feb 23, 2023

maleadt commented Feb 23, 2023

vchuravy commented Feb 23, 2023

codecov bot commented Mar 13, 2023 • edited Loading

Codecov Report

maleadt commented Mar 13, 2023

vchuravy commented Mar 13, 2023 • edited Loading

vchuravy commented Mar 13, 2023 • edited Loading

maleadt commented Mar 13, 2023 • edited Loading

maleadt commented Mar 14, 2023 • edited Loading

maleadt commented Mar 14, 2023

codecov bot commented Mar 13, 2023 •

edited

Loading

vchuravy commented Mar 13, 2023 •

edited

Loading

vchuravy commented Mar 13, 2023 •

edited

Loading

maleadt commented Mar 13, 2023 •

edited

Loading

maleadt commented Mar 14, 2023 •

edited

Loading