Rework cached compilation; remove invalidation generator #445

maleadt · 2023-05-11T09:36:58Z

This PR changes how we do cached compilation. Before, we looked into a small cache indexed by the codegen world age we got from a hacky generator. As @vtjnash said, that world age isn't valid and shouldn't leak into runtime code, so I redesigned the cached compilation here to more resemble what Base does. We now store compiled and linked objects 'next' to the CodeInfos (just like how Base stores pointers inside the CI). At run time, we still use a small cache but it's indexed by the current TLS world age. When the world changes, that may have happened due to an unrelated method redefinition, so we query the CI cache (intersecting world ages) and look up the GPU object that's stored next to it.

@vchuravy This broke LazyCodegen. I'm not sure why it even relied on the codegen world age, as it doesn't do any invalidation-related tests.

@wsmoses I know Enzyme relies on this, sorry. Feel free to copy the old code there.

Fixes #435, #440, #146

maleadt · 2023-05-11T13:25:41Z

Found the issue: The cache that CUDA provides is itself keyed on the context, which ensures that after a device_reset! a new cache is automatically used. This doesn't work if we put the compiled objects inside of GPUCompiler's CodeCache, so I'm putting them in the user-provided cache now.

We could probably do something better, but I'm not a fan of adding yet another interface like empty_ci_caches!. Maybe CUDA.jl just ought to specialize ci_cache and key it with a context. For now, slightly abusing the cache that's passed into cached_compilation does the job though.

maleadt · 2023-05-11T13:28:29Z

Surprisingly, this gets rid of some of the remaining allocations in cached_compilation.

Before:

julia> @benchmark @cuda identity(nothing)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.330 μs …   5.463 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.512 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.535 μs ± 108.118 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂▅▆█▇▃▇▅    ▂▂▁▃▃
  ▂▁▁▂▂▂▂▃▃▃▅▆████████▇▆▇███████▇▇▆▆▅▄▄▄▄▄▄▃▄▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂ ▄
  2.33 μs         Histogram: frequency by time        2.83 μs <

 Memory estimate: 288 bytes, allocs estimate: 6.

After:

julia> @benchmark @cuda identity(nothing)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.076 μs …  5.234 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.309 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.333 μs ± 99.139 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▁▅▅██▄▄▂       ▁
  ▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▄██████████▇▇█████▆▄▅▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁ ▃
  2.08 μs        Histogram: frequency by time         2.6 μs <

 Memory estimate: 256 bytes, allocs estimate: 4.

src/execution.jl

src/jlgen.jl

codecov · 2023-05-15T10:16:58Z

Codecov Report

Patch coverage: 87.82% and project coverage change: -7.08 ⚠️

Comparison is base (acf9c49) 85.71% compared to head (a5dfcbb) 78.63%.

❗ Current head a5dfcbb differs from pull request most recent head 6c9be17. Consider uploading reports for the commit 6c9be17 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #445      +/-   ##
==========================================
- Coverage   85.71%   78.63%   -7.08%     
==========================================
  Files          24       23       -1     
  Lines        2962     2926      -36     
==========================================
- Hits         2539     2301     -238     
- Misses        423      625     +202

Impacted Files	Coverage Δ
src/GPUCompiler.jl	`100.00% <ø> (ø)`
src/jlgen.jl	`78.39% <82.66%> (+6.08%)`	⬆️
src/execution.jl	`67.79% <96.15%> (-23.12%)`	⬇️
src/interface.jl	`80.89% <100.00%> (-4.82%)`	⬇️
src/optim.jl	`84.40% <100.00%> (+0.14%)`	⬆️
src/validation.jl	`96.27% <100.00%> (ø)`

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

vtjnash · 2023-05-15T11:28:17Z

Yes, Base does not store that Tuple for the reason you found. It uses an iterated lookup instead, with the first level keyed by mi, and then a linear scan (usually just one entry) of all of the possibilities for that

So that the client can wipe the cache.

maleadt changed the title ~~Rework cached compilation; remove invalidation generator~~ WIP: Rework cached compilation; remove invalidation generator May 11, 2023

maleadt changed the title ~~WIP: Rework cached compilation; remove invalidation generator~~ Rework cached compilation; remove invalidation generator May 11, 2023

vtjnash reviewed May 11, 2023

View reviewed changes

src/execution.jl Outdated Show resolved Hide resolved

vtjnash reviewed May 11, 2023

View reviewed changes

src/execution.jl Outdated Show resolved Hide resolved

vtjnash reviewed May 11, 2023

View reviewed changes

src/jlgen.jl Outdated Show resolved Hide resolved

maleadt force-pushed the tb/rm_generator branch from b1c311e to 2ad7a4f Compare May 15, 2023 09:07

maleadt force-pushed the tb/rm_generator branch from c8ec094 to a5dfcbb Compare May 15, 2023 14:08

maleadt added 15 commits May 15, 2023 16:50

Store compiled objects in the ci cache.

ab7ef69

Fix keying of CI caches.

a8082a7

Re-use looked up world.

463e303

Don't use the CI cache when we have a compile hook.

5906293

Move cache.jl contents to execution.jl.

6d80040

Don't re-link in case of hooked compilation.

2853dcd

Put everything in the client-specified cache.

bed69a5

So that the client can wipe the cache.

Remove a micro-optimization that does not matter.

fbee9c2

Look-up and cache MethodInstances directly.

c7e5e7a

Relax cache type.

4bc5f88

Pass the MI directly to cached_compilation.

c451c31

Bump version.

93cfcbf

Simplify compiler imports.

c24e6c4

Use findsup instead of method_by_ftype where applicable.

0d14954

Explain use of objectid.

6c9be17

maleadt force-pushed the tb/rm_generator branch from a5dfcbb to 6c9be17 Compare May 15, 2023 14:51

maleadt merged commit 3de799a into master May 15, 2023

maleadt deleted the tb/rm_generator branch May 15, 2023 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework cached compilation; remove invalidation generator #445

Rework cached compilation; remove invalidation generator #445

maleadt commented May 11, 2023 •

edited

Loading

maleadt commented May 11, 2023

maleadt commented May 11, 2023

codecov bot commented May 15, 2023 •

edited

Loading

vtjnash commented May 15, 2023

Rework cached compilation; remove invalidation generator #445

Rework cached compilation; remove invalidation generator #445

Conversation

maleadt commented May 11, 2023 • edited Loading

maleadt commented May 11, 2023

maleadt commented May 11, 2023

codecov bot commented May 15, 2023 • edited Loading

Codecov Report

vtjnash commented May 15, 2023

maleadt commented May 11, 2023 •

edited

Loading

codecov bot commented May 15, 2023 •

edited

Loading