Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework cached compilation; remove invalidation generator #445

Merged
merged 15 commits into from
May 15, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented May 11, 2023

This PR changes how we do cached compilation. Before, we looked into a small cache indexed by the codegen world age we got from a hacky generator. As @vtjnash said, that world age isn't valid and shouldn't leak into runtime code, so I redesigned the cached compilation here to more resemble what Base does. We now store compiled and linked objects 'next' to the CodeInfos (just like how Base stores pointers inside the CI). At run time, we still use a small cache but it's indexed by the current TLS world age. When the world changes, that may have happened due to an unrelated method redefinition, so we query the CI cache (intersecting world ages) and look up the GPU object that's stored next to it.

@vchuravy This broke LazyCodegen. I'm not sure why it even relied on the codegen world age, as it doesn't do any invalidation-related tests.

@wsmoses I know Enzyme relies on this, sorry. Feel free to copy the old code there.

Fixes #435, #440, #146

@maleadt maleadt changed the title Rework cached compilation; remove invalidation generator WIP: Rework cached compilation; remove invalidation generator May 11, 2023
@maleadt maleadt changed the title WIP: Rework cached compilation; remove invalidation generator Rework cached compilation; remove invalidation generator May 11, 2023
@maleadt
Copy link
Member Author

maleadt commented May 11, 2023

Found the issue: The cache that CUDA provides is itself keyed on the context, which ensures that after a device_reset! a new cache is automatically used. This doesn't work if we put the compiled objects inside of GPUCompiler's CodeCache, so I'm putting them in the user-provided cache now.

We could probably do something better, but I'm not a fan of adding yet another interface like empty_ci_caches!. Maybe CUDA.jl just ought to specialize ci_cache and key it with a context. For now, slightly abusing the cache that's passed into cached_compilation does the job though.

@maleadt
Copy link
Member Author

maleadt commented May 11, 2023

Surprisingly, this gets rid of some of the remaining allocations in cached_compilation.

Before:

julia> @benchmark @cuda identity(nothing)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.330 μs …   5.463 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.512 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.535 μs ± 108.118 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂▅▆█▇▃▇▅    ▂▂▁▃▃
  ▂▁▁▂▂▂▂▃▃▃▅▆████████▇▆▇███████▇▇▆▆▅▄▄▄▄▄▄▃▄▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂ ▄
  2.33 μs         Histogram: frequency by time        2.83 μs <

 Memory estimate: 288 bytes, allocs estimate: 6.

After:

julia> @benchmark @cuda identity(nothing)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.076 μs …  5.234 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.309 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.333 μs ± 99.139 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▁▅▅██▄▄▂       ▁
  ▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▄██████████▇▇█████▆▄▅▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁ ▃
  2.08 μs        Histogram: frequency by time         2.6 μs <

 Memory estimate: 256 bytes, allocs estimate: 4.

@maleadt maleadt force-pushed the tb/rm_generator branch from b1c311e to 2ad7a4f Compare May 15, 2023 09:07
@codecov
Copy link

codecov bot commented May 15, 2023

Codecov Report

Patch coverage: 87.82% and project coverage change: -7.08 ⚠️

Comparison is base (acf9c49) 85.71% compared to head (a5dfcbb) 78.63%.

❗ Current head a5dfcbb differs from pull request most recent head 6c9be17. Consider uploading reports for the commit 6c9be17 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #445      +/-   ##
==========================================
- Coverage   85.71%   78.63%   -7.08%     
==========================================
  Files          24       23       -1     
  Lines        2962     2926      -36     
==========================================
- Hits         2539     2301     -238     
- Misses        423      625     +202     
Impacted Files Coverage Δ
src/GPUCompiler.jl 100.00% <ø> (ø)
src/jlgen.jl 78.39% <82.66%> (+6.08%) ⬆️
src/execution.jl 67.79% <96.15%> (-23.12%) ⬇️
src/interface.jl 80.89% <100.00%> (-4.82%) ⬇️
src/optim.jl 84.40% <100.00%> (+0.14%) ⬆️
src/validation.jl 96.27% <100.00%> (ø)

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vtjnash
Copy link
Contributor

vtjnash commented May 15, 2023

Yes, Base does not store that Tuple for the reason you found. It uses an iterated lookup instead, with the first level keyed by mi, and then a linear scan (usually just one entry) of all of the possibilities for that

@maleadt maleadt force-pushed the tb/rm_generator branch from c8ec094 to a5dfcbb Compare May 15, 2023 14:08
@maleadt maleadt force-pushed the tb/rm_generator branch from a5dfcbb to 6c9be17 Compare May 15, 2023 14:51
@maleadt maleadt merged commit 3de799a into master May 15, 2023
@maleadt maleadt deleted the tb/rm_generator branch May 15, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kernel invalidation relies on undefined behavior
2 participants