-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split CompilerJob in dynamic and static part. #395
Conversation
Getting rid of the final allocation would double performance again, but I can't seem to find where it comes from. Removing the |
Maybe use the allocation profiler with a frequency of 1 |
4d91a45
to
1ac5160
Compare
f57650a
to
c23bbba
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #395 +/- ##
==========================================
- Coverage 80.47% 79.28% -1.19%
==========================================
Files 24 24
Lines 2863 2844 -19
==========================================
- Hits 2304 2255 -49
- Misses 559 589 +30
... and 2 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Got it down to 32 bytes, but not sure why those still come from. The profiler also doesn't help: I don't see how But anyway, this is a 25x speed-up, so probably good enough right now. |
We are |
Can you share the pprof? https://pprof.me if it's small enough |
Sure: https://pprof.me/d48281f/ |
Ugh, turns out all of the improvement here comes from making the @vchuravy Revert this, or keep it? It does make sense as a refactor, but adds another level of structuring that isn't terribly useful right now. On the other hand, I guess it will be needed if we ever want to make kernel launch fully allocation-less, but we're far from that right now:
|
Actually, it allows to get me to:
So not much faster, but significantly less allocations, which people are allergic to. So let's keep this. |
Currently, we need to re-create the entire
CompilerJob
struct every time we want to check the cache because it contains the function we want to compile. For example, simplified from https://github.com/JuliaGPU/CUDA.jl/blob/dbcaca84191fb8621f097d85dad80e1627f1c11b/src/compiler/execution.jl#L299-L308:This results in unnecessary allocations on every kernel launch, even if it didn't cause any compilation:
This PR aims to avoid that by introducing a
CompilerConfig
containing all static bits of theCompilerJob
, bundling it together with theFunctionSpec
in theCompilerJob
. In addition,cached_compilation
now only takes the function and argument types, so nothing needs to be allocated on the hot path anymore (assuming theCompilerConfig
can be retrieved from a global dict look-up or something):