-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign the frame layout to avoid the redundant computation of the stackOffsets #99278
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsRedesign the frame layout to avoid the redundant computation of the stackOffsets. First I push the LoongArch64's optimization, if needed, I can also implement the ARM64's optimization.
|
7d09326
to
46e29e5
Compare
What is the optimization/benefit here? It does not seem like a good idea to make LA64 different from the other backends (e.g. I see it introduces more ifdefs in shared code). |
I want to optimize the computation of the LclVarDsc's StackOffset.
This PR is the first PR, if this is OK, I can implement the other CPU liking this PR, at least the RISC CPUs. |
I'm not that familiar with the stack layout stuff, but are you saying that |
At least for RISC-ISA CPUs liking the LA64/ARM64/RISCV64. For LA64, I had tested it OK. |
I don't understand why LA needs to be different here. It doesn't seem like there is any reason to "optimize" the setting of LclVarDsc StackOffset. |
I will give detailed explanation later. |
Hi, @BruceForstall
In fact, the two steps can be merged together by optimization liking this PR. |
46e29e5
to
1a4347a
Compare
Hi, @BruceForstall |
@BruceForstall @jakobbotsch |
Can you give us a little more time to think about this before spending more time on this? |
OK. |
3841fdb
to
68831a7
Compare
dfb6b99
to
9deea5f
Compare
Just modified the Besides, I should confirm whehter the |
9deea5f
to
6ca44dd
Compare
6d0fc84
to
1d3d085
Compare
Hi, @jakobbotsch |
The failures look related to this PR. For example: Assert failure(PID 23 [0x00000017], Thread: 23 [0x0017]): Assertion failed 'frameSize < -2040' in 'Runtime_66089:TestEntryPoint()' during 'Generate code' (IL size 441; hash 0x76f50dc2; FullOpts)
File: /__w/1/s/src/coreclr/jit/codegenarm64.cpp:1227
Image: /root/helix/work/correlation/corerun Regardless, as we mentioned above, we are still not sure that we want to accept this PR. From what I see this PR makes it harder for us to implement future optimizations like #35274. Making some backends work differently to other backends also makes maintenance harder for us in the future. It is important for us to keep the backends as uniform as possible so that we can make changes that work the same way for all targets, without having to keep several different models in mind. What exactly is the justification for this change? |
I can fix this.
(1) This PR is also to optimization.
This PR is including LoongArch64, RISCV64 and ARM64 and uniforms some places. I think this PR shared more codes than before.
As this PR's description above, Especially for ARM64, the CodeGen::genPushCalleeSavedRegisters will be significantly simplified. Of course, I can update the docs about the frame layout and if needed I can add some might next steps for future optimization. |
So this PR is primarily meant as a throughput optimization? Do you have numbers on how much this improves throughput? |
This PR optimize now's two steps to one step. I didn't rewrite this part and the optimization is very noticeable.
For the case which not used frame pointer, this PR is the first PR and just optimize two steps to one step because now the ARM64/LoongArch64/RISCV64 are always using FP. For optimization of not used frame pointer for ARM64/LoongArch64/RISCV64 is the next step and that is not conlicting with this PR, and can also implement based on this PR by a seperate PR. |
We have ASM diffs and throughput measurements automatically run for arm64 in CI. You can see them by looking at "runtime-coreclr superpmi-diffs", clicking "View more details on Azure Pipelines" and then opening the "Extensions" tab.
As you say arm64 always uses FP today. #35274 is about optimizing it so that leaf methods do not use frame pointers. There are currently issues around EE suspension, but I know that @filipnavara has been looking into this. I also would expect LA64 and RISCV64 to be able to use the same optimization if we manage to solve the EE suspension issue. x64/x86 already implement this optimization. I like the LOC diff of this change -- +678/-1913 -- but it is hard to say anything about that given that it regresses arm64 by a lot. |
If just for ARM64 within this PR, I agree that this PR will increase the Prolog/Epilog's size because this PR doesn't use liking the If you can accept this PR only for RISCV64 and LoongArch64 but not ARM64, I can revert for ARM64. |
I will have to discuss with @BruceForstall what we think. But yes, it seems likely we won't be able to change arm64 as part of this PR, but again that makes LA64/RISCV64 different from ARM64 and introduces long term maintenance burden for us. |
OK.
I think this is not conflicting with future. we can optimize step by step and we can uniform these in future and I believe we can. |
07586f8
to
b812c5f
Compare
Hi, @jakobbotsch This PR is the first PR and I will push a series of PRs on future to amend related code for ARMARCH/XARCH/LA64/RISCV64, some ideas or PRs are not pushed within this PR. |
avoid the redundant computation of the stackOffsets. After the `lvaAssignVirtualFrameOffsetsToLocals`, there is no need to recompute the LclVarDsc's StackOffset within the `lvaFixVirtualFrameOffsets`.
b812c5f
to
d5ec274
Compare
Hi, @jakobbotsch |
@shushanhf I need to establish with Bruce whether we want to go in this direction (Bruce and Kunal are generally the owners of the JIT backends, not me, even though I often review the LA64/RISCV64 PRs). Mainly we have to discuss the points highlighted above, where we're concerned about uniformity in the backends. |
OK, thanks. I want to delay this PR's optimization to future. I want to merge this PR as you can accept way because I have some other PRs to push after this PR. |
How about I keep this PR and create a new PR which I just modify the frame layout without modifying the redundant computation of the stackOffsets within the |
The title of this PR implies it is purely an optimization. But are you saying that actually this PR is fundamentally changing stack layout, and that's the change you want to make? There are constraints on the exact expected frame layout from several sources I know of: unwinding, ETW on Windows, EnC support and pinvokes. I'm sure there are constraints from other sources I'm not familiar with. I would not be very comfortable taking fundamental changes to frame layout without doing a lot of due diligence. For example: https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#arm64-stack-frame-layout |
OK. |
We don't have time to work on this for .NET 9, so we will review it in .NET 10. |
I will close this PR. |
Redesign the frame layout to avoid the redundant computation of the stackOffsets.
After the
lvaAssignVirtualFrameOffsetsToLocals
, there is no need to recompute the LclVarDsc's StackOffset within thelvaFixVirtualFrameOffsets
.First I push the LoongArch64's optimization, then I will implement the ARM64's optimization. (Of course including the RISCV64)
After this PR's optimization, the frame layout liking this:
The frame layout will be divided into two parts by the
FP
slot.The
stackOffsets
is positive above theFP
slot, while thestackOffsets
is negative below theFP
slot.As this PR, the
stackOffsets
will be finished with thelvaAssignVirtualFrameOffsetsToLocals
,so there is no need to recompute the LclVarDsc's StackOffset within the
lvaFixVirtualFrameOffsets
.Besides, the OSR's stackOffset within the
Compiler::generatePatchpointInfo()
can also be omitted as theoffsetAdjust
is zero.Especially for ARM64, the
CodeGen::genPushCalleeSavedRegisters
will be significantly simplified.At the same time, there is also no need to use the
SetSaveFpLrWithAllCalleeSavedRegisters
.The
TARGET_XARCH
is not always using the frame pointer.So I will amend the
Compiler::lvaAssignVirtualFrameOffsetsToLocals
depending on the frame layout is based on frame pointer or sp, liking this:The stackOffset will be final value within the
Compiler::lvaAssignVirtualFrameOffsetsToLocals
.Then will delete the final computation of stackOffset within the
Compiler::lvaFixVirtualFrameOffsets
.Next plans:
As the ARM32 and XARCH don't always use the frame pointer, these should be done by a separate PR.
This PR is the first PR and I will push a series of PRs on future to amend related code for ARMARCH/XARCH/LA64/RISCV64.
But don't worry, I will don't increase the ARM64's Prolog/Epilog's asm size and will not disturb the runtime-team to optimize the leaf functions's FP #35274, I can wait to modify ARM64 unitil you finish liking #35274.
I will uniform most for ARMARCH/XARCH/LA64/RISCV64 to ease the long term burden for different implementations.