-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Suffix cleanup #102
base: master
Are you sure you want to change the base?
Conversation
ZTE(power << 1); | ||
for (;;) { | ||
result = result ^ result.shiftIntraLaneLeft(power, shiftMask); | ||
if (!--log2Count) { break; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thecppzoo note one condition in the inner loop now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is broken: it does not work with a bit count of non-powers of 2, like 7
inc/zoo/swar/associative_iteration.h
Outdated
|
||
template<typename S> | ||
constexpr auto parallel_suffix(S input) { | ||
constexpr auto log2Count = S::Lanes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be right, the parallel suffix does not depend on the number of lanes, but the number of bits in the lanes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you might be reviewing an outdated version ?
This implementation might be simple enough, sure, but it can only accept lane sizes that have a power of two number of bits. |
hmmm... yeah i see what you mean...
…On Thu, 12 Sept 2024 at 19:18, thecppzoo ***@***.***> wrote:
This implementation might be simple enough, sure, but it can only accept
lane sizes that have a power of two number of bits.
Let's review if the implementation I made is less efficient than yours.
Otherwise, the much harder challenge of supporting any arbitrary bitcount
will have to decompose the number of bits into its binary representation to
make the groups, and then AI would come to bear more clearly.
In simpler terms, this implementation is like multiplication when the
factor is a power of two, much easier.
—
Reply to this email directly, view it on GitHub
<#102 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARQHT3HLE4OYF3LWWQQLWE3ZWJDRBAVCNFSM6AAAAABOCUJ25GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXHEYTCNBXGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
ok now working with non-power of two nuim bits, just needed to idiot check
myself about the log2 impl for non-powers of two:
https://godbolt.org/z/aPa6q8r8c
…On Thu, 12 Sept 2024 at 20:23, Jamie Pond ***@***.***> wrote:
hmmm... yeah i see what you mean...
On Thu, 12 Sept 2024 at 19:18, thecppzoo ***@***.***> wrote:
> This implementation might be simple enough, sure, but it can only accept
> lane sizes that have a power of two number of bits.
> Let's review if the implementation I made is less efficient than yours.
> Otherwise, the much harder challenge of supporting any arbitrary bitcount
> will have to decompose the number of bits into its binary representation to
> make the groups, and then AI would come to bear more clearly.
> In simpler terms, this implementation is like multiplication when the
> factor is a power of two, much easier.
>
> —
> Reply to this email directly, view it on GitHub
> <#102 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ARQHT3HLE4OYF3LWWQQLWE3ZWJDRBAVCNFSM6AAAAABOCUJ25GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXHEYTCNBXGE>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
I just did this: I am very surprised and disappointed that the generated code for powers of two is basically identical, we have now a good example of code that the optimizer does not "understand", or perhaps we have to look deeper about whether this implementation is inherently not efficient. Another lesson is to always, always, always! work on the straightforward solution of the straightforward need to have something to compare to sophisticated solutions to abstract and general needs. |
Not totally sure how this can be turned into AI right now... i think this function might be too simple for associative iteration?