Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Suffix cleanup #102

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft

Parallel Suffix cleanup #102

wants to merge 15 commits into from

Conversation

jamierpond
Copy link
Collaborator

@jamierpond jamierpond commented Sep 12, 2024

Not totally sure how this can be turned into AI right now... i think this function might be too simple for associative iteration?

template<typename S>
constexpr auto parallelSuffix(S input) {
    auto
        log2Count = log2_of_power_of_two(S::NBits),
        power = 1;
    auto
        result = input,
        shiftMask = S{~S::MostSignificantBit};

    for (;;) {
        result = result ^ result.shiftIntraLaneLeft(power, shiftMask);
        if (!--log2Count) { break; }
        shiftMask = shiftMask & S{shiftMask.value() >> power};
        power <<= 1;
    }

    return S{result};
}

ZTE(power << 1);
for (;;) {
result = result ^ result.shiftIntraLaneLeft(power, shiftMask);
if (!--log2Count) { break; }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thecppzoo note one condition in the inner loop now

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is broken: it does not work with a bit count of non-powers of 2, like 7


template<typename S>
constexpr auto parallel_suffix(S input) {
constexpr auto log2Count = S::Lanes;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be right, the parallel suffix does not depend on the number of lanes, but the number of bits in the lanes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you might be reviewing an outdated version ?

@thecppzoo
Copy link
Owner

This implementation might be simple enough, sure, but it can only accept lane sizes that have a power of two number of bits.
Let's review if the implementation I made is less efficient than yours.
Otherwise, the much harder challenge of supporting any arbitrary bitcount will have to decompose the number of bits into its binary representation to make the groups, and then AI would come to bear more clearly.
In simpler terms, this implementation is like multiplication when the factor is a power of two, much easier.

@jamierpond
Copy link
Collaborator Author

jamierpond commented Sep 13, 2024

@jamierpond
Copy link
Collaborator Author

jamierpond commented Sep 13, 2024 via email

@jamierpond
Copy link
Collaborator Author

jamierpond commented Sep 13, 2024 via email

@thecppzoo
Copy link
Owner

@thecppzoo https://godbolt.org/z/5jdfffb1M

I just did this:
https://godbolt.org/z/cE1eoKM3d

I am very surprised and disappointed that the generated code for powers of two is basically identical, we have now a good example of code that the optimizer does not "understand", or perhaps we have to look deeper about whether this implementation is inherently not efficient.

Another lesson is to always, always, always! work on the straightforward solution of the straightforward need to have something to compare to sophisticated solutions to abstract and general needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants