Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve internal DX around byte classification [1] #16864

Merged
merged 11 commits into from
Mar 5, 2025

Conversation

RobinMalfait
Copy link
Member

@RobinMalfait RobinMalfait commented Feb 28, 2025

This PR improves the internal DX when working with u8 classification into a smaller enum. This is done by implementing a ClassifyBytes proc derive macro. The benefit of this is that the DX is much better and everything you will see here is done at compile time.

Before:

#[derive(Debug, Clone, Copy, PartialEq)]
enum Class {
    ValidStart,
    ValidInside,
    OpenBracket,
    OpenParen,
    Slash,
    Other,
}

const CLASS_TABLE: [Class; 256] = {
    let mut table = [Class::Other; 256];

    macro_rules! set {
        ($class:expr, $($byte:expr),+ $(,)?) => {
            $(table[$byte as usize] = $class;)+
        };
    }

    macro_rules! set_range {
        ($class:expr, $start:literal ..= $end:literal) => {
            let mut i = $start;
            while i <= $end {
                table[i as usize] = $class;
                i += 1;
            }
        };
    }

    set_range!(Class::ValidStart, b'a'..=b'z');
    set_range!(Class::ValidStart, b'A'..=b'Z');
    set_range!(Class::ValidStart, b'0'..=b'9');

    set!(Class::OpenBracket, b'[');
    set!(Class::OpenParen, b'(');

    set!(Class::Slash, b'/');

    set!(Class::ValidInside, b'-', b'_', b'.');

    table
};

After:

#[derive(Debug, Clone, Copy, PartialEq, ClassifyBytes)]
enum Class {
    #[bytes_range(b'a'..=b'z', b'A'..=b'Z', b'0'..=b'9')]
    ValidStart,

    #[bytes(b'-', b'_', b'.')]
    ValidInside,

    #[bytes(b'[')]
    OpenBracket,

    #[bytes(b'(')]
    OpenParen,

    #[bytes(b'/')]
    Slash,

    #[fallback]
    Other,
}

Before we were generating a CLASS_TABLE that we could access directly, but now it will be part of the Class. This means that the usage has to change:

- CLASS_TABLE[cursor.curr as usize]
+ Class::TABLE[cursor.curr as usize]

This is slightly worse UX, and this is where another change comes in. We implemented the From<u8> for #enum_name trait inside of the ClassifyBytes derive macro. This allows us to use .into() on any u8 as long as we are comparing it to a Class instance. In our scenario:

- Class::TABLE[cursor.curr as usize]
+ cursor.curr.into()

Usage wise, this looks something like this:

        while cursor.pos < len {
-           match Class::TABLE[cursor.curr as usize] {
+           match cursor.curr.into() {
-               Class::Escape => match Class::Table[cursor.next as usize] {
+               Class::Escape => match cursor.next.into() {
                    // An escaped whitespace character is not allowed
                    Class::Whitespace => return MachineState::Idle,

                    // An escaped character, skip ahead to the next character
                    _ => cursor.advance(),
                },

                // End of the string
                Class::Quote if cursor.curr == end_char => return self.done(start_pos, cursor),

                // Any kind of whitespace is not allowed
                Class::Whitespace => return MachineState::Idle,

                // Everything else is valid
                _ => {}
            };

            cursor.advance()
        }

        MachineState::Idle
    }
}

If you manually look at the Class::TABLE in your editor for example, you can see that it is properly generated at compile time.

Given this input:

#[derive(Clone, Copy, ClassifyBytes)]
enum Class {
    #[bytes_range(b'a'..=b'z')]
    AlphaLower,

    #[bytes_range(b'A'..=b'Z')]
    AlphaUpper,

    #[bytes(b'@')]
    At,

    #[bytes(b':')]
    Colon,

    #[bytes(b'-')]
    Dash,

    #[bytes(b'.')]
    Dot,

    #[bytes(b'\0')]
    End,

    #[bytes(b'!')]
    Exclamation,

    #[bytes_range(b'0'..=b'9')]
    Number,

    #[bytes(b'[')]
    OpenBracket,

    #[bytes(b']')]
    CloseBracket,

    #[bytes(b'(')]
    OpenParen,

    #[bytes(b'%')]
    Percent,

    #[bytes(b'"', b'\'', b'`')]
    Quote,

    #[bytes(b'/')]
    Slash,

    #[bytes(b'_')]
    Underscore,

    #[bytes(b' ', b'\t', b'\n', b'\r', b'\x0C')]
    Whitespace,

    #[fallback]
    Other,
}

This is the result:
image

@RobinMalfait RobinMalfait force-pushed the feat/improve-classification branch from c3d9d8c to 0639df7 Compare February 28, 2025 08:54
@RobinMalfait RobinMalfait marked this pull request as ready for review February 28, 2025 09:01
@RobinMalfait RobinMalfait requested a review from a team as a code owner February 28, 2025 09:01
@RobinMalfait RobinMalfait changed the title Improve internal DX around byte classification Improve internal DX around byte classification [1] Feb 28, 2025
Base automatically changed from feat/improve-oxide-scanner to main March 5, 2025 10:55
@RobinMalfait RobinMalfait force-pushed the feat/improve-classification branch from 4d0d459 to 3887c6b Compare March 5, 2025 10:57
This is a little proc macro that allows us to write the `enum Class` in
a more ergonomic way.
This gets rid of the redundant `Class::CLASS_TABLE`

```diff
- Class::CLASS_TABLE
+ Class::TABLE
```
This allows us to apply this change:

```diff
- Class::CLASS_TABLE[cursor.curr as usize]
+ cursor.curr.classify()
```

Where `cursor.curr` is a `u8`
The `classify` method is implemented as `Class::CLASS_TABLE[*self as
usize]` and uses an `#[inline(always)]` as well.

This is just a DX improvement.
There is a tiny problem with the custom trait which exposes
`classify()`. The moment you have 2 enums with `ClassifyBytes` in the
same file, it means that the `trait U8Ext` was duplicated resulting in
errors.

To solve this, we can also implement the `From<u8> for #enum_name`
trait. This allows us to just use `b'a'.into()` in a context where you
are comparing it with an `Class::AlphaLower` for example.
In case the property starts with `--`, we will parse it as a CSS
variable which can contain a few more characters like emoji for example.

Initial reason that I included uppercase is because I was thinking about
`Webkit`, but that should be `-webkit` in CSS anyway.

Technically uppercase values are valid in CSS:
```css
BODY {
  BACKGROUND-COLOR: RED;
}
```
Play: https://play.tailwindcss.com/sUsWi0Ampe?file=css

But for the sake of the arbitrary property, let's only allow
lower-kebab-case property names.
@RobinMalfait RobinMalfait force-pushed the feat/improve-classification branch from 3887c6b to 54a1d6b Compare March 5, 2025 11:00
Copy link
Member

@philipp-spiess philipp-spiess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff. Happy to see the CLASS_TABLE macro stuff gone honestly. Not really happy that this has to be its own crate but what can you do. Also unsure wether Char#into() is super intuitive API but kinda also nice that it doesn't leak that it's a lookup table internally so maybe that's cool. Anyway no blockers from my side 👍

Comment on lines 86 to 87
// If no fallback variant is found, default to "Other"
let fallback_ident = fallback_variant.unwrap_or_else(|| Ident::new("Other", Span::call_site()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we error in the case no fallback variant is defined? Otherwise what happens if we don't define an Other on the enum? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can be explicit here; In this PR I've always added the #[fallback] anyway. Let me change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented here:

5f22c13

@@ -102,7 +103,7 @@ impl Machine for ArbitraryPropertyMachine {
//
// E.g.: `[color:red]`
// ^^^^^
Class::Alpha => cursor.advance(),
Class::AlphaLower => cursor.advance(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaking in a behavioral change here I see 👀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤫

If you're missing the `#[fallback]` variant, you will see an error like
this:

```
error: proc-macro derive panicked
  --> crates/oxide/src/extractor/string_machine.rs:71:45
   |
71 | #[derive(Debug, Clone, Copy, PartialEq, Eq, ClassifyBytes)]
   |                                             ^^^^^^^^^^^^^
   |
   = help: message: A variant marked with #[fallback] is missing
```
@RobinMalfait RobinMalfait merged commit 4c11001 into main Mar 5, 2025
5 checks passed
@RobinMalfait RobinMalfait deleted the feat/improve-classification branch March 5, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants