Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve internal DX around byte classification [1] (#16864)
This PR improves the internal DX when working with `u8` classification into a smaller enum. This is done by implementing a `ClassifyBytes` proc derive macro. The benefit of this is that the DX is much better and everything you will see here is done at compile time. Before: ```rs #[derive(Debug, Clone, Copy, PartialEq)] enum Class { ValidStart, ValidInside, OpenBracket, OpenParen, Slash, Other, } const CLASS_TABLE: [Class; 256] = { let mut table = [Class::Other; 256]; macro_rules! set { ($class:expr, $($byte:expr),+ $(,)?) => { $(table[$byte as usize] = $class;)+ }; } macro_rules! set_range { ($class:expr, $start:literal ..= $end:literal) => { let mut i = $start; while i <= $end { table[i as usize] = $class; i += 1; } }; } set_range!(Class::ValidStart, b'a'..=b'z'); set_range!(Class::ValidStart, b'A'..=b'Z'); set_range!(Class::ValidStart, b'0'..=b'9'); set!(Class::OpenBracket, b'['); set!(Class::OpenParen, b'('); set!(Class::Slash, b'/'); set!(Class::ValidInside, b'-', b'_', b'.'); table }; ``` After: ```rs #[derive(Debug, Clone, Copy, PartialEq, ClassifyBytes)] enum Class { #[bytes_range(b'a'..=b'z', b'A'..=b'Z', b'0'..=b'9')] ValidStart, #[bytes(b'-', b'_', b'.')] ValidInside, #[bytes(b'[')] OpenBracket, #[bytes(b'(')] OpenParen, #[bytes(b'/')] Slash, #[fallback] Other, } ``` Before we were generating a `CLASS_TABLE` that we could access directly, but now it will be part of the `Class`. This means that the usage has to change: ```diff - CLASS_TABLE[cursor.curr as usize] + Class::TABLE[cursor.curr as usize] ``` This is slightly worse UX, and this is where another change comes in. We implemented the `From<u8> for #enum_name` trait inside of the `ClassifyBytes` derive macro. This allows us to use `.into()` on any `u8` as long as we are comparing it to a `Class` instance. In our scenario: ```diff - Class::TABLE[cursor.curr as usize] + cursor.curr.into() ``` Usage wise, this looks something like this: ```diff while cursor.pos < len { - match Class::TABLE[cursor.curr as usize] { + match cursor.curr.into() { - Class::Escape => match Class::Table[cursor.next as usize] { + Class::Escape => match cursor.next.into() { // An escaped whitespace character is not allowed Class::Whitespace => return MachineState::Idle, // An escaped character, skip ahead to the next character _ => cursor.advance(), }, // End of the string Class::Quote if cursor.curr == end_char => return self.done(start_pos, cursor), // Any kind of whitespace is not allowed Class::Whitespace => return MachineState::Idle, // Everything else is valid _ => {} }; cursor.advance() } MachineState::Idle } } ``` If you manually look at the `Class::TABLE` in your editor for example, you can see that it is properly generated at compile time. Given this input: ```rs #[derive(Clone, Copy, ClassifyBytes)] enum Class { #[bytes_range(b'a'..=b'z')] AlphaLower, #[bytes_range(b'A'..=b'Z')] AlphaUpper, #[bytes(b'@')] At, #[bytes(b':')] Colon, #[bytes(b'-')] Dash, #[bytes(b'.')] Dot, #[bytes(b'\0')] End, #[bytes(b'!')] Exclamation, #[bytes_range(b'0'..=b'9')] Number, #[bytes(b'[')] OpenBracket, #[bytes(b']')] CloseBracket, #[bytes(b'(')] OpenParen, #[bytes(b'%')] Percent, #[bytes(b'"', b'\'', b'`')] Quote, #[bytes(b'/')] Slash, #[bytes(b'_')] Underscore, #[bytes(b' ', b'\t', b'\n', b'\r', b'\x0C')] Whitespace, #[fallback] Other, } ``` This is the result: <img width="1244" alt="image" src="https://github.com/user-attachments/assets/6ffd6ad3-0b2f-4381-a24c-593e4c72080e" />
- Loading branch information