Improve Oxide candidate extractor [0] #16306

RobinMalfait · 2025-02-06T16:15:27Z

This PR adds a new candidate¹ extractor with 2 major goals in mind:

It must be way easier to reason about and maintain.
It must have on-par performance or better than the current candidate extractor.

Problem

Candidate extraction is a bit of a wild west in Tailwind CSS and it's a very critical step to make sure that all your classes are picked up correctly to ensure that your website/app looks good.

One issue we run into is that Tailwind CSS is used in many different "host" languages and frameworks with their own syntax. It's not only used in HTML but also in JSX/TSX, Vue, Svelte, Angular, Pug, Rust, PHP, Rails, Clojure, .NET, … the list goes on and all of these have different syntaxes. Introducing dedicated parsers for each of these languages would be a huge maintenance burden because there will be new languages and frameworks coming up all the time. The best thing we can do is make assumptions and so far we've done a pretty good job at that.

The only certainty we have is that there is at least some structure to the possible Tailwind classes used in a file. E.g.: abc#def is definitely not a valid class, hover:flex definitely is. In a perfect world we limit the characters that can be used and defined a formal grammar that each candidate must follow, but that's not really an option right now (maybe this is something we can implement in future major versions).

The current candidate extractor we have has grown organically over time and required patching things here and there to make it work in various scenarios (and edge cases due to the different languages Tailwind is used in).

While there is definitely some structure, we essentially work in 2 phases:

Try to extract 0..n candidates. (This is the hard part)
Validate each candidate to make sure they are valid looking classes (by validating against the few rules we have)

Another reason the current extractor is hard to reason about is that we need it to be fast and that comes with some trade-offs to readability and maintainability.

Unfortunately there will always be a lot of false positives, but if we extract more classes than necessary then that's fine. It's only when we pass the candidates to the core engine that we will know for sure if they are valid or not. (we have some ideas to limit the amount of false positives but that's for another time)

Solution

Since the introduction of Tailwind CSS v4, we re-worked the internals quite a bit and we have a dedicated internal AST structure for candidates. For example, if you take a look at this:

<div class="[@media(pointer:fine)]:data-[state=pending]:hover:text-red-500/(--my-opacity)"></div>

This will be parsed into the following AST:

[
  {
    "kind": "functional",
    "root": "text",
    "value": {
      "kind": "named",
      "value": "red-500",
      "fraction": null
    },
    "modifier": {
      "kind": "arbitrary",
      "value": "var(--my-opacity)"
    },
    "variants": [
      {
        "kind": "static",
        "root": "hover"
      },
      {
        "kind": "functional",
        "root": "data",
        "value": {
          "kind": "arbitrary",
          "value": "state=pending"
        },
        "modifier": null
      },
      {
        "kind": "arbitrary",
        "selector": "@media(pointer:fine)",
        "relative": false
      }
    ],
    "important": false,
    "raw": "[@media(pointer:fine)]:data-[state=pending]:hover:text-red-500/(--my-opacity)"
  }
]

We have a lot of information here and we gave these patterns a name internally. You'll see names like functional, static, arbitrary, modifier, variant, compound, ...

Some of these patterns will be important for the new candidate extractor as well:

Name	Example	Description
Static utility (named)	`flex`	A simple utility with no inputs whatsoever
Functional utility (named)	`bg-red-500`	A utility `bg` with an input that is named `red-500`
Arbitrary value	`bg-[#0088cc]`	A utility `bg` with an input that is arbitrary, denoted by `[…]`
Arbitrary variable	`bg-(--my-color)`	A utility `bg` with an input that is arbitrary and has a CSS variable shorthand, denoted by `(--…)`
Arbitrary property	`[color:red]`	A utility that sets a property to a value on the fly

A similar structure exist for modifiers, where each modifier must start with /:

Name	Example	Description
Named modifier	bg-red-500`/20`	A named modifier
Arbitrary value	bg-red-500`/[20%]`	An arbitrary value, denoted by `/[…]`
Arbitrary variable	bg-red-500`/(--my-opacity)`	An arbitrary variable, denoted by `/(…)`

Last but not least, we have variants. They have a very similar pattern but they must end in a :.

Name	Example	Description
Named variant	`hover:`	A named variant
Arbitrary value	`data-[state=pending]:`	An arbitrary value, denoted by `[…]`
Arbitrary variable	`supports-(--my-variable):`	An arbitrary variable, denoted by `(…)`
Arbitrary variant	`[@media(pointer:fine)]:`	Similar to arbitrary properties, this will generate a variant on the fly

The goal with the new extractor is to encode these separate patterns in dedicated pieces of code (we called them "machines" because they are mostly state machine based and because I've been watching Person of Interest but I digress).

This will allow us to focus on each pattern separately, so if there is a bug or some new syntax we want to support we can add it to those machines.

One nice benefit of this is that we can encode the rules and handle validation as we go. The moment we know that some pattern is invalid, we can bail out early.

At the time of writing this, there are a bunch of machines:

Overview of the machines

ArbitraryPropertyMachine

Extracts candidates such as [color:red]. Some of the rules are:
1. There must be a property name
2. There must be a :
3. There must ba a value
There cannot be any spaces, the brackets are included, if the property is a CSS variable, it must be a valid CSS variable (uses the CssVariableMachine).
```
[color:red]
^^^^^^^^^^^

[--my-color:red]
^^^^^^^^^^^^^^^^
```
Depends on the StringMachine and CssVariableMachine.
ArbitraryValueMachine

Extracts arbitrary values for utilities and modifiers including the brackets:
```
bg-[#0088cc]
   ^^^^^^^^^

bg-red-500/[20%]
           ^^^^^
```
Depends on the StringMachine.
ArbitraryVariableMachine

Extracts arbitrary variables including the parentheses. The first argument must be a valid CSS variable, the other arguments are optional fallback arguments.
```
(--my-value)
^^^^^^^^^^^^

bg-red-500/(--my-opacity)
           ^^^^^^^^^^^^^^
```
Depends on the StringMachine and CssVariableMachine.
CandidateMachine

Uses the variant machine and utility machine. It will make sure that 0 or more variants are directly touching and followed by a utility.
```
hover:focus:flex
^^^^^^^^^^^^^^^^

aria-invalid:bg-red-500/(--my-opacity)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```
Depends on the VariantMachine and UtilityMachine.
CssVariableMachine

Extracts CSS variables, they must start with -- and must contain at least one alphanumeric character or, -, _ and can contain any escaped character (except for whitespace).
```
bg-(--my-color)
    ^^^^^^^^^^

bg-red-500/(--my-opacity)
            ^^^^^^^^^^^^

bg-(--my-color)/(--my-opacity)
    ^^^^^^^^^^   ^^^^^^^^^^^^
```
ModifierMachine

Extracts modifiers including the /
- /[ will delegate to the ArbitraryValueMachine
- /( will delegate to the ArbitraryVariableMachine
```
bg-red-500/20
          ^^^

bg-red-500/[20%]
          ^^^^^^

bg-red-500/(--my-opacity)
          ^^^^^^^^^^^^^^^
```
Depends on the ArbitraryValueMachine and ArbitraryVariableMachine.
NamedUtilityMachine

Extracts named utilities regardless of whether they are functional or static.
```
flex
^^^^

px-2.5
^^^^^^
```
This includes rules like: A . must be surrounded by digits.

Depends on the ArbitraryValueMachine and ArbitraryVariableMachine.
NamedVariantMachine

Extracts named variants regardless of whether they are functional or static. This is very similar to the NamedUtilityMachine but with different rules. We could combine them, but splitting things up makes it easier to reason about.

Another rule is that the : must be included.
```
hover:flex
^^^^^^

data-[state=pending]:flex
^^^^^^^^^^^^^^^^^^^^^

supports-(--my-variable):flex
^^^^^^^^^^^^^^^^^^^^^^^^^
```
Depends on the ArbitraryVariableMachine, ArbitraryValueMachine, and ModifierMachine.
StringMachine

This is a low-level machine that is used by various other machines. The only job this has is to extract strings that start with double quotes, single quotes or backticks.

We have this because once you are in a string, we don't have to make sure that brackets, parens and curlies are properly balanced. We have to make sure that balancing brackets are properly handled in other machines.
```
content-["Hello_World!"]
         ^^^^^^^^^^^^^^

bg-[url("https://example.com")]
        ^^^^^^^^^^^^^^^^^^^^^
```
UtilityMachine

Extracts utilities, it will use the lower level NamedUtilityMachine, ArbitraryPropertyMachine and ModifierMachine to extract the utility.

It will also handle important markers (including the legacy important marker).
```
flex
^^^^

bg-red-500/20
^^^^^^^^^^^^^

!bg-red-500/20      Legacy important marker
^^^^^^^^^^^^^^

bg-red-500/20!      New important marker
^^^^^^^^^^^^^^

!bg-red-500/20!     Both, but this is considered invalid
^^^^^^^^^^^^^^^
```
Depends on the ArbitraryPropertyMachine, NamedUtilityMachine, and ModifierMachine.
VariantMachine

Extracts variants, it will use the lower level NamedVariantMachine and ArbitraryValueMachine to extract the variant.
```
hover:focus:flex
^^^^^^
      ^^^^^^
```
Depends on the NamedVariantMachine and ArbitraryValueMachine.

One important thing to know here is that each machine runs to completion. They all implement a Machine trait that has a next(cursor) method and returns a MachineState.

The MachineState looks like this:

enum MachineState {
  Idle,
  Done(Span)
}

Where a Span is just the location in the input where the candidate was found.

struct Span {
  pub start: usize,
  pub end: usize,
}

Complexities

Boundary characters:

When running these machines to completion, they don't typically check for boundary characters, the wrapping CandidateMachine will check for boundary characters.

A boundary character is where we know that even though the character is touching the candidate it will not be part of the candidate.

<div class="flex"></div>
<!--       ^    ^    -->

The quotes are touching the candidate flex, but they will not be part of the candidate itself, so this is considered a valid candidate.

What to pick?

Let's imagine you are parsing this input:

<div class="hover:flex"></div>

The UtilityMachine will find hover and flex. The VariantMachine will find hover:. This means that at a certain point in the CandidateMachine you will see something like this:

let variant_machine_state = variant_machine.next(cursor);
//  MachineState::Done(Span { start: 12, end: 17 })        // `hover:`

let utility_machine_state = utility_machine.next(cursor);
//  MachineState::Done(Span { start: 12, end: 16 })        // `hover`

They are both done, but which one do we pick? In this scenario we will always pick the variant because its range will always be 1 character longer than the utility.

Of course there is an exception to this rule and it has to do with the fact that Tailwind CSS can be used in different languages and frameworks. A lot of people use clsx for dynamically applying classes to their React components. E.g.:

<div
  class={clsx({
    underline: someCondition(),
  })}
></div>

In this scenario, we will see underline: as a variant, and underline as a utility. We will pick the utility in this scenario because the next character is whitespace so this will never be a valid candidate otherwise (variants and utilities must be touching). Another reason this is valid, is because there wasn't a variant present prior to this candidate.

E.g.:

<div
  class={clsx({
    hover:underline: someCondition(),
  })}
></div>

This will be considered invalid, if you do want this, you should use quotes.

E.g.:

<div
  class={clsx({
    'hover:underline': someCondition(),
  })}
></div>

Overlapping/covered spans:

Another complexity is that the extracted spans for candidates can and will overlap. Let's take a look at this C# example:

public enum StackSpacing
{
  [CssClass("gap-y-4")]
  Small,

  [CssClass("gap-y-6")]
  Medium,

  [CssClass("gap-y-8")]
  Large
}

In this scenario, [CssClass("gap-y-4")] starts with a [ so we have a few options here:

It is an arbitrary property, e.g.: [color:red]
It is an arbitrary variant, e.g.: [@media(pointer:fine)]:

When running the parsers, both the VariantMachine and the UtilityMachine will run to completion but end up in a MachineState::Idle state.

This is because it is not a valid variant because it didn't end with a :.
It's also not a valid arbitrary property, because it didn't include a : to separate the property from the value.

Looking at the code as a human it's very clear what this is supposed to be, but not from the individual machines perspective.

Obviously we want to extract the gap-y-* classes here.

To solve this problem, we will run over an additional slice of the input, starting at the position before the machines started parsing until the position where the machines stopped parsing.

That slice will be this one: [CssClass("gap-y-6")] (we already skipped over the whitespace). Now, for every [ character we see, will start a new CandidateMachine right after the ['s position and run the machines over that slice. This will now eventually extract the gap-y-6 class.

The next question is, what if there was a : (e.g.: [CssClass("gap-y-6")]:), then the VariantMachine would complete, but the UtilityMachine will not because not exists after it. We will apply the same idea in this case.

Another issue is if we do have actual overlapping ranges. E.g.: let classes = ['[color:red]'];. This will extract both the [color:red] and color:red classes. You have to use your imagination, but the last one has the exact same structure as hover:flex (variant + utility).

In this case we will make sure to drop spans that are covered by other spans.

The extracted Spans will be valid candidates therefore if the outer most candidate is valid, we can throw away the inner candidate.

Position:                       11112222222
                                67890123456
                                ↓↓↓↓↓↓↓↓↓↓↓

Span { start: 17, end: 25 }  //  color:red
Span { start: 16, end: 26 }  // [color:red]

Exceptions

JavaScript keys as candidates:

We already talked about the clsx scenario, but there are a few more exceptions and that has to do with different syntaxes.

CSS class shorthand in certain templating languages:

In Pug and Slim, you can have a syntax like this:

.flex.underline
  div Hello World

Generated HTML

<div class="flex underline">
  <div>Hello World</div>
</div>

We have to make sure that in these scenarios the . is a valid boundary character. For this, we introduce a pre-processing step to massage the input a little bit to improve the extraction of the data. We have to make sure we don't make the input smaller or longer otherwise the positions might be off.

In this scenario, we could simply replace the . with a space. But of course, there are scenarios in these languages where it's not safe to do that.

If you want to use px-2.5 with this syntax, then you'd write:

.flex.px-2.5
  div Hello World

But that's invalid because that technically means flex, px-2, and 5 as classes.

You can use this syntax to get around that:

div(class="px-2.5")
  div Hello World

Generated HTML

<div class="px-2.5">
  <div>Hello World</div>
</div>

Which means that we can't simply replace . with a space, but have to parse the input. Luckily we only care about strings (and we have a StringMachine for that) and ignore replacing . inside of strings.

Ruby's weird string syntax:

%w[flex underline]

This is valid syntax and is shorthand for:

["flex", "underline"]

Luckily this problem is solved by the running the sub-machines after each [ character.

Performance

Testing:

Each machine has a test_…_performance test (that is ignored by default) that allows you to test the throughput of that machine. If you want to run them, you can use the following command:

cargo test test_variant_machine_performance --release -- --ignored

This will run the test in release mode and allows you to run the ignored test.

Caution

This test will fail, but it will print some output. E.g.:

tailwindcss_oxide::extractor::variant_machine::VariantMachine: Throughput: 737.75 MB/s over 0.02s
tailwindcss_oxide::extractor::variant_machine::VariantMachine:   Duration: 500ns

Readability:

One thing to note when looking at the code is that it's not always written in the cleanest way but we had to make some sacrifices for performance reasons.

The input is of type &[u8], so we are already dealing with bytes. Luckily, Rust has some nice ergonomics to easily write b'[' instead of 0x5b.

A concrete example where we had to sacrifice readability is the state machines where we check the previous, current and next character to make decisions. For a named utility one of the rules is that a . must be preceded by and followed by a digit. This can be written as:

match (cursor.prev, cursor.curr, cursor.next) {
  (b'0'..=b'9', b'.', b'0'..=b'9') => { /* … */ }
  _ => { /* … */ }
}

But this is not very fast because Rust can't optimize the match statement very well, especially because we are dealing with tuples containing 3 values and each value is a u8.

To solve this we use some nesting, once we reach b'.' only then will we check for the previous and next characters. We will also early return in most places. If the previous character is not a digit, there is no need to check the next character.

Classification and jump tables:

Another optimization we did is to classify the characters into a much smaller enum such that Rust can optimize all match arms and create some jump tables behind the scenes.

E.g.:

#[derive(Debug, Clone, Copy, PartialEq)]
enum Class {
    /// ', ", or `
    Quote,

    /// \
    Escape,

    /// Whitespace characters
    Whitespace,

    Other,
}

const CLASS_TABLE: [Class; 256] = {
    let mut table = [Class::Other; 256];

    macro_rules! set {
        ($class:expr, $($byte:expr),+ $(,)?) => {
            $(table[$byte as usize] = $class;)+
        };
    }

    set!(Class::Quote, b'"', b'\'', b'`');
    set!(Class::Escape, b'\\');
    set!(Class::Whitespace, b' ', b'\t', b'\n', b'\r', b'\x0C');

    table
};

There are only 4 values in this enum, so Rust can optimize this very well. The CLASS_TABLE is generated at compile time and must be exactly 256 elements long to fit all u8 values.

Inlining:

Last but not least, sometimes we use functions to abstract some logic. Luckily Rust will optimize and inline most of the functions automatically. In some scenarios, explicitly adding a #[inline(always)] improves performance, sometimes it doesn't improve it at all.

You might notice that in some functions the annotation is added and in some it's not. Every state machine was tested on its own and whenever the performance was better with the annotation, it was added.

Test Plan

Each machine has a dedicated set of tests to try and extract the relevant part for that machine. Most machines don't even check boundary characters or try to extract nested candidates. So keep that in mind when adding new tests. Extracting inside of nested […] is only handled by the outer most extractor/mod.rs.
The main extractor/mod.rs has dedicated tests for recent bug reports related to missing candidates.
You can test each machine's performance if you want to.

There is a chance that this new parser is missing candidates even though a lot of tests are added and existing tests have been ported.

To double check, we ran the new extractor on our own projects to make sure we didn't miss anything obvious.

Tailwind UI

On Tailwind UI the diff looks like this:

diff

diff --git a/./main.css b/./pr.css
index d83b0a506..b3dd94a1d 100644
--- a/./main.css
+++ b/./pr.css
@@ -5576,9 +5576,6 @@ @layer utilities {
     --tw-saturate: saturate(0%);
     filter: var(--tw-blur,) var(--tw-brightness,) var(--tw-contrast,) var(--tw-grayscale,) var(--tw-hue-rotate,) var(--tw-invert,) var(--tw-saturate,) var(--tw-sepia,) var(--tw-drop-shadow,);
   }
-  .\!filter {
-    filter: var(--tw-blur,) var(--tw-brightness,) var(--tw-contrast,) var(--tw-grayscale,) var(--tw-hue-rotate,) var(--tw-invert,) var(--tw-saturate,) var(--tw-sepia,) var(--tw-drop-shadow,) !important;
-  }
   .filter {
     filter: var(--tw-blur,) var(--tw-brightness,) var(--tw-contrast,) var(--tw-grayscale,) var(--tw-hue-rotate,) var(--tw-invert,) var(--tw-saturate,) var(--tw-sepia,) var(--tw-drop-shadow,);
   }

The reason !filter is gone, is because it was used like this:

getProducts.js
23:          if (!filter) return true

And right now ( and ) are not considered valid boundary characters for a candidate.

Catalyst

On Catalyst, the diff looks like this:

diff

diff --git a/./main.css b/./pr.css
index 9f8ed129..4aec992e 100644
--- a/./main.css
+++ b/./pr.css
@@ -2105,9 +2105,6 @@
   .outline-transparent {
     outline-color: transparent;
   }
-  .filter {
-    filter: var(--tw-blur,) var(--tw-brightness,) var(--tw-contrast,) var(--tw-grayscale,) var(--tw-hue-rotate,) var(--tw-invert,) var(--tw-saturate,) var(--tw-sepia,) var(--tw-drop-shadow,);
-  }
   .backdrop-blur-\[6px\] {
     --tw-backdrop-blur: blur(6px);
     -webkit-backdrop-filter: var(--tw-backdrop-blur,) var(--tw-backdrop-brightness,) var(--tw-backdrop-contrast,) var(--tw-backdrop-grayscale,) var(--tw-backdrop-hue-rotate,) var(--tw-backdrop-invert,) var(--tw-backdrop-opacity,) var(--tw-backdrop-saturate,) var(--tw-backdrop-sepia,);
@@ -7141,46 +7138,6 @@
   inherits: false;
   initial-value: solid;
 }
-@property --tw-blur {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-brightness {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-contrast {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-grayscale {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-hue-rotate {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-invert {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-opacity {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-saturate {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-sepia {
-  syntax: "*";
-  inherits: false;
-}
-@property --tw-drop-shadow {
-  syntax: "*";
-  inherits: false;
-}
 @property --tw-backdrop-blur {
   syntax: "*";
   inherits: false;

The reason for this is that filter was only used as a function call:

src/app/docs/Code.tsx
31:    .filter((x) => x !== null)

This was tested on all templates and they all remove a very small amount of classes that aren't used.

The script to test this looks like this:

bun --bun ~/github.com/tailwindlabs/tailwindcss/packages/@tailwindcss-cli/src/index.t -- -i ./src/styles/tailwind.css -o pr.css
bun --bun ~/github.com/tailwindlabs/tailwindcss--main/packages/@tailwindcss-cli/src/index.t -- -i ./src/styles/tailwind.css -o main.css

git diff --no-index --patch ./{main,pr}.css

This is using git worktrees, so the pr branch lives in a tailwindcss folder, and the main branch lives in a tailwindcss--main folder.

Fixes:

Fixes: Valid Angular class binding not recognized #15616
Fixes: Some classes not being detected in source files with square brackets #16750
Fixes: [v4] Slim template regression - 2xl variant classes not extracted #16790
Fixes: Tailwindcss classes not generated when used with angular class binding. #16801
Fixes: .eps files are included in source file detection #16880 (due to validating the arbitrary property)

Ideas for in the future

Right now each machine takes in a Cursor object. One potential improvement we can make is to rely on the input on its own instead of going via the wrapping Cursor object.
If you take a look at the AST, you'll notice that utilities and variants have a "root", these are basically prefixes of each available utility and/or variant. We can use this information to filter out candidates and bail out early if we know that a certain candidate will never produce a valid class.
Passthrough the prefix information. Everything that doesn't start with tw: can be skipped.

Design decisions that didn't make it

Once you reach this part, you can stop reading if you want to, but this is more like a brain dump of the things we tried and didn't work out. Wanted to include them as a reference in case we want to look back at this issue and know why certain things are implemented the way they are.

One character at a time

In an earlier implementation, the state machines were pure state machines where the next() function was called on every single character of the input. This had a lot of overhead because for every character we had to:

Ask the CandidateMachine which state it was in.
Check the cursor.curr (and potentially the cursor.prev and cursor.next) character.
If we were in a state where a nested state machine was running, we had to check its current state as well and so on.
Once we did all of that we could go to the next character.

In this approach, the MachineState looked like this instead:

enum MachineState {
  Idle,
  Parsing,
  Done(Span)
}

This had its own set of problems because now it's very hard to know whether we are done or not.

<div class="hover:flex"></div>
<!--            ^          -->

Let's look at the current position in the example above. At this point, it's both a valid variant and valid utility, so there was a lot of additional state we had to track to know whether we were done or not.

`Span` stitching

Another approach we tried was to just collect all valid variants and utilities and throw them in a big Vec<Span>. This reduced the amount of additional state to track and we could track a span the moment we saw a MachineState::Done(span).

The next thing we had to do was to make sure that:

Covered spans were removed. We still do this part in the current implementation.
Combine all touching variant spans (where span_a.end + 1 == span_b.start).
For every combined variant span, find a corresponding utility span.
- If there is no utility span, the candidate is invalid.
- If there are multiple candidate spans (this is in theory not possible because we dropped covered spans)
- If there is a candidate but it is attached to another set of spans, then the candidate is invalid. E.g.: flex!block
All left-over utility spans are candidates without variants.

This approach was slow, and still a bit hard to reason about.

Matching on tuples

While matching against the prev, curr and next characters was very readable and easy to reason about. It was not very fast. Unfortunately had to abandon this approach in favor of a more optimized approach.

In a perfect world, we would still write it this way, but have some compile time macro that would optimize this for us.

Matching against `b'…'` instead of classification and jump tables

Similar to the previous point, while this is better for readability, it's not fast enough. The jump tables are much faster.

Luckily for us, each machine has it's own set of rules and context, so it's much easier to reason about a single problem and optimize a single machine.

A candidate is what a potential Tailwind CSS class could be. It's a candidate because at this stage we don't know if it will actually produce something but it looks like it could be a valid class. E.g.: hover:bg-red-500 is a candidate, but it will only produce something if --color-red-500 is defined in your theme. ↩

This PR bumps the Prettier dependencies, and also pins the version. Noticed that a PR with a single empty commit started failing at the time of writing this (#16306). This is because prettier released a new minor version which results in slightly different output. Let's bump prettier and handle the differences, but also pin the version to avoid this in the future.

Co-authored-by: Jordan Pittman <[email protected]>

~550 lines of code to mimic a real-world HTML file. Used for benchmarks.

Before this, the structure looked like: ```rs struct ChangedContent { file: Option<PathBuf>, content: Option<String> } ``` There are 2 problems with this: 1. It should be either a file or content, but not both and definitely not none. This structure doesn't model that very well. But this structure is needed to allow us to pass in a JS object with this information. 2. This is missing the extension information which is required to do some preprocessing. The public ChangedContent is still the "wrong" implementation, but we translate it to a well-formed ChangedContent enum instead: ```rs enum ChangedContent { File(path, extension), Content(contents, extension), } ```

+ setup a `src/main.rs` file for benchmarks

The reason we get `class` is because the pre-process step for Svelte files will replace `class:` with `class `, this means that the input looks like: ```diff - <div class:px-4='condition'></div> + <div class px-4='condition'></div> ``` The reason we _don't_ get the `div` anymore, is because it's preceded by an invalid boundary character (`<`) and therefore we skip ahead to the next valid boundary character even though `div` on its own is a perfectly valid candidate.

crates/oxide/src/lib.rs

RobinMalfait · 2025-02-26T15:24:49Z

crates/oxide/src/lib.rs

+            if blob.is_empty() {
+                return None;
+            }


This happens quite a lot which allows us to not create a Extractor at all.

RobinMalfait · 2025-02-26T15:27:20Z

crates/oxide/src/main.rs

+    let throughput = Throughput::compute(iterations, input.len(), || {
+        _ = black_box(
+            input
+                .split(|x| *x == b'\n')


This mimics how we do the real parsing on a line-by-line basis, but without the parallelism from Rayon. Including rayon here makes it much harder to reason about when you look at Instruments.

RobinMalfait · 2025-02-26T15:28:36Z

crates/oxide/src/parser.rs

@@ -1,1757 +0,0 @@
-use crate::{cursor::Cursor, fast_skip::fast_skip};


I dropped this, but maybe we can keep it around somewhere for additional benchmarks?

One thing we could consider is making the new parser opt-in for a bit or keeping the old around for fast opt-out, so we get some confidence it's not missing something critical

crates/oxide/src/extractor/mod.rs

rust-toolchain.toml

philipp-spiess

This is a partial review, going to finish this later. Really love reviewing the individual machines so far. One high level thought I have is about the MachineState type as it's not really holding any state information right now anymore (as we moved most of the state onto the stack into function closures which I really like actually). Do with that information what you want so far, I need to besser formalize my thoughts by tomorrow :P

Memo to myself: continue here 38748b4

philipp-spiess · 2025-02-26T16:26:43Z

crates/oxide/src/lib.rs

-    pub content: Option<String>,
+pub enum ChangedContent<'a> {
+    File(PathBuf, Cow<'a, str>),
+    Content(String, Cow<'a, str>),


Does the second part here (which I think is the extension?) really have to be writeable? I wonder if a pointer to a string is enough here but could be I severely misunderstand stuff tbh so take with a grain of salt

crates/oxide/src/extractor/string_machine.rs

philipp-spiess · 2025-02-26T16:51:01Z

crates/oxide/src/extractor/string_machine.rs

+    macro_rules! set {
+        ($class:expr, $($byte:expr),+ $(,)?) => {
+            $(table[$byte as usize] = $class;)+
+        };
+    }
+
+    set!(Class::Quote, b'"', b'\'', b'`');
+    set!(Class::Escape, b'\\');
+    set!(Class::Whitespace, b' ', b'\t', b'\n', b'\r', b'\x0C');


table[b'"' as usize] = Class::Quote; table[b'\'' as usize] = Class::Quote; table[b'`' as usize] = Class::Quote; table[b'\\' as usize] = Class::Escape table[b' ' as usize] = Class::Whitespace; table[b'\t' as usize] = Class::Whitespace; table[b'\n' as usize] = Class::Whitespace; table[b'\r' as usize] = Class::Whitespace; table[b'\x0C' as usize] = Class::Whitespace;

same number of lines btw 😄

Haha yep, I just copied it over and over again. In some situations it makes more sense especially when using ranges. E.g.: set_range!(Class::Alpha, b'a'..=b'z').

Improved the DX here in a separate PR: #16864

crates/oxide/src/extractor/css_variable_machine.rs

crates/oxide/src/extractor/arbitrary_value_machine.rs

This is a faster implementation compared to `advance_by(2)`. It's a bit of an unsafe function similar to how `advance()` is unsafe because `cursor.pos` could be larger than the actual input length so use this in places where you are absolutely sure.

This reduces the state necessary and can bail early when we don't see any `[`. Increases performance as well: ```diff - ArbitraryValueMachine: Throughput: 654.80 MB/s + ArbitraryValueMachine: Throughput: 718.51 MB/s ```

philipp-spiess

Awesome stuff. One thing to consider is if we want to ship this as-is (which might include some bugs we need to fix up fast afterwards) or if we want to either have an opt-in or opt-out based approach. I'm curious what your thoughts are about this?

philipp-spiess · 2025-02-27T10:56:03Z

crates/oxide/src/extractor/arbitrary_variable_machine.rs

+            // Exceptions:
+            // Arbitrary variable must start with a CSS variable
+            (r"(bar)", vec![]),
+            // Arbitrary variables must be valid CSS variables
+            (r"(--my-\ color)", vec![]),
+            (r"(--my#color)", vec![]),
+            // Fallbacks cannot have spaces
+            (r"(--my-color, red)", vec![]),
+            // Fallbacks cannot have escaped spaces
+            (r"(--my-color,\ red)", vec![]),
+            // Variables must have at least one character after the `--`
+            (r"(--)", vec![]),
+            (r"(--,red)", vec![]),


Add something like (-my-color) here. I think we never require there to be two dashes right now 👍

Added it, but was already covered. Once we see a dash we go straight to the CSS variable machine which requires 2 dashes.

crates/oxide/src/extractor/arbitrary_property_machine.rs

crates/oxide/src/extractor/modifier_machine.rs

crates/oxide/src/extractor/named_variant_machine.rs

crates/oxide/src/parser.rs

crates/oxide/src/lib.rs

crates/oxide/src/extractor/css_variable_machine.rs

crates/oxide/src/cursor.rs

We dropped some boundary characters such as `[` and `{` because these were only necessary for certain languages and frameworks such as Ruby and Svelte. However, we will now pre-process those. In a perfect world, we could handle the Angular syntax as a preprocessing step as well but this has 2 issues: 1. Angular can be used in `.html` files 2. The special syntax can be used in JS files in a `@Component` decoration. See: https://angular.dev/guide/components/host-elements#binding-to-the-host-element This means that we would have to pre-process _all_ JS/TS files just for the Angular case which is unfortunate.

Co-authored-by: Philipp Spiess <[email protected]>

We do not allow utilities to start with an uppercase letter. While we accept negative utilities, the next characters should also not accept any uppercase letters. So `Foo` is invalid, therefore `-Foo` should also be invalid. Co-authored-by: Philipp Spiess <[email protected]>

Co-authored-by: Philipp Spiess <[email protected]>

RobinMalfait force-pushed the feat/only-expose-used-variables branch from 7236df4 to f3439f3 Compare February 7, 2025 15:57

Base automatically changed from feat/only-expose-used-variables to main February 7, 2025 17:12

RobinMalfait force-pushed the feat/improve-oxide-scanner branch from bbb7a29 to 10dee6f Compare February 9, 2025 13:09

RobinMalfait mentioned this pull request Feb 9, 2025

Bump and pin prettier #16382

Merged

RobinMalfait force-pushed the feat/improve-oxide-scanner branch 4 times, most recently from 15c95dd to 0927bb7 Compare February 13, 2025 00:32

RobinMalfait force-pushed the feat/improve-oxide-scanner branch 2 times, most recently from fd88773 to 0ddac4f Compare February 26, 2025 10:24

RobinMalfait changed the title ~~Improve Oxide scanner~~ Improve Oxide candidate extractor Feb 26, 2025

RobinMalfait and others added 9 commits February 26, 2025 16:06

start of Oxide scanner improvements

1bde46d

bump Rust version

d9369ea

bump Rust dependencies

157ef40

apply cargo clippy

85341d7

make move_to less branchy

8208f1d

Co-authored-by: Jordan Pittman <[email protected]>

add advance() shorthand

15aaae6

add throughput helper for benchmarks

529c92c

Co-authored-by: Jordan Pittman <[email protected]>

add example fixture file

6dd19f7

~550 lines of code to mimic a real-world HTML file. Used for benchmarks.

RobinMalfait force-pushed the feat/improve-oxide-scanner branch from 0ddac4f to 6f2ec16 Compare February 26, 2025 15:06

RobinMalfait marked this pull request as ready for review February 26, 2025 15:07

RobinMalfait requested a review from a team as a code owner February 26, 2025 15:07

RobinMalfait added 6 commits February 26, 2025 16:47

add Machine

8deb2ab

add BracketStack

49885ce

add StringMachine

4a78d4c

add CssVariableMachine

d9aa816

add ArbitraryValueMachine

bfe611d

add ArbitraryVariableMachine

38748b4

RobinMalfait added 3 commits February 26, 2025 16:47

wire up new Extractor

bdf2e81

+ setup a `src/main.rs` file for benchmarks

update changelog

cc11fb5

RobinMalfait force-pushed the feat/improve-oxide-scanner branch from 6f2ec16 to cc11fb5 Compare February 26, 2025 15:48

RobinMalfait commented Feb 26, 2025

View reviewed changes

philipp-spiess reviewed Feb 26, 2025

View reviewed changes

RobinMalfait added 6 commits February 26, 2025 19:14

add advance_twice method

60e5f67

This is a faster implementation compared to `advance_by(2)`. It's a bit of an unsafe function similar to how `advance()` is unsafe because `cursor.pos` could be larger than the actual input length so use this in places where you are absolutely sure.

use cursor.advance_twice()

3ab7a52

simplify ArbitraryValueMachine

6bbf003

This reduces the state necessary and can bail early when we don't see any `[`. Increases performance as well: ```diff - ArbitraryValueMachine: Throughput: 654.80 MB/s + ArbitraryValueMachine: Throughput: 718.51 MB/s ```

support emoji in CSS Variables

21fe019

remove cursor.rewind_by

f083c9d

Merge branch 'main' into feat/improve-oxide-scanner

36905b5

philipp-spiess reviewed Feb 27, 2025

View reviewed changes

RobinMalfait and others added 9 commits February 27, 2025 17:00

add dedicated pre-processors for certain file extensions

8cf78f0

remove tests when we already have dedicated tests for them

869e557

add more angular binding attribute tests

b00d370

add a few more CSS Variable tests

b0968b6

Co-authored-by: Philipp Spiess <[email protected]>

improve comments and fix typos

3d8c41a

Co-authored-by: Philipp Spiess <[email protected]>

use a set_range!

fb235a2

Co-authored-by: Philipp Spiess <[email protected]>

Merge branch 'main' into feat/improve-oxide-scanner

df51fba

RobinMalfait changed the title ~~Improve Oxide candidate extractor~~ Improve Oxide candidate extractor [0] Feb 28, 2025

RobinMalfait added 2 commits March 2, 2025 11:43

Merge branch 'main' into feat/improve-oxide-scanner

072d8d2

Merge branch 'main' into feat/improve-oxide-scanner

995948d

philipp-spiess approved these changes Mar 5, 2025

View reviewed changes

RobinMalfait merged commit b3c2556 into main Mar 5, 2025
5 checks passed

RobinMalfait deleted the feat/improve-oxide-scanner branch March 5, 2025 10:55

RobinMalfait mentioned this pull request Mar 5, 2025

[v4] Class within square brackets [] is ignored #16189

Closed

iquito mentioned this pull request Mar 6, 2025

Missing classes because of { boundary and new extractor #16999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Oxide candidate extractor [0] #16306

Improve Oxide candidate extractor [0] #16306

RobinMalfait commented Feb 6, 2025 •

edited

Loading

RobinMalfait Feb 26, 2025

RobinMalfait Feb 26, 2025

RobinMalfait Feb 26, 2025

philipp-spiess Feb 27, 2025

philipp-spiess left a comment

philipp-spiess Feb 26, 2025

philipp-spiess Feb 26, 2025

RobinMalfait Feb 26, 2025

RobinMalfait Mar 1, 2025

philipp-spiess left a comment

philipp-spiess Feb 27, 2025

RobinMalfait Feb 27, 2025

		@@ -1,1757 +0,0 @@
		use crate::{cursor::Cursor, fast_skip::fast_skip};

Improve Oxide candidate extractor [0] #16306

Improve Oxide candidate extractor [0] #16306

Conversation

RobinMalfait commented Feb 6, 2025 • edited Loading

Problem

Solution

Complexities

Exceptions

Performance

Test Plan

Tailwind UI

Catalyst

Fixes:

Ideas for in the future

Design decisions that didn't make it

One character at a time

Span stitching

Matching on tuples

Matching against b'…' instead of classification and jump tables

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipp-spiess left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipp-spiess left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobinMalfait commented Feb 6, 2025 •

edited

Loading

`Span` stitching

Matching against `b'…'` instead of classification and jump tables