-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Oxide candidate extractor [0] #16306
Conversation
7236df4
to
f3439f3
Compare
bbb7a29
to
10dee6f
Compare
This PR bumps the Prettier dependencies, and also pins the version. Noticed that a PR with a single empty commit started failing at the time of writing this (#16306). This is because prettier released a new minor version which results in slightly different output. Let's bump prettier and handle the differences, but also pin the version to avoid this in the future.
15c95dd
to
0927bb7
Compare
fd88773
to
0ddac4f
Compare
Co-authored-by: Jordan Pittman <[email protected]>
Co-authored-by: Jordan Pittman <[email protected]>
~550 lines of code to mimic a real-world HTML file. Used for benchmarks.
Before this, the structure looked like: ```rs struct ChangedContent { file: Option<PathBuf>, content: Option<String> } ``` There are 2 problems with this: 1. It should be either a file or content, but not both and definitely not none. This structure doesn't model that very well. But this structure is needed to allow us to pass in a JS object with this information. 2. This is missing the extension information which is required to do some preprocessing. The public ChangedContent is still the "wrong" implementation, but we translate it to a well-formed ChangedContent enum instead: ```rs enum ChangedContent { File(path, extension), Content(contents, extension), } ```
0ddac4f
to
6f2ec16
Compare
+ setup a `src/main.rs` file for benchmarks
The reason we get `class` is because the pre-process step for Svelte files will replace `class:` with `class `, this means that the input looks like: ```diff - <div class:px-4='condition'></div> + <div class px-4='condition'></div> ``` The reason we _don't_ get the `div` anymore, is because it's preceded by an invalid boundary character (`<`) and therefore we skip ahead to the next valid boundary character even though `div` on its own is a perfectly valid candidate.
6f2ec16
to
cc11fb5
Compare
if blob.is_empty() { | ||
return None; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens quite a lot which allows us to not create a Extractor
at all.
let throughput = Throughput::compute(iterations, input.len(), || { | ||
_ = black_box( | ||
input | ||
.split(|x| *x == b'\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mimics how we do the real parsing on a line-by-line basis, but without the parallelism from Rayon. Including rayon here makes it much harder to reason about when you look at Instruments.
@@ -1,1757 +0,0 @@ | |||
use crate::{cursor::Cursor, fast_skip::fast_skip}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped this, but maybe we can keep it around somewhere for additional benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we could consider is making the new parser opt-in for a bit or keeping the old around for fast opt-out, so we get some confidence it's not missing something critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a partial review, going to finish this later. Really love reviewing the individual machines so far. One high level thought I have is about the MachineState
type as it's not really holding any state information right now anymore (as we moved most of the state onto the stack into function closures which I really like actually). Do with that information what you want so far, I need to besser formalize my thoughts by tomorrow :P
Memo to myself: continue here 38748b4
pub content: Option<String>, | ||
pub enum ChangedContent<'a> { | ||
File(PathBuf, Cow<'a, str>), | ||
Content(String, Cow<'a, str>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the second part here (which I think is the extension?) really have to be writeable? I wonder if a pointer to a string is enough here but could be I severely misunderstand stuff tbh so take with a grain of salt
macro_rules! set { | ||
($class:expr, $($byte:expr),+ $(,)?) => { | ||
$(table[$byte as usize] = $class;)+ | ||
}; | ||
} | ||
|
||
set!(Class::Quote, b'"', b'\'', b'`'); | ||
set!(Class::Escape, b'\\'); | ||
set!(Class::Whitespace, b' ', b'\t', b'\n', b'\r', b'\x0C'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table[b'"' as usize] = Class::Quote;
table[b'\'' as usize] = Class::Quote;
table[b'`' as usize] = Class::Quote;
table[b'\\' as usize] = Class::Escape
table[b' ' as usize] = Class::Whitespace;
table[b'\t' as usize] = Class::Whitespace;
table[b'\n' as usize] = Class::Whitespace;
table[b'\r' as usize] = Class::Whitespace;
table[b'\x0C' as usize] = Class::Whitespace;
same number of lines btw 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yep, I just copied it over and over again. In some situations it makes more sense especially when using ranges. E.g.: set_range!(Class::Alpha, b'a'..=b'z')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the DX here in a separate PR: #16864
This is a faster implementation compared to `advance_by(2)`. It's a bit of an unsafe function similar to how `advance()` is unsafe because `cursor.pos` could be larger than the actual input length so use this in places where you are absolutely sure.
This reduces the state necessary and can bail early when we don't see any `[`. Increases performance as well: ```diff - ArbitraryValueMachine: Throughput: 654.80 MB/s + ArbitraryValueMachine: Throughput: 718.51 MB/s ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff. One thing to consider is if we want to ship this as-is (which might include some bugs we need to fix up fast afterwards) or if we want to either have an opt-in or opt-out based approach. I'm curious what your thoughts are about this?
// Exceptions: | ||
// Arbitrary variable must start with a CSS variable | ||
(r"(bar)", vec![]), | ||
// Arbitrary variables must be valid CSS variables | ||
(r"(--my-\ color)", vec![]), | ||
(r"(--my#color)", vec![]), | ||
// Fallbacks cannot have spaces | ||
(r"(--my-color, red)", vec![]), | ||
// Fallbacks cannot have escaped spaces | ||
(r"(--my-color,\ red)", vec![]), | ||
// Variables must have at least one character after the `--` | ||
(r"(--)", vec![]), | ||
(r"(--,red)", vec![]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add something like (-my-color)
here. I think we never require there to be two dashes right now 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it, but was already covered. Once we see a dash we go straight to the CSS variable machine which requires 2 dashes.
We dropped some boundary characters such as `[` and `{` because these were only necessary for certain languages and frameworks such as Ruby and Svelte. However, we will now pre-process those. In a perfect world, we could handle the Angular syntax as a preprocessing step as well but this has 2 issues: 1. Angular can be used in `.html` files 2. The special syntax can be used in JS files in a `@Component` decoration. See: https://angular.dev/guide/components/host-elements#binding-to-the-host-element This means that we would have to pre-process _all_ JS/TS files just for the Angular case which is unfortunate.
Co-authored-by: Philipp Spiess <[email protected]>
Co-authored-by: Philipp Spiess <[email protected]>
We do not allow utilities to start with an uppercase letter. While we accept negative utilities, the next characters should also not accept any uppercase letters. So `Foo` is invalid, therefore `-Foo` should also be invalid. Co-authored-by: Philipp Spiess <[email protected]>
Co-authored-by: Philipp Spiess <[email protected]>
This PR adds a new candidate1 extractor with 2 major goals in mind:
Problem
Candidate extraction is a bit of a wild west in Tailwind CSS and it's a very critical step to make sure that all your classes are picked up correctly to ensure that your website/app looks good.
One issue we run into is that Tailwind CSS is used in many different "host" languages and frameworks with their own syntax. It's not only used in HTML but also in JSX/TSX, Vue, Svelte, Angular, Pug, Rust, PHP, Rails, Clojure, .NET, … the list goes on and all of these have different syntaxes. Introducing dedicated parsers for each of these languages would be a huge maintenance burden because there will be new languages and frameworks coming up all the time. The best thing we can do is make assumptions and so far we've done a pretty good job at that.
The only certainty we have is that there is at least some structure to the possible Tailwind classes used in a file. E.g.:
abc#def
is definitely not a valid class,hover:flex
definitely is. In a perfect world we limit the characters that can be used and defined a formal grammar that each candidate must follow, but that's not really an option right now (maybe this is something we can implement in future major versions).The current candidate extractor we have has grown organically over time and required patching things here and there to make it work in various scenarios (and edge cases due to the different languages Tailwind is used in).
While there is definitely some structure, we essentially work in 2 phases:
0..n
candidates. (This is the hard part)Another reason the current extractor is hard to reason about is that we need it to be fast and that comes with some trade-offs to readability and maintainability.
Unfortunately there will always be a lot of false positives, but if we extract more classes than necessary then that's fine. It's only when we pass the candidates to the core engine that we will know for sure if they are valid or not. (we have some ideas to limit the amount of false positives but that's for another time)
Solution
Since the introduction of Tailwind CSS v4, we re-worked the internals quite a bit and we have a dedicated internal AST structure for candidates. For example, if you take a look at this:
This will be parsed into the following AST:
We have a lot of information here and we gave these patterns a name internally. You'll see names like
functional
,static
,arbitrary
,modifier
,variant
,compound
, ...Some of these patterns will be important for the new candidate extractor as well:
flex
bg-red-500
bg
with an input that is namedred-500
bg-[#0088cc]
bg
with an input that is arbitrary, denoted by[…]
bg-(--my-color)
bg
with an input that is arbitrary and has a CSS variable shorthand, denoted by(--…)
[color:red]
A similar structure exist for modifiers, where each modifier must start with
/
:/20
/[20%]
/[…]
/(--my-opacity)
/(…)
Last but not least, we have variants. They have a very similar pattern but they must end in a
:
.hover:
data-[state=pending]:
[…]
supports-(--my-variable):
(…)
[@media(pointer:fine)]:
The goal with the new extractor is to encode these separate patterns in dedicated pieces of code (we called them "machines" because they are mostly state machine based and because I've been watching Person of Interest but I digress).
This will allow us to focus on each pattern separately, so if there is a bug or some new syntax we want to support we can add it to those machines.
One nice benefit of this is that we can encode the rules and handle validation as we go. The moment we know that some pattern is invalid, we can bail out early.
At the time of writing this, there are a bunch of machines:
Overview of the machines
ArbitraryPropertyMachine
Extracts candidates such as
[color:red]
. Some of the rules are::
There cannot be any spaces, the brackets are included, if the property is a CSS variable, it must be a valid CSS variable (uses the
CssVariableMachine
).Depends on the
StringMachine
andCssVariableMachine
.ArbitraryValueMachine
Extracts arbitrary values for utilities and modifiers including the brackets:
Depends on the
StringMachine
.ArbitraryVariableMachine
Extracts arbitrary variables including the parentheses. The first argument must be a valid CSS variable, the other arguments are optional fallback arguments.
Depends on the
StringMachine
andCssVariableMachine
.CandidateMachine
Uses the variant machine and utility machine. It will make sure that 0 or more variants are directly touching and followed by a utility.
Depends on the
VariantMachine
andUtilityMachine
.CssVariableMachine
Extracts CSS variables, they must start with
--
and must contain at least one alphanumeric character or,-
,_
and can contain any escaped character (except for whitespace).ModifierMachine
Extracts modifiers including the
/
/[
will delegate to theArbitraryValueMachine
/(
will delegate to theArbitraryVariableMachine
Depends on the
ArbitraryValueMachine
andArbitraryVariableMachine
.NamedUtilityMachine
Extracts named utilities regardless of whether they are functional or static.
This includes rules like: A
.
must be surrounded by digits.Depends on the
ArbitraryValueMachine
andArbitraryVariableMachine
.NamedVariantMachine
Extracts named variants regardless of whether they are functional or static. This is very similar to the
NamedUtilityMachine
but with different rules. We could combine them, but splitting things up makes it easier to reason about.Another rule is that the
:
must be included.Depends on the
ArbitraryVariableMachine
,ArbitraryValueMachine
, andModifierMachine
.StringMachine
This is a low-level machine that is used by various other machines. The only job this has is to extract strings that start with double quotes, single quotes or backticks.
We have this because once you are in a string, we don't have to make sure that brackets, parens and curlies are properly balanced. We have to make sure that balancing brackets are properly handled in other machines.
UtilityMachine
Extracts utilities, it will use the lower level
NamedUtilityMachine
,ArbitraryPropertyMachine
andModifierMachine
to extract the utility.It will also handle important markers (including the legacy important marker).
Depends on the
ArbitraryPropertyMachine
,NamedUtilityMachine
, andModifierMachine
.VariantMachine
Extracts variants, it will use the lower level
NamedVariantMachine
andArbitraryValueMachine
to extract the variant.Depends on the
NamedVariantMachine
andArbitraryValueMachine
.One important thing to know here is that each machine runs to completion. They all implement a
Machine
trait that has anext(cursor)
method and returns aMachineState
.The
MachineState
looks like this:Where a
Span
is just the location in the input where the candidate was found.Complexities
Boundary characters:
When running these machines to completion, they don't typically check for boundary characters, the wrapping
CandidateMachine
will check for boundary characters.A boundary character is where we know that even though the character is touching the candidate it will not be part of the candidate.
The quotes are touching the candidate
flex
, but they will not be part of the candidate itself, so this is considered a valid candidate.What to pick?
Let's imagine you are parsing this input:
The
UtilityMachine
will findhover
andflex
. TheVariantMachine
will findhover:
. This means that at a certain point in theCandidateMachine
you will see something like this:They are both done, but which one do we pick? In this scenario we will always pick the variant because its range will always be 1 character longer than the utility.
Of course there is an exception to this rule and it has to do with the fact that Tailwind CSS can be used in different languages and frameworks. A lot of people use
clsx
for dynamically applying classes to their React components. E.g.:In this scenario, we will see
underline:
as a variant, andunderline
as a utility. We will pick the utility in this scenario because the next character is whitespace so this will never be a valid candidate otherwise (variants and utilities must be touching). Another reason this is valid, is because there wasn't a variant present prior to this candidate.E.g.:
This will be considered invalid, if you do want this, you should use quotes.
E.g.:
Overlapping/covered spans:
Another complexity is that the extracted spans for candidates can and will overlap. Let's take a look at this C# example:
In this scenario,
[CssClass("gap-y-4")]
starts with a[
so we have a few options here:[color:red]
[@media(pointer:fine)]:
When running the parsers, both the
VariantMachine
and theUtilityMachine
will run to completion but end up in aMachineState::Idle
state.:
.:
to separate the property from the value.Looking at the code as a human it's very clear what this is supposed to be, but not from the individual machines perspective.
Obviously we want to extract the
gap-y-*
classes here.To solve this problem, we will run over an additional slice of the input, starting at the position before the machines started parsing until the position where the machines stopped parsing.
That slice will be this one:
[CssClass("gap-y-6")]
(we already skipped over the whitespace). Now, for every[
character we see, will start a newCandidateMachine
right after the[
's position and run the machines over that slice. This will now eventually extract thegap-y-6
class.The next question is, what if there was a
:
(e.g.:[CssClass("gap-y-6")]:
), then theVariantMachine
would complete, but theUtilityMachine
will not because not exists after it. We will apply the same idea in this case.Another issue is if we do have actual overlapping ranges. E.g.:
let classes = ['[color:red]'];
. This will extract both the[color:red]
andcolor:red
classes. You have to use your imagination, but the last one has the exact same structure ashover:flex
(variant + utility).In this case we will make sure to drop spans that are covered by other spans.
The extracted
Span
s will be valid candidates therefore if the outer most candidate is valid, we can throw away the inner candidate.Exceptions
JavaScript keys as candidates:
We already talked about the
clsx
scenario, but there are a few more exceptions and that has to do with different syntaxes.CSS class shorthand in certain templating languages:
In Pug and Slim, you can have a syntax like this:
.flex.underline div Hello World
Generated HTML
We have to make sure that in these scenarios the
.
is a valid boundary character. For this, we introduce a pre-processing step to massage the input a little bit to improve the extraction of the data. We have to make sure we don't make the input smaller or longer otherwise the positions might be off.In this scenario, we could simply replace the
.
with a space. But of course, there are scenarios in these languages where it's not safe to do that.If you want to use
px-2.5
with this syntax, then you'd write:.flex.px-2.5 div Hello World
But that's invalid because that technically means
flex
,px-2
, and5
as classes.You can use this syntax to get around that:
Generated HTML
Which means that we can't simply replace
.
with a space, but have to parse the input. Luckily we only care about strings (and we have aStringMachine
for that) and ignore replacing.
inside of strings.Ruby's weird string syntax:
This is valid syntax and is shorthand for:
Luckily this problem is solved by the running the sub-machines after each
[
character.Performance
Testing:
Each machine has a
test_…_performance
test (that is ignored by default) that allows you to test the throughput of that machine. If you want to run them, you can use the following command:cargo test test_variant_machine_performance --release -- --ignored
This will run the test in release mode and allows you to run the ignored test.
Caution
This test will fail, but it will print some output. E.g.:
Readability:
One thing to note when looking at the code is that it's not always written in the cleanest way but we had to make some sacrifices for performance reasons.
The
input
is of type&[u8]
, so we are already dealing with bytes. Luckily, Rust has some nice ergonomics to easily writeb'['
instead of0x5b
.A concrete example where we had to sacrifice readability is the state machines where we check the
previous
,current
andnext
character to make decisions. For a named utility one of the rules is that a.
must be preceded by and followed by a digit. This can be written as:But this is not very fast because Rust can't optimize the match statement very well, especially because we are dealing with tuples containing 3 values and each value is a
u8
.To solve this we use some nesting, once we reach
b'.'
only then will we check for the previous and next characters. We will also early return in most places. If the previous character is not a digit, there is no need to check the next character.Classification and jump tables:
Another optimization we did is to classify the characters into a much smaller
enum
such that Rust can optimize allmatch
arms and create some jump tables behind the scenes.E.g.:
There are only 4 values in this enum, so Rust can optimize this very well. The
CLASS_TABLE
is generated at compile time and must be exactly 256 elements long to fit allu8
values.Inlining:
Last but not least, sometimes we use functions to abstract some logic. Luckily Rust will optimize and inline most of the functions automatically. In some scenarios, explicitly adding a
#[inline(always)]
improves performance, sometimes it doesn't improve it at all.You might notice that in some functions the annotation is added and in some it's not. Every state machine was tested on its own and whenever the performance was better with the annotation, it was added.
Test Plan
[…]
is only handled by the outer mostextractor/mod.rs
.extractor/mod.rs
has dedicated tests for recent bug reports related to missing candidates.There is a chance that this new parser is missing candidates even though a lot of tests are added and existing tests have been ported.
To double check, we ran the new extractor on our own projects to make sure we didn't miss anything obvious.
Tailwind UI
On Tailwind UI the diff looks like this:
diff
The reason
!filter
is gone, is because it was used like this:And right now
(
and)
are not considered valid boundary characters for a candidate.Catalyst
On Catalyst, the diff looks like this:
diff
The reason for this is that
filter
was only used as a function call:This was tested on all templates and they all remove a very small amount of classes that aren't used.
The script to test this looks like this:
This is using git worktrees, so the
pr
branch lives in atailwindcss
folder, and themain
branch lives in atailwindcss--main
folder.Fixes:
.eps
files are included in source file detection #16880 (due to validating the arbitrary property)Ideas for in the future
Cursor
object. One potential improvement we can make is to rely on theinput
on its own instead of going via the wrappingCursor
object.prefix
information. Everything that doesn't start withtw:
can be skipped.Design decisions that didn't make it
Once you reach this part, you can stop reading if you want to, but this is more like a brain dump of the things we tried and didn't work out. Wanted to include them as a reference in case we want to look back at this issue and know why certain things are implemented the way they are.
One character at a time
In an earlier implementation, the state machines were pure state machines where the
next()
function was called on every single character of the input. This had a lot of overhead because for every character we had to:CandidateMachine
which state it was in.cursor.curr
(and potentially thecursor.prev
andcursor.next
) character.In this approach, the
MachineState
looked like this instead:This had its own set of problems because now it's very hard to know whether we are done or not.
Let's look at the current position in the example above. At this point, it's both a valid variant and valid utility, so there was a lot of additional state we had to track to know whether we were done or not.
Span
stitchingAnother approach we tried was to just collect all valid variants and utilities and throw them in a big
Vec<Span>
. This reduced the amount of additional state to track and we could track a span the moment we saw aMachineState::Done(span)
.The next thing we had to do was to make sure that:
span_a.end + 1 == span_b.start
).flex!block
This approach was slow, and still a bit hard to reason about.
Matching on tuples
While matching against the
prev
,curr
andnext
characters was very readable and easy to reason about. It was not very fast. Unfortunately had to abandon this approach in favor of a more optimized approach.In a perfect world, we would still write it this way, but have some compile time macro that would optimize this for us.
Matching against
b'…'
instead of classification and jump tablesSimilar to the previous point, while this is better for readability, it's not fast enough. The jump tables are much faster.
Luckily for us, each machine has it's own set of rules and context, so it's much easier to reason about a single problem and optimize a single machine.
Footnotes
A candidate is what a potential Tailwind CSS class could be. It's a candidate because at this stage we don't know if it will actually produce something but it looks like it could be a valid class. E.g.:
hover:bg-red-500
is a candidate, but it will only produce something if--color-red-500
is defined in your theme. ↩