Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore regex source generator #7598

Open
4 tasks
rainersigwald opened this issue May 4, 2022 · 38 comments
Open
4 tasks

Explore regex source generator #7598

rainersigwald opened this issue May 4, 2022 · 38 comments
Assignees
Labels
backlog help wanted Issues that the core team doesn't plan to work on, but would accept a PR for. Comment to claim. performance triaged
Milestone

Comments

@rainersigwald
Copy link
Member

rainersigwald commented May 4, 2022

.NET 7 introduces a regular expression source generator that can generate a bunch of code at compile time to make a more optimal matcher.

Since MSBuild uses many regular expressions and knows a bunch of them at compile time, we should adopt this.

Possible locations:

(non-exhaustive)

@rainersigwald rainersigwald added this to the .NET 7.0 milestone May 4, 2022
@AraHaan
Copy link
Member

AraHaan commented May 4, 2022

It would be nice if the .NET Framework version of msbuild could go away for .NET Core based ones and have Visual Studio use the one from only the .NET SDK so it does not have to bundle a version of msbuild as well to where the .NET Core msbuild can compile both .NET SDK style projects and non-sdk style projects.

This would also help improve the quality of msbuild (and Visual Studio) as a whole.

@dotnet dotnet deleted a comment from AR-May May 9, 2022
@AR-May
Copy link
Member

AR-May commented May 9, 2022

@rainersigwald How much of perf. win do you expect here?

@danmoseley
Copy link
Member

It is mostly a startup win.
Note that the build for .NET Framework would still need to use the existing pattern, as alluded above. The generator is only for targeting .NET 7+

@rainersigwald
Copy link
Member Author

I wouldn't expect a huge win on anything, but some of the pattern matches are in warm inner loops:

  1. Validating every property/item name (so hundreds/project evaluation + many at execution time).
  2. Scanning every line of ToolTask output.

As Dan mentioned, startup time is reduced and there may be possibilities for inlining/JIT magic.

I wish we could use the generator for our .NET Framework target for unification, but alas.

@danmoseley
Copy link
Member

Cc @stephentoub fyi

@stephentoub
Copy link
Member

stephentoub commented May 9, 2022

I wish we could use the generator for our .NET Framework target for unification, but alas.

Yeah, you'll end up wanting to do something like:

#if NET7_0_OR_GREATER
[RegexGenerator(Pattern)]
public static partial Regex AwesomeSauce();
#else
public static Regex AwesomeSauce() => s_regex;
private static readonly Regex s_regex = new Regex(Pattern, RegexOptions.Compiled);
#endif

@Forgind
Copy link
Member

Forgind commented Aug 18, 2022

Just noting this is blocked on #7790

@Eli-Black-Work
Copy link

Looks like #7790 has been merged, so this is unblocked? 🙂

@rainersigwald
Copy link
Member Author

@Bosch-Eli-Black Yes, but to be clear it remains low-priority for the core team.

@Eli-Black-Work
Copy link

@rainersigwald Understood 🙂

@danmoseley
Copy link
Member

@Bosch-Eli-Black I expect they'd accept a PR? (and related #7499 too)

@Eli-Black-Work
Copy link

@danmoseley Haha, ah, yes, I'd love to! 🙂 I've just begun working through my company's procedures for getting authorization to contribute to open source projects, though, and I'm afraid that may take a very long time indeed.

@rainersigwald rainersigwald added the help wanted Issues that the core team doesn't plan to work on, but would accept a PR for. Comment to claim. label Sep 27, 2022
@nikijohnson77
Copy link

Do you need any help I don't mind helping so just let me know if you do cuz I just did one of myself last night my project I got done with it last night

@Forgind
Copy link
Member

Forgind commented Oct 27, 2022

@nikijohnson77 We'd love some! Thanks for volunteering; I just assigned it to you. Submit a PR whenever you're ready, and feel free to ask questions.

@JaynieBai
Copy link
Member

JaynieBai commented Nov 30, 2022

#if NET7_0_OR_GREATER
[RegexGenerator(Pattern)]
public static partial Regex AwesomeSauce();
#else
public static Regex AwesomeSauce() => s_regex;
private static readonly Regex s_regex = new Regex(Pattern, RegexOptions.Compiled);
#endif

Do you have a definition to distinguish different target currently? such as NET7_0_OR_GREATER. @Forgind

@ViktorHofer
Copy link
Member

NET7_0_OR_GREATER comes via the SDK, so every project that uses the Microsoft.NET.Sdk should have access to it.

@AraHaan
Copy link
Member

AraHaan commented Nov 30, 2022

NET7_0_OR_GREATER comes via the SDK, so every project that uses the Microsoft.NET.Sdk should have access to it.

only if the project targets .NET 7 in its list of target frameworks otherwise the maximum defined is NET6_0_OR_GREATER if up to .NET 6 is in it.

@rainersigwald
Copy link
Member Author

In the MSBuild codebase we generally try to avoid generic ifdefs like NET7_0_OR_GREATER in favor of specific ones like FEATURE_REGEX_GENERATOR, defined like

<PropertyGroup Condition="'$(NetCoreBuild)'=='true'">
<CompilerToolsDir>$([System.IO.Path]::Combine($(ToolPackagesDir)Microsoft.Net.Compilers, $(CompilerToolsVersion), "tools"))$([System.IO.Path]::DirectorySeparatorChar)</CompilerToolsDir>
<DefineConstants>$(DefineConstants);FEATURE_ASSEMBLYLOADCONTEXT</DefineConstants>
<DefineConstants>$(DefineConstants);FEATURE_RUNTIMEINFORMATION</DefineConstants>
<DefineConstants>$(DefineConstants);USE_MSBUILD_DLL_EXTN</DefineConstants>
<DefineConstants>$(DefineConstants);WORKAROUND_COREFX_19110</DefineConstants>
<DefineConstants>$(DefineConstants);FEATURE_SYMLINK_TARGET</DefineConstants>
</PropertyGroup>

This is helpful because it self-documents why the code is conditionally compiled; while I don't expect it in this case it's conceivable that the regex source generator could someday work on .NET Framework 4.7.2 targets, and if it did adopting it would mean searching for the specific feature, rather than having to look at all net7+ stuff.

@stephentoub
Copy link
Member

while I don't expect it in this case it's conceivable that the regex source generator could someday work on .NET Framework 4.7.2 targets

Highly unlikely ;-)

@rainersigwald
Copy link
Member Author

But if I keep whining about it in this thread . . . 😇

We got most of our benefit from the granular ifdefs in the other direction: there were lots of features that were not in .NET Core 1.0 that got ported later, and the ifdefs let us gradually reunify a bunch of code. I still prefer the granular-ifdef strategy for its "documentation" of why things are conditional.

@stephentoub
Copy link
Member

But if I keep whining about it in this thread . . . 😇

Your voice will get tired. 😛

@AraHaan
Copy link
Member

AraHaan commented Nov 30, 2022

Alternatively, one can't stop them from defining their own version of the .NET 7 attribute for the source generator and install the generator as a nuget package on non-.NET 7 targets, right? If so, that might also be an option to reunify the code.

@stephentoub
Copy link
Member

Alternatively, one can't stop them from defining their own version of the .NET 7 attribute for the source generator and install the generator as a nuget package on non-.NET 7 targets, right? If so, that might also be an option to reunify the code.

The generator emits code that uses APIs new to .NET 7, including how the generated code plugs into Regex itself.

@danmoseley
Copy link
Member

danmoseley commented Nov 30, 2022

If the pattern is relatively straightforward and stable another option is to inspect the generated code and adjust it to make your own hardcoded pattern matcher, no Regex involved.

@rainersigwald
Copy link
Member Author

Just noting that I edited the OP to add a new location, where startup time might be a noticeable win: in solution parsing

// An example of a project line looks like this:
// Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "ClassLibrary1", "ClassLibrary1\ClassLibrary1.csproj", "{05A5AD00-71B5-4612-AF2F-9EA9121C4111}"
private static readonly Lazy<Regex> s_crackProjectLine = new Lazy<Regex>(
() => new Regex
(
"^" // Beginning of line
+ "Project\\(\"(?<PROJECTTYPEGUID>.*)\"\\)"
+ "\\s*=\\s*" // Any amount of whitespace plus "=" plus any amount of whitespace
+ "\"(?<PROJECTNAME>.*)\""
+ "\\s*,\\s*" // Any amount of whitespace plus "," plus any amount of whitespace
+ "\"(?<RELATIVEPATH>.*)\""
+ "\\s*,\\s*" // Any amount of whitespace plus "," plus any amount of whitespace
+ "\"(?<PROJECTGUID>.*)\""
+ "$", // End-of-line
RegexOptions.Compiled
)
);
// An example of a property line looks like this:
// AspNetCompiler.VirtualPath = "/webprecompile"
// Because website projects now include the target framework moniker as
// one of their properties, <PROPERTYVALUE> may now have '=' in it.
private static readonly Lazy<Regex> s_crackPropertyLine = new Lazy<Regex>(
() => new Regex
(
"^" // Beginning of line
+ "(?<PROPERTYNAME>[^=]*)"
+ "\\s*=\\s*" // Any amount of whitespace plus "=" plus any amount of whitespace
+ "(?<PROPERTYVALUE>.*)"
+ "$", // End-of-line
RegexOptions.Compiled
)
);

Since that's one of the first things we do in a dotnet build foo.sln scenario, the compiled-already-ness would likely remove regex compilation from a critical path.

@danmoseley
Copy link
Member

Reducing JITting is also helpful if Native AOT is ever a goal. (But given MSBuild reflects over assemblies it dynamically discovers, perhaps that would be challenging..)

@danmoseley
Copy link
Member

@nikijohnson77 how is it going on this?

@AraHaan
Copy link
Member

AraHaan commented Dec 6, 2022

Wait why is the regex used for solution parsing, if only there was a library that msbuild could use that uses a much better way of parsing solution files that is just as fast as .NET 7's regex.

@danmoseley
Copy link
Member

@AraHaan the change to use a source generated regex is quite localized and straightforward. If you're proposing replacing the component, that is best discussed in an issue specific to that (which I'm guessing exists)

@danmoseley
Copy link
Member

danmoseley commented Dec 22, 2022

OK, so I played with this and it's quite doable, but a little less than straightforward because of the complexity of MSBuild regexes, and the fact that they are built up from strings (which sometimes themselves are built up from strings).

As an extreme example, take this regex

internal static readonly Lazy<Regex> NonTransformItemMetadataPattern = new Lazy<Regex>(
() => new Regex
(
@"((?<=" + ItemVectorWithTransformLHS + @")" + ItemMetadataSpecification + @"(?!" +
ItemVectorWithTransformRHS + @")) | ((?<!" + ItemVectorWithTransformLHS + @")" +
ItemMetadataSpecification + @"(?=" + ItemVectorWithTransformRHS + @")) | ((?<!" +
ItemVectorWithTransformLHS + @")" + ItemMetadataSpecification + @"(?!" +
ItemVectorWithTransformRHS + @"))",
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.Compiled
));

which includes strings that are themselves built up, eg.,
private const string ItemVectorWithTransformLHS = @"@\(\s*" + ProjectWriter.itemTypeOrMetadataNameSpecification + @"\s*->\s*'[^']*";

The generator can work on this, creating

[GeneratedRegex("((?<=@\\(\\s*[A-Za-z_][A-Za-z_0-9\\-]*\\s*->\\s*'[^']*)%\\(\\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\)(?![^']*'(\\s*,\\s*'[^']*')?\\s*\\)))|((?<!@\\(\\s*[A-Za-z_][A-Za-z_0-9\\-]*\\s*->\\s*'[^']*)%\\(\\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\)(?=[^']*'(\\s*,\\s*'[^']*')?\\s*\\)))|((?<!@\\(\\s*[A-Za-z_][A-Za-z_0-9\\-]*\\s*->\\s*'[^']*@)%\\(\\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\)(?![^']*'(\\s*,\\s*'[^']*')?\\s*\\)))",RegexOptions.ExplicitCapture|RegexOptions.Compiled|RegexOptions.IgnorePatternWhitespace)]

It's probably worth converting to a verbatim string (put a @ at the front and replace \ with ) in order to make them more readable. I opened suggestion dotnet/runtime#79895 for that.

now I have

[GeneratedRegex(@"((?<=@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*)%\(\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\-]*)\s*\)(?![^']*'(\s*,\s*'[^']*')?\s*\)))|((?<!@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*)%\(\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\-]*)\s*\)(?=[^']*'(\s*,\s*'[^']*')?\s*\)))|((?<!@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*@)%\(\s*(?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)?(?<NAME>[A-Za-z_][A-Za-z_0-9\-]*)\s*\)(?![^']*'(\s*,\s*'[^']*')?\s*\)))",RegexOptions.ExplicitCapture|RegexOptions.Compiled|RegexOptions.IgnorePatternWhitespace)]

The names of the substring variables had been acting as self documentation, so potentially one could manually add those as comments, something like this:

            [GeneratedRegex(@"
    ((?<=@\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*        # ItemVectorWithTransformLHS
)
    %\(\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\-]*) \s*\)        # ItemMetadataSpecification
(?!
    [^']*'(\s*,\s*'[^']*')?\s*\)        # ItemVectorWithTransformRHS
)) | ((?<!
    @\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*        # ItemVectorWithTransformLHS
)
    %\(\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\-]*) \s*\)        # ItemMetadataSpecification
(?=
    [^']*'(\s*,\s*'[^']*')?\s*\)        # ItemVectorWithTransformRHS
)) | ((?<!
    @\(\s*[A-Za-z_][A-Za-z_0-9\-]*\s*->\s*'[^']*        # ItemVectorWithTransformLHS
@)
    %\(\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\-]*)\s*\.\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\-]*) \s*\)        # ItemMetadataSpecification
(?!
    [^']*'(\s*,\s*'[^']*')?\s*\)        # ItemVectorWithTransformRHS      
))
", RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace)]

TBH I'm not sure that's clearer. I don't think I would bother. It would be better to just memorialize the existing fragments in comments, rather than in the regex.

Note one other wrinkle - a number of these files compile for older versions than .NET 7. So it is necessary to wrap these in #if, so something like

            internal static readonly Lazy<Regex> ItemMetadataPattern = new Lazy<Regex>(
                () => new Regex(ItemMetadataSpecification,
                    RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.Compiled));

becomes

            internal static readonly Lazy<Regex> ItemMetadataPattern = new Lazy<Regex>(() => 
#if NET7_0_OR_GREATER
                GetItemMetadataPatternRegex());
#else
                new Regex(ItemMetadataSpecification,
                    RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture | RegexOptions.Compiled));
#endif

#if NET7_0_OR_GREATER
            [GeneratedRegex("%\\(\\s* (?<ITEM_SPECIFICATION>(?<ITEM_TYPE>[A-Za-z_][A-Za-z_0-9\\-]*)\\s*\\.\\s*)? (?<NAME>[A-Za-z_][A-Za-z_0-9\\-]*) \\s*\\)", RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace)]
            private static partial Regex GetItemMetadataPatternRegex();
#endif

These #if's aren't pretty, but it's worth it to eliminate all this JIT work at runtime when running on .NET 7+.

I note that most of the Regexes in the code are constructed lazily, since #1199 to delay generating the IL. I don't think that's necessary for the source generated ones, as constructing them is cheap. However, it's probably simplest to still put them inside the lazy construction, as I've done above, so that all the places that use them can remain unchanged.

@danmoseley
Copy link
Member

I also hit a couple of issues using the fixer
dotnet/runtime#79891 --- this is partially fixed in .NET 8 builds
dotnet/runtime#79892 --- I dealt with this by just partially reverting the changes.

Neither stop it working, just make a little more manual tuning necessary.

@nikijohnson77 does this make sense? LMK if you need more help.

@KalleOlaviNiemitalo
Copy link

Doesn't the regex source generator support references to named string constants in the attribute argument?

@AraHaan
Copy link
Member

AraHaan commented Dec 23, 2022

I feel like the regex string can be further simplified and made smaller (by defining common parts in each as const and then use string interpolation). So then even in the .NET 7's attribute code, it would use the constant used for the input regex string.

@danmoseley
Copy link
Member

If that works - great.

@danmoseley
Copy link
Member

the issues I linked above are now mostly fixed. Is anyone (@AraHaan @KalleOlaviNiemitalo?) interested in doing this? might start with just enabling one or two to see how it goes.

@AraHaan
Copy link
Member

AraHaan commented Apr 23, 2023

I been busy lately but I could see what could be done on this when I have more time.

@nikijohnson77
Copy link

Ty

@nikijohnson77
Copy link

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog help wanted Issues that the core team doesn't plan to work on, but would accept a PR for. Comment to claim. performance triaged
Projects
None yet
Development

No branches or pull requests