-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look-behind and word boundary regex operators where supported don't work properly. #611
Comments
Hm, I'm not sure what can be done about this. In your example, we call the search function (regexec or re_search or whatever) which finds the first "ab". Then we call it again, passing the part of the line after the matched "ab". This also begins with "ab", and since it is at the start of the newly-passed string, the search function thinks it is another match. There was a similar problem with using ^ to match the beginning of the line, but this was fixed by using the NOTBOL flag which many of the search functions have, to tell the search function that the start of the string is not really the start of a line (REG_NOTBOL, PCRE_NOTBOL, etc). As far as I can see, there is not a similar flag to indicate that the beginning of the string is not really the beginning of a word, in pcre, prce2 or posix. |
As mentioned above, with APIs that support specifying start offsets in they search/match function (which AFAIK excludes the POSIX API but includes the GNU, PCRE, PCRE2 ones), the way to fix it would be when searching for successive matches to do:
Instead of:
That's typically what GNU tools do; Compare:
With
If that's fixed for the GNU regex API, then it may make sense to reorder the list for |
See also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=656694 for another reason why one may prefer |
A word boundary search ("\<" in gnu or "\b" in pcre) will sometimes match strings that do not actually start at a word boundary. This fixes the problem, when using the gnu, pcre or pcre2 regex libraries, by using the start offset parameter in the match function. Related to #611.
Sorry, I did not read your initial report carefully enough. This should be fixed in cc1bc7c for the gnu, pcre and pcre2 regex libraries. I am still evaluating whether to make gnu have precedence over posix, which seems a somewhat separate issue. |
In 342c928, the fix is implemented for POSIX too, if the REG_STARTEND feature is supported. |
Thanks, TIL, I wasn't aware of that I guess that means there's no reason to reorder |
Where that
\<
is supported such as with--with-regex=gnu
orposix
on GNU systems (same happens withless '/\bab'
withpcre2
and likelypcre
as well) highlights the first 2ab
s even though the secondab
does not follow a word boundary.With the POSIX API, there may not be any way around that (and anyway,
\<
is not a POSIX regex operator, even if the GNU implementation supports it as an extension), but with the GNU or PCRE/PCRE2 APIs, it should be possible to avoid it as their respectivere_search()
andpcre2_match()
APIs both can search within an input string starting at a specified offset.It also affects look-behind operators where available like with PCRE2:
For instance (with
less
builtin with--with-regex=pcre2
) highlights bothfoo
s.The text was updated successfully, but these errors were encountered: