Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look-behind and word boundary regex operators where supported don't work properly. #611

Open
stephane-chazelas opened this issue Mar 2, 2025 · 6 comments

Comments

@stephane-chazelas
Copy link

echo ababcab | less '+/\<ab'

Where that \< is supported such as with --with-regex=gnu or posix on GNU systems (same happens with less '/\bab' with pcre2 and likely pcre as well) highlights the first 2 abs even though the second ab does not follow a word boundary.

With the POSIX API, there may not be any way around that (and anyway, \< is not a POSIX regex operator, even if the GNU implementation supports it as an extension), but with the GNU or PCRE/PCRE2 APIs, it should be possible to avoid it as their respective re_search() and pcre2_match() APIs both can search within an input string starting at a specified offset.

It also affects look-behind operators where available like with PCRE2:

echo foofoo | less '+/(?<!foo)foo'

For instance (with less builtin with --with-regex=pcre2) highlights both foos.

@gwsw
Copy link
Owner

gwsw commented Mar 2, 2025

Hm, I'm not sure what can be done about this. In your example, we call the search function (regexec or re_search or whatever) which finds the first "ab". Then we call it again, passing the part of the line after the matched "ab". This also begins with "ab", and since it is at the start of the newly-passed string, the search function thinks it is another match.

There was a similar problem with using ^ to match the beginning of the line, but this was fixed by using the NOTBOL flag which many of the search functions have, to tell the search function that the start of the string is not really the start of a line (REG_NOTBOL, PCRE_NOTBOL, etc). As far as I can see, there is not a similar flag to indicate that the beginning of the string is not really the beginning of a word, in pcre, prce2 or posix.

@stephane-chazelas
Copy link
Author

As mentioned above, with APIs that support specifying start offsets in they search/match function (which AFAIK excludes the POSIX API but includes the GNU, PCRE, PCRE2 ones), the way to fix it would be when searching for successive matches to do:

search(whole_line, pattern, offset_after_previous_match)

Instead of:

search(part_of_the_line_after_previous_match, pattern, 0)

That's typically what GNU tools do;

Compare:

$ echo ababcab | ltrace -e re_search sed 's/\<ab/X/g'
sed->re_search(0x55637c6d7330, "ababcab", 7, 0, 7, 0x55635122b680)                                = 0
sed->re_search(0x55637c6d7330, "ababcab", 7, 2, 5, 0x55635122b680)                                = -1
Xabcab

With less doing:

less->re_search(0x55ee5246f920, "ababcab", 7, 0, 7, 0x7ffe2d270f10)                               = 0
less->re_search(0x55ee5246f920, "ababcab", 7, 0, 7, 0x7ffe2d270df0)                               = 0
less->re_search(0x55ee5246f920, "abcab", 5, 0, 5, 0x7ffe2d270d90)                                 = 0
less->re_search(0x55ee5246f920, "cab", 3, 0, 3, 0x7ffe2d270d90)                                   = -1

If that's fixed for the GNU regex API, then it may make sense to reorder the list for --with-regex=auto so gnu comes before posix in configure.ac at least on GNU systems, as that's arguably a better API for the same regex flavour there as it allows at least \</\> to work properly (and bring more consistency with other GNU tools).

@stephane-chazelas
Copy link
Author

See also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=656694 for another reason why one may prefer gnu over posix.

gwsw added a commit that referenced this issue Mar 3, 2025
A word boundary search ("\<" in gnu or "\b" in pcre) will sometimes
match strings that do not actually start at a word boundary.
This fixes the problem, when using the gnu, pcre or pcre2 regex
libraries, by using the start offset parameter in the match function.

Related to #611.
@gwsw
Copy link
Owner

gwsw commented Mar 3, 2025

Sorry, I did not read your initial report carefully enough. This should be fixed in cc1bc7c for the gnu, pcre and pcre2 regex libraries. I am still evaluating whether to make gnu have precedence over posix, which seems a somewhat separate issue.

@gwsw
Copy link
Owner

gwsw commented Mar 4, 2025

In 342c928, the fix is implemented for POSIX too, if the REG_STARTEND feature is supported.

@stephane-chazelas
Copy link
Author

Thanks, TIL, I wasn't aware of that REG_STARTEND BSD extension.

I guess that means there's no reason to reorder gnu before posix on GNU systems (which also support that extension) with --with-regex=auto any longer if it addresses both this issue and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=656694 (note I've not tested it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants