Look-behind and word boundary regex operators where supported don't work properly. #611

stephane-chazelas · 2025-03-02T20:32:10Z

echo ababcab | less '+/\<ab'

Where that \< is supported such as with --with-regex=gnu or posix on GNU systems (same happens with less '/\bab' with pcre2 and likely pcre as well) highlights the first 2 abs even though the second ab does not follow a word boundary.

With the POSIX API, there may not be any way around that (and anyway, \< is not a POSIX regex operator, even if the GNU implementation supports it as an extension), but with the GNU or PCRE/PCRE2 APIs, it should be possible to avoid it as their respective re_search() and pcre2_match() APIs both can search within an input string starting at a specified offset.

It also affects look-behind operators where available like with PCRE2:

echo foofoo | less '+/(?<!foo)foo'

For instance (with less builtin with --with-regex=pcre2) highlights both foos.

The text was updated successfully, but these errors were encountered:

gwsw · 2025-03-02T23:01:45Z

Hm, I'm not sure what can be done about this. In your example, we call the search function (regexec or re_search or whatever) which finds the first "ab". Then we call it again, passing the part of the line after the matched "ab". This also begins with "ab", and since it is at the start of the newly-passed string, the search function thinks it is another match.

There was a similar problem with using ^ to match the beginning of the line, but this was fixed by using the NOTBOL flag which many of the search functions have, to tell the search function that the start of the string is not really the start of a line (REG_NOTBOL, PCRE_NOTBOL, etc). As far as I can see, there is not a similar flag to indicate that the beginning of the string is not really the beginning of a word, in pcre, prce2 or posix.

stephane-chazelas · 2025-03-03T06:32:27Z

As mentioned above, with APIs that support specifying start offsets in they search/match function (which AFAIK excludes the POSIX API but includes the GNU, PCRE, PCRE2 ones), the way to fix it would be when searching for successive matches to do:

search(whole_line, pattern, offset_after_previous_match)

Instead of:

search(part_of_the_line_after_previous_match, pattern, 0)

That's typically what GNU tools do;

Compare:

$ echo ababcab | ltrace -e re_search sed 's/\<ab/X/g'
sed->re_search(0x55637c6d7330, "ababcab", 7, 0, 7, 0x55635122b680)                                = 0
sed->re_search(0x55637c6d7330, "ababcab", 7, 2, 5, 0x55635122b680)                                = -1
Xabcab

With less doing:

less->re_search(0x55ee5246f920, "ababcab", 7, 0, 7, 0x7ffe2d270f10)                               = 0
less->re_search(0x55ee5246f920, "ababcab", 7, 0, 7, 0x7ffe2d270df0)                               = 0
less->re_search(0x55ee5246f920, "abcab", 5, 0, 5, 0x7ffe2d270d90)                                 = 0
less->re_search(0x55ee5246f920, "cab", 3, 0, 3, 0x7ffe2d270d90)                                   = -1

If that's fixed for the GNU regex API, then it may make sense to reorder the list for --with-regex=auto so gnu comes before posix in configure.ac at least on GNU systems, as that's arguably a better API for the same regex flavour there as it allows at least \</\> to work properly (and bring more consistency with other GNU tools).

stephane-chazelas · 2025-03-03T08:18:45Z

See also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=656694 for another reason why one may prefer gnu over posix.

A word boundary search ("\<" in gnu or "\b" in pcre) will sometimes match strings that do not actually start at a word boundary. This fixes the problem, when using the gnu, pcre or pcre2 regex libraries, by using the start offset parameter in the match function. Related to #611.

gwsw · 2025-03-03T21:39:35Z

Sorry, I did not read your initial report carefully enough. This should be fixed in cc1bc7c for the gnu, pcre and pcre2 regex libraries. I am still evaluating whether to make gnu have precedence over posix, which seems a somewhat separate issue.

gwsw · 2025-03-04T17:52:11Z

In 342c928, the fix is implemented for POSIX too, if the REG_STARTEND feature is supported.

stephane-chazelas · 2025-03-05T08:10:04Z

Thanks, TIL, I wasn't aware of that REG_STARTEND BSD extension.

I guess that means there's no reason to reorder gnu before posix on GNU systems (which also support that extension) with --with-regex=auto any longer if it addresses both this issue and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=656694 (note I've not tested it).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look-behind and word boundary regex operators where supported don't work properly. #611

Look-behind and word boundary regex operators where supported don't work properly. #611

stephane-chazelas commented Mar 2, 2025

gwsw commented Mar 2, 2025

stephane-chazelas commented Mar 3, 2025

stephane-chazelas commented Mar 3, 2025

gwsw commented Mar 3, 2025

gwsw commented Mar 4, 2025

stephane-chazelas commented Mar 5, 2025

Look-behind and word boundary regex operators where supported don't work properly. #611

Look-behind and word boundary regex operators where supported don't work properly. #611

Comments

stephane-chazelas commented Mar 2, 2025

gwsw commented Mar 2, 2025

stephane-chazelas commented Mar 3, 2025

stephane-chazelas commented Mar 3, 2025

gwsw commented Mar 3, 2025

gwsw commented Mar 4, 2025

stephane-chazelas commented Mar 5, 2025