Addition of unrelated, unused parser rule to grammar causes different parsing results #3851

kaby76 · 2022-08-31T00:19:55Z

This is related to a SO question https://stackoverflow.com/q/73522066/4779853

The grammar is poorly written as an EOF-start rule, but that is irrelevant because parsing results should always be consistent when a completely unrelated, unused parser rule is added or deleted. Therefore there is something wrong in the runtime.

grammar my; 
st: 'H' hd | EOF ;
hd: 'D' d | 'C' c | st ;
d: hd ;
c: 'D' c | hd ;
s1: 'D' s1 | c ;
// p: hd ;
SKP: [ \t\r\n]+ -> skip;

Input
H C D C C D

Commented rule p causes parser to work one way, uncomment rule p works another way. It is illogical for a rule that is unused to influence a parse because it's never used in a derivation. It must be investigated.

The text was updated successfully, but these errors were encountered:

KvanTTT · 2022-08-31T14:23:45Z

It looks like the similar bug was already reported: #2179 But it's about weird error instead of different parsing result.

parrt · 2022-08-31T23:54:19Z

A random unused rule can cause FOLLOW sets to be different if I recall. Yeah, this grammar is pretty messed up. The issue relates to whether or not lookahead "falls off the edge" ever. Because of the tangle of rules (ref to hd), there is no rule w/o a follow set. adding p means adding a place where hd can be followed by nothing. W/o swapping in more knowledge from cold storage I can't explain any better.

parrt · 2022-09-01T00:04:44Z

Ah. i think it's the LL(1) analyzer. See comments on _LOOK(). EOF can be added to FOLLOW set depending on hitting end of a rule. Not saying code is correct exactly but I can't remember enough to resolve.

kaby76 · 2022-09-01T12:45:36Z

This is legal in Antlr. What does it mean?

grammar T3;
s : 'a' EOF 'c' | 'b' ;

I'm not sure whether it's a good idea to allow rules with EOF that

are in one alt and missing in another;
or allow symbols to follow an EOF.
It's not clear what these mean, and Antlr has a problem in grammar my where commented and uncommented rule p yields two different trees.

If it is allowed, it must be crystal clear what they mean. (I have an xpath that should flag such rules, but trxgrep has a bug that I now have to fix.)

KvanTTT · 2022-09-01T14:24:07Z

s : 'a' EOF 'c' | 'b' ;

I think such code should be disallowed. EOF might be placed only at the end position of the rule. EOF_CLOSURE error should be extended for covering such cases.

ericvergnaud · 2022-09-01T14:25:52Z

True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...

KvanTTT · 2022-09-01T14:40:44Z

are in one alt and missing in another;

~~Why do you think it should be disallowed? It's quite real code.~~

Yes, it looks like it also doesn't make sense. If it's encountered, the grammar should be rewritten to prevent potential weird behaviour and errors.

or allow symbols to follow an EOF.

Agree with that. It's nonsense since the part after EOF is unreachable.

True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...

At least not ending EOF can be disallowed within one rule, it's easy. Detecting not ending EOF within the entire grammar is also possible. To implement it, containsEOF flag should be considered for every rule. By the way, I've implemented the similar warnings but for beginning position: PREDICATE_AT_THE_BEGINNING_OF_LEXER_RULE_DEGRADES_PERFORMANCE.

I suggest introducing the new errors together with warnings from #3626, they are processed in the same way.

kaby76 · 2022-09-01T15:23:31Z

#3763 -- we could just take it out of the hands of users. :)

parrt · 2022-09-01T19:56:41Z

Hi guys...any change to the secret sauce parsing algorithm would require me to fully understand it again. Not sure I am willing to dig deep enough. Changing the grammar semantics for EOF is less risky but would likely break backward compatibility. Also I've just never had a problem putting EOF on end of a start rule or leaving it off so I can parse partial phrases.

parrt · 2022-09-01T19:58:21Z

I'd rather we focus on larger issues than grammar warnings that I might find too risky to merge. Maybe help with antlr4-lab?

kaby76 · 2022-09-01T20:30:44Z

Some good news.

There are no rules with EOF followed by another element in grammars-v4.
There are a few grammars where an EOF occurs on one alt but not in another. I'll take a look at these.

./cool/COOL.g4 has an EOF in one alt, but not in another.
./csharp/Antlr4cs/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./golang/Go/GoParser.g4 has an EOF in one alt, but not in another.
./golang/GoParser.g4 has an EOF in one alt, but not in another.
./hypertalk/HyperTalk.g4 has an EOF in one alt, but not in another.
./java/java/JavaParser.g4 has an EOF in one alt, but not in another.
./javadoc/JavadocParser.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharp/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharpSharwell/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Go/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/JavaScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Python/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/TypeScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/javascript/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/jsx/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/typescript/TypeScriptParser.g4 has an EOF in one alt, but not in another.
./kirikiri-tjs/TJSParser.g4 has an EOF in one alt, but not in another.
./kotlin/kotlin-formal/KotlinParser.g4 has an EOF in one alt, but not in another.
./objc/two-step-processing/ObjectiveCPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/Test/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./smtlibv2/SMTLIBv2.g4 has an EOF in one alt, but not in another.
./sql/tsql/TSqlParser.g4 has an EOF in one alt, but not in another.
./v/V.g4 has an EOF in one alt, but not in another.
./wat/WatParser.g4 has an EOF in one alt, but not in another.

Both of these conditions are really easy to find with Trash. Here's a Bash script:

#!/usr/bin/bash
for i in `find . -name '*.g4' | grep -v Generated | grep -v examples | grep -v Lexer`
do
 count=`trparse $i -t antlr4 2> /dev/null \
  | trxgrep ' //parserRuleSpec//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element' \
  | trtext -c`
 if [ "$count" != "0" ]
 then
  echo $i has an EOF usage followed by another element.
 fi
 count=`trparse $i -t antlr4 2> /dev/null \
  | trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]' \
  | trtext -c`
 if [ "$count" != "0" ]
 then
  echo $i has an EOF in one alt, but not in another.
 fi
done

kaby76 mentioned this issue Apr 22, 2023

EOF fails to match when starting rule used elsewhere. #4242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addition of unrelated, unused parser rule to grammar causes different parsing results #3851

Addition of unrelated, unused parser rule to grammar causes different parsing results #3851

kaby76 commented Aug 31, 2022 •

edited

Loading

KvanTTT commented Aug 31, 2022

parrt commented Aug 31, 2022

parrt commented Sep 1, 2022

kaby76 commented Sep 1, 2022 •

edited

Loading

KvanTTT commented Sep 1, 2022

ericvergnaud commented Sep 1, 2022

KvanTTT commented Sep 1, 2022 •

edited

Loading

kaby76 commented Sep 1, 2022

parrt commented Sep 1, 2022

parrt commented Sep 1, 2022

kaby76 commented Sep 1, 2022

Addition of unrelated, unused parser rule to grammar causes different parsing results #3851

Addition of unrelated, unused parser rule to grammar causes different parsing results #3851

Comments

kaby76 commented Aug 31, 2022 • edited Loading

KvanTTT commented Aug 31, 2022

parrt commented Aug 31, 2022

parrt commented Sep 1, 2022

kaby76 commented Sep 1, 2022 • edited Loading

KvanTTT commented Sep 1, 2022

ericvergnaud commented Sep 1, 2022

KvanTTT commented Sep 1, 2022 • edited Loading

kaby76 commented Sep 1, 2022

parrt commented Sep 1, 2022

parrt commented Sep 1, 2022

kaby76 commented Sep 1, 2022

kaby76 commented Aug 31, 2022 •

edited

Loading

kaby76 commented Sep 1, 2022 •

edited

Loading

KvanTTT commented Sep 1, 2022 •

edited

Loading