Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of unrelated, unused parser rule to grammar causes different parsing results #3851

Open
kaby76 opened this issue Aug 31, 2022 · 11 comments

Comments

@kaby76
Copy link
Contributor

kaby76 commented Aug 31, 2022

This is related to a SO question https://stackoverflow.com/q/73522066/4779853

The grammar is poorly written as an EOF-start rule, but that is irrelevant because parsing results should always be consistent when a completely unrelated, unused parser rule is added or deleted. Therefore there is something wrong in the runtime.

grammar my; 
st: 'H' hd | EOF ;
hd: 'D' d | 'C' c | st ;
d: hd ;
c: 'D' c | hd ;
s1: 'D' s1 | c ;
// p: hd ;
SKP: [ \t\r\n]+ -> skip;

Input
H C D C C D

Commented rule p causes parser to work one way, uncomment rule p works another way. It is illogical for a rule that is unused to influence a parse because it's never used in a derivation. It must be investigated.

@KvanTTT
Copy link
Member

KvanTTT commented Aug 31, 2022

It looks like the similar bug was already reported: #2179 But it's about weird error instead of different parsing result.

@parrt
Copy link
Member

parrt commented Aug 31, 2022

A random unused rule can cause FOLLOW sets to be different if I recall. Yeah, this grammar is pretty messed up. The issue relates to whether or not lookahead "falls off the edge" ever. Because of the tangle of rules (ref to hd), there is no rule w/o a follow set. adding p means adding a place where hd can be followed by nothing. W/o swapping in more knowledge from cold storage I can't explain any better.

@parrt
Copy link
Member

parrt commented Sep 1, 2022

Ah. i think it's the LL(1) analyzer. See comments on _LOOK(). EOF can be added to FOLLOW set depending on hitting end of a rule. Not saying code is correct exactly but I can't remember enough to resolve.

@kaby76
Copy link
Contributor Author

kaby76 commented Sep 1, 2022

This is legal in Antlr. What does it mean?

grammar T3;
s : 'a' EOF 'c' | 'b' ;

I'm not sure whether it's a good idea to allow rules with EOF that

  • are in one alt and missing in another;
  • or allow symbols to follow an EOF.
    It's not clear what these mean, and Antlr has a problem in grammar my where commented and uncommented rule p yields two different trees.

If it is allowed, it must be crystal clear what they mean. (I have an xpath that should flag such rules, but trxgrep has a bug that I now have to fix.)

@KvanTTT
Copy link
Member

KvanTTT commented Sep 1, 2022

s : 'a' EOF 'c' | 'b' ;

I think such code should be disallowed. EOF might be placed only at the end position of the rule. EOF_CLOSURE error should be extended for covering such cases.

@ericvergnaud
Copy link
Contributor

True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...

@KvanTTT
Copy link
Member

KvanTTT commented Sep 1, 2022

are in one alt and missing in another;

Why do you think it should be disallowed? It's quite real code.

Yes, it looks like it also doesn't make sense. If it's encountered, the grammar should be rewritten to prevent potential weird behaviour and errors.

or allow symbols to follow an EOF.

Agree with that. It's nonsense since the part after EOF is unreachable.

True, but then you'd also want to disable combination of rules that have an EOF not at the end. Tricky imho...

At least not ending EOF can be disallowed within one rule, it's easy. Detecting not ending EOF within the entire grammar is also possible. To implement it, containsEOF flag should be considered for every rule. By the way, I've implemented the similar warnings but for beginning position: PREDICATE_AT_THE_BEGINNING_OF_LEXER_RULE_DEGRADES_PERFORMANCE.

I suggest introducing the new errors together with warnings from #3626, they are processed in the same way.

@kaby76
Copy link
Contributor Author

kaby76 commented Sep 1, 2022

#3763 -- we could just take it out of the hands of users. :)

@parrt
Copy link
Member

parrt commented Sep 1, 2022

Hi guys...any change to the secret sauce parsing algorithm would require me to fully understand it again. Not sure I am willing to dig deep enough. Changing the grammar semantics for EOF is less risky but would likely break backward compatibility. Also I've just never had a problem putting EOF on end of a start rule or leaving it off so I can parse partial phrases.

@parrt
Copy link
Member

parrt commented Sep 1, 2022

I'd rather we focus on larger issues than grammar warnings that I might find too risky to merge. Maybe help with antlr4-lab?

@kaby76
Copy link
Contributor Author

kaby76 commented Sep 1, 2022

Some good news.

  • There are no rules with EOF followed by another element in grammars-v4.
  • There are a few grammars where an EOF occurs on one alt but not in another. I'll take a look at these.
./cool/COOL.g4 has an EOF in one alt, but not in another.
./csharp/Antlr4cs/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./csharp/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./golang/Go/GoParser.g4 has an EOF in one alt, but not in another.
./golang/GoParser.g4 has an EOF in one alt, but not in another.
./hypertalk/HyperTalk.g4 has an EOF in one alt, but not in another.
./java/java/JavaParser.g4 has an EOF in one alt, but not in another.
./javadoc/JavadocParser.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharp/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/CSharpSharwell/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Go/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/JavaScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/Python/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/ecmascript/TypeScript/ECMAScript.g4 has an EOF in one alt, but not in another.
./javascript/javascript/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/jsx/JavaScriptParser.g4 has an EOF in one alt, but not in another.
./javascript/typescript/TypeScriptParser.g4 has an EOF in one alt, but not in another.
./kirikiri-tjs/TJSParser.g4 has an EOF in one alt, but not in another.
./kotlin/kotlin-formal/KotlinParser.g4 has an EOF in one alt, but not in another.
./objc/two-step-processing/ObjectiveCPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./save.generated/Test/CSharpPreprocessorParser.g4 has an EOF in one alt, but not in another.
./smtlibv2/SMTLIBv2.g4 has an EOF in one alt, but not in another.
./sql/tsql/TSqlParser.g4 has an EOF in one alt, but not in another.
./v/V.g4 has an EOF in one alt, but not in another.
./wat/WatParser.g4 has an EOF in one alt, but not in another.
  • Both of these conditions are really easy to find with Trash. Here's a Bash script:
#!/usr/bin/bash
for i in `find . -name '*.g4' | grep -v Generated | grep -v examples | grep -v Lexer`
do
 count=`trparse $i -t antlr4 2> /dev/null \
  | trxgrep ' //parserRuleSpec//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element' \
  | trtext -c`
 if [ "$count" != "0" ]
 then
  echo $i has an EOF usage followed by another element.
 fi
 count=`trparse $i -t antlr4 2> /dev/null \
  | trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]' \
  | trtext -c`
 if [ "$count" != "0" ]
 then
  echo $i has an EOF in one alt, but not in another.
 fi
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants