-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexer.getCharIndex() return value not behaving as expected #3606
Comments
I've set up a repo here that follows the standardized "grammars-v4" format, and which shows that if an empty action "{}" is placed within the +-closure in the LINE rule, getCharIndex()/CharIndex returns a value that is expected. For whatever reason, the diff between the offset saved after SPACE and the offset fetched before DOT is constant, formed after the first LINE token found without the empty action in the +-closure. regardless whether there are any other tokens found in the input. After the first LINE found, the offset is constant. If one wants to duplicate this, I suggest removing the "{}" empty action in the Java code or in the CSharp code, then use trgen to quickly create an application that can be debugged in seconds. I suggest examples/test5.in. So, for the input:
you see as output:
not
The problem exists in CSharp or Java, and I've duplicated the problem in older versions of Antlr4. This is a problem because it reduces faith that actions actually work. |
I've traced this down to the main lexer automaton, ExecATN(). In that code, there are a few lines to check if there was a transition from a DFAState "s" and symbol "t" previously computed. If so, the old value is used. The problem is that the old state contains a list of ATNConfig that records the actions that need to be executed. Those actions contain offsets that need to be recomputed. The easiest fix is to just recompute the DFAState from scratch. That will create a new list of ATNConfig with the correct offsets. The following code fixes the problem.
ExecATN() is now modified thus:
With the call to IsCandidate(), the state is recomputed if the state contains ATNConfig that are index-sensitive actions. With the change, the actions for the problem are performed perfectly. |
@parrt This seems to be a problem that should be addressed at some point. Also, I worry about how actions work where someone is relying on an action to run prior to executing a semantic predicate in a lexer rule. It looks like it may not work as expected because actions are queued up whereas semantic predicate actions are executed on the fly. |
wow! holy crap. thanks for all the analysis @kaby76 and @tianshuang !! Seems like easiest fix is to disable DFA caching in the presence of the actions. Would slow that rule down a bit but... |
Hi. It appears to have been my intention, or at least a result of how I handle actually decided to match lexer rules. please see my comment here: #3611. I am updating the documentation to be more explicit in this case. |
…tions and semantic predicates; Fixes antlr#3611. Fixes antlr#3606. Signed-off-by: Terence Parr <[email protected]>
In the meanwhile, I will add a check to Trash to warn when there is a mix of plain actions and semantic predicates in lexer rules. People do not read documentation. If they want to do that, they should rewrite actions into code that returns true and performs a side-effect. |
At least I suggest introducting a warning to the ANTLR tool about actions in parsing decisions. Something like
The current behaviour of actions and predicates is not intuitive indeed. |
Indeed, a check should be implemented--it will save on a lot of headaches. The check should test mixing actions and predicates across all uses of the lexer symbols--not just within one lexer rule--as trying to do so still breaks (I tested). |
Does it affect only lexer rules? And should predicates be presented for warning? If I understand correctly, actions are always executed after rule recognition. For example: A: { action() } 'a'; // warning
B: 'b' { action() }; // no warning
C: 'c' { action() } { predicate() }?; // warning
D: 'd' { predicate() }? { action() }; // no warning |
Right, it only affects lexer rules. There's no queue data structure on the parser side, only on the lexer side. And, I've already tested parser rules--you can mix actions and predicates. |
I'll check to see if placing the action in a predicate fixes this getCharIndex() problem. If so, then we don't need to change ExecATN(). |
I suspect lexer grammar ACTION_SHOULD_BE_PLACED_TO_THE_END_OF_LEXER_RULE;
@members {String s = "";
// Action
void a(String x) {
s += x;
}
// Predicate
boolean p() {
return !s.equals("");
}}
// No warnings
OR: 'or1' {a("or1");} | 'or2' {a("or2");};
AND: 'and1' 'and2' {a("and1" + "and2");};
PREDICATE: 'predicate' {p()}? {a("predicate");};
ACTIONS: 'actions' {a("action1");} {a("action2");};
RECURSIVE: '/*' (RECURSIVE | .)*? '*/' {a("RECURSIVE");};
USE_FRAGMENT: 'f1_' FRAGMENT; // No warning since fragment action at the end and FRAGMENT at the end
// Warnings
AND_W: 'and_w1' {a("and_w1" + "and_w2");} 'and_w2';
CLOSURE_W: ('closure_w' {a("closure_w");})+;
PREDICATE_W: 'predicate_w' {a("predicate_w");} {p()}?;
OPTIONAL_W: 'optional_' {a("optional_w");} 'w'?;
USE_FRAGMENT_W: 'f2_' FRAGMENT '_f2'; // Warning because FRAGMENT action is anyway performed after `USE_FRAGMENT` recognition
USE_REF: 'r_' REF '_r';
REF: 'ref' {a("ref");};
fragment FRAGMENT: 'fragment' {a("fragment");};
WS: [ \r\n]+; |
Yes, that could work, too--forcing all actions to occur after all predicates. It corresponds to the order that ExecATN() would execute them (first predicates, then actions). But, I would think we'd have to make sure one doesn't interleave the actions and predicates across rules, e.g., |
Yes, you are right. I've updated my sample with recursive (action is performed once) and fragment rule. |
Also, actually |
Hi guys. We can't be randomly changing the behavior of the lexers. :) too much history. I agree with @kaby76 however that we really need a warning message. the issue is the interaction of predicates and actions not the fact that actions are in the middle of a rule. We definitely want those actions executed at different places, and we can't randomly decided to cut off that functionality. predicates in the lexer should really be about what column you are in or what the current text is. anything else is dangerous. please be advised that even parsing actions and semantic predicates can be misleading. remember that the predicates are generally evaluated to make parsing decisions, which is outside the parsing process. Only those side effects from actions that have executed before a parsing decision is initiated can be tested by the predicate. I think something like this would not operate as you would expect:
In summary, actions are allowed anywhere inside the lexer rule. semantic predicates are tested AFTER all viable tokens have been discovered and then the semantic predicates decide what to do. actions occur only after a token has been discovered, which is a function of semantic predicates. So, warning when someone uses predicates and actions, regardless of the order, might be the right idea because of looping and so on. I think I'm remembering this correctly after eight years haha. |
I've check for parser. It works as expected: grammar T;
@parser::members {boolean a = false; boolean b = false;}
x: y y end;
y: ({a = true;} {a}? A | {b = true;} {b}? B);
end: 'end' {System.out.println(a); System.out.println(b);};
A: 'a';
B: 'b';
END: 'end';
WS: [ \r\n]+ -> channel(HIDDEN); It prints If I change But there is another issue with parser semantic predicates. Parser on the following grammar: grammar T;
x: y | z;
y: A {false}?;
z: A;
A: 'a'; is unable to recogize just simple grammar T;
x: y | z;
y: {false}? A;
z: A;
A: 'a'; It is also quite unexpected behavior. |
Could you clarify what's unexpected in this sample? It looks like action is executed after predicate as it's described. I have another interesting sample: lexer grammar L;
@members {String s = "";}
A: '(' B ')';
B: 'b' {s += "b";};
END: 'end' {System.out.println(s);};
WS: [ \r\n]+; In this case |
Sorry, I meant |
A : { a(); } B; B : 'b' { b() }?; In this case my first assumption works fine without subrules analysing: action should be placed to the end of lexer rule. I think it matters even without predicates since code within action may use internal lexer state (current position). |
Sorry. For some reason, I thought when debugging the lexer ExecATN() that actions were queued up from sub-rules. They're not. You are right, they're not even executed. Yes, a warning would be good here. I'm not sure it's really necessary to force (or just warn??) users to change from this |
I don't think so. Consider the following grammar: lexer grammar L;
A: {System.out.print(getCharPositionInLine());} 'A' {System.out.print(getCharPositionInLine());}; It prints |
Wonderful. For CSharp, I don't get "11" but "01".
But, yes, for Java, it's "11". What a friggin' mess. |
Interesting. Does the mix of semantic predicates and actions work sequentially for C#? |
Well, that is for sequential actions only. I haven't tested it with sequential predicates. I use C#. I prefer to not use primitive language and environment Java (although they are starting to finally get with it, but they will never fix the JVM to "reposition the instruction pointer and go!", the main reason I hate Java with a passion). I get the same result with or without the modifications I gave above to LexerATNSimulator.ExecATN(). That's why I've been saying there are two different bugs here in C#. Now we have another. |
Sorry, it looks like everything works fine, A: {System.out.print(getCharIndex());} 'A' {System.out.print(getCharIndex());}; It prints |
Phew. That's good. It's hard to remember what a method in one runtime maps to another method in another. |
Maybe it's a good time to unify these ? (getCharPositionInLine is definitely much better than getCharIndex) |
Sorry, I started discussion about action/predicate order in wrong topic. Correct topic is #3611. The current topic is about incorrect working of lexer grammar L;
@lexer::members {private int startIndex = 0;}
LINE: {startIndex = getCharIndex();} LETTER {System.out.println(_input.getText(Interval.of(startIndex, getCharIndex() - 1)));} SLASH;
fragment LETTER: [a-z]+;
fragment SLASH: '-'; It prints |
They are different, |
I was referring to unify the names across runtimes
… Le 5 avr. 2022 à 22:08, Ivan Kochurkin ***@***.***> a écrit :
Maybe it's a good time to unify these ? (getCharPositionInLine is definitely much better than getCharIndex)
They are different, getCharIndex returns linear index starts from the beginning of file. getCharPositionInLine returns index from the beginning of the current line and it looks like it's being updated after matching of token unlike the first one.
—
Reply to this email directly, view it on GitHub <#3606 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZNQJBCOHFAO63ZBLOE4BDVDSMTNANCNFSM5R5OEPKQ>.
You are receiving this because you commented.
|
There are similar methods accross runtimes: |
I'm not sure how much can be done coordinating the names in the runtime fields/method/properties across targets. The least that can be done is to have a nice table to refer to or an IDE plugin that implements the table. |
hi guys. Trying to dive into this conversation but there's a lot going on. first, let me clear up some spell misconceptions about the interaction of semantic predicates and actions. No semantic predicate is evaluated during prediction that is hidden by an action. further, semantic predicates are not evaluated during prediction if syntax on is sufficient to make decisions. for example,
No semantic predicates need to be evaluated to distinguish between A and B. Further, they can't be evaluated anyway because they are hidden by actions. any action is assumed to be a side effect that affects the semantic predicates. The following example always matches the first alternative because these semantic predicates are hidden by the actions:
Here is a case where semantic predicates are evaluated to resolve a syntactic ambiguity:
This is not unexpected behavior either:
y and z are both 'a' which is a syntactic ambiguity for rule x. That means that the first alternative will be given precedence in x. The semantic predicate is not evaluated becauseIt is hidden by the token. See https://github.com/antlr/antlr4/blob/master/doc/predicates.md#finding-visible-predicates which starts with "The parser will not evaluate predicates during prediction that occur after an action or token reference." By moving the predicate, it now gets evaluated to resolve the syntactic ambiguity, forcing the second alternative to match.
|
When it comes to the evaluation of actions in the lexer, I can't remember the details but there is definitely only so much context that we can recreate. Remember that the actions are only executed after the entire rule has been matched, but we try to execute them in the right context. |
I implemented correct
It looks like we need a warning for this case as well. It's also confusing, I've encoutered the problem several times (for example, detect if identifier matches |
I'm not sure a warning is warranted there. The rule is clear and i might want the pred evaluated after prediction in that spot. |
But it's evaluated anyway later and failed predicate exception is raised if it returns false: Could you elaborate on an example where pred evaluation after prediction is useful? Moreover, I don't completely understand why |
Well, If you provide a predicate it should be evaluated at some point; it can't be ignored. the discovery of semantic predicates is complicated by the fact that it can look through multiple rules and sub rules when making a decision. So it might be hidden in one case and not hidden in another, depending on where that rule is refer to and what the parsing decision is. So the idea is that if you give a predicate it's like an assert, but it can also be hoisted into the prediction decision to help disambiguate things. I can't think of a great example that isn't better done with a separate semantic phase, but you might do something like:
|
I just read My coauthors implementation of context sensitive lexer actions. it looks like only the index in the character stream is reset before the actions are executed; no other context is saved or restored. So it sounds like getCharIndex() is the only thing we can rely on in an action, other than user-defined code. I suppose there is value in actions in loops like:
But, it really does confuse people... honestly I'm not sure if I have enough of this remembered in my head now to make an accurate decision about what to do here. |
I don't understand why it was closed. It should be reopened. |
Github seems to have a "feature" that automatically closes issues when it parses "fixes pound-sign-number" or "fixed pound-sign-number" in a |
Got it. I see the commit
But it actually doesn't related to the current issue since the issue is about incorrect working of |
Sorry about that guys. It got interwined with preds. Can we start a fresh bug, copy over stuff from this one? |
Sure, it's even better: #3645 |
Language: Java
ANTLR Version: 4.9.3
Execute the following code:
The output is as follows:
Expected output:
The text was updated successfully, but these errors were encountered: