Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer.getCharIndex() return value not behaving as expected #3606

Closed
4 tasks done
tianshuang opened this issue Mar 29, 2022 · 45 comments
Closed
4 tasks done

Lexer.getCharIndex() return value not behaving as expected #3606

tianshuang opened this issue Mar 29, 2022 · 45 comments

Comments

@tianshuang
Copy link
Contributor

tianshuang commented Mar 29, 2022

  • I have reproduced my issue using the latest version of ANTLR
  • I have asked at stackoverflow
  • Responses from the above seem to indicate that my issue could be an ANTLR bug
  • I have done a search of the existing issues to make sure I'm not sending in a duplicate

Language: Java
ANTLR Version: 4.9.3

parser grammar TestParser;

options { tokenVocab=TestLexer; }

root
    : LINE+ EOF
    ;
lexer grammar TestLexer;

@lexer::members {
    private int startIndex = 0;

    private void updateStartIndex() {
        startIndex = getCharIndex();
    }

    private void printNumber() {
        String number = _input.getText(Interval.of(startIndex, getCharIndex() - 1));
        System.out.println(number);
    }
}

LINE:                          {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} DOT .+? NEWLINE;
OTHER:                         . -> skip;

fragment NUMBER:               [0-9]+;
fragment ANSWER:               '( ' [A-D] ' )';
fragment SPACE:                ' ';
fragment NEWLINE:              '\n';
fragment DOT:                  '.';

Execute the following code:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;

public class TestParseTest {

    public static void main(String[] args) {
        CharStream charStream = CharStreams.fromString("( B ) 12. hahaha\n"+
                "( B ) 123. hahaha\n");
        Lexer lexer = new TestLexer(charStream);

        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        ParseTree parseTree = parser.root();

        System.out.println(parseTree.toStringTree(parser));
    }

}

The output is as follows:

12
12
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)

Expected output:

12
123
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)
@kaby76
Copy link
Contributor

kaby76 commented Mar 29, 2022

I've set up a repo here that follows the standardized "grammars-v4" format, and which shows that if an empty action "{}" is placed within the +-closure in the LINE rule, getCharIndex()/CharIndex returns a value that is expected.

For whatever reason, the diff between the offset saved after SPACE and the offset fetched before DOT is constant, formed after the first LINE token found without the empty action in the +-closure. regardless whether there are any other tokens found in the input. After the first LINE found, the offset is constant.

If one wants to duplicate this, I suggest removing the "{}" empty action in the Java code or in the CSharp code, then use trgen to quickly create an application that can be debugged in seconds. I suggest examples/test5.in.

So, for the input:

( B ) 12345. hahaha
( B ) 2234555888. hahaha
( B ) 335. hahaha
( B ) 44321. hahaha
( B ) 51. hahaha
( B ) 62222222. hahaha

you see as output:

12345
22345
335.
44321
51. h
62222

not

12345
2234555888
335
44321
51
62222222

The problem exists in CSharp or Java, and I've duplicated the problem in older versions of Antlr4.

This is a problem because it reduces faith that actions actually work.

@kaby76
Copy link
Contributor

kaby76 commented Mar 30, 2022

I've traced this down to the main lexer automaton, ExecATN(). In that code, there are a few lines to check if there was a transition from a DFAState "s" and symbol "t" previously computed. If so, the old value is used. The problem is that the old state contains a list of ATNConfig that records the actions that need to be executed. Those actions contain offsets that need to be recomputed. The easiest fix is to just recompute the DFAState from scratch. That will create a new list of ATNConfig with the correct offsets.

The following code fixes the problem.

bool IsCandidate(DFAState target)
{
    var es = target?.configSet?.Elements;
    foreach (var e in es)
    {
        if (e != null && e is LexerATNConfig l)
        {
            var lae = l.getLexerActionExecutor();
            if (lae == null) return false;
            foreach (var x in lae.LexerActions)
            {
                if (x is Atn.LexerIndexedCustomAction) return true;
            }
        }
    }
    return false;
}

ExecATN() is now modified thus:

            // As we move src->trg, src->trg, we keep track of the previous trg to
            // avoid looking up the DFA state again, which is expensive.
            // If the previous target was already part of the DFA, we might
            // be able to avoid doing a reach operation upon t. If s!=null,
            // it means that semantic predicates didn't prevent us from
            // creating a DFA state. Once we know s!=null, we check to see if
            // the DFA state has an edge already for t. If so, we can just reuse
            // it's configuration set; there's no point in re-computing it.
            // This is kind of like doing DFA simulation within the ATN
            // simulation because DFA simulation is really just a way to avoid
            // computing reach/closure sets. Technically, once we know that
            // we have a previously added DFA state, we could jump over to
            // the DFA simulator. But, that would mean popping back and forth
            // a lot and making things more complicated algorithmically.
            // This optimization makes a lot of sense for loops within DFA.
            // A character will take us back to an existing DFA state
            // that already has lots of edges out of it. e.g., .* in comments.
            DFAState target = GetExistingTargetState(s, t);
            if (target == null || IsCandidate(target))
            {
                target = ComputeTargetState(input, s, t);
            }

With the call to IsCandidate(), the state is recomputed if the state contains ATNConfig that are index-sensitive actions. With the change, the actions for the problem are performed perfectly.

@kaby76
Copy link
Contributor

kaby76 commented Mar 30, 2022

@parrt This seems to be a problem that should be addressed at some point. Also, I worry about how actions work where someone is relying on an action to run prior to executing a semantic predicate in a lexer rule. It looks like it may not work as expected because actions are queued up whereas semantic predicate actions are executed on the fly.

@parrt
Copy link
Member

parrt commented Mar 30, 2022

wow! holy crap. thanks for all the analysis @kaby76 and @tianshuang !! Seems like easiest fix is to disable DFA caching in the presence of the actions. Would slow that rule down a bit but...

@parrt
Copy link
Member

parrt commented Apr 2, 2022

Hi. It appears to have been my intention, or at least a result of how I handle actually decided to match lexer rules. please see my comment here: #3611. I am updating the documentation to be more explicit in this case.

parrt added a commit to parrt/antlr4 that referenced this issue Apr 2, 2022
…tions and semantic predicates; Fixes antlr#3611. Fixes antlr#3606.

Signed-off-by: Terence Parr <[email protected]>
@kaby76
Copy link
Contributor

kaby76 commented Apr 4, 2022

In the meanwhile, I will add a check to Trash to warn when there is a mix of plain actions and semantic predicates in lexer rules. People do not read documentation. If they want to do that, they should rewrite actions into code that returns true and performs a side-effect.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 4, 2022

At least I suggest introducting a warning to the ANTLR tool about actions in parsing decisions. Something like actions should be placed at the end of the rule since they are always executed after rule recognition?

Stuff : ( 'a'+ {initA(); } | 'b') 'c' 'd' { count == 3 }? ;

The current behaviour of actions and predicates is not intuitive indeed.

@kaby76
Copy link
Contributor

kaby76 commented Apr 4, 2022

Indeed, a check should be implemented--it will save on a lot of headaches. The check should test mixing actions and predicates across all uses of the lexer symbols--not just within one lexer rule--as trying to do so still breaks (I tested).

@KvanTTT
Copy link
Member

KvanTTT commented Apr 4, 2022

Does it affect only lexer rules? And should predicates be presented for warning? If I understand correctly, actions are always executed after rule recognition. For example:

A: { action() } 'a'; // warning
B: 'b' { action() }; // no warning
C: 'c' { action() } { predicate() }?; // warning
D: 'd' { predicate() }? { action() }; // no warning

@kaby76
Copy link
Contributor

kaby76 commented Apr 4, 2022

Right, it only affects lexer rules. There's no queue data structure on the parser side, only on the lexer side. And, I've already tested parser rules--you can mix actions and predicates.

@kaby76
Copy link
Contributor

kaby76 commented Apr 4, 2022

I'll check to see if placing the action in a predicate fixes this getCharIndex() problem. If so, then we don't need to change ExecATN().

@KvanTTT
Copy link
Member

KvanTTT commented Apr 4, 2022

I suspect ACTION_SHOULD_BE_PLACED_TO_THE_END_OF_LEXER_RULE should be thrown in the following samples:

lexer grammar ACTION_SHOULD_BE_PLACED_TO_THE_END_OF_LEXER_RULE;
@members {String s = "";

// Action
void a(String x) {
    s += x;
}

// Predicate
boolean p() {
    return !s.equals("");
}}

// No warnings
OR: 'or1' {a("or1");} | 'or2' {a("or2");};
AND: 'and1' 'and2' {a("and1" + "and2");};
PREDICATE: 'predicate' {p()}? {a("predicate");};
ACTIONS: 'actions' {a("action1");} {a("action2");};
RECURSIVE: '/*' (RECURSIVE | .)*? '*/' {a("RECURSIVE");};
USE_FRAGMENT: 'f1_' FRAGMENT; // No warning since fragment action at the end and FRAGMENT at the end

// Warnings
AND_W: 'and_w1' {a("and_w1" + "and_w2");} 'and_w2';
CLOSURE_W: ('closure_w' {a("closure_w");})+;
PREDICATE_W: 'predicate_w' {a("predicate_w");} {p()}?;
OPTIONAL_W: 'optional_' {a("optional_w");} 'w'?;
USE_FRAGMENT_W: 'f2_' FRAGMENT '_f2'; // Warning because FRAGMENT action is anyway performed after `USE_FRAGMENT` recognition
USE_REF: 'r_' REF '_r';
REF: 'ref' {a("ref");};

fragment FRAGMENT: 'fragment' {a("fragment");};

WS: [ \r\n]+;

@kaby76
Copy link
Contributor

kaby76 commented Apr 4, 2022

Yes, that could work, too--forcing all actions to occur after all predicates. It corresponds to the order that ExecATN() would execute them (first predicates, then actions). But, I would think we'd have to make sure one doesn't interleave the actions and predicates across rules, e.g., A : B { a(); }; B : 'b' { b() }?;. In other words, someone might try to be cleaver and think that would work--it can't because ExecATN() works across all rules until a token is found, even if the sub-rules are non-fragment lexer rules.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 4, 2022

Yes, you are right. I've updated my sample with recursive (action is performed once) and fragment rule.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 4, 2022

Also, actually ACTION_SHOULD_BE_PLACED_TO_THE_END_OF_LEXER_RULE can be error instead of warning to get rid of potential misunderstanding.

@parrt
Copy link
Member

parrt commented Apr 4, 2022

Hi guys. We can't be randomly changing the behavior of the lexers. :) too much history. I agree with @kaby76 however that we really need a warning message. the issue is the interaction of predicates and actions not the fact that actions are in the middle of a rule. We definitely want those actions executed at different places, and we can't randomly decided to cut off that functionality. predicates in the lexer should really be about what column you are in or what the current text is. anything else is dangerous.

please be advised that even parsing actions and semantic predicates can be misleading. remember that the predicates are generally evaluated to make parsing decisions, which is outside the parsing process. Only those side effects from actions that have executed before a parsing decision is initiated can be tested by the predicate. I think something like this would not operate as you would expect:

a : {x=1;} {x==1?} A   //Ambiguous so we must consider the predicates during prediction
   | {x=2;} {x==2}? A  // however actions are not executed because we're not sure which alternative to execute
   ;

In summary, actions are allowed anywhere inside the lexer rule. semantic predicates are tested AFTER all viable tokens have been discovered and then the semantic predicates decide what to do. actions occur only after a token has been discovered, which is a function of semantic predicates. So, warning when someone uses predicates and actions, regardless of the order, might be the right idea because of looping and so on.

I think I'm remembering this correctly after eight years haha.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 4, 2022

I've check for parser. It works as expected:

grammar T;

@parser::members {boolean a = false; boolean b = false;}

x: y y end;
y: ({a = true;} {a}? A | {b = true;} {b}? B);
end: 'end' {System.out.println(a); System.out.println(b);};
A: 'a';
B: 'b';
END: 'end';

WS: [ \r\n]+ -> channel(HIDDEN);

It prints true true for input b a end.

If I change y to y: ({a = true;} {a}? A | {b = true;} {b}? A); it always choose the first alternative thus b is always false.

But there is another issue with parser semantic predicates. Parser on the following grammar:

grammar T;

x: y | z;
y: A {false}?;
z: A;
A: 'a';

is unable to recogize just simple a. It fails with line 1:1 rule y failed predicate: {false}?. But it works fine if I change the order of semantic predicate:

grammar T;

x: y | z;
y: {false}? A;
z: A;
A: 'a';

It is also quite unexpected behavior.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 4, 2022

But, I would think we'd have to make sure one doesn't interleave the actions and predicates across rules, e.g., A : B { a(); }; B : 'b' { b() }?;

Could you clarify what's unexpected in this sample? It looks like action is executed after predicate as it's described. I have another interesting sample:

lexer grammar L;
@members {String s = "";}
A: '(' B ')';
B: 'b' {s += "b";};
END: 'end' {System.out.println(s);};
WS: [ \r\n]+;

In this case B action is not executed at all. By the way we have FRAGMENT_ACTION_IGNORED warning for fragment rules but it's also actual for rules that have refs.

@kaby76
Copy link
Contributor

kaby76 commented Apr 4, 2022

Sorry, I meant A : { a(); } B; B : 'b' { b() }?;. Here the writer thought a() would be executed first, then b(), because that's how it would be executed in parser rules, but it's not that way with lexer rules. My point is that a check/error/warning message must look across lexer rules, not only within a single rule. That's all.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 5, 2022

A : { a(); } B; B : 'b' { b() }?;

In this case my first assumption works fine without subrules analysing: action should be placed to the end of lexer rule. I think it matters even without predicates since code within action may use internal lexer state (current position).

@kaby76
Copy link
Contributor

kaby76 commented Apr 5, 2022

Sorry. For some reason, I thought when debugging the lexer ExecATN() that actions were queued up from sub-rules. They're not. You are right, they're not even executed. Yes, a warning would be good here.

I'm not sure it's really necessary to force (or just warn??) users to change from this A : { a(); } 'a' { b(); } 'b' ; to this A : 'a' 'b' { a(); } { b(); }; or this A : 'a' 'b' { a(); b(); };. It only needs to happen if there's a semantic predicate somewhere in that rule, right?

@KvanTTT
Copy link
Member

KvanTTT commented Apr 5, 2022

It only needs to happen if there's a semantic predicate somewhere in that rule, right?

I don't think so. Consider the following grammar:

lexer grammar L;
A: {System.out.print(getCharPositionInLine());} 'A' {System.out.print(getCharPositionInLine());};

It prints 11 on input A. But it should print 01 if actions are executed in expected order. Actions might rely on lexer state and should be moved to the end regarless of semantic predicates using.

@kaby76
Copy link
Contributor

kaby76 commented Apr 5, 2022

Wonderful. For CSharp, I don't get "11" but "01".

lexer grammar LLexer;
A:
 {System.Console.WriteLine("first " + this.CharIndex);}
 'A'
 {System.Console.WriteLine("second " + this.CharIndex);}
 ;
parser grammar LParser;
options { tokenVocab = LLexer; }
file : A+ EOF;

But, yes, for Java, it's "11".

What a friggin' mess.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 5, 2022

Interesting. Does the mix of semantic predicates and actions work sequentially for C#?

@kaby76
Copy link
Contributor

kaby76 commented Apr 5, 2022

Well, that is for sequential actions only. I haven't tested it with sequential predicates. I use C#. I prefer to not use primitive language and environment Java (although they are starting to finally get with it, but they will never fix the JVM to "reposition the instruction pointer and go!", the main reason I hate Java with a passion). I get the same result with or without the modifications I gave above to LexerATNSimulator.ExecATN(). That's why I've been saying there are two different bugs here in C#. Now we have another.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 5, 2022

Sorry, it looks like everything works fine, getCharIndex should be used instead of getCharPositionInLine for Java and other runtimes:

A: {System.out.print(getCharIndex());} 'A' {System.out.print(getCharIndex());};

It prints 01 as C#. In this way you are right about mixing of actions and predicates.

@kaby76
Copy link
Contributor

kaby76 commented Apr 5, 2022

Phew. That's good. It's hard to remember what a method in one runtime maps to another method in another.

@ericvergnaud
Copy link
Contributor

Maybe it's a good time to unify these ? (getCharPositionInLine is definitely much better than getCharIndex)

@KvanTTT
Copy link
Member

KvanTTT commented Apr 5, 2022

Sorry, I started discussion about action/predicate order in wrong topic. Correct topic is #3611. The current topic is about incorrect working of getCharIndex (or action execution?). Semantic predicates are not required at all for repoducing:

lexer grammar L;

@lexer::members {private int startIndex = 0;}

LINE: {startIndex = getCharIndex();} LETTER {System.out.println(_input.getText(Interval.of(startIndex, getCharIndex() - 1)));} SLASH;
fragment LETTER: [a-z]+;
fragment SLASH: '-';

It prints aabb instead of aabbb.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 5, 2022

Maybe it's a good time to unify these ? (getCharPositionInLine is definitely much better than getCharIndex)

They are different, getCharIndex returns linear index starts from the beginning of file. getCharPositionInLine returns index from the beginning of the current line and it looks like it's being updated after matching of token unlike the first one.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Apr 5, 2022 via email

@KvanTTT
Copy link
Member

KvanTTT commented Apr 5, 2022

There are similar methods accross runtimes: CharIndex for C#, getCharIndex() for Java etc. But yes, maybe it makes sense to use CharPositionInLine instead of current Column for C#.

@kaby76
Copy link
Contributor

kaby76 commented Apr 6, 2022

I'm not sure how much can be done coordinating the names in the runtime fields/method/properties across targets. The least that can be done is to have a nice table to refer to or an IDE plugin that implements the table.

@parrt
Copy link
Member

parrt commented Apr 7, 2022

hi guys. Trying to dive into this conversation but there's a lot going on. first, let me clear up some spell misconceptions about the interaction of semantic predicates and actions. No semantic predicate is evaluated during prediction that is hidden by an action. further, semantic predicates are not evaluated during prediction if syntax on is sufficient to make decisions. for example,

y: ({a = true;} {a}? A | {b = true;} {b}? B);

No semantic predicates need to be evaluated to distinguish between A and B. Further, they can't be evaluated anyway because they are hidden by actions. any action is assumed to be a side effect that affects the semantic predicates.

The following example always matches the first alternative because these semantic predicates are hidden by the actions:

y: ({a = true;} {a}? A | {b = true;} {b}? A);

Here is a case where semantic predicates are evaluated to resolve a syntactic ambiguity:

y: ({a}? A | {b}? A);

This is not unexpected behavior either:

x: y | z;
y: A {false}?;
z: A;
A: 'a';

y and z are both 'a' which is a syntactic ambiguity for rule x. That means that the first alternative will be given precedence in x. The semantic predicate is not evaluated becauseIt is hidden by the token. See https://github.com/antlr/antlr4/blob/master/doc/predicates.md#finding-visible-predicates

which starts with "The parser will not evaluate predicates during prediction that occur after an action or token reference."

By moving the predicate, it now gets evaluated to resolve the syntactic ambiguity, forcing the second alternative to match.

x: y | z;
y: {false}? A;
z: A;
A: 'a';

@parrt
Copy link
Member

parrt commented Apr 7, 2022

When it comes to the evaluation of actions in the lexer, I can't remember the details but there is definitely only so much context that we can recreate. Remember that the actions are only executed after the entire rule has been matched, but we try to execute them in the right context.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 7, 2022

I implemented correct ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES warning here: #3626 I've checked for possible cases and added tests.

which starts with "The parser will not evaluate predicates during prediction that occur after an action or token reference."

It looks like we need a warning for this case as well. It's also confusing, I've encoutered the problem several times (for example, detect if identifier matches get or set string for JavaScript: https://github.com/antlr/grammars-v4/blob/master/javascript/javascript/JavaScriptParser.g4#L437-L443).

@parrt
Copy link
Member

parrt commented Apr 7, 2022

I'm not sure a warning is warranted there. The rule is clear and i might want the pred evaluated after prediction in that spot.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 7, 2022

The semantic predicate is not evaluated becauseIt is hidden by the token.

But it's evaluated anyway later and failed predicate exception is raised if it returns false: line 1:1 rule y failed predicate: {p()}?

Could you elaborate on an example where pred evaluation after prediction is useful? Moreover, I don't completely understand why FailedPredicateException was introduced and where it's useful. There are no examples in docs as well (only examples with predicates at the beginning of alternative).

@parrt
Copy link
Member

parrt commented Apr 7, 2022

Well, If you provide a predicate it should be evaluated at some point; it can't be ignored. the discovery of semantic predicates is complicated by the fact that it can look through multiple rules and sub rules when making a decision. So it might be hidden in one case and not hidden in another, depending on where that rule is refer to and what the parsing decision is.

So the idea is that if you give a predicate it's like an assert, but it can also be hoisted into the prediction decision to help disambiguate things. I can't think of a great example that isn't better done with a separate semantic phase, but you might do something like:

varRef : ID {isDefined($ID.text)}? ;

@parrt
Copy link
Member

parrt commented Apr 8, 2022

I just read My coauthors implementation of context sensitive lexer actions. it looks like only the index in the character stream is reset before the actions are executed; no other context is saved or restored. So it sounds like getCharIndex() is the only thing we can rely on in an action, other than user-defined code.

I suppose there is value in actions in loops like:

A : ('a' {n++;})+ ;

But, it really does confuse people... honestly I'm not sure if I have enough of this remembered in my head now to make an accurate decision about what to do here.

@parrt parrt closed this as completed in e1455c0 Apr 10, 2022
@KvanTTT
Copy link
Member

KvanTTT commented Apr 11, 2022

I don't understand why it was closed. It should be reopened.

@kaby76
Copy link
Contributor

kaby76 commented Apr 11, 2022

I don't understand why it was closed. It should be reopened.

Github seems to have a "feature" that automatically closes issues when it parses "fixes pound-sign-number" or "fixed pound-sign-number" in a git commit -m "message". Apparently, this is the "much better new way, not dumb old-school way" of software engineering, software development, specifying, designing, hacking--add "features" that we didn't ask for, add "features" we don't want, provide no tracking of the "features" added, add new bugs in the process, and do it "whenever we want".

@KvanTTT
Copy link
Member

KvanTTT commented Apr 11, 2022

Got it. I see the commit

 update doc to be more explicit about the interaction between lexer actions and semantic predicates

But it actually doesn't related to the current issue since the issue is about incorrect working of getCharIndex, it's reproducable even without semantic predicates. The issue should be reopened.

@parrt
Copy link
Member

parrt commented Apr 11, 2022

Sorry about that guys. It got interwined with preds. Can we start a fresh bug, copy over stuff from this one?

@KvanTTT
Copy link
Member

KvanTTT commented Apr 11, 2022

Can we start a fresh bug, copy over stuff from this one?

Sure, it's even better: #3645

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants