-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ATN cannot be deserialized in PHP-runtime when using example XML-language #4075
Comments
In the meantime found the HTML parser which of course is better than XML for parsing actual HTML, but this one contains the same problem. |
FYI:Unicode currently supports roughly 150k characters, each Unicode character has a unique 32-bit valueUTF-16 restricts the charset to the first 65636, such that they fit in 16–bits by simply removing the leading 0s. This is what Windows uses.UTF-8 is an encoding of the full charset, where each character may take from 1 to 6 bytes. This is what Java uses for string constants.Envoyé de mon iPhoneLe 15 janv. 2023 à 09:47, Martin Molema ***@***.***> a écrit :
Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.
Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)
fragment
NameChar : NameStartChar
| '-' | '_' | '.' | DIGIT
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment
NameStartChar
: [:a-zA-Z]
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
ANTLR4 runtime using antlr-4.9.3-complete.jar
using ANTLR PHP Runtime version 0.5.0
I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.
Error occurs in ATNDeserializer.php, line 175 (
$characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY);
)
returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
Effect
ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.
composer.json:
{ "require": { "antlr/antlr4-php-runtime": "0.5.0" } }
My Test.php:
`
Test';
$stream = InputStream::fromString($expression);
$lexer = new XMLLexer($stream);
$tokens = new CommonTokenStream($lexer);
$parser = new XMLParser($tokens);
$tree = $parser->document();
`
**Solution**
The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me **a lot of ** time.
The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.
In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Can you try the latest antlr and runtime ? It may indirectly address your issue by changing the ATN serialization formatEnvoyé de mon iPhoneLe 15 janv. 2023 à 20:05, Wanadoo ***@***.***> a écrit :FYI:Unicode currently supports roughly 150k characters, each Unicode character has a unique 32-bit valueUTF-16 restricts the charset to the first 65636, such that they fit in 16–bits by simply removing the leading 0s. This is what Windows uses.UTF-8 is an encoding of the full charset, where each character may take from 1 to 6 bytes. This is what Java uses for string constants.Envoyé de mon iPhoneLe 15 janv. 2023 à 09:47, Martin Molema ***@***.***> a écrit :
Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.
Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)
fragment
NameChar : NameStartChar
| '-' | '_' | '.' | DIGIT
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment
NameStartChar
: [:a-zA-Z]
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
ANTLR4 runtime using antlr-4.9.3-complete.jar
using ANTLR PHP Runtime version 0.5.0
I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.
Error occurs in ATNDeserializer.php, line 175 (
$characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY);
)
returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
Effect
ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.
composer.json:
{ "require": { "antlr/antlr4-php-runtime": "0.5.0" } }
My Test.php:
`
Test';
$stream = InputStream::fromString($expression);
$lexer = new XMLLexer($stream);
$tokens = new CommonTokenStream($lexer);
$parser = new XMLParser($tokens);
$tree = $parser->document();
`
**Solution**
The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me **a lot of ** time.
The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.
In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
The XML grammar doesn't work directly with the Antlr PHP 4.11.1, PHP 8. There seems to be a symbol conflict with "XMLParser", I guess in the PHP 8 runtime. But, if in the pom.xml the grammar is renamed (XML => MyXML), along with the grammar files (XMLLexer.g4 => MyXMLLexer.g4, XMLParser.g4 => MyXMLParser.g4, with a few changes within the grammars), the generated parser appears to work. Yes, PHP 8 is required for Antlr4.11.1 PHP.
I think the Antlr PHP runtime uses XMLParser |
I need to setup a VM or something to use PHP 8 and then test this. |
Ok, I have update my local dev machine to use PHP 8.1 and updated the project to use PHP Antlr 0.8.0. Current package-lock.json below.
I renamed the XMLParser.g4 and XMLLexer.g4 to MyXMLParser and MyXMLLexer and update my project includes. After that I generated the files (see bash-file below) using PARSER=MyXMLParser
LEXER=MyXMLLexer
BASE_DIR=/mnt/ssd/Develop/crisisgame/newlangdef
ANTLR_DIR=${BASE_DIR}/ANTLR4
cd ${BASE_DIR}
OUTPUT_DIR=${BASE_DIR}/parser
NAMESPACE=parser
JAR_FILE=/usr/local/lib/antlr-4.11.1-complete.jar
LANGUAGE=PHP
# Clean Output Directory
rm $OUTPUT_DIR/${PARSER}*
rm $OUTPUT_DIR/${LEXER}*
export CLASSPATH=".:$JAR_FILE:$CLASSPATH"
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -no-visitor -no-listener -package $NAMESPACE -o $OUTPUT_DIR -Xexact-output-dir ${ANTLR_DIR}/${LEXER}.g4
cp $OUTPUT_DIR/${LEXER}.tokens $ANTLR_DIR
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -visitor -no-listener -package $NAMESPACE -o $OUTPUT_DIR -Xexact-output-dir ${ANTLR_DIR}/${PARSER}.g4 This way the lexer and parser will run without problems. Any chance this new ATN serialisation can be incorporated in older versions so I can benefit from it using PHP 7.4? |
it's very unlikely, but you can fork a PHP 7 compatible Antlr runtime and backport the serialization bits, it's not a big deal |
Ok, so I have been able to move from Laravel/Lumen 7 to Laravel/Lumen 9 in the meantime. After reading the upgrade guides it seems I use very little that is impacted. So now my project is on PHP8.2 and I can take full advantage of the new ANTLR4 stuff. Reason why I avoided this upgrade is some time ago I started a new Laravel project in version 9 and found it very different. Never thought to investigate the upgrade because these are often complex processess. Looking at the timestamps the basic upgrade took me 35 minutes. Of course intensive testing needs to be done.. but first steps seem optimistic. I will close this ticket for now. In the future I will revisit the HTML-parsing . For now I reverted back to original language without the HTML bit in it because of performance issues. Perhaps this upgrades of Laravel/Lumen, ANTLR4 (to 4.11.1) and PHP8 will have significant performance upgrades? |
Note, I've been making significant modifications to the PHP runtime, but I'm still working out why the parser is still so slow in certain situations (e.g., when there's lots of ambiguity in the grammar). If and when these changes get merged, a 25% speed-up should be seen. antlr/antlr-php-runtime#34 https://github.com/antlr/antlr-php-runtime/issues antlr/antlr-php-runtime#36 |
Great! Looking forward to this version @kaby76 |
Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.
Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)
Error occurs in ATNDeserializer.php, line 175 (
$characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY);
)
returning
false
. This is described in theu
-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.phpEffect
ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.
composer.json:
{ "require": { "antlr/antlr4-php-runtime": "0.5.0" } }
My Test.php:
Solution
The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me **a lot of ** time.
The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.
In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....
The text was updated successfully, but these errors were encountered: