Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATN cannot be deserialized in PHP-runtime when using example XML-language #4075

Open
martinmolema opened this issue Jan 15, 2023 · 10 comments
Open

Comments

@martinmolema
Copy link

martinmolema commented Jan 15, 2023

Hello, I found an issue with the PHP-target language using the supplied XML-example language. All info below.

Origin of the error seems to be the Unicode-characters in the language: (https://github.com/antlr/grammars-v4/blob/master/xml/XMLLexer.g4)

fragment
NameChar    :   NameStartChar
            |   '-' | '_' | '.' | DIGIT
            |   '\u00B7'
            |   '\u0300'..'\u036F'
            |   '\u203F'..'\u2040'
            ;

fragment
NameStartChar
            :   [:a-zA-Z]
            |   '\u2070'..'\u218F'
            |   '\u2C00'..'\u2FEF'
            |   '\u3001'..'\uD7FF'
            |   '\uF900'..'\uFDCF'
            |   '\uFDF0'..'\uFFFD'
            ;
  • ANTLR4 runtime using antlr-4.9.3-complete.jar
  • using ANTLR PHP Runtime version 0.5.0
  • I am stuck in a vendor lock-in with Laravel/Lumen version that will not upgrade tot PHP8, so using PHP7.4.

Error occurs in ATNDeserializer.php, line 175 (
$characters = \preg_split('//u', $data, -1, \PREG_SPLIT_NO_EMPTY);
)
returning false. This is described in the u-modifier https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Effect
ATN cannot be deserialized and this yields the error in a completely different part of the code because there is no ATN data.

composer.json:
{ "require": { "antlr/antlr4-php-runtime": "0.5.0" } }

My Test.php:

<?php

require_once 'vendor/autoload.php';
require_once './parser/XMLParserVisitor.php';
require_once './parser/XMLParserBaseVisitor.php';
require_once './parser/XMLLexer.php';
require_once './parser/XMLParser.php';

use Antlr\Antlr4\Runtime\CommonTokenStream;
use Antlr\Antlr4\Runtime\InputStream;

use parser\XMLLexer;
use parser\XMLParser;

$expression = '<html><p>Test</p></html>';

$stream = InputStream::fromString($expression);
$lexer  = new XMLLexer($stream);
$tokens = new CommonTokenStream($lexer);
$parser = new XMLParser($tokens);

$tree = $parser->document();

Solution
The simplest way is to simply remove the Unicode characters from the example, but that would be too simple. These characters probably represent valid characters. Instead, a proper warning of catcheable exception with an indication of this problem would have saved me **a lot of ** time.

The PHP-manual says: "Five and six octet UTF-8 sequences are regarded as invalid. ". I can't quite understand what that means but maybe there's a hint of a solution there.

In the meantime I removed these characters as I am only parsing HTML generated by CKEditor. Testing in progress....

@martinmolema martinmolema changed the title ATN cannot be deserialized in PHP-runtime when using wrong example XML-language ATN cannot be deserialized in PHP-runtime when using example XML-language Jan 15, 2023
@martinmolema
Copy link
Author

martinmolema commented Jan 15, 2023

In the meantime found the HTML parser which of course is better than XML for parsing actual HTML, but this one contains the same problem.
After I got things working, it appears to be horribly slow. A normal document can take more than 30 seconds to parse which will not work in a HTTP-call due to time-outs.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 15, 2023 via email

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jan 15, 2023 via email

@kaby76
Copy link
Contributor

kaby76 commented Jan 16, 2023

The XML grammar doesn't work directly with the Antlr PHP 4.11.1, PHP 8. There seems to be a symbol conflict with "XMLParser", I guess in the PHP 8 runtime.

But, if in the pom.xml the grammar is renamed (XML => MyXML), along with the grammar files (XMLLexer.g4 => MyXMLLexer.g4, XMLParser.g4 => MyXMLParser.g4, with a few changes within the grammars), the generated parser appears to work. Yes, PHP 8 is required for Antlr4.11.1 PHP.

01/15-19:40:38 ~/issues/issue-2988/grammars-v4/xml
$ trgen -t PHP
C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/
CSharp  XMLLexer.g4 success 0.0587564
CSharp  XMLParser.g4 success 0.0069608
Rendering template file from PHP/build.ps1 to Generated-PHP/build.ps1
Rendering template file from PHP/build.sh to Generated-PHP/build.sh
Rendering template file from PHP/clean.ps1 to Generated-PHP/clean.ps1
Rendering template file from PHP/clean.sh to Generated-PHP/clean.sh
Rendering template file from PHP/composer.json to Generated-PHP/composer.json
Rendering template file from PHP/makefile to Generated-PHP/makefile
Rendering template file from PHP/Test.php to Generated-PHP/Test.php
Rendering template file from PHP/test.ps1 to Generated-PHP/test.ps1
Rendering template file from PHP/test.sh to Generated-PHP/test.sh
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/XMLParser.g4 to Generated-PHP/XMLParser.g4
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/XMLLexer.g4 to Generated-PHP/XMLLexer.g4
01/15-19:40:45 ~/issues/issue-2988/grammars-v4/xml
$ cd Generated-PHP/
01/15-19:40:47 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make
bash build.sh
No composer.lock file present. Updating dependencies to latest instead of installing from lock file. See https://getcomposer.org/install for more information.
Loading composer repositories with package information
Info from https://repo.packagist.org: #StandWithUkraine
Updating dependencies
Lock file operations: 3 installs, 0 updates, 0 removals
  - Locking antlr/antlr4-php-runtime (0.8.0)
  - Locking phpunit/php-timer (5.0.3)
  - Locking psr/log (3.0.0)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 3 installs, 0 updates, 0 removals
  - Installing antlr/antlr4-php-runtime (0.8.0): Extracting archive
  - Installing phpunit/php-timer (5.0.3): Extracting archive
  - Installing psr/log (3.0.0): Extracting archive
Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!
01/15-19:40:52 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make test
bash test.sh
PHP Fatal error:  Cannot declare class XMLParser, because the name is already in use in C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\XMLParser.php on line 24
PHP Stack trace:
PHP   1. {main}() C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\Test.php:0
PHP   2. require_once() C:\msys64\home\Kenne\issues\issue-2988\grammars-v4\xml\Generated-PHP\Test.php:6
Test failed.
mingw32-make: *** [makefile:10: test] Error 1
01/15-19:40:58 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ ls
build.ps1  clean.sh       makefile  test.sh      XMLLexer.interp  XMLParser.g4      XMLParser.tokens
build.sh   composer.json  Test.php  vendor/      XMLLexer.php     XMLParser.interp  XMLParserBaseListener.php
clean.ps1  composer.lock  test.ps1  XMLLexer.g4  XMLLexer.tokens  XMLParser.php     XMLParserListener.php
01/15-19:41:08 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ trgen -t PHP
C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/
CSharp  MyXMLLexer.g4 success 0.0483575
CSharp  MyXMLParser.g4 success 0.0066586
Rendering template file from PHP/build.ps1 to Generated-PHP/build.ps1
Rendering template file from PHP/build.sh to Generated-PHP/build.sh
Rendering template file from PHP/clean.ps1 to Generated-PHP/clean.ps1
Rendering template file from PHP/clean.sh to Generated-PHP/clean.sh
Rendering template file from PHP/composer.json to Generated-PHP/composer.json
Rendering template file from PHP/makefile to Generated-PHP/makefile
Rendering template file from PHP/Test.php to Generated-PHP/Test.php
Rendering template file from PHP/test.ps1 to Generated-PHP/test.ps1
Rendering template file from PHP/test.sh to Generated-PHP/test.sh
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/MyXMLParser.g4 to Generated-PHP/MyXMLParser.g4
Copying source file from C:/msys64/home/Kenne/issues/issue-2988/grammars-v4/xml/MyXMLLexer.g4 to Generated-PHP/MyXMLLexer.g4
01/15-19:48:31 ~/issues/issue-2988/grammars-v4/xml
$ cd Generated-PHP/
01/15-19:48:34 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ ls
build.ps1  build.sh  clean.ps1  clean.sh  composer.json  makefile  MyXMLLexer.g4  MyXMLParser.g4  Test.php  test.ps1  test.sh
01/15-19:48:35 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make
bash build.sh
No composer.lock file present. Updating dependencies to latest instead of installing from lock file. See https://getcomposer.org/install for more information.
Loading composer repositories with package information
Info from https://repo.packagist.org: #StandWithUkraine
Updating dependencies
Lock file operations: 3 installs, 0 updates, 0 removals
  - Locking antlr/antlr4-php-runtime (0.8.0)
  - Locking phpunit/php-timer (5.0.3)
  - Locking psr/log (3.0.0)
Writing lock file
Installing dependencies from lock file (including require-dev)
Package operations: 3 installs, 0 updates, 0 removals
  - Installing antlr/antlr4-php-runtime (0.8.0): Extracting archive
  - Installing phpunit/php-timer (5.0.3): Extracting archive
  - Installing psr/log (3.0.0): Extracting archive
Generating autoload files
1 package you are using is looking for funding.
Use the `composer fund` command to find out more!
01/15-19:48:39 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP
$ make test
bash test.sh
PHP 0 ../examples/books.xml success 0.4387135
PHP 1 ../examples/web.xml success 0.0351489
Total Time: 1.1430899
dos2unix: converting file ../examples/books.xml.errors to Unix format...
dos2unix: converting file ../examples/books.xml.tree to Unix format...
dos2unix: converting file ../examples/web.xml.errors to Unix format...
dos2unix: converting file ../examples/web.xml.tree to Unix format...
Test succeeded.
01/15-19:48:48 ~/issues/issue-2988/grammars-v4/xml/Generated-PHP

I think the Antlr PHP runtime uses XMLParser

@martinmolema
Copy link
Author

I need to setup a VM or something to use PHP 8 and then test this.

@martinmolema
Copy link
Author

Ok, I have update my local dev machine to use PHP 8.1 and updated the project to use PHP Antlr 0.8.0. Current package-lock.json below.

{
    "_readme": [
        "This file locks the dependencies of your project to a known state",
        "Read more about it at https://getcomposer.org/doc/01-basic-usage.md#installing-dependencies",
        "This file is @generated automatically"
    ],
    "content-hash": "cd9b019ee661e13d2c5e0c0fdd2f17d4",
    "packages": [
        {
            "name": "antlr/antlr4-php-runtime",
            "version": "0.8.0",
            "source": {
                "type": "git",
                "url": "https://github.com/antlr/antlr-php-runtime.git",
                "reference": "7de4181629faaa4f0b9399610689cd8338c52e2c"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/antlr/antlr-php-runtime/zipball/7de4181629faaa4f0b9399610689cd8338c52e2c",
                "reference": "7de4181629faaa4f0b9399610689cd8338c52e2c",
                "shasum": ""
            },
            "require": {
                "ext-mbstring": "*",
                "php": "^8.0"
            },
            "require-dev": {
                "ergebnis/composer-normalize": "^2.15",
                "phpstan/extension-installer": "^1.0",
                "phpstan/phpstan": "^1.4",
                "phpstan/phpstan-deprecation-rules": "^1.0",
                "phpstan/phpstan-strict-rules": "^1.1",
                "slevomat/coding-standard": "^7.0",
                "squizlabs/php_codesniffer": "^3.6"
            },
            "type": "library",
            "extra": {
                "branch-alias": {
                    "dev-master": "0.2-dev"
                }
            },
            "autoload": {
                "psr-4": {
                    "Antlr\\Antlr4\\Runtime\\": "src/"
                }
            },
            "notification-url": "https://packagist.org/downloads/",
            "license": [
                "BSD-3-Clause"
            ],
            "description": "PHP 8.0+ runtime for ANTLR 4",
            "keywords": [
                "antlr4",
                "php",
                "runtime"
            ],
            "support": {
                "issues": "https://github.com/antlr/antlr-php-runtime/issues",
                "source": "https://github.com/antlr/antlr-php-runtime/tree/0.8.0"
            },
            "time": "2022-09-04T21:10:52+00:00"
        }
    ],
    "packages-dev": [],
    "aliases": [],
    "minimum-stability": "stable",
    "stability-flags": [],
    "prefer-stable": false,
    "prefer-lowest": false,
    "platform": [],
    "platform-dev": [],
    "plugin-api-version": "2.3.0"
}

I renamed the XMLParser.g4 and XMLLexer.g4 to MyXMLParser and MyXMLLexer and update my project includes. After that I generated the files (see bash-file below) using antlr-4.11.1-complete.jar.

PARSER=MyXMLParser
LEXER=MyXMLLexer

BASE_DIR=/mnt/ssd/Develop/crisisgame/newlangdef
ANTLR_DIR=${BASE_DIR}/ANTLR4

cd ${BASE_DIR}

OUTPUT_DIR=${BASE_DIR}/parser
NAMESPACE=parser
JAR_FILE=/usr/local/lib/antlr-4.11.1-complete.jar
LANGUAGE=PHP

# Clean Output Directory
rm $OUTPUT_DIR/${PARSER}*
rm $OUTPUT_DIR/${LEXER}*


export CLASSPATH=".:$JAR_FILE:$CLASSPATH"
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -no-visitor -no-listener -package $NAMESPACE -o $OUTPUT_DIR  -Xexact-output-dir ${ANTLR_DIR}/${LEXER}.g4
cp $OUTPUT_DIR/${LEXER}.tokens $ANTLR_DIR
java -jar $JAR_FILE -Dlanguage=$LANGUAGE -visitor    -no-listener -package $NAMESPACE -o $OUTPUT_DIR  -Xexact-output-dir ${ANTLR_DIR}/${PARSER}.g4

This way the lexer and parser will run without problems. Any chance this new ATN serialisation can be incorporated in older versions so I can benefit from it using PHP 7.4?

@ericvergnaud
Copy link
Contributor

it's very unlikely, but you can fork a PHP 7 compatible Antlr runtime and backport the serialization bits, it's not a big deal

@martinmolema
Copy link
Author

Ok, so I have been able to move from Laravel/Lumen 7 to Laravel/Lumen 9 in the meantime. After reading the upgrade guides it seems I use very little that is impacted. So now my project is on PHP8.2 and I can take full advantage of the new ANTLR4 stuff.

Reason why I avoided this upgrade is some time ago I started a new Laravel project in version 9 and found it very different. Never thought to investigate the upgrade because these are often complex processess. Looking at the timestamps the basic upgrade took me 35 minutes. Of course intensive testing needs to be done.. but first steps seem optimistic.

I will close this ticket for now. In the future I will revisit the HTML-parsing . For now I reverted back to original language without the HTML bit in it because of performance issues. Perhaps this upgrades of Laravel/Lumen, ANTLR4 (to 4.11.1) and PHP8 will have significant performance upgrades?

@kaby76
Copy link
Contributor

kaby76 commented Jan 20, 2023

Perhaps this upgrades of Laravel/Lumen, ANTLR4 (to 4.11.1) and PHP8 will have significant performance upgrades?

Note, I've been making significant modifications to the PHP runtime, but I'm still working out why the parser is still so slow in certain situations (e.g., when there's lots of ambiguity in the grammar). If and when these changes get merged, a 25% speed-up should be seen. antlr/antlr-php-runtime#34 https://github.com/antlr/antlr-php-runtime/issues antlr/antlr-php-runtime#36

@martinmolema
Copy link
Author

Great! Looking forward to this version @kaby76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants