Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary parsing #828

Closed
stavytskyi opened this issue Mar 2, 2015 · 32 comments
Closed

Binary parsing #828

stavytskyi opened this issue Mar 2, 2015 · 32 comments

Comments

@stavytskyi
Copy link

You have written:

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

I don't see any example. And I understand ANTLR supports binary parsing only in interface level. I don't clearly understand how I should extend yours code for supporting my binaries and how I should write grammar. Could you give any example? Because in current state antlr can work only with character streams.

@parrt
Copy link
Member

parrt commented Mar 2, 2015

i'm pretty sure the book has an example. just pass in bytes of data not chars. easy.

@parrt parrt closed this as completed Mar 2, 2015
@stavytskyi
Copy link
Author

Could say please in what part of book? I understand that I can implement IntStrem but I don't understand how I should write grammar for using with IntStream.

@parrt
Copy link
Member

parrt commented Mar 4, 2015

Can't find it. hmm.. anyway, just use '\u0042' to match byte 0x42 in lexer.

@stavytskyi
Copy link
Author

Oh, I understand. I will try. Thank you.

@ams-tschoening
Copy link

I don't see why this is invalid, I came across the same question and have a hard time finding any documentation about using antlr4 with binary files as well. Your hint using "\u0042" is helpful of course, but wasn't that obvious like it sounds in your statement. Additionally, the more interesting cases are still not covered, for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value because that's what I'm interested in, how all this looks like in callbacks regarding data types etc.

Asking for more documentation about that part of your quoted sentence seems perfectly valid to me.

@parrt
Copy link
Member

parrt commented Nov 17, 2016

this is a question not an issue with the software. besides, I don't know what this means "for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value "

@kkbkris
Copy link

kkbkris commented Nov 17, 2016

Dear Terence,

Using DOM, and my limited and outdated knowledge of programming in JavaScript, query, not a very good term to use, the .valueOf() attribute or .value of the DOM object. Reference the value of the DOM object. Also measure it.

Yours sincerely,
GitHub User

Sent from my iPhone

On 17 Nov 2016, at 19:31, "Terence Parr" <[email protected]mailto:[email protected]> wrote:

this is a question not an issue with the software. besides, I don't know what this means "for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value "

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/828#issuecomment-261344965, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGlumkUabdnU_oPN1UzThLLvUHSCSTSkks5q_KuKgaJpZM4DoGem.

@ams-tschoening
Copy link

Guten Tag Terence Parr,
am Donnerstag, 17. November 2016 um 20:31 schrieben Sie:

besides, I don't know what this means[...]

The homepage claims antlr supports parsing binary data and in most of
such data formats I know of, at least in the case I'm interested in
currently, one knows the defined structure of the data, but not the
actual values. Think of an IPv4 frame, a MPEG video file or such:

You know that some bytes are in there in some special order to tell
you "things", but you actually need to read the bytes themselfs and
process them in some special manner to know which is the actual
destination IP of a packet or what's the length of the video in
seconds etc. From reading the documentation, I have no idea how antlr
could provide me that data if I told it the structure of my binary
file.

And that's why some examples how to work with different binary file
formats would be of great help. I have the feeling that simply no one
does this.

Or maybe the homepage has something completely different in mind when
talking of "binary files". Then explaining that term on the homepage
itself might be of help. I think of video files, archive formats like
Zip etc., which all need some kind of parser.

Mit freundlichen Grüßen,

Thorsten Schöning

Thorsten Schöning E-Mail: [email protected]
AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55
Fax...............05151- 9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

@parrt
Copy link
Member

parrt commented Nov 17, 2016

Ah. Well, right. parsing by definition means you know the type of thing but not its value like an ID or INT. Let me whip up an example.

@parrt
Copy link
Member

parrt commented Nov 17, 2016

@kkbkris i have no idea what you are saying.

@jimidle
Copy link
Collaborator

jimidle commented Nov 17, 2016

I think here the issue is that the OP wants to ask for the 'next byte' at certain points. I think that such things really require another level of abstraction. Otherwise you need a Parser driven lexer. 
Jim

@parrt
Copy link
Member

parrt commented Nov 18, 2016

@jimidle the getExpectedTokens() or whatever it is should give them the expected bytes no problem.

@parrt
Copy link
Member

parrt commented Nov 18, 2016

"Fixed" by a6a7304 adding this documentation https://github.com/antlr/antlr4/blob/master/doc/parsing-binary-files.md. @tschoening does this answer your question? As you can see, there is no difference between parsing characters and binary from a recognition point of view. It's all about how you feed things to the lexer.

@ams-tschoening
Copy link

Guten Tag Terence Parr,
am Freitag, 18. November 2016 um 18:10 schrieben Sie:

"Fixed" by adding this documentation
https://github.com/antlr/antlr4/blob/master/doc/parsing-binary-files.md.
@tschoeningdoes this answer your question?

Yes, thanks. This makes it much clearer for me to understand how
thinks are supposed to work.

Mit freundlichen Grüßen,

Thorsten Schöning

Thorsten Schöning E-Mail: [email protected]
AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55
Fax...............05151- 9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

@wbuchanan
Copy link

Don't mean to beat a dead horse, but @parrt is there a way that ANTLR can handle byte swapping in the case where the file might have been written on a little endian machine? I'm interested in trying to build a parser with ANTLR to handle file formats that include a combination of ASCII text and binary data. In this case I would think storing the data from the <header>...</header> tags as a map object would be useful since it defines the number of columns/rows in the data structure, but not sure how cases like this where the endianness might differ from what the JVM is expecting would need to be handled. I started trying to write some stuff to parse these files manually in the past, but think it might be more efficient - and certainly more consistent - to use ANTLR to build the parser.

@parrt
Copy link
Member

parrt commented Jan 11, 2017

@wbuchanan well, the minute you say "order", you imply language so it's up to the grammar author to create two rules, one for 4 byte little end and one for 4 byte big end let's say. ANTLR knows nothing of byte order, machine architecture etc...

@wbuchanan
Copy link

@parrt There's a value in the header of the file spec that indicates the order of the bytes, but I'm not sure how something like that would be handled via ANTLR. There is a new/unorthodox rule for some of the larger string data as well that involves 6-byte integers. I'll try digging into the book a bit more in the meantime. Thanks again.

@parrt
Copy link
Member

parrt commented Jan 11, 2017

@wbuchanan antlr is not designed to understand file formats. it is designed to let YOU specify file formats. :)

@wbuchanan
Copy link

@parrt I understand, just don't have experience in this particular area.

@jimidle
Copy link
Collaborator

jimidle commented Jan 12, 2017 via email

@wbuchanan
Copy link

@jimidle I'm not sure how that would work with some of the data. For example, there are 6-byte integers that are used to reference binary objects which may/may not be string data (they could be long strings or binary values). The endianness of the file is indicated pretty early in the header so I'd assume there's a way I could pass the value so I could byte swap as needed. There are also stopping rules for the string data that depend on the location of binary zeros which could become a bit more challenging to handle.

@hcoona
Copy link

hcoona commented Dec 12, 2017

I think binary parser is quite different with text parser. ANTLR is designed for parsing text instead of binary data formats. There should be some room for ANTLR to be improved.

For example, a very common structure for binary format is a prefix length with following several times of blocks:

+----------+---------------------------+---------------------------+------------------------+
|   0x08   |     <a structure>         |         <a structure>     |   repeat 0x08 times    |
+----------+---------------------------+---------------------------+------------------------+

Another basic structure in binary format is a piece of compressed data.

ANTLR cannot deal with these situation well.

Reference: https://pdos.csail.mit.edu/papers/nail:osdi14.pdf

@wbuchanan
Copy link

@hcoona
The use case I have in mind involves a file with a combination of XML style tags and binary data (see https://www.stata.com/help.cgi?dta for examples). In this case I need to parse the plain text to identify the tags in the file, but the contents between the tags will appear in binary formatted data. Depending on the version of the file format/software sometimes the types for these binary data elements will change so there also would need to be a way to handle that as well.

@hcoona
Copy link

hcoona commented Dec 12, 2017

@wbuchanan
Text data could be a restricted subset of binary data, but cannot vice versa. This is the problem.

@hcoona
Copy link

hcoona commented Dec 12, 2017

@wbuchanan
A quick but dirty solution is to insert a customized parser whenver you see the xml begin tag, and return to ANTLR after your parser proceed binary data.

@parrt
Copy link
Member

parrt commented Dec 12, 2017

@hcoona use a semantic predicate to handle prefix values.

@hcoona
Copy link

hcoona commented Dec 20, 2017

Surely there could be workaround for it. I mentioned it here just means there could be more improvements against binary formats for ANTLR.

@Andrew-Fryer
Copy link

@hcoona use a semantic predicate to handle prefix values.

@parrt
I don't think that semantic predicates can handle prefix values in non-trivial binary grammars.
Could you please let me know if I'm missing something here?

I've created an MWE. There is a choice between 2 parser rules, a and b, that both begin and end with a prefix length followed by that number of bytes.
The rules are distinguished by 2 possible values of a middle byte, which both appear in the input stream.
It seems that ANTLR can't use semantic predicates predicates during prediction, so it arbitrarily picks the first rule and then fails if the second rule should have been chosen because it can't backtrack.

The following error is produced on the input: 0x01 0x00 0x01 0x00

line 1:2 extraneous input '1' expecting BYTE_A
line 1:4 missing {BYTE_A, BYTE_B, LENGTH_BYTE} at '<EOF>'
Exception in thread "main" java.lang.NumberFormatException: For input string: ""
        at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
        at java.base/java.lang.Integer.parseInt(Integer.java:678)
        at java.base/java.lang.Integer.parseInt(Integer.java:784)
        at MWEParser.byte_(MWEParser.java:386)
        at MWEParser.string(MWEParser.java:275)
        at MWEParser.a(MWEParser.java:184)
        at MWEParser.mwe(MWEParser.java:125)
        at MWE.main(MWE.java:15)

(Parsing succeeds on the same input if I replace (a | b) with just b in the grammar.)

Here is my grammar:

grammar MWE;

mwe:
  (a | b)
  EOF
  { System.out.println("Success!"); }
  ;
a:
  string
  BYTE_A
  string
  ;

b:
  string
  BYTE_B
  string
  ;

string:
  length=byte
  data=blob[$length.val]
  ;

blob [int n]
locals [int i = 0]
    : ( {$i < $n}? byte {$i++;} ) * {$i == $n}?
    ;

byte returns [byte val]
  : data=allTerminals
  {
$val = (byte) Integer.parseInt($data.text);
  };

allTerminals
  : BYTE_A
  | BYTE_B
  | BYTE
  ;

BYTE_A: '\u0000'..'\u0000';
BYTE_B: '\u0001'..'\u0001';
BYTE: '\u0000'..'\u00ff';

(Note that this grammar works if it is refactored so that the string rule is moved from the beginning of a and b to the beginning of mwe.)

And here is my test code:

import java.io.IOException;
import java.nio.charset.StandardCharsets;

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTree;


class MWE {
    public static void main(String args[]) throws IOException {
        String inputFile = args[0];
        ANTLRFileStream bytesAsChar = new BinaryANTLRFileStream(inputFile);
        MWELexer lexer = new MWELexer(bytesAsChar);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        MWEParser parser = new MWEParser(tokens);
        ParseTree tree = parser.mwe();
    }
}

BinaryANTLRFileStream is taken from https://github.com/antlr/antlr4/blob/master/doc/parsing-binary-files.md

@KvanTTT
Copy link
Member

KvanTTT commented Sep 16, 2022

Honestly, in my opinion, ANTLR is not the best tool for binary parsing. Predicates are used frequently in binary parsing to distinguish data types after prefix. On the other side, there are almost no special prefixes in text parsing. ANTLR is good for text parsing.

I recommend trying Kaitai Struct.

@parrt
Copy link
Member

parrt commented Sep 16, 2022

Yeah, antlr can do binary person but the minute it starts to be context sensitive, it can be ugly. In particular things like blob [int n] cause problems when you then have a predicate based upon that parameter. That is an action and antlr does not execute actions during look ahead operations and if it has to see through a function called with an argument it won't to see that argument.

@KvanTTT
Copy link
Member

KvanTTT commented Sep 16, 2022

@parrt it looks like the warnings from #3626 should be implemented in ANTLR ASAP since users encounter the same problem again and again.

In the following code:

blob [int n]
locals [int i = 0]
    : ( {$i < $n}? byte {$i++;} ) * {$i == $n}?
    ;

The warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES will be thrown.

@parrt
Copy link
Member

parrt commented Sep 16, 2022

Well most users don't encounter. I've seen two cases recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants