Binary parsing #828

stavytskyi · 2015-03-02T15:50:28Z

You have written:

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

I don't see any example. And I understand ANTLR supports binary parsing only in interface level. I don't clearly understand how I should extend yours code for supporting my binaries and how I should write grammar. Could you give any example? Because in current state antlr can work only with character streams.

parrt · 2015-03-02T15:58:59Z

i'm pretty sure the book has an example. just pass in bytes of data not chars. easy.

stavytskyi · 2015-03-04T12:42:18Z

Could say please in what part of book? I understand that I can implement IntStrem but I don't understand how I should write grammar for using with IntStream.

parrt · 2015-03-04T15:44:54Z

Can't find it. hmm.. anyway, just use '\u0042' to match byte 0x42 in lexer.

stavytskyi · 2015-03-12T09:47:24Z

Oh, I understand. I will try. Thank you.

ams-tschoening · 2016-11-11T17:30:43Z

I don't see why this is invalid, I came across the same question and have a hard time finding any documentation about using antlr4 with binary files as well. Your hint using "\u0042" is helpful of course, but wasn't that obvious like it sounds in your statement. Additionally, the more interesting cases are still not covered, for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value because that's what I'm interested in, how all this looks like in callbacks regarding data types etc.

Asking for more documentation about that part of your quoted sentence seems perfectly valid to me.

parrt · 2016-11-17T19:31:19Z

this is a question not an issue with the software. besides, I don't know what this means "for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value "

kkbkris · 2016-11-17T19:39:55Z

Dear Terence,

Using DOM, and my limited and outdated knowledge of programming in JavaScript, query, not a very good term to use, the .valueOf() attribute or .value of the DOM object. Reference the value of the DOM object. Also measure it.

Yours sincerely,
GitHub User

Sent from my iPhone

On 17 Nov 2016, at 19:31, "Terence Parr" <[email protected]mailto:[email protected]> wrote:

this is a question not an issue with the software. besides, I don't know what this means "for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value "

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/828#issuecomment-261344965, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGlumkUabdnU_oPN1UzThLLvUHSCSTSkks5q_KuKgaJpZM4DoGem.

ams-tschoening · 2016-11-17T20:11:58Z

Guten Tag Terence Parr,
am Donnerstag, 17. November 2016 um 20:31 schrieben Sie:

besides, I don't know what this means[...]

The homepage claims antlr supports parsing binary data and in most of
such data formats I know of, at least in the case I'm interested in
currently, one knows the defined structure of the data, but not the
actual values. Think of an IPv4 frame, a MPEG video file or such:

You know that some bytes are in there in some special order to tell
you "things", but you actually need to read the bytes themselfs and
process them in some special manner to know which is the actual
destination IP of a packet or what's the length of the video in
seconds etc. From reading the documentation, I have no idea how antlr
could provide me that data if I told it the structure of my binary
file.

And that's why some examples how to work with different binary file
formats would be of great help. I have the feeling that simply no one
does this.

Or maybe the homepage has something completely different in mind when
talking of "binary files". Then explaining that term on the homepage
itself might be of help. I think of video files, archive formats like
Zip etc., which all need some kind of parser.

Mit freundlichen Grüßen,

Thorsten Schöning

Thorsten Schöning E-Mail: [email protected]
AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55
Fax...............05151- 9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

parrt · 2016-11-17T21:30:40Z

Ah. Well, right. parsing by definition means you know the type of thing but not its value like an ID or INT. Let me whip up an example.

parrt · 2016-11-17T21:31:18Z

@kkbkris i have no idea what you are saying.

jimidle · 2016-11-17T23:47:46Z

I think here the issue is that the OP wants to ask for the 'next byte' at certain points. I think that such things really require another level of abstraction. Otherwise you need a Parser driven lexer.
Jim

parrt · 2016-11-18T16:25:55Z

@jimidle the getExpectedTokens() or whatever it is should give them the expected bytes no problem.

parrt · 2016-11-18T17:10:21Z

"Fixed" by a6a7304 adding this documentation https://github.com/antlr/antlr4/blob/master/doc/parsing-binary-files.md. @tschoening does this answer your question? As you can see, there is no difference between parsing characters and binary from a recognition point of view. It's all about how you feed things to the lexer.

ams-tschoening · 2016-11-18T18:09:58Z

Guten Tag Terence Parr,
am Freitag, 18. November 2016 um 18:10 schrieben Sie:

"Fixed" by adding this documentation
https://github.com/antlr/antlr4/blob/master/doc/parsing-binary-files.md.
@tschoeningdoes this answer your question?

Yes, thanks. This makes it much clearer for me to understand how
thinks are supposed to work.

Mit freundlichen Grüßen,

Thorsten Schöning

Thorsten Schöning E-Mail: [email protected]
AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55
Fax...............05151- 9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

wbuchanan · 2017-01-11T10:45:00Z

Don't mean to beat a dead horse, but @parrt is there a way that ANTLR can handle byte swapping in the case where the file might have been written on a little endian machine? I'm interested in trying to build a parser with ANTLR to handle file formats that include a combination of ASCII text and binary data. In this case I would think storing the data from the <header>...</header> tags as a map object would be useful since it defines the number of columns/rows in the data structure, but not sure how cases like this where the endianness might differ from what the JVM is expecting would need to be handled. I started trying to write some stuff to parse these files manually in the past, but think it might be more efficient - and certainly more consistent - to use ANTLR to build the parser.

parrt · 2017-01-11T18:11:06Z

@wbuchanan well, the minute you say "order", you imply language so it's up to the grammar author to create two rules, one for 4 byte little end and one for 4 byte big end let's say. ANTLR knows nothing of byte order, machine architecture etc...

wbuchanan · 2017-01-11T18:14:46Z

@parrt There's a value in the header of the file spec that indicates the order of the bytes, but I'm not sure how something like that would be handled via ANTLR. There is a new/unorthodox rule for some of the larger string data as well that involves 6-byte integers. I'll try digging into the book a bit more in the meantime. Thanks again.

parrt · 2017-01-11T18:18:33Z

@wbuchanan antlr is not designed to understand file formats. it is designed to let YOU specify file formats. :)

wbuchanan · 2017-01-11T19:23:01Z

@parrt I understand, just don't have experience in this particular area.

jimidle · 2017-01-12T01:32:00Z

Maybe you should/could imprint your own input stream that is code with this knowledge and presents binary sections of either endianness in the same form to the order regardless of what it looks like on disk?

…

On Jan 11, 2017, at 09:23, William Buchanan ***@***.***> wrote: @parrt I understand, just don't have experience in this particular area. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

wbuchanan · 2017-01-12T01:38:16Z

@jimidle I'm not sure how that would work with some of the data. For example, there are 6-byte integers that are used to reference binary objects which may/may not be string data (they could be long strings or binary values). The endianness of the file is indicated pretty early in the header so I'd assume there's a way I could pass the value so I could byte swap as needed. There are also stopping rules for the string data that depend on the location of binary zeros which could become a bit more challenging to handle.

hcoona · 2017-12-12T12:37:23Z

I think binary parser is quite different with text parser. ANTLR is designed for parsing text instead of binary data formats. There should be some room for ANTLR to be improved.

For example, a very common structure for binary format is a prefix length with following several times of blocks:

+----------+---------------------------+---------------------------+------------------------+
|   0x08   |     <a structure>         |         <a structure>     |   repeat 0x08 times    |
+----------+---------------------------+---------------------------+------------------------+

Another basic structure in binary format is a piece of compressed data.

ANTLR cannot deal with these situation well.

Reference: https://pdos.csail.mit.edu/papers/nail:osdi14.pdf

wbuchanan · 2017-12-12T13:00:55Z

@hcoona
The use case I have in mind involves a file with a combination of XML style tags and binary data (see https://www.stata.com/help.cgi?dta for examples). In this case I need to parse the plain text to identify the tags in the file, but the contents between the tags will appear in binary formatted data. Depending on the version of the file format/software sometimes the types for these binary data elements will change so there also would need to be a way to handle that as well.

hcoona · 2017-12-12T13:27:54Z

@wbuchanan
Text data could be a restricted subset of binary data, but cannot vice versa. This is the problem.

hcoona · 2017-12-12T13:29:25Z

@wbuchanan
A quick but dirty solution is to insert a customized parser whenver you see the xml begin tag, and return to ANTLR after your parser proceed binary data.

parrt · 2017-12-12T15:58:20Z

@hcoona use a semantic predicate to handle prefix values.

hcoona · 2017-12-20T06:31:59Z

Surely there could be workaround for it. I mentioned it here just means there could be more improvements against binary formats for ANTLR.

Andrew-Fryer · 2022-09-16T18:13:48Z

@hcoona use a semantic predicate to handle prefix values.

@parrt
I don't think that semantic predicates can handle prefix values in non-trivial binary grammars.
Could you please let me know if I'm missing something here?

I've created an MWE. There is a choice between 2 parser rules, a and b, that both begin and end with a prefix length followed by that number of bytes.
The rules are distinguished by 2 possible values of a middle byte, which both appear in the input stream.
It seems that ANTLR can't use semantic predicates predicates during prediction, so it arbitrarily picks the first rule and then fails if the second rule should have been chosen because it can't backtrack.

The following error is produced on the input: 0x01 0x00 0x01 0x00

line 1:2 extraneous input '1' expecting BYTE_A
line 1:4 missing {BYTE_A, BYTE_B, LENGTH_BYTE} at '<EOF>'
Exception in thread "main" java.lang.NumberFormatException: For input string: ""
        at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
        at java.base/java.lang.Integer.parseInt(Integer.java:678)
        at java.base/java.lang.Integer.parseInt(Integer.java:784)
        at MWEParser.byte_(MWEParser.java:386)
        at MWEParser.string(MWEParser.java:275)
        at MWEParser.a(MWEParser.java:184)
        at MWEParser.mwe(MWEParser.java:125)
        at MWE.main(MWE.java:15)

(Parsing succeeds on the same input if I replace (a | b) with just b in the grammar.)

Here is my grammar:

grammar MWE;

mwe:
  (a | b)
  EOF
  { System.out.println("Success!"); }
  ;
a:
  string
  BYTE_A
  string
  ;

b:
  string
  BYTE_B
  string
  ;

string:
  length=byte
  data=blob[$length.val]
  ;

blob [int n]
locals [int i = 0]
    : ( {$i < $n}? byte {$i++;} ) * {$i == $n}?
    ;

byte returns [byte val]
  : data=allTerminals
  {
$val = (byte) Integer.parseInt($data.text);
  };

allTerminals
  : BYTE_A
  | BYTE_B
  | BYTE
  ;

BYTE_A: '\u0000'..'\u0000';
BYTE_B: '\u0001'..'\u0001';
BYTE: '\u0000'..'\u00ff';

(Note that this grammar works if it is refactored so that the string rule is moved from the beginning of a and b to the beginning of mwe.)

And here is my test code:

import java.io.IOException;
import java.nio.charset.StandardCharsets;

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTree;


class MWE {
    public static void main(String args[]) throws IOException {
        String inputFile = args[0];
        ANTLRFileStream bytesAsChar = new BinaryANTLRFileStream(inputFile);
        MWELexer lexer = new MWELexer(bytesAsChar);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        MWEParser parser = new MWEParser(tokens);
        ParseTree tree = parser.mwe();
    }
}

BinaryANTLRFileStream is taken from https://github.com/antlr/antlr4/blob/master/doc/parsing-binary-files.md

KvanTTT · 2022-09-16T18:47:59Z

Honestly, in my opinion, ANTLR is not the best tool for binary parsing. Predicates are used frequently in binary parsing to distinguish data types after prefix. On the other side, there are almost no special prefixes in text parsing. ANTLR is good for text parsing.

I recommend trying Kaitai Struct.

parrt · 2022-09-16T18:54:11Z

Yeah, antlr can do binary person but the minute it starts to be context sensitive, it can be ugly. In particular things like blob [int n] cause problems when you then have a predicate based upon that parameter. That is an action and antlr does not execute actions during look ahead operations and if it has to see through a function called with an argument it won't to see that argument.

KvanTTT · 2022-09-16T18:59:05Z

@parrt it looks like the warnings from #3626 should be implemented in ANTLR ASAP since users encounter the same problem again and again.

In the following code:

blob [int n]
locals [int i = 0]
    : ( {$i < $n}? byte {$i++;} ) * {$i == $n}?
    ;

The warning ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES will be thrown.

parrt · 2022-09-16T19:40:18Z

Well most users don't encounter. I've seen two cases recently.

parrt closed this as completed Mar 2, 2015

parrt added the status:invalid label May 19, 2015

parrt added comp:doc type:improvement type:question and removed status:invalid labels Nov 18, 2016

parrt added this to the 4.6 milestone Nov 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary parsing #828

Binary parsing #828

stavytskyi commented Mar 2, 2015

parrt commented Mar 2, 2015

stavytskyi commented Mar 4, 2015

parrt commented Mar 4, 2015

stavytskyi commented Mar 12, 2015

ams-tschoening commented Nov 11, 2016

parrt commented Nov 17, 2016

kkbkris commented Nov 17, 2016

ams-tschoening commented Nov 17, 2016

parrt commented Nov 17, 2016

parrt commented Nov 17, 2016

jimidle commented Nov 17, 2016 •

edited by parrt

Loading

parrt commented Nov 18, 2016

parrt commented Nov 18, 2016 •

edited

Loading

ams-tschoening commented Nov 18, 2016

wbuchanan commented Jan 11, 2017

parrt commented Jan 11, 2017

wbuchanan commented Jan 11, 2017

parrt commented Jan 11, 2017

wbuchanan commented Jan 11, 2017

jimidle commented Jan 12, 2017 via email

wbuchanan commented Jan 12, 2017

hcoona commented Dec 12, 2017

wbuchanan commented Dec 12, 2017

hcoona commented Dec 12, 2017

hcoona commented Dec 12, 2017

parrt commented Dec 12, 2017

hcoona commented Dec 20, 2017

Andrew-Fryer commented Sep 16, 2022

KvanTTT commented Sep 16, 2022

parrt commented Sep 16, 2022

KvanTTT commented Sep 16, 2022

parrt commented Sep 16, 2022

Binary parsing #828

Binary parsing #828

Comments

stavytskyi commented Mar 2, 2015

parrt commented Mar 2, 2015

stavytskyi commented Mar 4, 2015

parrt commented Mar 4, 2015

stavytskyi commented Mar 12, 2015

ams-tschoening commented Nov 11, 2016

parrt commented Nov 17, 2016

kkbkris commented Nov 17, 2016

ams-tschoening commented Nov 17, 2016

parrt commented Nov 17, 2016

parrt commented Nov 17, 2016

jimidle commented Nov 17, 2016 • edited by parrt Loading

parrt commented Nov 18, 2016

parrt commented Nov 18, 2016 • edited Loading

ams-tschoening commented Nov 18, 2016

wbuchanan commented Jan 11, 2017

parrt commented Jan 11, 2017

wbuchanan commented Jan 11, 2017

parrt commented Jan 11, 2017

wbuchanan commented Jan 11, 2017

jimidle commented Jan 12, 2017 via email

wbuchanan commented Jan 12, 2017

hcoona commented Dec 12, 2017

wbuchanan commented Dec 12, 2017

hcoona commented Dec 12, 2017

hcoona commented Dec 12, 2017

parrt commented Dec 12, 2017

hcoona commented Dec 20, 2017

Andrew-Fryer commented Sep 16, 2022

KvanTTT commented Sep 16, 2022

parrt commented Sep 16, 2022

KvanTTT commented Sep 16, 2022

parrt commented Sep 16, 2022

jimidle commented Nov 17, 2016 •

edited by parrt

Loading

parrt commented Nov 18, 2016 •

edited

Loading