-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary parsing #828
Comments
i'm pretty sure the book has an example. just pass in bytes of data not chars. easy. |
Could say please in what part of book? I understand that I can implement IntStrem but I don't understand how I should write grammar for using with IntStream. |
Can't find it. hmm.. anyway, just use '\u0042' to match byte 0x42 in lexer. |
Oh, I understand. I will try. Thank you. |
I don't see why this is invalid, I came across the same question and have a hard time finding any documentation about using antlr4 with binary files as well. Your hint using "\u0042" is helpful of course, but wasn't that obvious like it sounds in your statement. Additionally, the more interesting cases are still not covered, for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value because that's what I'm interested in, how all this looks like in callbacks regarding data types etc. Asking for more documentation about that part of your quoted sentence seems perfectly valid to me. |
this is a question not an issue with the software. besides, I don't know what this means "for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value " |
Dear Terence, Using DOM, and my limited and outdated knowledge of programming in JavaScript, query, not a very good term to use, the .valueOf() attribute or .value of the DOM object. Reference the value of the DOM object. Also measure it. Yours sincerely, Sent from my iPhone On 17 Nov 2016, at 19:31, "Terence Parr" <[email protected]mailto:[email protected]> wrote: this is a question not an issue with the software. besides, I don't know what this means "for example how to deal with the found token in the callbacks, how to deal with the fact that I know that I have fields of e.g. 1 byte length, but don't know their actual value " You are receiving this because you are subscribed to this thread. |
Guten Tag Terence Parr,
The homepage claims antlr supports parsing binary data and in most of You know that some bytes are in there in some special order to tell And that's why some examples how to work with different binary file Or maybe the homepage has something completely different in mind when Mit freundlichen Grüßen, Thorsten Schöning Thorsten Schöning E-Mail: [email protected] Telefon...........05151- 9468- 55 AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln |
Ah. Well, right. parsing by definition means you know the type of thing but not its value like an ID or INT. Let me whip up an example. |
@kkbkris i have no idea what you are saying. |
I think here the issue is that the OP wants to ask for the 'next byte' at certain points. I think that such things really require another level of abstraction. Otherwise you need a Parser driven lexer. |
@jimidle the getExpectedTokens() or whatever it is should give them the expected bytes no problem. |
"Fixed" by a6a7304 adding this documentation https://github.com/antlr/antlr4/blob/master/doc/parsing-binary-files.md. @tschoening does this answer your question? As you can see, there is no difference between parsing characters and binary from a recognition point of view. It's all about how you feed things to the lexer. |
Guten Tag Terence Parr,
Yes, thanks. This makes it much clearer for me to understand how Mit freundlichen Grüßen, Thorsten Schöning Thorsten Schöning E-Mail: [email protected] Telefon...........05151- 9468- 55 AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln |
Don't mean to beat a dead horse, but @parrt is there a way that ANTLR can handle byte swapping in the case where the file might have been written on a little endian machine? I'm interested in trying to build a parser with ANTLR to handle file formats that include a combination of ASCII text and binary data. In this case I would think storing the data from the |
@wbuchanan well, the minute you say "order", you imply language so it's up to the grammar author to create two rules, one for 4 byte little end and one for 4 byte big end let's say. ANTLR knows nothing of byte order, machine architecture etc... |
@parrt There's a value in the header of the file spec that indicates the order of the bytes, but I'm not sure how something like that would be handled via ANTLR. There is a new/unorthodox rule for some of the larger string data as well that involves 6-byte integers. I'll try digging into the book a bit more in the meantime. Thanks again. |
@wbuchanan antlr is not designed to understand file formats. it is designed to let YOU specify file formats. :) |
@parrt I understand, just don't have experience in this particular area. |
Maybe you should/could imprint your own input stream that is code with this knowledge and presents binary sections of either endianness in the same form to the order regardless of what it looks like on disk?
… On Jan 11, 2017, at 09:23, William Buchanan ***@***.***> wrote:
@parrt I understand, just don't have experience in this particular area.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@jimidle I'm not sure how that would work with some of the data. For example, there are 6-byte integers that are used to reference binary objects which may/may not be string data (they could be long strings or binary values). The endianness of the file is indicated pretty early in the header so I'd assume there's a way I could pass the value so I could byte swap as needed. There are also stopping rules for the string data that depend on the location of binary zeros which could become a bit more challenging to handle. |
I think binary parser is quite different with text parser. ANTLR is designed for parsing text instead of binary data formats. There should be some room for ANTLR to be improved. For example, a very common structure for binary format is a prefix length with following several times of blocks:
Another basic structure in binary format is a piece of compressed data. ANTLR cannot deal with these situation well. Reference: https://pdos.csail.mit.edu/papers/nail:osdi14.pdf |
@hcoona |
@wbuchanan |
@wbuchanan |
@hcoona use a semantic predicate to handle prefix values. |
Surely there could be workaround for it. I mentioned it here just means there could be more improvements against binary formats for ANTLR. |
@parrt I've created an MWE. There is a choice between 2 parser rules, The following error is produced on the input: 0x01 0x00 0x01 0x00
(Parsing succeeds on the same input if I replace Here is my grammar:
(Note that this grammar works if it is refactored so that the And here is my test code:
|
Honestly, in my opinion, ANTLR is not the best tool for binary parsing. Predicates are used frequently in binary parsing to distinguish data types after prefix. On the other side, there are almost no special prefixes in text parsing. ANTLR is good for text parsing. I recommend trying Kaitai Struct. |
Yeah, antlr can do binary person but the minute it starts to be context sensitive, it can be ugly. In particular things like |
@parrt it looks like the warnings from #3626 should be implemented in ANTLR ASAP since users encounter the same problem again and again. In the following code: blob [int n]
locals [int i = 0]
: ( {$i < $n}? byte {$i++;} ) * {$i == $n}?
; The warning |
Well most users don't encounter. I've seen two cases recently. |
You have written:
I don't see any example. And I understand ANTLR supports binary parsing only in interface level. I don't clearly understand how I should extend yours code for supporting my binaries and how I should write grammar. Could you give any example? Because in current state antlr can work only with character streams.
The text was updated successfully, but these errors were encountered: