-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: compress/lzw: extend package for correctly handling tiff files and pdf LZWDecode filters #25409
Comments
cc @dsnet Note that due to the Go 1 compatibility promise, we cannot change the signature of any exported function or method. (At least for Go 1.) |
👍 |
Would it be possible to add a public Reader to compress/lzw and change the NewReader signature to return a *Reader? Does that still violate the Go1 promise? EDIT: I suppose anywhere a type switch or similar is done as the result of := NewReader would fail so technically not a drop in replacement and can violate. |
Yes, the Go 1 compatibility guarantee prevents us from changing the result type of |
My counter proposal is #25429, which is to actually freeze the |
I may not have been clear but There is even a hinting TODO comment about that at the beginning of There are 2 ways lzw compression is used depending on the filter parameter There is a test with an lzw encoded string in reader_test.go:
The way I see it this test is not representative and only passes because the chosen string is short. I think something needs to be done here. |
I'm not sure if I'm missing something but if the signature of NewReader can't be changed why not suggest adding a NewReaderEarlyChange() or NewReaderExt(oneOff bool)? (not necessarily advocating but it seems like that is the only issue with the change, the rest is internal?) |
The To avoid the explosion of configuration cross products, you could make the case the API should have been: // Format determines the variation of LZW used in certain file formats.
type Format int
const (
_ Format = 1 << iota
GIF
TIFF
PDF
)
func NewReader(r io.Reader, f Format) (*Reader, error)
func NewWriter(r io.Writer, f Format) (*Writer, error) However, given that the number of formats that use "lzw" is a limited and specific set of formats, I argue that this functionality should not have been a general purpose package in the standard library. If anything, we should document that the I understand the need for this functionality, but I don't understand why it has the be done in the standard library. Forking the |
👍 In addition to the The few PDF files I could get a hand on that use the LZWDecode filter with or w/o The overall too much subtleties of the PDF LZWDecode filter justify to be separated from the standard lib +1 also for adding a comment to |
I'd rather have one lzw package that works for the three major formats (GIF, PDF, TIFF) than have two different ones. It looks like the code changes here are small. Obviously the API change must be done in a compatible way. But assuming it is, what's the harm in merging the x/ lzw functionality into the standard one? Are there yet more variants in common use besides these two? |
Three. GIF, TIFF, and PDF are different in subtle ways.
Let me ask it the opposite way, wrong's wrong with pairing each As a format, LZW is not that complicated, being around 200 lines (per reader or writer) if you strip away the logic that already exists to vary based on the existing options. Specializing on each LZW variant has benefits:
|
I can only chip in on the PDF variant. I may be wrong but there are at least two reasons why I tend to agree with @dsnet it would be a good idea to keep the implementation of the PDF LZW variant separate from the standard library
It seems to be true that the LZWDecode filter has been abandoned for PDF writers in favour of the FlateDecode filter by the major PDF tool vendors for performance reasons. Still PDF readers want to be able to digest lzw encoded content for backwards compatibility and PDF processors want to be able to handle any filter that's part of the PDF spec for the same reason. Therefore it would be nice to make an implementation of the PDF lzw variant available to the community. I think that's undisputed. As described above and here PDFs published on the internet indicate not all PDF writers implement(ed) For this reason testing focused on a particular PDF filter is cumbersome. The basic encode/decode and compare against golden cycle is not enough when you want to make sure you get the same result coming from either direction. Unfortunately we are in this position with PDF LZWDecode because there is not much content available out there. You have to identify or generate a PDF file using this filter for compression, then dissect the file, identify a fitting object stream and extract its compressed and decompressed content for testing. Other than the testing issue I can see code adaptions coming up as the implementation gets exposure and people start testing against it. While bugfixes are part of the software development life cycle we do not want to have a fix for |
Just on a possible 2.0 API (or even an additional 1.x API if it wasn't called NewReader),
won't work for GIF, as a valid GIF's litWidth can vary between 2 and 8 (inclusive). See "LZW Code Size" in Appendix F of https://www.w3.org/Graphics/GIF/spec-gif89a.txt As for bikeshedding on freezing or growing the stdlib's compress/lzw, I'm OK with growing the stdlib version to accomodate PDF and TIFF. Sure, the new API will be awkward, due to 1.x compatibility, but so be it. But I don't have the spare time to work on it myself. If @dsnet is the de facto |
Does the following make any sense?
|
I'm still undecided whether that's actually the best API, as opposed to e.g. NewReaderExt, but it does make sense. Note that the Order args are no longer necessary if you're specifing GIF / TIFF / PDF. The TIFF reader and writer also doesn't need the litWidth arg. I think it's always 8 in TIFF, like PDF. |
OK.
So that means |
Yeah, you got it right. LitWidth is GIF only, and GIF only uses LSB. On balance though, I think I like the NewReaderExt or NewReader2 function, with an extra oneOff or earlyChange argument, better. |
I am not sure I understand what you are suggesting. You want to have a separate ReaderExt that uses Are we talking 1.11 or 2.0? Has it been decided who is going to take care of the implementation and the corresponding CL? |
Ditto for NewWriterExt. This can be for Go 1.x. |
I think I got you. Keep the original Reader/Writer in order to comply with the Go 1 compatibility promise Is that it? Then of course supply proper documentation on the parameter usage for each format. |
In #25409 (comment), @hhrutter mentioned that PDF's variant of LZW expects streams to start with a clear-table symbol. Is this true of the other LZW variants? I still suspect that it really is better to have |
Good catch! |
Maybe it would make sense to develop this in golang.org/x/image/lzw for now? Only GIF is in the standard library - TIFF and PDF are not. There's still the problem of who will review the code. |
For PDF There is also already a golang.org/x/image/tiff package in need of a |
To be clear, I didn't say that, if you're changing x/image/tiff/lzw, that you should make a backwards incompatible change by removing that package's litWidth adjustability (by modifying or removing NewReader). I said that, if you wanted a NewTIFFReader method, you don't need a litWidth arg. What I would recommend is to fork the stdlib's compress/lzw to a new location, e.g. github.com/username/lzw to make a package that's acceptable to (1) the stdlib's image/gif, (2) the extended lib's x/image/tiff, and (3) whatever pdf packge you're working on. I'd consider also making username/gif and username/tiff forks that used username/lzw. I'd also consider organizing it into 1 repo instead of 3: username/foo/lzw, username/foo/gif and username/foo/tiff for some good value of foo. The point of this is, nobody should be blocked on e.g. the Go 1.x release cadence, and once we have some concrete code, with some experience with how that API works in practice, then we can have a much more informed discussion of what to do with e.g. the stdlib's compress/lzw package (and its backwards compat guarantees), versus new packages under x. |
On top of these recommendations I'd like to follow the stdlib's hierarchy and keep Repo 1: Repo 2: Repo 3: As far as the reader/writer implementation in repo 1 goes I am undecided but I am thinking Hence I am leaning towards:
It would definitely make sense though to put the revised Please share your opinions. |
I'm repeating myself, but I still prefer
To me, it seems like a layering violation for lzw (which is a low layer) to have API named after things like GIF and PDF (which are high layers). |
I see where you are coming from. If
you have higher level packages like The same applies for PDF consumers only there is nothing PDF related in the stdlib. |
@dsnet Any thoughts on @nigeltao 's comment #25409 (comment) ? |
I still hold to my opinion in #25409 (comment) that I suspect specialized APIs are better. Per @nigeltao's comment in #25409 (comment), it seems that we both support development of this outside of standard library. Perhaps the proposal should be on hold until the API for LZW as used in PDF stabilizes and we can see what the subtle differences are with GIF and TIFF. |
In order to let the PDF version stabilize it would be good to put it somewhere it gets proper exposure. |
I don't think that's necessary. Wouldn't the repo that is developing an implementation of Go PDF be the perfect place to put that? |
Basically yes, but I can give you two reasons:
|
If you create
I'm not fond of adding more to |
I would also recommend |
Ok then. I will provide Still, somebody needs to update the comments in |
It seems like this discussion converged on having it outside the standard library (and outside golang.org/x) which sounds fine. |
We have both lzw reader and writer under compress/lzw.
There is also an lzw reader hidden under https://github.com/golang/image/blob/master/tiff/lzw/reader.go
that is supposed to be used for reading tiff files. (There is no writer.)
The difference being a slight modification of the algorithm as stated in https://github.com/golang/image/blob/master/tiff/lzw/reader.go
I am working on a pdf processor and have a use case for encoding/decoding LZW data using both variations. The PDF spec defines the LZWDecode filter with the parameter EarlyChange that decides about the variation of the algorithm to use:
if EarlyChange = 1 => use golang/image/tiff/lzw/reader.go from the experimental repos at golang.org/x and a correspondingwriter.
if EarlyChange = 0 => use compress.lzw.reader.go and compress/lzw/writer.go from the std lib.
The default value for EarlyChange is 1.
This proposal is about extending compress/lzw so it can be used by any package that needs to implement lzw compression in any way. While reading/writing tiff files may be an exotic usecase these days, upcoming PDF readers,writers,processors would greatly benefit from this. In addition this would be the first time to introduce tiff encoding and decoding within the stdlib.
These are the proposed code changes. The modifications of the algorithm are minor but
but for now I don't see any better way than extending the API:
The text was updated successfully, but these errors were encountered: