Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading and writing DICOM whole slide images #3742

Merged
merged 49 commits into from
Dec 3, 2021

Conversation

melissalinkert
Copy link
Member

Significantly reworks the DICOM reader and adds a new writer so that WSI data is supported.

Relevant test data is in inbox/dicom-wsi. There are a ton of datasets there, but in general the newer sets (LUNG..., pixelmed-converted, WG26/3DHISTECH, WG26/Leica_GT450) will be better for testing by hand. Most are copied from ftp://dicom.nema.org/MEDICAL/Dicom/DataSets/WG26, so publicly available.

Whole slide datasets will have one or more files per resolution, with optional additional files that contain the label, overview, etc. Multiple pyramids per dataset are possible if an extended depth of field image was created from a Z stack. Both brightfield and fluorescence slides are supported.

Converting non-DICOM slides to DICOM is supported by the new DicomWriter, using typical bfconvert options. The expected output is one file per resolution or extra (label/macro/etc.) image. Some changes were needed in ImageConverter to make the tile writing logic less TIFF-specific.

There is also a new bfconvert option to turn off the sequential-writing assumption (3208d92). This doesn't change the actual tile writing order, just the assumption that the writer uses. For DICOM, this is helpful for testing the two tile position storage modes: TILED_FULL and TILED_SPARSE. The former implicitly defines the positions; all tiles are stored, and left-to-right, top-to-bottom. The latter explicitly records positions, which results in a larger file (more metadata), but allows missing tiles and/or non-sequential storage.

There is obviously a lot to go through here, so happy to have a discussion if that's helpful.

This will definitely impact memo files so should not be considered for a patch release. Most test issues will be addressed in a forthcoming config PR, and are primarily updates to image names/descriptions and a few pixel hash changes due to differences in file sorting (by metadata now instead of by file name/pattern). There are a couple of specific files which will still cause failures, and a proposed solution will be noted on the config PR.

I have it on my to-do list to go through outstanding DICOM issues to verify whether or not this helps. In particular:

Very basic support so far, still needs quite a bit of work (see TODOs).
Basic tests similar to "bfconvert test.fake test.dcm" should result in a file that dcdump and ImageMagick can handle.
"bfconvert -tilex 512 -tiley 512 input output.dcm" should now produce 512x512 tiles as expected.
The pixel data should now be properly encapsulated as required by the transfer syntax.
See http://dicom.nema.org/dicom/2013/output/chtml/part05/sect_A.4.html
Still need to figure out a better way to set the Image Type for extra images though.
-no-sequential won't change the order in which planes/tiles are written,
just the writer's assumption of plane/tile order.
There are a few compromises here. One is that the DicomAttribute enum
does not contain the entire dictionary, but a subset that is most likely to
be explicitly referenced in the reader/writer. The very large size of the
dictionary combined with Java's limitations on bytecode size means that
a single enum with the whole dictionary is not possible.

The other compromise is that the original metadata table does not contain
every tag in every file of a dataset, as that can quickly lead to memory
exhaustion (especially once the table is included in OME-XML).
DicomReader now has a "List<DicomTag> getTags()" method that can be used
to access the tag hierarchy directly, should an application need to do so.
return 8;
}
default:
throw new IllegalArgumentException(vr.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result of the DICOM reader priority increase in readers.txt, this exception is now being thrown across several file formats during the isThisType check.

See e.g. bd-pathway, deltavision for examples of affected filesets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect f75b8dc, 84e1a80, and fc13799 to resolve unexpected test failures, but will check the results tomorrow now that builds are running again.

@fedorov
Copy link

fedorov commented Nov 12, 2021

@melissalinkert would it be possible to acknowledge that development of this functionality was supported by NCI Imaging Data Commons, and include the wording for the funding source?

@melissalinkert
Copy link
Member Author

@fedorov : we can certainly add a funding acknowledgement. Typically we do this in the release notes via a pull request to https://github.com/ome/bio-formats-documentation. An example is the New file formats section of the 6.6.0 notes:

https://docs.openmicroscopy.org/bio-formats/6.7.0/about/whats-new.html#december

If you have specific acknowledgement text you prefer, or a line in the release notes is insufficient, just let us know.

@fedorov
Copy link

fedorov commented Nov 15, 2021

Melissa, thank you for the pointer, this is very helpful. Let me check with your project leads at NCI/Leidos what is their preference, and I will get back to you towards the end of the week.

@@ -65,78 +66,30 @@
import ome.units.quantity.Length;
import ome.units.UNITS;

import static loci.formats.in.DicomAttribute.*;
import static loci.formats.in.DicomVR.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned during the weekly Formats discussion, immediate question is whether this API handling the specifics of the DICOM specification should be managed in a dedicated package similarly to loci.formats.tiff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

625f8cc reorganizes into a new loci.formats.dicom package.

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build BIOFORMATS-push#1041. See the console output for more details.
Possible conflicts:

--conflicts

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build BIOFORMATS-push#1042. See the console output for more details.
Possible conflicts:

--conflicts

@dgault dgault added this to the 6.8.0 milestone Nov 24, 2021
@snoopycrimecop
Copy link
Member

snoopycrimecop commented Nov 25, 2021

Conflicting PR. Removed from build BIOFORMATS-push#1049. See the console output for more details.
Possible conflicts:

--conflicts Conflict resolved in build BIOFORMATS-push#1054. See the console output for more details.

@fedorov
Copy link

fedorov commented Nov 30, 2021

Melissa, I am not sure how documentation updates are handled - would that be done in a separate repo after this PR is merged, to document improvements to DICOM support?

@fedorov
Copy link

fedorov commented Nov 30, 2021

Melissa, regarding the acknowledgment, I checked with our federal leads, and they would like to have the following acknowledgment, if possible:

"This project has been funded in whole or in part with Federal funds from the National Cancer Institute,
National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN2612015000031."

cc: @ulrikew

My personal wish would also be to hyperlink "Imaging Data Commons", so I suggest the following wording:

"This functionality was implemented through collaboration with NCI Imaging Data Commons, and has been funded in whole or in part with Federal funds from the National Cancer Institute,
National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN2612015000031."

@dgault
Copy link
Member

dgault commented Dec 2, 2021

Builds and tests have been green with this PR included and the merge-ci repo tests are now green also with the associated config. Overall the PR looks to be in good shape and suitable for inclusion with the 6.8.0 release.

While testing the writer I noticed a few times that converted files would produce the below exception. This appears to have been due to files being grouped with others I was testing, when the files were moved to a separate directory they opened without exception. This is likely a similar issue that required some of the config files to be moved.

Exception in thread "main" java.lang.ArithmeticException: / by zero
	at loci.formats.MetadataTools.populatePixelsOnly(MetadataTools.java:315)
	at loci.formats.MetadataTools.populateMetadata(MetadataTools.java:262)
	at loci.formats.MetadataTools.populatePixels(MetadataTools.java:151)
	at loci.formats.MetadataTools.populatePixels(MetadataTools.java:116)
	at loci.formats.in.DicomReader.initFile(DicomReader.java:905)

I manually tested the reader across a subset of the new sample files:

CPTAC-LSCC_C3L-00965-26
Dataset opened and displayed without exception, containing 1 series with 8 resolutions.

CPTAC-LSCC_C3L-00965-26_PixelMedConvertedSVSSuppliedPyramid
Dataset opened and displayed without exception, containing 3 series, the first with 3 resolutions, then a thumbnail and overview image. However when attempting to open the file CPTAC-LSCC_C3L-00965-26_DICOMDIR an exception was thrown as below. Based on the config PR this file was set with test=false so I assume it is not expected to be able to open this file.

java.lang.ArithmeticException: / by zero
	at loci.formats.in.DicomReader.initFile(DicomReader.java:719)

LUNG-1-LN^ [LUNG-1-LN]
Each series could be opened and displayed without exceptions. Each had 5 resolutions.

WG26/3DHISTECH/FL_20191127/DICOM_TILED_FULL/056_ATM
The dataset was able to opened and displayed without exception. It contained 4 series. The first series has 10 resolutions then an overview image, another image with 10 resolutions and the final series is labelled REGIONLOCALIZER

WG26/3DHISTECH/FL_20191127/DICOM_TILED_SPARSE/056_ATM
The dataset was able to opened and displayed without exception matching that of the TILED_FULL dataset

WG26/3DHISTECH/FL_20191127/DICOM_TILED_FULL/FluorCell_4
The dataset was able to opened and displayed without exception. It contained 3 series. The first series has 10 resolutions then an image REGIONLOCALIZER followed by a label image.

WG26/3DHISTECH/FL_20191127/DICOM_TILED_SPARSE/FluorCell_4
The dataset was able to opened and displayed without exception matching that of the TILED_FULL dataset

WG26/3DHISTECH/FL_20191127/DICOM_TILED_FULL/Liver_v1
The dataset was able to opened and displayed without exception. It contained 4 series. The first series has 10 resolutions then an overview image, another image with 10 resolutions and the final series is labelled REGIONLOCALIZER

WG26/3DHISTECH/FL_20191127/DICOM_TILED_SPARSE/Liver_v1
The dataset was able to opened and displayed without exception matching that of the TILED_FULL dataset

WG26/3DHISTECH/FL_20191127/DICOM_TILED_FULL/17361-13-FISH
The dataset was able to opened and displayed without exception. It contained 4 series. The first series has 10 resolutions then an overview image, another image with 10 resolutions and the final series is labelled REGIONLOCALIZER

WG26/3DHISTECH/FL_20191127/DICOM_TILED_SPARSE/17361-13-FISH
The dataset was able to opened and displayed without exception matching that of the TILED_FULL dataset

For the writer I tested the bfconvert workflow against a range of our public data.

For the below datasets with multiple timepoints a format exception was thrown. This is as expected:

Exception in thread "main" loci.formats.FormatException: Multiple timepoints not supported
	at loci.formats.out.DicomWriter.setId(DicomWriter.java:335)

For other datasets the conversion was run both with and without the new option. Each of the resulting converted files opened and displayed as expected without any exceptions:

When running the converter using the -no-sequential across a range of the public sample data the resulting files matched those converted without the flag. However when running the converter with the flag and trying to generate resolutions I would run into the below exception (the index is 1 greater than the num planes):

bfconvert -no-sequential -pyramid-resolutions 2 -pyramid-scale 2

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 511 out of bounds for length 511
	at loci.formats.out.PyramidOMETiffWriter.close(PyramidOMETiffWriter.java:128)
	at loci.formats.ImageWriter.close(ImageWriter.java:469)
	at loci.formats.tools.ImageConverter.testConvert(ImageConverter.java:758)
However this error does not appear to be related to the changes in the ImageConverter code or the new option, it simply makes it easier to occur. You can achieve the same error using some of our example code and setting sequential writing to false. This is a bug that should probably be logged as a separate issue.

@dgault
Copy link
Member

dgault commented Dec 2, 2021

@fedorov, the addition to the version history (which will include the funding acknowledgement) mentioned by Melissa will be opened as a separate PR against https://github.com/ome/bio-formats-documentation. This usually happens once all of the PR's for a given release have been merged and the contents of that PR is then also used as the basis for the public announcements on our website and image.sc etc .

If there are any other areas of documentation you feel need updated please do let me know.

@melissalinkert
Copy link
Member Author

Thanks for the thorough review, @dgault.

While testing the writer I noticed a few times that converted files would produce the below exception. This appears to have been due to files being grouped with others I was testing, when the files were moved to a separate directory they opened without exception. This is likely a similar issue that required some of the config files to be moved.

I definitely wouldn't recommend trying to put multiple converted DICOM pyramids in the same directory, since the file grouping is logic is mostly best guess. Would it help to have the writer issue a warning if it will be writing to a non-empty directory?

CPTAC-LSCC_C3L-00965-26_PixelMedConvertedSVSSuppliedPyramid
Dataset opened and displayed without exception, containing 3 series, the first with 3 resolutions, then a thumbnail and overview image. However when attempting to open the file CPTAC-LSCC_C3L-00965-26_DICOMDIR an exception was thrown as below. Based on the config PR this file was set with test=false so I assume it is not expected to be able to open this file.

Correct, I would not expect using the *DICOMDIR file for this particular dataset to work.

@dgault
Copy link
Member

dgault commented Dec 3, 2021

Thanks Melissa, getting this merged for now for 6.8.0

@dclunie
Copy link

dclunie commented Sep 5, 2023

@melissalinkert does this include support for DICOM z-stack images? E.g., see discussion at news:comp.protocols.dicom.

@melissalinkert
Copy link
Member Author

@dclunie : yes, Z stacks are expected to be supported here. Several of the WG26 datasets referenced above include Z stacks, and those are included in nightly regression tests. If you (or anyone else following) have Z stacks that aren't being correctly read or written by Bio-Formats, we'll definitely want to know so that we can investigate the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants