Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added column descriptions #129

Merged
merged 2 commits into from
Sep 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,17 @@ client.download_from_selection(
)
```

## The `indices` of `idc-index`

`idc-index` is named this way because it wraps indices of IDC data: tables
containing the most important metadata attributes describing the files available
in IDC. The main metadata index is available in the `index` variable (which is a
pandas `DataFrame`) of `IDCClient`. Additional index tables such as the
`clinical_index` contain non-DICOM clinical data or slide microscopy specific
tables (indicated by the prefix `sm`) include metadata attributes specific to
slide microscopy images. A description of available attributes for all indices
can be found [here](column_descriptions).

## Tutorial

Please check out
Expand Down
116 changes: 116 additions & 0 deletions docs/column_descriptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Metadata attributes in `idc-index`'s index tables

## `index`

The following is the list of the columns included in `index`. You can use those
to select cohorts and subsetting data. `index` is series-based, i.e, it has one
row per DICOM series.

- non-DICOM attributes assigned/curated by IDC:

- `collection_id`: short string with the identifier of the collection the
series belongs to
- `analysis_result_id`: this string is not empty if the specific series is
part of an analysis results collection; analysis results can be added to a
given collection over time
- `source_DOI`: Digital Object Identifier of the dataset that contains the
given series; note that a given collection can include one or more DOIs,
since analysis results added to the collection would typically have
independent DOI values!
- `instanceCount`: number of files in the series (typically, this matches the
number of slices in cross-sectional modalities)
- `license_short_name`: short name of the license that governs the use of the
files corresponding to the series
- `series_aws_url`: location of the series files in a public AWS bucket
- `series_size_MB`: total disk size needed to store the series

- DICOM attributes extracted from the files
- `PatientID`: identifier of the patient
- `PatientAge` and `PatientSex`: attributes containing patient age and sex
- `StudyInstanceUID`: unique identifier of the DICOM study
- `StudyDescription`: textual description of the study content
- `StudyDate`: date of the study (note that those dates are shifted, and are
not real dates when images were acquired, to protect patient privacy)
- `SeriesInstanceUID`: unique identifier of the DICOM series
- `SeriesDate`: date when the series was acquired
- `SeriesDescription`: textual description of the series content
- `SeriesNumber`: series number
- `BodyPartExamined`: body part imaged
- `Modality`: acquisition modality
- `Manufacturer`: manufacturer of the equipment that generated the series
- `ManufacturerModelName`: model name of the equipment

## `sm_index`

The following is the list of the columns included in `sm_index`. `sm_index` is
series-based, i.e, it has one row per DICOM series, but only includes series
with slide microscopy data.

- DICOM attributes extracted from the files:
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
= one slide
- `embeddingMedium`: describes in what medium the slide was embedded before
the image was obtained
- `tissueFixative`: describes tissue fixatives used before the image was
obtained
- `staining_usingSubstance`: describes staining steps the specimen underwent
before the image was obtained
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
- `max_TotalMatrixRows`: height of the image at the maximum resolution
- `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer,
rounded to 2 significant figures
- `ObjectiveLensPower`: power of the objective lens of the equipment used to
digitize the slide
- `primaryAnatomicStructure`: anatomic location from where the imaged specimen
was collected
- `primaryAnatomicStructureModifier`: additional characteristics of the
specimen, such as whether it is a tumor or normal tissue
- `illuminationType`: specifies the type of illumination used when obtaining
the image

In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`,
`primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and
`illuminationType` the attributes exist with suffix `_code_designator_value_str`
and `_CodeMeaning`, which indicates whether the column contains
CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a
brief explanation on the three-value based coding scheme in DICOM can be found
at https://learn.canceridc.dev/dicom/coding-schemes.

## `sm_instance_index`

The following is the list of the columns included in `sm_instance_index`.
`sm_instance_index` is instance-based, i.e, it has one row per DICOM instance
(pyramid level of a slide, plus potentially thumbnail or label images), but only
includes DICOM instances of the slide microscopy modality.

- DICOM attributes extracted from the files:

- `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM
instance = one level/label/thumbnail image of the slide
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
= one slide
- `embeddingMedium`: describes in what medium the slide was embedded before
the image was obtained
- `tissueFixative`: describes tissue fixatives used before the image was
obtained
- `staining_usingSubstance`: describes staining steps the specimen underwent
before the image was obtained
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
- `max_TotalMatrixRows`: height of the image at the maximum resolution
- `PixelSpacing_0`: pixel spacing in mm
- `ImageType`: specifies further characteristics of the image in a list,
including as the third value whether it is a VOLUME, LABEL, OVERVIEW or
THUMBNAIL image.
- `TransferSyntaxUID`: specifies the encoding scheme used for the image data
- `instance_size`: specifies the DICOM instance's size in bytes

- non-DICOM attributes assigned/curated by IDC:
- `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM
instance

In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance`
the attributes exist with suffix `_code_designator_value_str` and
`_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator
and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the
three-value based coding scheme in DICOM can be found at
https://learn.canceridc.dev/dicom/coding-schemes.
48 changes: 2 additions & 46 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,58 +21,14 @@ starting a discussion in [IDC User forum](https://discourse.canceridc.dev/).
:start-after: <!-- SPHINX-START -->
```

## The `index` of `idc-index`

`idc-index` is named this way because it wraps index of IDC data: a table
containing most important metadata attributes describing the files available in
IDC. This metadata index is available in the `index` variable (which is a pandas
`DataFrame`) of `IDCClient`.

The following is the list of the columns included in `index`. You can use those
to select cohorts and subsetting data. `idc-index` is series-based, i.e, it has
one row per DICOM series.

- non-DICOM attributes assigned/curated by IDC:

- `collection_id`: short string with the identifier of the collection the
series belongs to
- `analysis_result_id`: this string is not empty if the specific series is
part of an analysis results collection; analysis results can be added to a
given collection over time
- `source_DOI`: Digital Object Identifier of the dataset that contains the
given series; note that a given collection can include one or more DOIs,
since analysis results added to the collection would typically have
independent DOI values!
- `instanceCount`: number of files in the series (typically, this matches the
number of slices in cross-sectional modalities)
- `license_short_name`: short name of the license that governs the use of the
files corresponding to the series
- `series_aws_url`: location of the series files in a public AWS bucket
- `series_size_MB`: total disk size needed to store the series

- DICOM attributes extracted from the files
- `PatientID`: identifier of the patient
- `PatientAge` and `PatientSex`: attributes containing patient age and sex
- `StudyInstanceUID`: unique identifier of the DICOM study
- `StudyDescription`: textual description of the study content
- `StudyDate`: date of the study (note that those dates are shifted, and are
not real dates when images were acquired, to protect patient privacy)
- `SeriesInstanceUID`: unique identifier of the DICOM series
- `SeriesDate`: date when the series was acquired
- `SeriesDescription`: textual description of the series content
- `SeriesNumber`: series number
- `BodyPartExamined`: body part imaged
- `Modality`: acquisition modality
- `Manufacturer`: manufacturer of the equipment that generated the series
- `ManufacturerModelName`: model name of the equipment

## Contents

```{toctree}
:maxdepth: 1
:maxdepth: 2
:titlesonly:
:caption: API docs

column_descriptions
api/idc_index
```

Expand Down
Loading