Skip to content

Commit f18ec0b

Browse files
include column descriptions from separate file
minor text changes updated ImageType description check if include command works now Moved section introducing indices of idc-index into README Test to see if docs are build correctly Docs: Added separate page for column description of indices and linked to it from README. removed test fixed link removed doubled text adapted toctree to generate column_descriptions.html corrected toctree another try next try a test probably solved added title and sections
1 parent 1361412 commit f18ec0b

File tree

3 files changed

+129
-89
lines changed

3 files changed

+129
-89
lines changed

README.md

+11
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,17 @@ client.download_from_selection(
8181
)
8282
```
8383

84+
## The `indices` of `idc-index`
85+
86+
`idc-index` is named this way because it wraps indices of IDC data: tables
87+
containing the most important metadata attributes describing the files available
88+
in IDC. The main metadata index is available in the `index` variable (which is a
89+
pandas `DataFrame`) of `IDCClient`. Additional index tables such as the
90+
`clinical_index` contain non-DICOM clinical data or slide microscopy specific
91+
tables (indicated by the prefix `sm`) include metadata attributes specific to
92+
slide microscopy images. A description of available attributes for all indices
93+
can be found [here](column_descriptions).
94+
8495
## Tutorial
8596

8697
Please check out

docs/column_descriptions.md

+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Metadata attributes in `idc-index`'s index tables
2+
3+
## `index`
4+
5+
The following is the list of the columns included in `index`. You can use those
6+
to select cohorts and subsetting data. `index` is series-based, i.e, it has one
7+
row per DICOM series.
8+
9+
- non-DICOM attributes assigned/curated by IDC:
10+
11+
- `collection_id`: short string with the identifier of the collection the
12+
series belongs to
13+
- `analysis_result_id`: this string is not empty if the specific series is
14+
part of an analysis results collection; analysis results can be added to a
15+
given collection over time
16+
- `source_DOI`: Digital Object Identifier of the dataset that contains the
17+
given series; note that a given collection can include one or more DOIs,
18+
since analysis results added to the collection would typically have
19+
independent DOI values!
20+
- `instanceCount`: number of files in the series (typically, this matches the
21+
number of slices in cross-sectional modalities)
22+
- `license_short_name`: short name of the license that governs the use of the
23+
files corresponding to the series
24+
- `series_aws_url`: location of the series files in a public AWS bucket
25+
- `series_size_MB`: total disk size needed to store the series
26+
27+
- DICOM attributes extracted from the files
28+
- `PatientID`: identifier of the patient
29+
- `PatientAge` and `PatientSex`: attributes containing patient age and sex
30+
- `StudyInstanceUID`: unique identifier of the DICOM study
31+
- `StudyDescription`: textual description of the study content
32+
- `StudyDate`: date of the study (note that those dates are shifted, and are
33+
not real dates when images were acquired, to protect patient privacy)
34+
- `SeriesInstanceUID`: unique identifier of the DICOM series
35+
- `SeriesDate`: date when the series was acquired
36+
- `SeriesDescription`: textual description of the series content
37+
- `SeriesNumber`: series number
38+
- `BodyPartExamined`: body part imaged
39+
- `Modality`: acquisition modality
40+
- `Manufacturer`: manufacturer of the equipment that generated the series
41+
- `ManufacturerModelName`: model name of the equipment
42+
43+
## `sm_index`
44+
45+
The following is the list of the columns included in `sm_index`. `sm_index` is
46+
series-based, i.e, it has one row per DICOM series, but only includes series
47+
with slide microscopy data.
48+
49+
- DICOM attributes extracted from the files:
50+
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
51+
= one slide
52+
- `embeddingMedium`: describes in what medium the slide was embedded before
53+
the image was obtained
54+
- `tissueFixative`: describes tissue fixatives used before the image was
55+
obtained
56+
- `staining_usingSubstance`: describes staining steps the specimen underwent
57+
before the image was obtained
58+
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
59+
- `max_TotalMatrixRows`: height of the image at the maximum resolution
60+
- `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer,
61+
rounded to 2 significant figures
62+
- `ObjectiveLensPower`: power of the objective lens of the equipment used to
63+
digitize the slide
64+
- `primaryAnatomicStructure`: anatomic location from where the imaged specimen
65+
was collected
66+
- `primaryAnatomicStructureModifier`: additional characteristics of the
67+
specimen, such as whether it is a tumor or normal tissue
68+
- `illuminationType`: specifies the type of illumination used when obtaining
69+
the image
70+
71+
In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`,
72+
`primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and
73+
`illuminationType` the attributes exist with suffix `_code_designator_value_str`
74+
and `_CodeMeaning`, which indicates whether the column contains
75+
CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a
76+
brief explanation on the three-value based coding scheme in DICOM can be found
77+
at https://learn.canceridc.dev/dicom/coding-schemes.
78+
79+
## `sm_instance_index`
80+
81+
The following is the list of the columns included in `sm_instance_index`.
82+
`sm_instance_index` is instance-based, i.e, it has one row per DICOM instance
83+
(pyramid level of a slide, plus potentially thumbnail or label images), but only
84+
includes DICOM instances of the slide microscopy modality.
85+
86+
- DICOM attributes extracted from the files:
87+
88+
- `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM
89+
instance = one level/label/thumbnail image of the slide
90+
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
91+
= one slide
92+
- `embeddingMedium`: describes in what medium the slide was embedded before
93+
the image was obtained
94+
- `tissueFixative`: describes tissue fixatives used before the image was
95+
obtained
96+
- `staining_usingSubstance`: describes staining steps the specimen underwent
97+
before the image was obtained
98+
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
99+
- `max_TotalMatrixRows`: height of the image at the maximum resolution
100+
- `PixelSpacing_0`: pixel spacing in mm
101+
- `ImageType`: specifies further characteristics of the image in a list,
102+
including as the third value whether it is a VOLUME, LABEL, OVERVIEW or
103+
THUMBNAIL image.
104+
- `TransferSyntaxUID`: specifies the encoding scheme used for the image data
105+
- `instance_size`: specifies the DICOM instance's size in bytes
106+
107+
- non-DICOM attributes assigned/curated by IDC:
108+
- `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM
109+
instance
110+
111+
In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance`
112+
the attributes exist with suffix `_code_designator_value_str` and
113+
`_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator
114+
and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the
115+
three-value based coding scheme in DICOM can be found at
116+
https://learn.canceridc.dev/dicom/coding-schemes.

docs/index.md

+2-89
Original file line numberDiff line numberDiff line change
@@ -21,101 +21,14 @@ starting a discussion in [IDC User forum](https://discourse.canceridc.dev/).
2121
:start-after: <!-- SPHINX-START -->
2222
```
2323

24-
## The `index` of `idc-index`
25-
26-
`idc-index` is named this way because it wraps indices of IDC data: tables
27-
containing most important metadata attributes describing the files available in
28-
IDC. The main metadata index is available in the `index` variable (which is a pandas
29-
`DataFrame`) of `IDCClient`.
30-
Additional index tables such as the `clinical_index` contain non-DICOM clinical data or
31-
slide microscopy specific tables (indicated by the prefix `sm`) include metadata attributes
32-
specific to slide microscopy images.
33-
34-
35-
36-
The following is the list of the columns included in `index`. You can use those
37-
to select cohorts and subsetting data. `idc-index` is series-based, i.e, it has
38-
one row per DICOM series.
39-
40-
- non-DICOM attributes assigned/curated by IDC:
41-
- `collection_id`: short string with the identifier of the collection the
42-
series belongs to
43-
- `analysis_result_id`: this string is not empty if the specific series is
44-
part of an analysis results collection; analysis results can be added to a
45-
given collection over time
46-
- `source_DOI`: Digital Object Identifier of the dataset that contains the
47-
given series; note that a given collection can include one or more DOIs,
48-
since analysis results added to the collection would typically have
49-
independent DOI values!
50-
- `instanceCount`: number of files in the series (typically, this matches the
51-
number of slices in cross-sectional modalities)
52-
- `license_short_name`: short name of the license that governs the use of the
53-
files corresponding to the series
54-
- `series_aws_url`: location of the series files in a public AWS bucket
55-
- `series_size_MB`: total disk size needed to store the series
56-
57-
- DICOM attributes extracted from the files
58-
- `PatientID`: identifier of the patient
59-
- `PatientAge` and `PatientSex`: attributes containing patient age and sex
60-
- `StudyInstanceUID`: unique identifier of the DICOM study
61-
- `StudyDescription`: textual description of the study content
62-
- `StudyDate`: date of the study (note that those dates are shifted, and are
63-
not real dates when images were acquired, to protect patient privacy)
64-
- `SeriesInstanceUID`: unique identifier of the DICOM series
65-
- `SeriesDate`: date when the series was acquired
66-
- `SeriesDescription`: textual description of the series content
67-
- `SeriesNumber`: series number
68-
- `BodyPartExamined`: body part imaged
69-
- `Modality`: acquisition modality
70-
- `Manufacturer`: manufacturer of the equipment that generated the series
71-
- `ManufacturerModelName`: model name of the equipment
72-
73-
The following is the list of the columns included in `sm_index`. `sm_index` is series-based, i.e, it has
74-
one row per DICOM series, but only includes series with slide microscopy data.
75-
76-
- DICOM attributes extracted from the files:
77-
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series = one slide
78-
- `embeddingMedium`: describes in what medium the slide was embedded before the image was obtained
79-
- `tissueFixative`: describes tissue fixatives used before the image was obtained
80-
- `staining_usingSubstance`: describes staining steps the specimen underwent before the image was obtained
81-
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
82-
- `max_TotalMatrixRows`: height of the image at the maximum resolution
83-
- `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer, rounded to 2 significant figures
84-
- `ObjectiveLensPower`: power of the objective lens of the equipment used to digitize the slide
85-
- `primaryAnatomicStructure`: anatomic location from where the imaged specimen was collected
86-
- `primaryAnatomicStructureModifier`: additional characteristics of the specimen, such as whether it is a tumor or normal tissue
87-
- `illuminationType`: specifies the type of illumination used when obtainig the image
88-
89-
In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`, `primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and `illuminationType` the attributes exist with suffix `_code_designator_value_str` and `_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the three-value based coding scheme in DICOM can be found at https://learn.canceridc.dev/dicom/coding-schemes.
90-
91-
The following is the list of the columns included in `sm_instance_index`. `sm_instance_index` is instance-based, i.e, it has
92-
one row per DICOM instance (pyramid level of a slide, plus potentially thumbnail or label images), but only includes DICOM instances of the slide microscopy modality.
93-
94-
- DICOM attributes extracted from the files:
95-
- `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM instance = one level/label/thumbnail image of the slide
96-
- `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series = one slide
97-
- `embeddingMedium`: describes in what medium the slide was embedded before the image was obtained
98-
- `tissueFixative`: describes tissue fixatives used before the image was obtained
99-
- `staining_usingSubstance`: describes staining steps the specimen underwent before the image was obtained
100-
- `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
101-
- `max_TotalMatrixRows`: height of the image at the maximum resolution
102-
- `PixelSpacing_0`: pixel spacing in mm
103-
- `ImageType`: specifies further characteristics of the image, including whether it is a VOLUME, LABEL, OVERVIEW or THUMBNAIL image.
104-
- `TransferSyntaxUID`: specifies the encoding scheme used for the image data
105-
- `instance_size`: specifies the DICOM instance's size in bytes
106-
107-
- non-DICOM attributes assigned/curated by IDC:
108-
- `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM instance
109-
110-
In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance` the attributes exist with suffix `_code_designator_value_str` and `_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the three-value based coding scheme in DICOM can be found at https://learn.canceridc.dev/dicom/coding-schemes.
111-
11224
## Contents
11325

11426
```{toctree}
115-
:maxdepth: 1
27+
:maxdepth: 2
11628
:titlesonly:
11729
:caption: API docs
11830
31+
column_descriptions
11932
api/idc_index
12033
```
12134

0 commit comments

Comments
 (0)