ImagingDataCommons · fedorov · Sep 28, 2024 · Sep 18, 2024 · Sep 18, 2024
diff --git a/README.md b/README.md
@@ -81,6 +81,17 @@ client.download_from_selection(
 )
 ```
 
+## The `indices` of `idc-index`
+
+`idc-index` is named this way because it wraps indices of IDC data: tables
+containing the most important metadata attributes describing the files available
+in IDC. The main metadata index is available in the `index` variable (which is a
+pandas `DataFrame`) of `IDCClient`. Additional index tables such as the
+`clinical_index` contain non-DICOM clinical data or slide microscopy specific
+tables (indicated by the prefix `sm`) include metadata attributes specific to
+slide microscopy images. A description of available attributes for all indices
+can be found [here](column_descriptions).
+
 ## Tutorial
 
 Please check out

diff --git a/docs/column_descriptions.md b/docs/column_descriptions.md
@@ -0,0 +1,116 @@
+# Metadata attributes in `idc-index`'s index tables
+
+## `index`
+
+The following is the list of the columns included in `index`. You can use those
+to select cohorts and subsetting data. `index` is series-based, i.e, it has one
+row per DICOM series.
+
+- non-DICOM attributes assigned/curated by IDC:
+
+  - `collection_id`: short string with the identifier of the collection the
+    series belongs to
+  - `analysis_result_id`: this string is not empty if the specific series is
+    part of an analysis results collection; analysis results can be added to a
+    given collection over time
+  - `source_DOI`: Digital Object Identifier of the dataset that contains the
+    given series; note that a given collection can include one or more DOIs,
+    since analysis results added to the collection would typically have
+    independent DOI values!
+  - `instanceCount`: number of files in the series (typically, this matches the
+    number of slices in cross-sectional modalities)
+  - `license_short_name`: short name of the license that governs the use of the
+    files corresponding to the series
+  - `series_aws_url`: location of the series files in a public AWS bucket
+  - `series_size_MB`: total disk size needed to store the series
+
+- DICOM attributes extracted from the files
+  - `PatientID`: identifier of the patient
+  - `PatientAge` and `PatientSex`: attributes containing patient age and sex
+  - `StudyInstanceUID`: unique identifier of the DICOM study
+  - `StudyDescription`: textual description of the study content
+  - `StudyDate`: date of the study (note that those dates are shifted, and are
+    not real dates when images were acquired, to protect patient privacy)
+  - `SeriesInstanceUID`: unique identifier of the DICOM series
+  - `SeriesDate`: date when the series was acquired
+  - `SeriesDescription`: textual description of the series content
+  - `SeriesNumber`: series number
+  - `BodyPartExamined`: body part imaged
+  - `Modality`: acquisition modality
+  - `Manufacturer`: manufacturer of the equipment that generated the series
+  - `ManufacturerModelName`: model name of the equipment
+
+## `sm_index`
+
+The following is the list of the columns included in `sm_index`. `sm_index` is
+series-based, i.e, it has one row per DICOM series, but only includes series
+with slide microscopy data.
+
+- DICOM attributes extracted from the files:
+  - `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
+    = one slide
+  - `embeddingMedium`: describes in what medium the slide was embedded before
+    the image was obtained
+  - `tissueFixative`: describes tissue fixatives used before the image was
+    obtained
+  - `staining_usingSubstance`: describes staining steps the specimen underwent
+    before the image was obtained
+  - `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
+  - `max_TotalMatrixRows`: height of the image at the maximum resolution
+  - `min_PixelSpacing_2sf`: pixel spacing in mm at the maximum resolution layer,
+    rounded to 2 significant figures
+  - `ObjectiveLensPower`: power of the objective lens of the equipment used to
+    digitize the slide
+  - `primaryAnatomicStructure`: anatomic location from where the imaged specimen
+    was collected
+  - `primaryAnatomicStructureModifier`: additional characteristics of the
+    specimen, such as whether it is a tumor or normal tissue
+  - `illuminationType`: specifies the type of illumination used when obtaining
+    the image
+
+In case of `embeddingMedium`, `tissueFixative`, `staining_usingSubstance`,
+`primaryAnatomicStructure`, `primaryAnatomicStructureModifier` and
+`illuminationType` the attributes exist with suffix `_code_designator_value_str`
+and `_CodeMeaning`, which indicates whether the column contains
+CodeSchemeDesignator and CodeValue, or CodeMeaning. If this is new to you, a
+brief explanation on the three-value based coding scheme in DICOM can be found
+at https://learn.canceridc.dev/dicom/coding-schemes.
+
+## `sm_instance_index`
+
+The following is the list of the columns included in `sm_instance_index`.
+`sm_instance_index` is instance-based, i.e, it has one row per DICOM instance
+(pyramid level of a slide, plus potentially thumbnail or label images), but only
+includes DICOM instances of the slide microscopy modality.
+
+- DICOM attributes extracted from the files:
+
+  - `SOPInstanceUID`: unique identifier of the DICOM instance: one DICOM
+    instance = one level/label/thumbnail image of the slide
+  - `SeriesInstanceUID`: unique identifier of the DICOM series: one DICOM series
+    = one slide
+  - `embeddingMedium`: describes in what medium the slide was embedded before
+    the image was obtained
+  - `tissueFixative`: describes tissue fixatives used before the image was
+    obtained
+  - `staining_usingSubstance`: describes staining steps the specimen underwent
+    before the image was obtained
+  - `max_TotalPixelMatrixColumns`: width of the image at the maximum resolution
+  - `max_TotalMatrixRows`: height of the image at the maximum resolution
+  - `PixelSpacing_0`: pixel spacing in mm
+  - `ImageType`: specifies further characteristics of the image in a list,
+    including as the third value whether it is a VOLUME, LABEL, OVERVIEW or
+    THUMBNAIL image.
+  - `TransferSyntaxUID`: specifies the encoding scheme used for the image data
+  - `instance_size`: specifies the DICOM instance's size in bytes
+
+- non-DICOM attributes assigned/curated by IDC:
+  - `crdc_instance_uuid`: globally unique, versioned identifier of the DICOM
+    instance
+
+In case of `embeddingMedium`, `tissueFixative`, and `staining_usingSubstance`
+the attributes exist with suffix `_code_designator_value_str` and
+`_CodeMeaning`, which indicates whether the column contains CodeSchemeDesignator
+and CodeValue, or CodeMeaning. If this is new to you, a brief explanation on the
+three-value based coding scheme in DICOM can be found at
+https://learn.canceridc.dev/dicom/coding-schemes.
diff --git a/docs/index.md b/docs/index.md
@@ -21,58 +21,14 @@ starting a discussion in [IDC User forum](https://discourse.canceridc.dev/).
 :start-after: <!-- SPHINX-START -->
 ```
 
-## The `index` of `idc-index`
-
-`idc-index` is named this way because it wraps index of IDC data: a table
-containing most important metadata attributes describing the files available in
-IDC. This metadata index is available in the `index` variable (which is a pandas
-`DataFrame`) of `IDCClient`.
-
-The following is the list of the columns included in `index`. You can use those
-to select cohorts and subsetting data. `idc-index` is series-based, i.e, it has
-one row per DICOM series.
-
-- non-DICOM attributes assigned/curated by IDC:
-
-  - `collection_id`: short string with the identifier of the collection the
-    series belongs to
-  - `analysis_result_id`: this string is not empty if the specific series is
-    part of an analysis results collection; analysis results can be added to a
-    given collection over time
-  - `source_DOI`: Digital Object Identifier of the dataset that contains the
-    given series; note that a given collection can include one or more DOIs,
-    since analysis results added to the collection would typically have
-    independent DOI values!
-  - `instanceCount`: number of files in the series (typically, this matches the
-    number of slices in cross-sectional modalities)
-  - `license_short_name`: short name of the license that governs the use of the
-    files corresponding to the series
-  - `series_aws_url`: location of the series files in a public AWS bucket
-  - `series_size_MB`: total disk size needed to store the series
-
-- DICOM attributes extracted from the files
-  - `PatientID`: identifier of the patient
-  - `PatientAge` and `PatientSex`: attributes containing patient age and sex
-  - `StudyInstanceUID`: unique identifier of the DICOM study
-  - `StudyDescription`: textual description of the study content
-  - `StudyDate`: date of the study (note that those dates are shifted, and are
-    not real dates when images were acquired, to protect patient privacy)
-  - `SeriesInstanceUID`: unique identifier of the DICOM series
-  - `SeriesDate`: date when the series was acquired
-  - `SeriesDescription`: textual description of the series content
-  - `SeriesNumber`: series number
-  - `BodyPartExamined`: body part imaged
-  - `Modality`: acquisition modality
-  - `Manufacturer`: manufacturer of the equipment that generated the series
-  - `ManufacturerModelName`: model name of the equipment
-
 ## Contents
 
 ```{toctree}
-:maxdepth: 1
+:maxdepth: 2
 :titlesonly:
 :caption: API docs
 
+column_descriptions
 api/idc_index
 ```