Add support for Croissant RecordSet objects #341

amercader · 2025-02-20T11:53:15Z

From the new docs:

for resources that have been imported to the CKAN DataStore, the resource will also expose Croissant's RecordSet objects with information about the data fields (e.g. column names and types).

DataStore imported tables is the closest that CKAN has right now to a source for identifying the data structure of the data files hosted, and this info really adds value for integration with ML processes and pipelines.

To test this the DataStore needs to be configured and some data imported:

Set up the DataStore
Get some data, either by enabling and running the Xloader extension or by pushing data directly with a script like the following:

curl -X POST \
  'http://CHANGE_ME/api/3/action/datastore_create' \
  -H "Authorization: CHANGE_ME" \
  -H 'Content-Type: application/json' \
  -d '{
    "resource": {
      "package_id": "CHANGE_ME",
      "name": "Air quality",
      "description": "Test data"
    },
    "fields": [
      {
        "id": "measurement_id",
        "type": "text",
        "primary_key": true
      },
      {
        "id": "timestamp",
        "type": "timestamp"
      },
      {
        "id": "location_name",
        "type": "text"
      },
      {
        "id": "temperature_celsius",
        "type": "float"
      },
      {
        "id": "humidity_percent",
        "type": "float"
      },
      {
        "id": "pm25_concentration",
        "type": "float"
      }
    ],
    "records": [
      {
        "measurement_id": "AQ001",
        "timestamp": "2025-02-20T09:30:00",
        "location_name": "Downtown Station",
        "temperature_celsius": 22.5,
        "humidity_percent": 45.3,
        "pm25_concentration": 12.8
      },
      {
        "measurement_id": "AQ002",
        "timestamp": "2025-02-20T10:15:00",
        "location_name": "Industrial Zone",
        "temperature_celsius": 24.1,
        "humidity_percent": 42.7,
        "pm25_concentration": 18.5
      },
      {
        "measurement_id": "AQ003",
        "timestamp": "2025-02-20T11:00:00",
        "location_name": "Residential Area",
        "temperature_celsius": 21.8,
        "humidity_percent": 48.2,
        "pm25_concentration": 8.3
      }
    ],
    "primary_key": ["measurement_id"],
    "indexes": ["timestamp", "location_name"],
    "force": true
}'

For resources that are on the DataStore, we can use `datastore_info` to get a list of the fields present in the data, and expose those as a `RecordSet`. Data types are also included (although these would be all `text` for standard xloader imports) as well as primary keys.

Currently failing because SCHEMA.DateTime is not an expected type??

It sounds really counterintuitive but the validator doesn't like fields with a dataType of DateTime: ``` [Metadata(Test Croissant dataset) > RecordSet(568b8ac9-8c69-4475-b35e-d7f812a63c32/records) > Field()] The field does not specify a valid http://mlcommons.org/croissant/dataType, neither does any of its predecessor. Got: [rdflib.term.URIRef('https://schema.org/DateTime')] ``` Looks like the Date / DateTime situation in schema.org is a bit confusing: schemaorg/schemaorg#1748

amercader added 12 commits February 13, 2025 21:27

Merge branch 'croissant' into croissant-recordset

8b6ebed

Bette is_live_dataset check

d0cffbd

Add serialize test for recordset logic

8bac82f

Add validation test for recordset

d0c9bfe

Currently failing because SCHEMA.DateTime is not an expected type??

Test float fields

5fef19c

Merge branch 'croissant' into croissant-recordset

4a7eb18

Don't fail if field type unknown, handle extra field types

0291008

Merge branch 'master' into croissant-recordset

8a964fc

Revert SCHEMA.DateTime to SCHEMA.Date

21e6b44

Document RecordSet support

84312ef

amercader mentioned this pull request Feb 20, 2025

Support for Croissant ML #339

Merged

4 tasks

amercader added 3 commits February 24, 2025 12:31

Update croissant profile per @Reikyo suggestions

9199d3e

Update serialization test after profile changes

86eff8c

Use mimetype for encodingFormat if present

7c1d810

amercader merged commit a624008 into master Feb 25, 2025
8 checks passed

amercader deleted the croissant-recordset branch February 25, 2025 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Croissant RecordSet objects #341

Add support for Croissant RecordSet objects #341

amercader commented Feb 20, 2025

Add support for Croissant RecordSet objects #341

Add support for Croissant RecordSet objects #341

Conversation

amercader commented Feb 20, 2025