Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Croissant RecordSet objects #341

Merged
merged 15 commits into from
Feb 25, 2025
Merged

Conversation

amercader
Copy link
Member

From the new docs:

for resources that have been imported to the CKAN DataStore, the resource will also expose Croissant's RecordSet objects with information about the data fields (e.g. column names and types).

DataStore imported tables is the closest that CKAN has right now to a source for identifying the data structure of the data files hosted, and this info really adds value for integration with ML processes and pipelines.


To test this the DataStore needs to be configured and some data imported:

  1. Set up the DataStore
  2. Get some data, either by enabling and running the Xloader extension or by pushing data directly with a script like the following:
curl -X POST \
  'http://CHANGE_ME/api/3/action/datastore_create' \
  -H "Authorization: CHANGE_ME" \
  -H 'Content-Type: application/json' \
  -d '{
    "resource": {
      "package_id": "CHANGE_ME",
      "name": "Air quality",
      "description": "Test data"
    },
    "fields": [
      {
        "id": "measurement_id",
        "type": "text",
        "primary_key": true
      },
      {
        "id": "timestamp",
        "type": "timestamp"
      },
      {
        "id": "location_name",
        "type": "text"
      },
      {
        "id": "temperature_celsius",
        "type": "float"
      },
      {
        "id": "humidity_percent",
        "type": "float"
      },
      {
        "id": "pm25_concentration",
        "type": "float"
      }
    ],
    "records": [
      {
        "measurement_id": "AQ001",
        "timestamp": "2025-02-20T09:30:00",
        "location_name": "Downtown Station",
        "temperature_celsius": 22.5,
        "humidity_percent": 45.3,
        "pm25_concentration": 12.8
      },
      {
        "measurement_id": "AQ002",
        "timestamp": "2025-02-20T10:15:00",
        "location_name": "Industrial Zone",
        "temperature_celsius": 24.1,
        "humidity_percent": 42.7,
        "pm25_concentration": 18.5
      },
      {
        "measurement_id": "AQ003",
        "timestamp": "2025-02-20T11:00:00",
        "location_name": "Residential Area",
        "temperature_celsius": 21.8,
        "humidity_percent": 48.2,
        "pm25_concentration": 8.3
      }
    ],
    "primary_key": ["measurement_id"],
    "indexes": ["timestamp", "location_name"],
    "force": true
}'

For resources that are on the DataStore, we can use `datastore_info`
to get a list of the fields present in the data, and expose those as a
`RecordSet`.
Data types are also included (although these would be all `text` for
standard xloader imports) as well as primary keys.
Currently failing because SCHEMA.DateTime is not an expected type??
It sounds really counterintuitive but the validator doesn't like fields
with a dataType of DateTime:

```
[Metadata(Test Croissant dataset) >
RecordSet(568b8ac9-8c69-4475-b35e-d7f812a63c32/records) > Field()] The
field does not specify a valid http://mlcommons.org/croissant/dataType,
neither does any of its predecessor. Got:
[rdflib.term.URIRef('https://schema.org/DateTime')]
```

Looks like the Date / DateTime situation in schema.org is a bit
confusing:

schemaorg/schemaorg#1748
@amercader amercader mentioned this pull request Feb 20, 2025
4 tasks
@amercader amercader merged commit a624008 into master Feb 25, 2025
8 checks passed
@amercader amercader deleted the croissant-recordset branch February 25, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant