Add dataset creation guide #324

lewtun · 2023-01-03T10:39:51Z

This PR adds a small guide on how to upload datasets to the OpenAssistant org on the Hugging Face Hub.

To simplify the process, I've created a Space (link) that creates a template loading script that people can edit. Note that although there are easier ways to add data to the Hub (e.g. by dumping JSON or CSV files), this approach guarantees the loading scripts can also run on S3 if needed later on.

For now, I haven't added any instructions about the column formats for specific tasks like instruction-fine tuning, but that can be done in a second step once there's some agreement.

Closes #305 #165

yk · 2023-01-03T10:54:19Z

Looks really nice, thank you!

I see two potential workflows:
a) people do this largely by themselves
b) people submit code to scrape/parse/clean/upload and "we" run them

a) would certainly be easier, but I tend to favor b) because we could make sure things are uniform and work as intended when people submit their code via PRs. I don't think a hybrid "upload-yourself-but-also-commit-code" would work well, as we'll have no guarantee the code to upload actually matches the one in the repo (and since people tend to do lots of fixes, it probably will be very different quickly).

What do you think? If b) sounds good, maybe we'll adjust the guide a bit to have one section for "data admins" (i.e. the section that's already there) and one section for "I have a dataset that I want to submit" that details the steps to submit the DS upload code.

also, pre-commit run --all-files

onegunsamurai · 2023-01-03T19:10:18Z

What about we provide some kind of pre-commit that will make all neccessary checks, that users will need to run on their side before creating the PR with a dataset. Basically just like the one we have rn In Open Assitant repo.

lewtun · 2023-01-03T21:39:23Z

OK, I like option (b) too. I think a clean way to do this is to have a folder structure like:

datasets
├── README.md
└── dataset_1
    ├── README.md
    ├── load.py
    ├── preprocess.py
    ├── push_to_hub.py
    ├── requirements.txt
    └── scrape.py

Here each dataset_i folder has the same set of scripts and we can provide a simple template for people to copy-paste. I'll amend the PR in this direction :)

lewtun · 2023-01-03T21:42:27Z

What about we provide some kind of pre-commit that will make all neccessary checks, that users will need to run on their side before creating the PR with a dataset. Basically just like the one we have rn In Open Assitant repo.

We can certainly do this, although we probably won't be able to check the heavy stuff like scraping / downloading large corpora

yk · 2023-01-03T21:57:05Z

sounds great, thank you for working this out 👍

lewtun · 2023-01-04T04:18:11Z

openassistant/templates/dataset_card.md

@@ -0,0 +1,28 @@
+---
+license: mit
+tags:


Any other tags we should add here for discoverability / filtering?

lewtun · 2023-01-04T04:18:23Z

openassistant/templates/dataset_card.md

@@ -0,0 +1,28 @@
+---
+license: mit


Just picked a permissive license at random

lewtun · 2023-01-04T04:39:30Z

openassistant/templates/hub.py

+    subset_id: str = None
+
+
+lm_features = datasets.Features(


These "features" are the schema associated with each dataset. I need to check out data_schemas.md to integrate them here as well.

For language modelling, I just picked the schema from The Pile (happy to change it)

lewtun · 2023-01-04T04:41:05Z

Alright, I revamped this PR significantly to allow us to have some standardisation going forward:

Created an openassistant Python "lib" where we can store all the dataset loading scripts + templates (I guess at some point it could also contain the modelling code?)
Expanded the guide for people to commit just the loading scripts

I could take a stab at adding a simple dataset as an example if someone can point me to one :)

yk

Amazing, thank you so much. I think all your choices make total sense.

I'll merge this bc lots of people are looking for it, we can adjust as we go :)

Add dataset creation guide

44667f4

lewtun requested review from yk and andreaskoepf as code owners January 3, 2023 10:39

yk added the data label Jan 3, 2023

yk mentioned this pull request Jan 3, 2023

Set up an initial framework for data collection, storage, cleaning, and accessing #165

Closed

lewtun mentioned this pull request Jan 3, 2023

Where to contribute datasets #305

Closed

lewtun added 2 commits January 4, 2023 13:45

Fix style

ca8d3c8

Fix style

000a908

lewtun commented Jan 4, 2023

View reviewed changes

lewtun added 4 commits January 4, 2023 15:21

Merge

a4a9a7b

pre-commit fix

5b0f6c1

pre-commit fix

8eab50b

pre-commit fix

42e8440

lewtun commented Jan 4, 2023

View reviewed changes

pre-commit fix

6950fff

yk approved these changes Jan 4, 2023

View reviewed changes

yk merged commit 099d035 into LAION-AI:main Jan 4, 2023

lewtun deleted the add-datasets-docs branch January 4, 2023 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset creation guide #324

Add dataset creation guide #324

lewtun commented Jan 3, 2023 •

edited

Loading

yk commented Jan 3, 2023

onegunsamurai commented Jan 3, 2023

lewtun commented Jan 3, 2023

lewtun commented Jan 3, 2023

yk commented Jan 3, 2023

lewtun Jan 4, 2023

lewtun Jan 4, 2023 •

edited

Loading

lewtun Jan 4, 2023

lewtun commented Jan 4, 2023 •

edited

Loading

yk left a comment

Add dataset creation guide #324

Add dataset creation guide #324

Conversation

lewtun commented Jan 3, 2023 • edited Loading

yk commented Jan 3, 2023

onegunsamurai commented Jan 3, 2023

lewtun commented Jan 3, 2023

lewtun commented Jan 3, 2023

yk commented Jan 3, 2023

lewtun Jan 4, 2023

Choose a reason for hiding this comment

lewtun Jan 4, 2023 • edited Loading

Choose a reason for hiding this comment

lewtun Jan 4, 2023

Choose a reason for hiding this comment

lewtun commented Jan 4, 2023 • edited Loading

yk left a comment

Choose a reason for hiding this comment

lewtun commented Jan 3, 2023 •

edited

Loading

lewtun Jan 4, 2023 •

edited

Loading

lewtun commented Jan 4, 2023 •

edited

Loading