Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset creation guide #324

Merged
merged 8 commits into from
Jan 4, 2023
Merged

Add dataset creation guide #324

merged 8 commits into from
Jan 4, 2023

Conversation

lewtun
Copy link
Collaborator

@lewtun lewtun commented Jan 3, 2023

This PR adds a small guide on how to upload datasets to the OpenAssistant org on the Hugging Face Hub.

To simplify the process, I've created a Space (link) that creates a template loading script that people can edit. Note that although there are easier ways to add data to the Hub (e.g. by dumping JSON or CSV files), this approach guarantees the loading scripts can also run on S3 if needed later on.

For now, I haven't added any instructions about the column formats for specific tasks like instruction-fine tuning, but that can be done in a second step once there's some agreement.

Closes #305 #165

@lewtun lewtun requested review from yk and andreaskoepf as code owners January 3, 2023 10:39
@yk
Copy link
Collaborator

yk commented Jan 3, 2023

Looks really nice, thank you!

I see two potential workflows:
a) people do this largely by themselves
b) people submit code to scrape/parse/clean/upload and "we" run them

a) would certainly be easier, but I tend to favor b) because we could make sure things are uniform and work as intended when people submit their code via PRs. I don't think a hybrid "upload-yourself-but-also-commit-code" would work well, as we'll have no guarantee the code to upload actually matches the one in the repo (and since people tend to do lots of fixes, it probably will be very different quickly).

What do you think? If b) sounds good, maybe we'll adjust the guide a bit to have one section for "data admins" (i.e. the section that's already there) and one section for "I have a dataset that I want to submit" that details the steps to submit the DS upload code.

also, pre-commit run --all-files

@onegunsamurai
Copy link
Contributor

What about we provide some kind of pre-commit that will make all neccessary checks, that users will need to run on their side before creating the PR with a dataset. Basically just like the one we have rn In Open Assitant repo.

@lewtun
Copy link
Collaborator Author

lewtun commented Jan 3, 2023

OK, I like option (b) too. I think a clean way to do this is to have a folder structure like:

datasets
├── README.md
└── dataset_1
    ├── README.md
    ├── load.py
    ├── preprocess.py
    ├── push_to_hub.py
    ├── requirements.txt
    └── scrape.py

Here each dataset_i folder has the same set of scripts and we can provide a simple template for people to copy-paste. I'll amend the PR in this direction :)

@lewtun
Copy link
Collaborator Author

lewtun commented Jan 3, 2023

What about we provide some kind of pre-commit that will make all neccessary checks, that users will need to run on their side before creating the PR with a dataset. Basically just like the one we have rn In Open Assitant repo.

We can certainly do this, although we probably won't be able to check the heavy stuff like scraping / downloading large corpora

@yk
Copy link
Collaborator

yk commented Jan 3, 2023

sounds great, thank you for working this out 👍

@@ -0,0 +1,28 @@
---
license: mit
tags:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other tags we should add here for discoverability / filtering?

@@ -0,0 +1,28 @@
---
license: mit
Copy link
Collaborator Author

@lewtun lewtun Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just picked a permissive license at random

subset_id: str = None


lm_features = datasets.Features(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These "features" are the schema associated with each dataset. I need to check out data_schemas.md to integrate them here as well.

For language modelling, I just picked the schema from The Pile (happy to change it)

@lewtun
Copy link
Collaborator Author

lewtun commented Jan 4, 2023

Alright, I revamped this PR significantly to allow us to have some standardisation going forward:

  • Created an openassistant Python "lib" where we can store all the dataset loading scripts + templates (I guess at some point it could also contain the modelling code?)
  • Expanded the guide for people to commit just the loading scripts

I could take a stab at adding a simple dataset as an example if someone can point me to one :)

Copy link
Collaborator

@yk yk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thank you so much. I think all your choices make total sense.

I'll merge this bc lots of people are looking for it, we can adjust as we go :)

@yk yk merged commit 099d035 into LAION-AI:main Jan 4, 2023
@lewtun lewtun deleted the add-datasets-docs branch January 4, 2023 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Where to contribute datasets
3 participants