-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset creation guide #324
Conversation
Looks really nice, thank you! I see two potential workflows: a) would certainly be easier, but I tend to favor b) because we could make sure things are uniform and work as intended when people submit their code via PRs. I don't think a hybrid "upload-yourself-but-also-commit-code" would work well, as we'll have no guarantee the code to upload actually matches the one in the repo (and since people tend to do lots of fixes, it probably will be very different quickly). What do you think? If b) sounds good, maybe we'll adjust the guide a bit to have one section for "data admins" (i.e. the section that's already there) and one section for "I have a dataset that I want to submit" that details the steps to submit the DS upload code. also, |
What about we provide some kind of pre-commit that will make all neccessary checks, that users will need to run on their side before creating the PR with a dataset. Basically just like the one we have rn In Open Assitant repo. |
OK, I like option (b) too. I think a clean way to do this is to have a folder structure like:
Here each |
We can certainly do this, although we probably won't be able to check the heavy stuff like scraping / downloading large corpora |
sounds great, thank you for working this out 👍 |
@@ -0,0 +1,28 @@ | |||
--- | |||
license: mit | |||
tags: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any other tags we should add here for discoverability / filtering?
@@ -0,0 +1,28 @@ | |||
--- | |||
license: mit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just picked a permissive license at random
subset_id: str = None | ||
|
||
|
||
lm_features = datasets.Features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These "features" are the schema associated with each dataset. I need to check out data_schemas.md
to integrate them here as well.
For language modelling, I just picked the schema from The Pile (happy to change it)
Alright, I revamped this PR significantly to allow us to have some standardisation going forward:
I could take a stab at adding a simple dataset as an example if someone can point me to one :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thank you so much. I think all your choices make total sense.
I'll merge this bc lots of people are looking for it, we can adjust as we go :)
This PR adds a small guide on how to upload datasets to the OpenAssistant org on the Hugging Face Hub.
To simplify the process, I've created a Space (link) that creates a template loading script that people can edit. Note that although there are easier ways to add data to the Hub (e.g. by dumping JSON or CSV files), this approach guarantees the loading scripts can also run on S3 if needed later on.
For now, I haven't added any instructions about the column formats for specific tasks like instruction-fine tuning, but that can be done in a second step once there's some agreement.
Closes #305 #165