-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hugging Face hub integration: push and load from the hub #87
Conversation
That's really cool – if any help is needed, @Wauplin who's the maintainer of the |
Thanks for the support! I'll ask if needed, but hopefully that won't be necessary :) |
Thanks @julien-c for the ping 🤗 @Natooz I quickly reviewed the draft PR and I think it can be drastically simplified by removing all TL;DR: you can inherit import json
from abc import ABC
from pathlib import Path
from typing import Dict, Optional, Type, Union
from huggingface_hub import ModelHubMixin, hf_hub_download
from .constants import CURRENT_VERSION_PACKAGE
class MIDITokenizer(ABC, ModelHubMixin):
... # keep all the existing logic from MIDITokenizer
...
...
... # and then add `_from_pretrained` and `_save_pretrained`
@classmethod
def _from_pretrained(
cls,
*,
model_id: str,
revision: Optional[str],
cache_dir: Optional[Union[str, Path]],
force_download: bool,
proxies: Optional[Dict],
resume_download: bool,
local_files_only: bool,
token: Optional[Union[str, bool]],
**model_kwargs,
) -> "MIDITokenizer":
params_path = hf_hub_download(
repo_id=model_id,
filename="tokenizer.conf",
revision=revision,
cache_dir=cache_dir,
force_download=force_download,
proxies=proxies,
resume_download=resume_download,
local_files_only=local_files_only,
token=token,
library_name="MidiTok",
library_version=CURRENT_VERSION_PACKAGE,
)
# TODO: adapt this. I assumed from the code this is how to load a conf but can't tell for sure
with open(params_path, "r") as f:
return cls(params=json.load(f))
def _save_pretrained(self, save_directory: Path) -> None:
"""
Overwrite this method in subclass to define how to save your model.
Check out our [integration guide](../guides/integrations) for instructions.
Args:
save_directory (`str` or `Path`):
Path to directory in which the model weights and configuration will be saved.
"""
# TODO: would be nice to also save a Model Card (README.md file) + any other useful info
self.save_params(save_directory / "tokenizer.conf") As a result, every subclass of >>> class REMI(MIDITokenizer): # no changes to this class
>>> ...
>>> # Load from Hub
>>> remi = REMI.from_pretrained("Natooz/MidiTok-tests")
>>> # (optional) Save on local machine
>>> remi.save_pretrained("path/to/local/directory")
>>> # (optional) Push to Hub
>>> remi.push_to_hub("Natooz/MidiTok-tests-v1") And pretty cool library btw! Looking forward to see the final integration! 🚀 |
Hi @Wauplin, Indeed the |
I would advice to add If you still decide to not require # I let you define a better error message :D
class DummyModelHubMixin:
def from_pretrained(self, *args, **kwargs) -> str:
raise RuntimeError("Please install huggingface_hub")
def save_pretrained(self, *args, **kwargs) -> str:
raise RuntimeError("Please install huggingface_hub")
def push_to_hub(self, *args, **kwargs) -> str:
raise RuntimeError("Please install huggingface_hub") What do you think? :) |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #87 +/- ##
==========================================
- Coverage 90.41% 90.34% -0.07%
==========================================
Files 31 33 +2
Lines 4536 4577 +41
==========================================
+ Hits 4101 4135 +34
- Misses 435 442 +7
☔ View full report in Codecov by Sentry. |
Hi @Wauplin , I ended up directly subclassing Also after this, I subclassed Thank you again for your everything! I'll leave the PR open a few days, feel free to review if you want to! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Great job @Natooz, I love this integration and how documented it is 👍 I left 2 minor comments but otherwise looks good to me.
To answer your comments/feedback:
At first I tried to use a dummy mixin in case the package is not installed, by the docstring and type hints in my IDE (PyCharm) only showed the one of the dummy one instead of ModelHubMixin, so I felt it would be preferable to add the dependency anyway and get the proper hints in any case.
Ah yep, haven't thought about type hints and cie in my dummy example. Better like this then! :)
ModelHubMixin is specifically for models, now I don't know how third-party libraries use it but maybe a "light" or maybe more universal version, from which ModelHubMixin could inherit, could fit better some usages? But really no big deal, everything works fine in any case!
Yes, interesting to know! Actually the "why" we are looking for a config.json
file by default is that this file is automatically used by the server to count the number of downloads of a model. This means that if you have a config.json
in your repo, the "download counter" on the Hub will work, even if your library is not officially integrated with the Hub (which is often the case for third-party libraries like MidiTok).
Now I tested it myself and found out about the logger.warning(f"{CONFIG_NAME} not found in HuggingFace Hub.")
we are triggering when not finding the config file. This is a bit too much IMO and I will make an update to set it at INFO
-level instead of warning. (EDIT: I just opened huggingface/huggingface_hub#1776)
Is there a huggingface_hub version that you would recommend to specify in setup.py / requirements? I set >=0.15 but there might be a better choice.
>=0.16.4
would be the best (see comment below)
I'll leave the PR open a few days, feel free to review if you want to!
I'll release the next version (v2.1.7) soon after (few days max).
🚀🚀🚀
Once released, what you can do is to open a PR to update this file: https://github.com/huggingface/hub-docs/blob/main/docs/hub/models-libraries.md and ping @osanseviero + me. This way, MidiTok will be listed as an integrated library in our docs as well.
Co-authored-by: Lucain <[email protected]>
Perfect, thank you for everything @Wauplin !
Thanks for the details, I get that it's important to keep it then :)
Great thank you! It wasn't necessary but it's a nice gesture
I certainly will do! Maybe tonight or tomorrow depending. |
This PR adds the ability to push a tokenizer to the Hugging Face hub, and load it from.
Sharing and loading MidiTok tokenizers will be much more convenient.
The two main new methods are called
push_to_hub
andfrom_pretrained
, to be used similarly to those in the HF libraries.As of now, there are two options for the implementation:
PushToHubMixin
from the transformers package, and subclass it for the required changes (less work and might be easier to maintain (no guaranty), but requires to load transformers (and have it installed which might not be desired for some users) which will be slower)In both ways, we still rely on the
huggingface_hub
package.