Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face hub integration: push and load from the hub #87

Merged
merged 10 commits into from
Oct 24, 2023

Conversation

Natooz
Copy link
Owner

@Natooz Natooz commented Oct 19, 2023

This PR adds the ability to push a tokenizer to the Hugging Face hub, and load it from.
Sharing and loading MidiTok tokenizers will be much more convenient.

The two main new methods are called push_to_hub and from_pretrained, to be used similarly to those in the HF libraries.

As of now, there are two options for the implementation:

  1. recreate the push and load methods from the transformers package, and adapt them for MidiTok (more work but faster results) --> preferred option;
  2. import PushToHubMixin from the transformers package, and subclass it for the required changes (less work and might be easier to maintain (no guaranty), but requires to load transformers (and have it installed which might not be desired for some users) which will be slower)

In both ways, we still rely on the huggingface_hub package.

@julien-c
Copy link

That's really cool – if any help is needed, @Wauplin who's the maintainer of the huggingface_hub package can probably help!

@Natooz
Copy link
Owner Author

Natooz commented Oct 20, 2023

Thanks for the support! I'll ask if needed, but hopefully that won't be necessary :)

@Wauplin
Copy link

Wauplin commented Oct 20, 2023

Thanks @julien-c for the ping 🤗

@Natooz I quickly reviewed the draft PR and I think it can be drastically simplified by removing all transformers logic. Actually, we already did kinda the same job as you did here but integrated it directly in huggingface_hub so that other libraries like MidiTok can reuse it. I invite you to read the How to integrate a library guide that explain the technical aspects of it :)

TL;DR: you can inherit MIDITokenizer from huggingface_hub.ModelHubMixin directly in midi_tokenizer.py which will automatically add 3 methods from_pretrained, save_pretrained and push_to_hub. The only thing you have to implement is the logic to serialize/deserialize the tokenize from/to a file. Here is an example of how it would look like:

import json
from abc import ABC
from pathlib import Path
from typing import Dict, Optional, Type, Union

from huggingface_hub import ModelHubMixin, hf_hub_download

from .constants import CURRENT_VERSION_PACKAGE


class MIDITokenizer(ABC, ModelHubMixin):
    ...  # keep all the existing logic from MIDITokenizer
    ...
    ...

    ... # and then add `_from_pretrained` and `_save_pretrained`

    @classmethod
    def _from_pretrained(
        cls,
        *,
        model_id: str,
        revision: Optional[str],
        cache_dir: Optional[Union[str, Path]],
        force_download: bool,
        proxies: Optional[Dict],
        resume_download: bool,
        local_files_only: bool,
        token: Optional[Union[str, bool]],
        **model_kwargs,
    ) -> "MIDITokenizer":
        params_path = hf_hub_download(
            repo_id=model_id,
            filename="tokenizer.conf",
            revision=revision,
            cache_dir=cache_dir,
            force_download=force_download,
            proxies=proxies,
            resume_download=resume_download,
            local_files_only=local_files_only,
            token=token,
            library_name="MidiTok",
            library_version=CURRENT_VERSION_PACKAGE,
        )

        # TODO: adapt this. I assumed from the code this is how to load a conf but can't tell for sure
        with open(params_path, "r") as f:
            return cls(params=json.load(f))

    def _save_pretrained(self, save_directory: Path) -> None:
        """
        Overwrite this method in subclass to define how to save your model.
        Check out our [integration guide](../guides/integrations) for instructions.

        Args:
            save_directory (`str` or `Path`):
                Path to directory in which the model weights and configuration will be saved.
        """
        # TODO: would be nice to also save a Model Card (README.md file) + any other useful info
        self.save_params(save_directory / "tokenizer.conf")

As a result, every subclass of MidiTokenizer can be loaded from the Hub or pushed to the Hub:

>>> class REMI(MIDITokenizer):  # no changes to this class
>>>     ...

>>> # Load from Hub
>>> remi = REMI.from_pretrained("Natooz/MidiTok-tests")

>>> # (optional) Save on local machine
>>> remi.save_pretrained("path/to/local/directory")

>>> # (optional) Push to Hub
>>> remi.push_to_hub("Natooz/MidiTok-tests-v1")

And pretty cool library btw! Looking forward to see the final integration! 🚀

@Natooz
Copy link
Owner Author

Natooz commented Oct 23, 2023

Hi @Wauplin,
Thanks for digging into it and for the suggestions! :)

Indeed the ModelHubMixin seems best suited!
I'm looking into it, my main question here is to wether subclass it (would add a dependency that some users wouldn't use or maybe don't want), subclass it if huggingface_hub is imported (leave the user the choice to not have huggingface_hub).
I'll come to you if I have any question, hopefully it should be fine as everything is already well documented :)

@Wauplin
Copy link

Wauplin commented Oct 23, 2023

my main question here is to wether subclass it (would add a dependency that some users wouldn't use or maybe don't want)

I would advice to add huggingface_hub as a core dependency given how seamless the integration would be -using from_pretrained instead of downloading files manually-. In general we try to keep our dependencies as shallow as possible, and especially no heavy ML-library is required (unlike transformers for example). In the end, huggingface_hub is simply a wrapper around requests + a few cool features.

If you still decide to not require huggingface_hub (which I can understand), what you can do is to define at runtime a dummy mixin that would replace the official one. Something like this:

# I let you define a better error message :D 
class DummyModelHubMixin:
    def from_pretrained(self, *args, **kwargs) -> str:
        raise RuntimeError("Please install huggingface_hub")  

    def save_pretrained(self, *args, **kwargs) -> str:
        raise RuntimeError("Please install huggingface_hub")

    def push_to_hub(self, *args, **kwargs) -> str:
        raise RuntimeError("Please install huggingface_hub")

What do you think? :)

@codecov
Copy link

codecov bot commented Oct 24, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (8cd6a67) 90.41% compared to head (1bd1c2a) 90.34%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #87      +/-   ##
==========================================
- Coverage   90.41%   90.34%   -0.07%     
==========================================
  Files          31       33       +2     
  Lines        4536     4577      +41     
==========================================
+ Hits         4101     4135      +34     
- Misses        435      442       +7     
Files Coverage Δ
miditok/__init__.py 68.42% <100.00%> (ø)
miditok/classes.py 84.50% <100.00%> (ø)
miditok/constants.py 100.00% <100.00%> (ø)
setup.py 0.00% <ø> (ø)
tests/conftest.py 100.00% <100.00%> (ø)
miditok/midi_tokenizer.py 90.53% <93.75%> (+0.04%) ⬆️
tests/test_hf_hub.py 68.42% <68.42%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Natooz
Copy link
Owner Author

Natooz commented Oct 24, 2023

Hi @Wauplin ,

I ended up directly subclassing huggingface_hub, as indeed the dependencies are very light. 🤗
At first I tried to use a dummy mixin in case the package is not installed, by the docstring and type hints in my IDE (PyCharm) only showed the one of the dummy one instead of ModelHubMixin, so I felt it would be preferable to add the dependency anyway and get the proper hints in any case.

Also after this, I subclassed ModelHubMixin to get rid of the config logic as it doesn't apply here, but finally gave up the idea for the maintenance cost.
ModelHubMixin is specifically for models, now I don't know how third-party libraries use it but maybe a "light" or maybe more universal version, from which ModelHubMixin could inherit, could fit better some usages? But really no big deal, everything works fine in any case!
(Edit: I assumed this kind of "user feedback" might interest you)

Thank you again for your everything!
Is there a huggingface_hub version that you would recommend to specify in setup.py / requirements? I set >=0.15 but there might be a better choice.

I'll leave the PR open a few days, feel free to review if you want to!
I'll release the next version (v2.1.7) soon after (few days max).

@Natooz Natooz marked this pull request as ready for review October 24, 2023 13:31
Copy link

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Great job @Natooz, I love this integration and how documented it is 👍 I left 2 minor comments but otherwise looks good to me.

To answer your comments/feedback:

At first I tried to use a dummy mixin in case the package is not installed, by the docstring and type hints in my IDE (PyCharm) only showed the one of the dummy one instead of ModelHubMixin, so I felt it would be preferable to add the dependency anyway and get the proper hints in any case.

Ah yep, haven't thought about type hints and cie in my dummy example. Better like this then! :)

ModelHubMixin is specifically for models, now I don't know how third-party libraries use it but maybe a "light" or maybe more universal version, from which ModelHubMixin could inherit, could fit better some usages? But really no big deal, everything works fine in any case!

Yes, interesting to know! Actually the "why" we are looking for a config.json file by default is that this file is automatically used by the server to count the number of downloads of a model. This means that if you have a config.json in your repo, the "download counter" on the Hub will work, even if your library is not officially integrated with the Hub (which is often the case for third-party libraries like MidiTok).

Now I tested it myself and found out about the logger.warning(f"{CONFIG_NAME} not found in HuggingFace Hub.") we are triggering when not finding the config file. This is a bit too much IMO and I will make an update to set it at INFO-level instead of warning. (EDIT: I just opened huggingface/huggingface_hub#1776)

Is there a huggingface_hub version that you would recommend to specify in setup.py / requirements? I set >=0.15 but there might be a better choice.

>=0.16.4 would be the best (see comment below)

I'll leave the PR open a few days, feel free to review if you want to!
I'll release the next version (v2.1.7) soon after (few days max).

🚀🚀🚀

Once released, what you can do is to open a PR to update this file: https://github.com/huggingface/hub-docs/blob/main/docs/hub/models-libraries.md and ping @osanseviero + me. This way, MidiTok will be listed as an integrated library in our docs as well.

Co-authored-by: Lucain <[email protected]>
@Natooz
Copy link
Owner Author

Natooz commented Oct 24, 2023

Perfect, thank you for everything @Wauplin !

>=0.16.4 it is!

Yes, interesting to know! Actually the "why" we are looking for a config.json file by default is that this file is automatically used by the server to count the number of downloads of a model. This means that if you have a config.json in your repo, the "download counter" on the Hub will work, even if your library is not officially integrated with the Hub (which is often the case for third-party libraries like MidiTok).

Thanks for the details, I get that it's important to keep it then :)
In the case of MidiTok, the config.json should in theory be in the repo, as the tokenizer is meant to be with a model (pushed with its config). So I guess everything is good here.

Now I tested it myself and found out about the logger.warning(f"{CONFIG_NAME} not found in HuggingFace Hub.") we are triggering when not finding the config file. This is a bit too much IMO and I will make an update to set it at INFO-level instead of warning. (EDIT: I just opened huggingface/huggingface_hub#1776)

Great thank you! It wasn't necessary but it's a nice gesture

Once released, what you can do is to open a PR to update this file: https://github.com/huggingface/hub-docs/blob/main/docs/hub/models-libraries.md and ping @osanseviero + me. This way, MidiTok will be listed as an integrated library in our docs as well.

I certainly will do! Maybe tonight or tomorrow depending.

@Natooz Natooz merged commit 8dddcca into main Oct 24, 2023
@Natooz Natooz deleted the hf-hub-integration branch October 31, 2023 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants