Hugging Face hub integration: push and load from the hub #87

Natooz · 2023-10-19T21:24:09Z

This PR adds the ability to push a tokenizer to the Hugging Face hub, and load it from.
Sharing and loading MidiTok tokenizers will be much more convenient.

The two main new methods are called push_to_hub and from_pretrained, to be used similarly to those in the HF libraries.

As of now, there are two options for the implementation:

recreate the push and load methods from the transformers package, and adapt them for MidiTok (more work but faster results) --> preferred option;
import PushToHubMixin from the transformers package, and subclass it for the required changes (less work and might be easier to maintain (no guaranty), but requires to load transformers (and have it installed which might not be desired for some users) which will be slower)

In both ways, we still rely on the huggingface_hub package.

julien-c · 2023-10-20T10:46:43Z

That's really cool – if any help is needed, @Wauplin who's the maintainer of the huggingface_hub package can probably help!

Natooz · 2023-10-20T11:27:16Z

Thanks for the support! I'll ask if needed, but hopefully that won't be necessary :)

Wauplin · 2023-10-20T12:32:47Z

Thanks @julien-c for the ping 🤗

@Natooz I quickly reviewed the draft PR and I think it can be drastically simplified by removing all transformers logic. Actually, we already did kinda the same job as you did here but integrated it directly in huggingface_hub so that other libraries like MidiTok can reuse it. I invite you to read the How to integrate a library guide that explain the technical aspects of it :)

TL;DR: you can inherit MIDITokenizer from huggingface_hub.ModelHubMixin directly in midi_tokenizer.py which will automatically add 3 methods from_pretrained, save_pretrained and push_to_hub. The only thing you have to implement is the logic to serialize/deserialize the tokenize from/to a file. Here is an example of how it would look like:

import json
from abc import ABC
from pathlib import Path
from typing import Dict, Optional, Type, Union

from huggingface_hub import ModelHubMixin, hf_hub_download

from .constants import CURRENT_VERSION_PACKAGE


class MIDITokenizer(ABC, ModelHubMixin):
    ...  # keep all the existing logic from MIDITokenizer
    ...
    ...

    ... # and then add `_from_pretrained` and `_save_pretrained`

    @classmethod
    def _from_pretrained(
        cls,
        *,
        model_id: str,
        revision: Optional[str],
        cache_dir: Optional[Union[str, Path]],
        force_download: bool,
        proxies: Optional[Dict],
        resume_download: bool,
        local_files_only: bool,
        token: Optional[Union[str, bool]],
        **model_kwargs,
    ) -> "MIDITokenizer":
        params_path = hf_hub_download(
            repo_id=model_id,
            filename="tokenizer.conf",
            revision=revision,
            cache_dir=cache_dir,
            force_download=force_download,
            proxies=proxies,
            resume_download=resume_download,
            local_files_only=local_files_only,
            token=token,
            library_name="MidiTok",
            library_version=CURRENT_VERSION_PACKAGE,
        )

        # TODO: adapt this. I assumed from the code this is how to load a conf but can't tell for sure
        with open(params_path, "r") as f:
            return cls(params=json.load(f))

    def _save_pretrained(self, save_directory: Path) -> None:
        """
        Overwrite this method in subclass to define how to save your model.
        Check out our [integration guide](../guides/integrations) for instructions.

        Args:
            save_directory (`str` or `Path`):
                Path to directory in which the model weights and configuration will be saved.
        """
        # TODO: would be nice to also save a Model Card (README.md file) + any other useful info
        self.save_params(save_directory / "tokenizer.conf")

As a result, every subclass of MidiTokenizer can be loaded from the Hub or pushed to the Hub:

>>> class REMI(MIDITokenizer):  # no changes to this class
>>>     ...

>>> # Load from Hub
>>> remi = REMI.from_pretrained("Natooz/MidiTok-tests")

>>> # (optional) Save on local machine
>>> remi.save_pretrained("path/to/local/directory")

>>> # (optional) Push to Hub
>>> remi.push_to_hub("Natooz/MidiTok-tests-v1")

And pretty cool library btw! Looking forward to see the final integration! 🚀

Natooz · 2023-10-23T09:02:19Z

Hi @Wauplin,
Thanks for digging into it and for the suggestions! :)

Indeed the ModelHubMixin seems best suited!
I'm looking into it, my main question here is to wether subclass it (would add a dependency that some users wouldn't use or maybe don't want), subclass it if huggingface_hub is imported (leave the user the choice to not have huggingface_hub).
I'll come to you if I have any question, hopefully it should be fine as everything is already well documented :)

Wauplin · 2023-10-23T09:21:36Z

my main question here is to wether subclass it (would add a dependency that some users wouldn't use or maybe don't want)

I would advice to add huggingface_hub as a core dependency given how seamless the integration would be -using from_pretrained instead of downloading files manually-. In general we try to keep our dependencies as shallow as possible, and especially no heavy ML-library is required (unlike transformers for example). In the end, huggingface_hub is simply a wrapper around requests + a few cool features.

If you still decide to not require huggingface_hub (which I can understand), what you can do is to define at runtime a dummy mixin that would replace the official one. Something like this:

# I let you define a better error message :D 
class DummyModelHubMixin:
    def from_pretrained(self, *args, **kwargs) -> str:
        raise RuntimeError("Please install huggingface_hub")  

    def save_pretrained(self, *args, **kwargs) -> str:
        raise RuntimeError("Please install huggingface_hub")

    def push_to_hub(self, *args, **kwargs) -> str:
        raise RuntimeError("Please install huggingface_hub")

What do you think? :)

codecov · 2023-10-24T08:20:58Z

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (8cd6a67) 90.41% compared to head (1bd1c2a) 90.34%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #87      +/-   ##
==========================================
- Coverage   90.41%   90.34%   -0.07%     
==========================================
  Files          31       33       +2     
  Lines        4536     4577      +41     
==========================================
+ Hits         4101     4135      +34     
- Misses        435      442       +7

Files	Coverage Δ
miditok/__init__.py	`68.42% <100.00%> (ø)`
miditok/classes.py	`84.50% <100.00%> (ø)`
miditok/constants.py	`100.00% <100.00%> (ø)`
setup.py	`0.00% <ø> (ø)`
tests/conftest.py	`100.00% <100.00%> (ø)`
miditok/midi_tokenizer.py	`90.53% <93.75%> (+0.04%)`	⬆️
tests/test_hf_hub.py	`68.42% <68.42%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Natooz · 2023-10-24T10:19:18Z

Hi @Wauplin ,

I ended up directly subclassing huggingface_hub, as indeed the dependencies are very light. 🤗
At first I tried to use a dummy mixin in case the package is not installed, by the docstring and type hints in my IDE (PyCharm) only showed the one of the dummy one instead of ModelHubMixin, so I felt it would be preferable to add the dependency anyway and get the proper hints in any case.

Also after this, I subclassed ModelHubMixin to get rid of the config logic as it doesn't apply here, but finally gave up the idea for the maintenance cost.
ModelHubMixin is specifically for models, now I don't know how third-party libraries use it but maybe a "light" or maybe more universal version, from which ModelHubMixin could inherit, could fit better some usages? But really no big deal, everything works fine in any case!
(Edit: I assumed this kind of "user feedback" might interest you)

Thank you again for your everything!
Is there a huggingface_hub version that you would recommend to specify in setup.py / requirements? I set >=0.15 but there might be a better choice.

I'll leave the PR open a few days, feel free to review if you want to!
I'll release the next version (v2.1.7) soon after (few days max).

Wauplin

Awesome! Great job @Natooz, I love this integration and how documented it is 👍 I left 2 minor comments but otherwise looks good to me.

To answer your comments/feedback:

At first I tried to use a dummy mixin in case the package is not installed, by the docstring and type hints in my IDE (PyCharm) only showed the one of the dummy one instead of ModelHubMixin, so I felt it would be preferable to add the dependency anyway and get the proper hints in any case.

Ah yep, haven't thought about type hints and cie in my dummy example. Better like this then! :)

ModelHubMixin is specifically for models, now I don't know how third-party libraries use it but maybe a "light" or maybe more universal version, from which ModelHubMixin could inherit, could fit better some usages? But really no big deal, everything works fine in any case!

Yes, interesting to know! Actually the "why" we are looking for a config.json file by default is that this file is automatically used by the server to count the number of downloads of a model. This means that if you have a config.json in your repo, the "download counter" on the Hub will work, even if your library is not officially integrated with the Hub (which is often the case for third-party libraries like MidiTok).

Now I tested it myself and found out about the logger.warning(f"{CONFIG_NAME} not found in HuggingFace Hub.") we are triggering when not finding the config file. This is a bit too much IMO and I will make an update to set it at INFO-level instead of warning. (EDIT: I just opened huggingface/huggingface_hub#1776)

Is there a huggingface_hub version that you would recommend to specify in setup.py / requirements? I set >=0.15 but there might be a better choice.

>=0.16.4 would be the best (see comment below)

I'll leave the PR open a few days, feel free to review if you want to!
I'll release the next version (v2.1.7) soon after (few days max).

🚀🚀🚀

Once released, what you can do is to open a PR to update this file: https://github.com/huggingface/hub-docs/blob/main/docs/hub/models-libraries.md and ping @osanseviero + me. This way, MidiTok will be listed as an integrated library in our docs as well.

docs/bpe.rst

docs/hf_hub.rst

setup.py

Co-authored-by: Lucain <[email protected]>

Natooz · 2023-10-24T15:54:15Z

Perfect, thank you for everything @Wauplin !

>=0.16.4 it is!

Yes, interesting to know! Actually the "why" we are looking for a config.json file by default is that this file is automatically used by the server to count the number of downloads of a model. This means that if you have a config.json in your repo, the "download counter" on the Hub will work, even if your library is not officially integrated with the Hub (which is often the case for third-party libraries like MidiTok).

Thanks for the details, I get that it's important to keep it then :)
In the case of MidiTok, the config.json should in theory be in the repo, as the tokenizer is meant to be with a model (pushed with its config). So I guess everything is good here.

Now I tested it myself and found out about the logger.warning(f"{CONFIG_NAME} not found in HuggingFace Hub.") we are triggering when not finding the config file. This is a bit too much IMO and I will make an update to set it at INFO-level instead of warning. (EDIT: I just opened huggingface/huggingface_hub#1776)

Great thank you! It wasn't necessary but it's a nice gesture

Once released, what you can do is to open a PR to update this file: https://github.com/huggingface/hub-docs/blob/main/docs/hub/models-libraries.md and ping @osanseviero + me. This way, MidiTok will be listed as an integrated library in our docs as well.

I certainly will do! Maybe tonight or tomorrow depending.

Natooz added 2 commits October 19, 2023 23:11

WIP: Hugging Face hub integration, adding the base files / changes

afa79b3

custom push_to_hub working

b8a7f40

Natooz added 3 commits October 23, 2023 23:23

subclassing huggingface_hub.HFHubMixin

0920775

adding pytest conftest.py for hf_token

c6feff4

fixing hf hub test, lint and doc

c344765

docs update

32d8a50

docs update

c36b9c9

Natooz marked this pull request as ready for review October 24, 2023 13:31

Wauplin approved these changes Oct 24, 2023

View reviewed changes

docs/bpe.rst Show resolved Hide resolved

docs/hf_hub.rst Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

Wauplin mentioned this pull request Oct 24, 2023

Do not warn in ModelHubMixin on missing config file huggingface/huggingface_hub#1776

Merged

Update setup.py

673d6ff

Co-authored-by: Lucain <[email protected]>

fixing bpe docs, fixing colab notebook

f2e2373

Natooz mentioned this pull request Oct 24, 2023

Please check your two notebook demos, there are too many errors inside. #88

Closed

docs fixes / improvements

1bd1c2a

Natooz merged commit 8dddcca into main Oct 24, 2023

Natooz deleted the hf-hub-integration branch October 31, 2023 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging Face hub integration: push and load from the hub #87

Hugging Face hub integration: push and load from the hub #87

Natooz commented Oct 19, 2023 •

edited

Loading

julien-c commented Oct 20, 2023

Natooz commented Oct 20, 2023

Wauplin commented Oct 20, 2023 •

edited

Loading

Natooz commented Oct 23, 2023 •

edited

Loading

Wauplin commented Oct 23, 2023

codecov bot commented Oct 24, 2023 •

edited

Loading

Natooz commented Oct 24, 2023 •

edited

Loading

Wauplin left a comment •

edited

Loading

Natooz commented Oct 24, 2023

Hugging Face hub integration: push and load from the hub #87

Hugging Face hub integration: push and load from the hub #87

Conversation

Natooz commented Oct 19, 2023 • edited Loading

julien-c commented Oct 20, 2023

Natooz commented Oct 20, 2023

Wauplin commented Oct 20, 2023 • edited Loading

Natooz commented Oct 23, 2023 • edited Loading

Wauplin commented Oct 23, 2023

codecov bot commented Oct 24, 2023 • edited Loading

Codecov Report

Natooz commented Oct 24, 2023 • edited Loading

Wauplin left a comment • edited Loading

Choose a reason for hiding this comment

Natooz commented Oct 24, 2023

Natooz commented Oct 19, 2023 •

edited

Loading

Wauplin commented Oct 20, 2023 •

edited

Loading

Natooz commented Oct 23, 2023 •

edited

Loading

codecov bot commented Oct 24, 2023 •

edited

Loading

Natooz commented Oct 24, 2023 •

edited

Loading

Wauplin left a comment •

edited

Loading