Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading Flan-T5 tokenizer throwing UnboundLocalError for variable sentencepiece_model_pb2 #25667

Closed
2 of 4 tasks
PyroGenesis opened this issue Aug 22, 2023 · 4 comments · Fixed by #25684
Closed
2 of 4 tasks

Comments

@PyroGenesis
Copy link

System Info

  • transformers version: 4.32.0
  • Platform: Windows-10-10.0.20348-SP0
  • Python version: 3.10.11
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.2
  • Accelerate version: 0.21.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes (but haven't loaded model yet)
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker and @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code:

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

Error:

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[1], line 2
      1 from transformers import T5Tokenizer, T5ForConditionalGeneration
----> 2 tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

File ~\Documents\flan-t5\lib\site-packages\transformers\tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   1851     else:
   1852         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1854 return cls._from_pretrained(
   1855     resolved_vocab_files,
   1856     pretrained_model_name_or_path,
   1857     init_configuration,
   1858     *init_inputs,
   1859     token=token,
   1860     cache_dir=cache_dir,
   1861     local_files_only=local_files_only,
   1862     _commit_hash=commit_hash,
   1863     _is_local=is_local,
   1864     **kwargs,
   1865 )

File ~\Documents\flan-t5\lib\site-packages\transformers\tokenization_utils_base.py:2017, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2015 # Instantiate tokenizer.
   2016 try:
-> 2017     tokenizer = cls(*init_inputs, **init_kwargs)
   2018 except OSError:
   2019     raise OSError(
   2020         "Unable to load vocabulary from file. "
   2021         "Please check that the provided vocabulary is accessible and not corrupted."
   2022     )

File ~\Documents\flan-t5\lib\site-packages\transformers\models\t5\tokenization_t5.py:194, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, legacy, **kwargs)
    191 self.vocab_file = vocab_file
    192 self._extra_ids = extra_ids
--> 194 self.sp_model = self.get_spm_processor()

File ~\Documents\flan-t5\lib\site-packages\transformers\models\t5\tokenization_t5.py:200, in T5Tokenizer.get_spm_processor(self)
    198 with open(self.vocab_file, "rb") as f:
    199     sp_model = f.read()
--> 200     model_pb2 = import_protobuf()
    201     model = model_pb2.ModelProto.FromString(sp_model)
    202     if not self.legacy:

File ~\Documents\flan-t5\lib\site-packages\transformers\convert_slow_tokenizer.py:40, in import_protobuf()
     38     else:
     39         from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
---> 40 return sentencepiece_model_pb2

UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment

Expected behavior

It's supposed to simply load the default tokenizer. This code was working fine earlier (on a different machine).

@PyroGenesis
Copy link
Author

Update:

I ran pip install protobuf and the tokenizer works now.

Is this requirement listed anywhere? I don't recall doing this the last time I set up this tokenizer.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Aug 23, 2023

It is a dependency listed in the setup.py see here, but it is not a hard dep. The error is indeed a bug on our side. Opening a PR to raise an error if protobuf is not installed and if people use legacy = False!.

@AIDevMonster
Copy link

thx

@AIDevMonster
Copy link

everyone~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants