Loading Flan-T5 tokenizer throwing `UnboundLocalError` for variable `sentencepiece_model_pb2` #25667

PyroGenesis · 2023-08-22T20:25:51Z

System Info

transformers version: 4.32.0
Platform: Windows-10-10.0.20348-SP0
Python version: 3.10.11
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.2
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (but haven't loaded model yet)
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker and @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Code:

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

Error:

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[1], line 2
      1 from transformers import T5Tokenizer, T5ForConditionalGeneration
----> 2 tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

File ~\Documents\flan-t5\lib\site-packages\transformers\tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   1851     else:
   1852         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1854 return cls._from_pretrained(
   1855     resolved_vocab_files,
   1856     pretrained_model_name_or_path,
   1857     init_configuration,
   1858     *init_inputs,
   1859     token=token,
   1860     cache_dir=cache_dir,
   1861     local_files_only=local_files_only,
   1862     _commit_hash=commit_hash,
   1863     _is_local=is_local,
   1864     **kwargs,
   1865 )

File ~\Documents\flan-t5\lib\site-packages\transformers\tokenization_utils_base.py:2017, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2015 # Instantiate tokenizer.
   2016 try:
-> 2017     tokenizer = cls(*init_inputs, **init_kwargs)
   2018 except OSError:
   2019     raise OSError(
   2020         "Unable to load vocabulary from file. "
   2021         "Please check that the provided vocabulary is accessible and not corrupted."
   2022     )

File ~\Documents\flan-t5\lib\site-packages\transformers\models\t5\tokenization_t5.py:194, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, legacy, **kwargs)
    191 self.vocab_file = vocab_file
    192 self._extra_ids = extra_ids
--> 194 self.sp_model = self.get_spm_processor()

File ~\Documents\flan-t5\lib\site-packages\transformers\models\t5\tokenization_t5.py:200, in T5Tokenizer.get_spm_processor(self)
    198 with open(self.vocab_file, "rb") as f:
    199     sp_model = f.read()
--> 200     model_pb2 = import_protobuf()
    201     model = model_pb2.ModelProto.FromString(sp_model)
    202     if not self.legacy:

File ~\Documents\flan-t5\lib\site-packages\transformers\convert_slow_tokenizer.py:40, in import_protobuf()
     38     else:
     39         from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
---> 40 return sentencepiece_model_pb2

UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment

Expected behavior

It's supposed to simply load the default tokenizer. This code was working fine earlier (on a different machine).

The text was updated successfully, but these errors were encountered:

PyroGenesis · 2023-08-22T22:25:49Z

Update:

I ran pip install protobuf and the tokenizer works now.

Is this requirement listed anywhere? I don't recall doing this the last time I set up this tokenizer.

ArthurZucker · 2023-08-23T08:21:41Z

It is a dependency listed in the setup.py see here, but it is not a hard dep. The error is indeed a bug on our side. Opening a PR to raise an error if protobuf is not installed and if people use legacy = False!.

AIDevMonster · 2023-08-25T13:50:58Z

thx

AIDevMonster · 2023-08-25T13:51:12Z

everyone~

bartekleon mentioned this issue Aug 23, 2023

Unbound local error facebookresearch/audiocraft#241

Closed

ArthurZucker mentioned this issue Aug 23, 2023

[Sentencepiece] make sure legacy do not require protobuf #25684

Merged

ArthurZucker closed this as completed in #25684 Aug 25, 2023

NeonBohdan mentioned this issue Aug 30, 2023

Fix tokenizer initialization NeonGeckoCom/neon-llm-fastchat#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading Flan-T5 tokenizer throwing `UnboundLocalError` for variable `sentencepiece_model_pb2` #25667

Loading Flan-T5 tokenizer throwing `UnboundLocalError` for variable `sentencepiece_model_pb2` #25667

PyroGenesis commented Aug 22, 2023

PyroGenesis commented Aug 22, 2023

ArthurZucker commented Aug 23, 2023 •

edited

Loading

AIDevMonster commented Aug 25, 2023

AIDevMonster commented Aug 25, 2023

Loading Flan-T5 tokenizer throwing UnboundLocalError for variable sentencepiece_model_pb2 #25667

Loading Flan-T5 tokenizer throwing UnboundLocalError for variable sentencepiece_model_pb2 #25667

Comments

PyroGenesis commented Aug 22, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

PyroGenesis commented Aug 22, 2023

ArthurZucker commented Aug 23, 2023 • edited Loading

AIDevMonster commented Aug 25, 2023

AIDevMonster commented Aug 25, 2023

Loading Flan-T5 tokenizer throwing `UnboundLocalError` for variable `sentencepiece_model_pb2` #25667

Loading Flan-T5 tokenizer throwing `UnboundLocalError` for variable `sentencepiece_model_pb2` #25667

ArthurZucker commented Aug 23, 2023 •

edited

Loading