Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 Tokenizer requires protobuf package #25753

Closed
2 of 4 tasks
sanchit-gandhi opened this issue Aug 25, 2023 · 1 comment · Fixed by #25684
Closed
2 of 4 tasks

T5 Tokenizer requires protobuf package #25753

sanchit-gandhi opened this issue Aug 25, 2023 · 1 comment · Fixed by #25684

Comments

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Aug 25, 2023

System Info

  • transformers version: 4.32.0.dev0
  • Platform: macOS-13.5.1-arm64-arm-64bit
  • Python version: 3.9.13
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

@ArthurZucker @sanchit-gandhi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Ensure protobuf is uninstalled:
pip uninstall protobuf
  1. Import the T5Tokenizer:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")

Traceback:

UnboundLocalError                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = T5Tokenizer.from_pretrained("t5-base")

File ~/transformers/src/transformers/tokenization_utils_base.py:1854, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   1851     else:
   1852         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1854 return cls._from_pretrained(
   1855     resolved_vocab_files,
   1856     pretrained_model_name_or_path,
   1857     init_configuration,
   1858     *init_inputs,
   1859     token=token,
   1860     cache_dir=cache_dir,
   1861     local_files_only=local_files_only,
   1862     _commit_hash=commit_hash,
   1863     _is_local=is_local,
   1864     **kwargs,
   1865 )

File ~/transformers/src/transformers/tokenization_utils_base.py:2017, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2015 # Instantiate tokenizer.
   2016 try:
-> 2017     tokenizer = cls(*init_inputs, **init_kwargs)
   2018 except OSError:
   2019     raise OSError(
   2020         "Unable to load vocabulary from file. "
   2021         "Please check that the provided vocabulary is accessible and not corrupted."
   2022     )

File ~/transformers/src/transformers/models/t5/tokenization_t5.py:194, in T5Tokenizer.__init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, legacy, **kwargs)
    191 self.vocab_file = vocab_file
    192 self._extra_ids = extra_ids
--> 194 self.sp_model = self.get_spm_processor()

File ~/transformers/src/transformers/models/t5/tokenization_t5.py:200, in T5Tokenizer.get_spm_processor(self)
    198 with open(self.vocab_file, "rb") as f:
    199     sp_model = f.read()
--> 200     model_pb2 = import_protobuf()
    201     model = model_pb2.ModelProto.FromString(sp_model)
    202     if not self.legacy:

File ~/transformers/src/transformers/convert_slow_tokenizer.py:40, in import_protobuf()
     38     else:
     39         from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
---> 40 return sentencepiece_model_pb2

UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment

This is occurring because we do import_protobuf in the init:

model_pb2 = import_protobuf()

But import_protobuf is ill-defined in the case that protobuf is not available:

def import_protobuf():
if is_protobuf_available():
import google.protobuf
if version.parse(google.protobuf.__version__) < version.parse("4.0.0"):
from transformers.utils import sentencepiece_model_pb2
else:
from transformers.utils import sentencepiece_model_pb2_new as sentencepiece_model_pb2
return sentencepiece_model_pb2

=> if protobuf is not installed, then sentencepiece_model_pb2 will be un-defined

Has protobuf been made a soft-dependency for T5Tokenizer inadvertently in #24622? Or can sentencepiece_model_pb2 be defined without protobuf?

Expected behavior

Use T5Tokenizer without protobuf

@ArthurZucker
Copy link
Collaborator

See the link PR for a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants