-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622
[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Ran all the tests with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing! Fix looks OK to me. I'm not super familiar with tokenizers, so would be good to have a second approval before merging.
Do we have equivalent tests for other tokenizers in the library?
Co-authored-by: amyeroberts <[email protected]>
…to patch-t5-tokenizer
…er/transformers into patch-t5-tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel good about this kind of change.
Adding ifs
in general in tokenization methods is asking for trouble as the number of differents cases is astronomical (we have too much surface).
I wasn't a bit fan on the original fix either.
IMO we need to strengthen way more the test suite.
6cbeb38
to
4845533
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
legacy
means we are not modifying existing tokenizers without knowing. (And we need to manually update those core tokenizers)- Test suite is more extensive and if we make code modifications later I feel more confident we're note going to rebreak things unknowingly (either way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this!
Just to make sure I've understood - was this bug always part of the T5 tokenizer, or was it introduced in #24565?
I'm asking because of the default value of legacy
- at the moment any new tokenizers created will default to the legacy behaviour. If the bug was recently introduced in #24565, then it wasn't part of the most recent release and so we can default to the "new" behaviour.
@@ -177,15 +178,6 @@ | |||
) | |||
|
|||
|
|||
if is_protobuf_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's messing up the import. By deleting it here, we can at least run the test individually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what's messed up. With the fix in #24689 , everything should be fine.
Also, moving this to another place might count as breaking change, as
from transformers.utils import sentencepiece_model_pb2
won't work anymore (despite the direct such usage might be low).
See here
Could we bring it back quickly, and you open an issue about what's the import error you got. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also from transformers.utils import sentencepiece_model_pb2
works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comes from #24690
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that we only import it once, where we actually use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not fix the version error, but fixes the issue with 3.20.3, when we cannot use seqio or anything importing protobuf:
! pip install protobuf=3.20.3
from transformers import AutoTokenizer
from seqio import SentencePieceVocabulary
In [3]: from seqio import SentencePieceVocabulary
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 from seqio import SentencePieceVocabulary
File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/__init__.py:18
15 """Import to top-level API."""
16 # pylint:disable=wildcard-import,g-bad-import-order
---> 18 from seqio.dataset_providers import *
19 from seqio import evaluation
20 from seqio import experimental
File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/dataset_providers.py:39
37 from packaging import version as version_lib
38 import pyglove as pg
---> 39 from seqio import metrics as metrics_lib
40 from seqio import preprocessors as seqio_preprocessors
41 from seqio import task_registry_provenance_tracking
File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/metrics.py:27
25 import jax.numpy as jnp
26 import numpy as np
---> 27 from seqio import utils
28 import tensorflow.compat.v2 as tf
31 @dataclasses.dataclass
32 class MetricValue:
File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/utils.py:29
27 from absl import logging
28 import numpy as np
---> 29 from seqio.vocabularies import Vocabulary
30 import tensorflow.compat.v2 as tf
31 import tensorflow_datasets as tfds
File /opt/conda/envs/py39/lib/python3.9/site-packages/seqio/vocabularies.py:27
24 import tensorflow.compat.v2 as tf
25 import tensorflow_text as tf_text
---> 27 from sentencepiece import sentencepiece_model_pb2
28 import sentencepiece as sentencepiece_processor
30 PAD_ID = 0
File /opt/conda/envs/py39/lib/python3.9/site-packages/sentencepiece/sentencepiece_model_pb2.py:16
9 # @@protoc_insertion_point(imports)
11 _sym_db = _symbol_database.Default()
---> 16 DESCRIPTOR = _descriptor.FileDescriptor(
17 name='sentencepiece_model.proto',
18 package='sentencepiece',
19 syntax='proto2',
20 serialized_options=b'H\003',
21 create_key=_descriptor._internal_create_key,
22 serialized_pb=b'\n\x19sentencepiece_model.proto\x12\rsentencepiece\"\x80\x0c\n\x0bTrainerSpec\x12\r\n\x05input\x18\x01 \x03(\t\x12\x14\n\x0cinput_format\x18\x07 \x01(\t\x12\x14\n\x0cmodel_prefix\x18\x02 \x01(\t\x12\x41\n\nmodel_type\x18\x03 \x01(\x0e\x32$.sentencepiece.TrainerSpec.ModelType:\x07UNIGRAM\x12\x18\n\nvocab_size\x18\x04 \x01(\x05:\x04\x38\x30\x30\x30\x12\x17\n\x0f\x61\x63\x63\x65pt_language\x18\x05 \x03(\t\x12 \n\x15self_test_sample_size\x18\x06 \x01(\x05:\x01\x30\x12*\n\x1b\x65nable_differential_privacy\x18\x32 \x01(\x08:\x05\x66\x61lse\x12+\n differential_privacy_noise_level\x18\x33 \x01(\x02:\x01\x30\x12\x32\n\'differential_privacy_clipping_threshold\x18\x34 \x01(\x04:\x01\x30\x12\"\n\x12\x63haracter_coverage\x18\n \x01(\x02:\x06\x30.9995\x12\x1e\n\x13input_sentence_size\x18\x0b \x01(\x04:\x01\x30\x12$\n\x16shuffle_input_sentence\x18\x13 \x01(\x08:\x04true\x12 \n\x14mining_sentence_size\x18\x0c \x01(\x05\x42\x02\x18\x01\x12\"\n\x16training_sentence_size\x18\r \x01(\x05\x42\x02\x18\x01\x12(\n\x17seed_sentencepiece_size\x18\x0e \x01(\x05:\x07\x31\x30\x30\x30\x30\x30\x30\x12\x1e\n\x10shrinking_factor\x18\x0f \x01(\x02:\x04\x30.75\x12!\n\x13max_sentence_length\x18\x12 \x01(\x05:\x04\x34\x31\x39\x32\x12\x17\n\x0bnum_threads\x18\x10 \x01(\x05:\x02\x31\x36\x12\x1d\n\x12num_sub_iterations\x18\x11 \x01(\x05:\x01\x32\x12$\n\x18max_sentencepiece_length\x18\x14 \x01(\x05:\x02\x31\x36\x12%\n\x17split_by_unicode_script\x18\x15 \x01(\x08:\x04true\x12\x1d\n\x0fsplit_by_number\x18\x17 \x01(\x08:\x04true\x12!\n\x13split_by_whitespace\x18\x16 \x01(\x08:\x04true\x12)\n\x1atreat_whitespace_as_suffix\x18\x18 \x01(\x08:\x05\x66\x61lse\x12+\n\x1c\x61llow_whitespace_only_pieces\x18\x1a \x01(\x08:\x05\x66\x61lse\x12\x1b\n\x0csplit_digits\x18\x19 \x01(\x08:\x05\x66\x61lse\x12#\n\x19pretokenization_delimiter\x18\x35 \x01(\t:\x00\x12\x17\n\x0f\x63ontrol_symbols\x18\x1e \x03(\t\x12\x1c\n\x14user_defined_symbols\x18\x1f \x03(\t\x12\x16\n\x0erequired_chars\x18$ \x01(\t\x12\x1c\n\rbyte_fallback\x18# \x01(\x08:\x05\x66\x61lse\x12+\n\x1dvocabulary_output_piece_score\x18 \x01(\x08:\x04true\x12\x1e\n\x10hard_vocab_limit\x18! \x01(\x08:\x04true\x12\x1c\n\ruse_all_vocab\x18\" \x01(\x08:\x05\x66\x61lse\x12\x11\n\x06unk_id\x18( \x01(\x05:\x01\x30\x12\x11\n\x06\x62os_id\x18) \x01(\x05:\x01\x31\x12\x11\n\x06\x65os_id\x18* \x01(\x05:\x01\x32\x12\x12\n\x06pad_id\x18+ \x01(\x05:\x02-1\x12\x18\n\tunk_piece\x18- \x01(\t:\x05<unk>\x12\x16\n\tbos_piece\x18. \x01(\t:\x03<s>\x12\x17\n\teos_piece\x18/ \x01(\t:\x04</s>\x12\x18\n\tpad_piece\x18\x30 \x01(\t:\x05<pad>\x12\x1a\n\x0bunk_surface\x18, \x01(\t:\x05 \xe2\x81\x87 \x12+\n\x1ctrain_extremely_large_corpus\x18\x31 \x01(\x08:\x05\x66\x61lse\"5\n\tModelType\x12\x0b\n\x07UNIGRAM\x10\x01\x12\x07\n\x03\x42PE\x10\x02\x12\x08\n\x04WORD\x10\x03\x12\x08\n\x04\x43HAR\x10\x04*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xd1\x01\n\x0eNormalizerSpec\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x1c\n\x14precompiled_charsmap\x18\x02 \x01(\x0c\x12\x1e\n\x10\x61\x64\x64_dummy_prefix\x18\x03 \x01(\x08:\x04true\x12&\n\x18remove_extra_whitespaces\x18\x04 \x01(\x08:\x04true\x12 \n\x12\x65scape_whitespaces\x18\x05 \x01(\x08:\x04true\x12\x1e\n\x16normalization_rule_tsv\x18\x06 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"y\n\x0cSelfTestData\x12\x33\n\x07samples\x18\x01 \x03(\x0b\x32\".sentencepiece.SelfTestData.Sample\x1a)\n\x06Sample\x12\r\n\x05input\x18\x01 \x01(\t\x12\x10\n\x08\x65xpected\x18\x02 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xfe\x03\n\nModelProto\x12\x37\n\x06pieces\x18\x01 \x03(\x0b\x32\'.sentencepiece.ModelProto.SentencePiece\x12\x30\n\x0ctrainer_spec\x18\x02 \x01(\x0b\x32\x1a.sentencepiece.TrainerSpec\x12\x36\n\x0fnormalizer_spec\x18\x03 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x12\x33\n\x0eself_test_data\x18\x04 \x01(\x0b\x32\x1b.sentencepiece.SelfTestData\x12\x38\n\x11\x64\x65normalizer_spec\x18\x05 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x1a\xd2\x01\n\rSentencePiece\x12\r\n\x05piece\x18\x01 \x01(\t\x12\r\n\x05score\x18\x02 \x01(\x02\x12\x42\n\x04type\x18\x03 \x01(\x0e\x32,.sentencepiece.ModelProto.SentencePiece.Type:\x06NORMAL\"T\n\x04Type\x12\n\n\x06NORMAL\x10\x01\x12\x0b\n\x07UNKNOWN\x10\x02\x12\x0b\n\x07\x43ONTROL\x10\x03\x12\x10\n\x0cUSER_DEFINED\x10\x04\x12\x08\n\x04\x42YTE\x10\x06\x12\n\n\x06UNUSED\x10\x05*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\x42\x02H\x03'
23 )
27 _TRAINERSPEC_MODELTYPE = _descriptor.EnumDescriptor(
28 name='ModelType',
29 full_name='sentencepiece.TrainerSpec.ModelType',
(...)
58 serialized_end=1570,
59 )
60 _sym_db.RegisterEnumDescriptor(_TRAINERSPEC_MODELTYPE)
File /opt/conda/envs/py39/lib/python3.9/site-packages/google/protobuf/descriptor.py:1024, in FileDescriptor.__new__(cls, name, package, options, serialized_options, serialized_pb, dependencies, public_dependencies, syntax, pool, create_key)
1022 raise RuntimeError('Please link in cpp generated lib for %s' % (name))
1023 elif serialized_pb:
-> 1024 return _message.default_pool.AddSerializedFile(serialized_pb)
1025 else:
1026 return super(FileDescriptor, cls).__new__(cls)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "sentencepiece_model.proto":
sentencepiece_model.proto: A file with this name is already in the pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the information. Could you provide the full trace 🙏 . It might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added it!
self.assertEqual(tokens, ["▁", ".", "▁He", "ll", "o"]) | ||
|
||
# `'▁'` is also a whitespace | ||
input_ids = self.tokenizer.encode("▁He is not") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have two different whitespaces e.g. " _He is not"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the test case, ▁
is always converted to
. So
and ▁
are the same!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test to make sure it's treated that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one:
input_ids = self.tokenizer.encode("▁He is not ▁He")
self.assertEqual(input_ids, [156, 46, 44, 156, 2])
tokens = self.tokenizer.tokenize("▁He is not ▁He")
This is making sure that ▁
is just encoded as a single
.
The bug has always been in T5, but since some models were trained with the bugged T5, we will let the user decide whether or not they incorparate the change |
Co-authored-by: amyeroberts <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this and for all the work making it safe and backwards compatible!
Co-authored-by: amyeroberts <[email protected]>
…transformers into patch-t5-tokenizer
What does this PR do?
There was a small typo that modified the behaviour in #24565, the test were not able to catch it. #24569
When a sentence does not start with a space, a space was added.
Before:
After:
Big bug 35 models involved
Not only punctuation but anything after a special token is basically wrong... Let's ignore the fact that we also split when it's the beginning of word less important

Tests were added as they were green before merging