-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T5 Tokenizer Adds Space after Each Added (Extra) Token #24743
Comments
I think a fix is in |
FYI: that PR is not merged yet into |
Let's wait until we merge to close! |
@ArthurZucker Hi, this issue still exists after updating transformers to the latest 4.31.0 with #24622 |
Hey! In [6]: tokenizer = AutoTokenizer.from_pretrained("t5-base", legacy = False, use_fast = False)
/fsx/arthur/miniconda3/envs/py10/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py:199: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
In [7]: tokenizer.add_tokens(["asdfg"], special_tokens=False)
Out[7]: 1
In [8]: tokenizer.tokenize("asdfgwordtimeasdfgtime")
Out[8]: ['asdfg', 'word', 'time', 'asdfg', 'time'] the key is that you need to set |
System Info
transformers
version: 4.30.2Who can help?
@arthu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
tokenizer return
['asdfg', 'word', 'time', 'asdfg', 'time']
The text was updated successfully, but these errors were encountered: