T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

tshu-w · 2023-07-11T06:54:38Z

System Info

transformers version: 4.30.2
Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.35
Python version: 3.11.3
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: (NA)
Using distributed or parallel set-up in script?: (NA)

Who can help?

@arthu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

In [1]: from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("./models/t5-base/")

In [3]: tokenizer.add_tokens(["asdfg"], special_tokens=False)
Out[3]: 1

In [4]: tokenizer.tokenize("asdfgwordtimeasdfgtime")
Out[4]: ['asdfg', '▁word', 'time', 'asdfg', '▁time']

Expected behavior

tokenizer return ['asdfg', 'word', 'time', 'asdfg', 'time']

The text was updated successfully, but these errors were encountered:

ydshieh · 2023-07-11T09:15:15Z

I think a fix is in

#24622

ydshieh · 2023-07-11T09:44:49Z

FYI: that PR is not merged yet into main branch

ArthurZucker · 2023-07-11T10:03:43Z

Let's wait until we merge to close!

tshu-w · 2023-07-19T04:58:57Z

@ArthurZucker Hi, this issue still exists after updating transformers to the latest 4.31.0 with #24622

ArthurZucker · 2023-07-20T10:45:30Z

Hey!
It is adressed for slow tokenizer, which are part of transformers! Fast tokenizers will need to wait a bit. It is also linked to the conversion script and meta space that need to be used similar to Llama

In [6]: tokenizer = AutoTokenizer.from_pretrained("t5-base", legacy = False, use_fast = False)
/fsx/arthur/miniconda3/envs/py10/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py:199: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(

In [7]: tokenizer.add_tokens(["asdfg"], special_tokens=False)
Out[7]: 1

In [8]: tokenizer.tokenize("asdfgwordtimeasdfgtime")
Out[8]: ['asdfg', 'word', 'time', 'asdfg', 'time']

the key is that you need to set legacy=False and use_fast = False because fast tokenizer is not fixed yet 😉

tshu-w closed this as completed Jul 11, 2023

ArthurZucker reopened this Jul 11, 2023

ArthurZucker linked a pull request Jul 11, 2023 that will close this issue

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged

ArthurZucker closed this as completed in #24622 Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

tshu-w commented Jul 11, 2023

ydshieh commented Jul 11, 2023

ydshieh commented Jul 11, 2023

ArthurZucker commented Jul 11, 2023

tshu-w commented Jul 19, 2023 •

edited

Loading

ArthurZucker commented Jul 20, 2023

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

Comments

tshu-w commented Jul 11, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ydshieh commented Jul 11, 2023

ydshieh commented Jul 11, 2023

ArthurZucker commented Jul 11, 2023

tshu-w commented Jul 19, 2023 • edited Loading

ArthurZucker commented Jul 20, 2023

tshu-w commented Jul 19, 2023 •

edited

Loading