Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

Closed
2 of 4 tasks
tshu-w opened this issue Jul 11, 2023 · 5 comments · Fixed by #24622
Closed
2 of 4 tasks

T5 Tokenizer Adds Space after Each Added (Extra) Token #24743

tshu-w opened this issue Jul 11, 2023 · 5 comments · Fixed by #24622

Comments

@tshu-w
Copy link
Contributor

tshu-w commented Jul 11, 2023

System Info

  • transformers version: 4.30.2
  • Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.35
  • Python version: 3.11.3
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu117 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: (NA)
  • Using distributed or parallel set-up in script?: (NA)

Who can help?

@arthu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

In [1]: from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("./models/t5-base/")

In [3]: tokenizer.add_tokens(["asdfg"], special_tokens=False)
Out[3]: 1

In [4]: tokenizer.tokenize("asdfgwordtimeasdfgtime")
Out[4]: ['asdfg', '▁word', 'time', 'asdfg', '▁time']

Expected behavior

tokenizer return ['asdfg', 'word', 'time', 'asdfg', 'time']

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 11, 2023

I think a fix is in

#24622

@tshu-w tshu-w closed this as completed Jul 11, 2023
@ydshieh
Copy link
Collaborator

ydshieh commented Jul 11, 2023

FYI: that PR is not merged yet into main branch

@ArthurZucker ArthurZucker reopened this Jul 11, 2023
@ArthurZucker
Copy link
Collaborator

Let's wait until we merge to close!

@tshu-w
Copy link
Contributor Author

tshu-w commented Jul 19, 2023

@ArthurZucker Hi, this issue still exists after updating transformers to the latest 4.31.0 with #24622

@ArthurZucker
Copy link
Collaborator

Hey!
It is adressed for slow tokenizer, which are part of transformers! Fast tokenizers will need to wait a bit. It is also linked to the conversion script and meta space that need to be used similar to Llama

In [6]: tokenizer = AutoTokenizer.from_pretrained("t5-base", legacy = False, use_fast = False)
/fsx/arthur/miniconda3/envs/py10/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py:199: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(

In [7]: tokenizer.add_tokens(["asdfg"], special_tokens=False)
Out[7]: 1

In [8]: tokenizer.tokenize("asdfgwordtimeasdfgtime")
Out[8]: ['asdfg', 'word', 'time', 'asdfg', 'time']

the key is that you need to set legacy=False and use_fast = False because fast tokenizer is not fixed yet 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants