Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Closed
2 of 4 tasks
suflaj opened this issue Apr 30, 2021 · 8 comments · Fixed by #23909
Closed
2 of 4 tasks

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

suflaj opened this issue Apr 30, 2021 · 8 comments · Fixed by #23909
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@suflaj
Copy link

suflaj commented Apr 30, 2021

Environment info

  • transformers version: 4.5.1
  • Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
  • Python version: 3.6.13
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

If it helps, here's also my pip-chill:

black==19.10b0
corrupt-text==0.0.1
en-core-web-sm==3.0.0
fairseq==1.0.0a0+f6f220e
flake8==3.9.0
pep8==1.7.1
pip-chill==1.0.1
rope==0.14.0
sentencepiece==0.1.95
torchtext==0.8.0
transformers==4.5.1
wikiextractor==3.0.5

Note that corrupt-text is a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the to reproduce section.

Who can help

Since it's a tokenizer issue, probably @LysandreJik.

Information

I'm using the T5Tokenizer. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give the add_tokens and add_special_tokens a list of AddedToken objects with lstrip and rstrip explicitly set to False.

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

Check out the to reproduce section to get an example of a code that doesn't work.

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

It's not really relevant for this problem but the code is, once again, in the to reproduce section.

This is likely related to #7901.

To reproduce

Try running this code:

from transformers import T5Tokenizer
from tokenizers import AddedToken

text = "Bruh doits <do_not_touch>"

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_tokens([AddedToken("doits", lstrip=False, rstrip=False)])
tokenizer.add_special_tokens(
    {
        "additional_special_tokens": [
            AddedToken("<do_not_touch>", lstrip=False, rstrip=False)
        ]
    }
)

tokens = tokenizer.tokenize(text)
ids = tokenizer(
    text,
    add_special_tokens=False,
    padding=False,
    truncation=False,
    return_attention_mask=False,
)["input_ids"]

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Text after: {tokenizer.convert_tokens_to_string(tokens)}")

You will get this:

Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', 'doits', '<do_not_touch>']
IDs: [3, 9465, 107, 32100, 32101]
Text after: Bruhdoits<do_not_touch>

Expected behavior

We should get:

Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', '▁', 'doits', '▁', '<do_not_touch>']
IDs: [3, 9465, 107, 3, 32100, 3, 32101]
Text after: Bruh doits <do_not_touch>

EDIT: Updated the code to have rstrip=False, since I made the mistake originally, but still acts the same.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@suflaj
Copy link
Author

suflaj commented May 31, 2021

The issue still persists and tokenizers in general still act weird with special tokens and whitespace.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jul 4, 2021
@LysandreJik LysandreJik reopened this Jul 5, 2021
@LysandreJik LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jul 5, 2021
@muelerma
Copy link

Hello @LysandreJik,
is there any update on this? We are also facing issues with added tokens for both, Rust and Python tokenizers, when using the default mt5 tokenizer.

Similar to the issues above, we experience inconsistent behavior with spaces in the immediate surroundings of added tokens.

tokenizer_fast = MT5TokenizerFast.from_pretrained("google/mt5-base")
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base")

tokenizer_fast.add_tokens("<new_token>")
tokenizer.add_tokens("<new_token>")

text = "This is a test <new_token>."

tokens = tokenizer_fast.tokenize(text)
print(tokens)
tokenizer_fast.convert_tokens_to_string(tokens)

['▁This', '▁is', '▁', 'a', '▁test', '▁', '<new_token>', '▁', '.']
'This is a test <new_token> .'

For the fast tokenizer, a space is inserted after the added token.

For the slow one, also spaces in front of added tokens are removed:

tokens = tokenizer.tokenize(text)
print(tokens)
tokenizer.convert_tokens_to_string(tokens)

['▁This', '▁is', '▁', 'a', '▁test', '<new_token>', '▁', '.']
'This is a test<new_token> .'

At least for the Python tokenizer, I believe the problem lies in the way how texts with added tokens are passed to the underlying sentence_piece tokenizer. The texts are basically split by added tokens and the remaining parts are individually passed to sp. By default, the sp tokenizer adds a space at the start of each sequence and removes them at the end:

tokenizer.sp_model.encode("A test ", out_type=str)

['▁A', '▁test']

When tokens are converted back into a single string, only the space at the very first position is removed, but not in case there is an added token in front of it

tokenizer.sp_model.decode_pieces(['▁This', '▁is', '▁', 'a', '▁test', '<new_token>', '▁', '.'])

'This is a test<new_token> .'

For the slow tokenizer, we could modify the tokens manually to e.g. take into account spaces in the original string. Unfortunately we lack the Rust skills to do this for the fast tokenizer.

Are there any plans to adjust this in the near future (since this issue still has the WIP tag)?

@LysandreJik
Copy link
Member

Pinging @SaulLu

@ArthurZucker
Copy link
Collaborator

Hey! This is being talked in the PR linked above! Sorry for the late reply

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jun 29, 2023

Regarding the default MT5 problem with addition of a space, this is being handled here: #24565. The problem is not because of striping left right for ponctuation, but rstrip and lstrip are indeed ingored

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jun 29, 2023

Fixing the rust tokenizer: it's a hack so I might have to change the rust code, but for now the following will strip anything on the right and left, giving the expected results.

class T5Converter(SpmConverter):
    def vocab(self, proto):
        num_extra_ids = self.original_tokenizer._extra_ids
        vocab = [(piece.piece, piece.score) for piece in proto.pieces]
        vocab += [(f"<extra_id_{i}>_", 0.0) for i in range(num_extra_ids - 1, -1, -1)]
        return vocab
    ..........

I tested:

>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("google/mt5-small", from_slow = True)
>>>  tokenizer.tokenize("Hello, <extra_id_0>, ")
['▁Hello', ',', '▁<extra_id_0>', ',', '▁']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
4 participants