Adding custom tokens makes the T5Tokenizer always strip spaces #11531

suflaj · 2021-04-30T15:55:42Z

Environment info

transformers version: 4.5.1
Platform: Linux-3.10.0-957.5.1.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python version: 3.6.13
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

If it helps, here's also my pip-chill:

black==19.10b0
corrupt-text==0.0.1
en-core-web-sm==3.0.0
fairseq==1.0.0a0+f6f220e
flake8==3.9.0
pep8==1.7.1
pip-chill==1.0.1
rope==0.14.0
sentencepiece==0.1.95
torchtext==0.8.0
transformers==4.5.1
wikiextractor==3.0.5

Note that corrupt-text is a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the to reproduce section.

Who can help

Since it's a tokenizer issue, probably @LysandreJik.

Information

I'm using the T5Tokenizer. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give the add_tokens and add_special_tokens a list of AddedToken objects with lstrip and rstrip explicitly set to False.

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

Check out the to reproduce section to get an example of a code that doesn't work.

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

It's not really relevant for this problem but the code is, once again, in the to reproduce section.

This is likely related to #7901.

To reproduce

Try running this code:

from transformers import T5Tokenizer
from tokenizers import AddedToken

text = "Bruh doits <do_not_touch>"

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer.add_tokens([AddedToken("doits", lstrip=False, rstrip=False)])
tokenizer.add_special_tokens(
    {
        "additional_special_tokens": [
            AddedToken("<do_not_touch>", lstrip=False, rstrip=False)
        ]
    }
)

tokens = tokenizer.tokenize(text)
ids = tokenizer(
    text,
    add_special_tokens=False,
    padding=False,
    truncation=False,
    return_attention_mask=False,
)["input_ids"]

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Text after: {tokenizer.convert_tokens_to_string(tokens)}")

You will get this:

Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', 'doits', '<do_not_touch>']
IDs: [3, 9465, 107, 32100, 32101]
Text after: Bruhdoits<do_not_touch>

Expected behavior

We should get:

Text: Bruh doits <do_not_touch>
Tokens: ['▁', 'Bru', 'h', '▁', 'doits', '▁', '<do_not_touch>']
IDs: [3, 9465, 107, 3, 32100, 3, 32101]
Text after: Bruh doits <do_not_touch>

EDIT: Updated the code to have rstrip=False, since I made the mistake originally, but still acts the same.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-05-31T15:11:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

suflaj · 2021-05-31T15:15:09Z

The issue still persists and tokenizers in general still act weird with special tokens and whitespace.

github-actions · 2021-06-25T15:02:15Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

muelerma · 2022-02-18T08:29:48Z

Hello @LysandreJik,
is there any update on this? We are also facing issues with added tokens for both, Rust and Python tokenizers, when using the default mt5 tokenizer.

Similar to the issues above, we experience inconsistent behavior with spaces in the immediate surroundings of added tokens.

tokenizer_fast = MT5TokenizerFast.from_pretrained("google/mt5-base")
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base")

tokenizer_fast.add_tokens("<new_token>")
tokenizer.add_tokens("<new_token>")

text = "This is a test <new_token>."

tokens = tokenizer_fast.tokenize(text)
print(tokens)
tokenizer_fast.convert_tokens_to_string(tokens)

['▁This', '▁is', '▁', 'a', '▁test', '▁', '<new_token>', '▁', '.']
'This is a test <new_token> .'

For the fast tokenizer, a space is inserted after the added token.

For the slow one, also spaces in front of added tokens are removed:

tokens = tokenizer.tokenize(text)
print(tokens)
tokenizer.convert_tokens_to_string(tokens)

['▁This', '▁is', '▁', 'a', '▁test', '<new_token>', '▁', '.']
'This is a test<new_token> .'

At least for the Python tokenizer, I believe the problem lies in the way how texts with added tokens are passed to the underlying sentence_piece tokenizer. The texts are basically split by added tokens and the remaining parts are individually passed to sp. By default, the sp tokenizer adds a space at the start of each sequence and removes them at the end:

tokenizer.sp_model.encode("A test ", out_type=str)

['▁A', '▁test']

When tokens are converted back into a single string, only the space at the very first position is removed, but not in case there is an added token in front of it

tokenizer.sp_model.decode_pieces(['▁This', '▁is', '▁', 'a', '▁test', '<new_token>', '▁', '.'])

'This is a test<new_token> .'

For the slow tokenizer, we could modify the tokens manually to e.g. take into account spaces in the original string. Unfortunately we lack the Rust skills to do this for the fast tokenizer.

Are there any plans to adjust this in the near future (since this issue still has the WIP tag)?

LysandreJik · 2022-02-22T18:59:03Z

Pinging @SaulLu

ArthurZucker · 2023-06-02T12:28:31Z

Hey! This is being talked in the PR linked above! Sorry for the late reply

ArthurZucker · 2023-06-29T03:05:15Z

Regarding the default MT5 problem with addition of a space, this is being handled here: #24565. The problem is not because of striping left right for ponctuation, but rstrip and lstrip are indeed ingored

ArthurZucker · 2023-06-29T03:24:05Z

Fixing the rust tokenizer: it's a hack so I might have to change the rust code, but for now the following will strip anything on the right and left, giving the expected results.

class T5Converter(SpmConverter):
    def vocab(self, proto):
        num_extra_ids = self.original_tokenizer._extra_ids
        vocab = [(piece.piece, piece.score) for piece in proto.pieces]
        vocab += [(f"<extra_id_{i}>_", 0.0) for i in range(num_extra_ids - 1, -1, -1)]
        return vocab
    ..........

I tested:

>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("google/mt5-small", from_slow = True)
>>>  tokenizer.tokenize("Hello, <extra_id_0>, ")
['▁Hello', ',', '▁<extra_id_0>', ',', '▁']

github-actions bot closed this as completed Jul 4, 2021

LysandreJik reopened this Jul 5, 2021

LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jul 5, 2021

mawilson1234 mentioned this issue Dec 14, 2021

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Closed

ArthurZucker mentioned this issue May 31, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

ArthurZucker mentioned this issue Jun 29, 2023

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

Merged

1 task

ArthurZucker closed this as completed in #23909 Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

suflaj commented Apr 30, 2021 •

edited

Loading

github-actions bot commented May 31, 2021

suflaj commented May 31, 2021

github-actions bot commented Jun 25, 2021

muelerma commented Feb 18, 2022

LysandreJik commented Feb 22, 2022

ArthurZucker commented Jun 2, 2023

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Comments

suflaj commented Apr 30, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

github-actions bot commented May 31, 2021

suflaj commented May 31, 2021

github-actions bot commented Jun 25, 2021

muelerma commented Feb 18, 2022

LysandreJik commented Feb 22, 2022

ArthurZucker commented Jun 2, 2023

ArthurZucker commented Jun 29, 2023 • edited Loading

ArthurZucker commented Jun 29, 2023 • edited Loading

suflaj commented Apr 30, 2021 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading