-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding custom tokens makes the T5Tokenizer always strip spaces #11531
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
The issue still persists and tokenizers in general still act weird with special tokens and whitespace. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hello @LysandreJik, Similar to the issues above, we experience inconsistent behavior with spaces in the immediate surroundings of added tokens.
For the fast tokenizer, a space is inserted after the added token. For the slow one, also spaces in front of added tokens are removed:
At least for the Python tokenizer, I believe the problem lies in the way how texts with added tokens are passed to the underlying sentence_piece tokenizer. The texts are basically split by added tokens and the remaining parts are individually passed to sp. By default, the sp tokenizer adds a space at the start of each sequence and removes them at the end:
When tokens are converted back into a single string, only the space at the very first position is removed, but not in case there is an added token in front of it
For the slow tokenizer, we could modify the tokens manually to e.g. take into account spaces in the original string. Unfortunately we lack the Rust skills to do this for the fast tokenizer. Are there any plans to adjust this in the near future (since this issue still has the WIP tag)? |
Pinging @SaulLu |
Hey! This is being talked in the PR linked above! Sorry for the late reply |
Regarding the default MT5 problem with addition of a space, this is being handled here: #24565. The problem is not because of striping left right for ponctuation, but |
Fixing the rust tokenizer: it's a hack so I might have to change the rust code, but for now the following will strip anything on the right and left, giving the expected results. class T5Converter(SpmConverter):
def vocab(self, proto):
num_extra_ids = self.original_tokenizer._extra_ids
vocab = [(piece.piece, piece.score) for piece in proto.pieces]
vocab += [(f"<extra_id_{i}>_", 0.0) for i in range(num_extra_ids - 1, -1, -1)]
return vocab
.......... I tested: >>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("google/mt5-small", from_slow = True)
>>> tokenizer.tokenize("Hello, <extra_id_0>, ")
['▁Hello', ',', '▁<extra_id_0>', ',', '▁'] |
Environment info
transformers
version: 4.5.1If it helps, here's also my
pip-chill
:Note that
corrupt-text
is a custom library, and the problem persists even when it's uninstalled. It has nothing to do with the problem, as can be seen in the to reproduce section.Who can help
Since it's a tokenizer issue, probably @LysandreJik.
Information
I'm using the
T5Tokenizer
. After adding custom tokens, if the input is tokenized and they're found in the text, they will have stripped spaces around them even if I explicitly give theadd_tokens
andadd_special_tokens
a list ofAddedToken
objects withlstrip
andrstrip
explicitly set toFalse
.The problem arises when using:
Check out the to reproduce section to get an example of a code that doesn't work.
The tasks I am working on is:
It's not really relevant for this problem but the code is, once again, in the to reproduce section.
This is likely related to #7901.
To reproduce
Try running this code:
You will get this:
Expected behavior
We should get:
EDIT: Updated the code to have
rstrip=False
, since I made the mistake originally, but still acts the same.The text was updated successfully, but these errors were encountered: