-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special tokens of UDOP aren't encoded/decoded correctly #29591
Comments
Ok I think I figured it out. We can sort the
This gives us the 1201 additional tokens, but making sure that their IDs are incremental. |
Answered on the PR, yes we need to add them in a particular order! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Transformers dev 4.38.2
Who can help?
Question for @ArthurZucker
Reproduction
UDOP has been added and the model works as expected. However for the tokenizer, the authors defined 1201 special tokens (only used during pre-training) which need to be added manually to the vocabulary of the tokenizer in order to get one-on-one matching results with the original implementation.
Based on comment here: #22940 (comment), I tried doing this:
As this could avoid the regex defined by the authors when encoding. However, this doesn't result in the same behaviour. Let's say you have the tensor [0, 8986, 32942, 32966, 32554, 32551, 1]. The original tokenizer decodes this to (using the original notebook):
whereas with the tokenizer defined above we get:
This is due to this function, I assume. This causes the additional tokens to not have integers that are incremental:
Expected behavior
Equivalent encoding/decoding of the 1201 special tokens with the original implementation
The text was updated successfully, but these errors were encountered: