Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special tokens of UDOP aren't encoded/decoded correctly #29591

Closed
NielsRogge opened this issue Mar 11, 2024 · 3 comments · Fixed by #29594
Closed

Special tokens of UDOP aren't encoded/decoded correctly #29591

NielsRogge opened this issue Mar 11, 2024 · 3 comments · Fixed by #29594

Comments

@NielsRogge
Copy link
Contributor

NielsRogge commented Mar 11, 2024

System Info

Transformers dev 4.38.2

Who can help?

Question for @ArthurZucker

Reproduction

UDOP has been added and the model works as expected. However for the tokenizer, the authors defined 1201 special tokens (only used during pre-training) which need to be added manually to the vocabulary of the tokenizer in order to get one-on-one matching results with the original implementation.

Based on comment here: #22940 (comment), I tried doing this:

from transformers import UdopTokenizer

# Add extra_ids to the special token list
extra_ids = 100
loc_extra_ids = 501
other_extra_ids = 200

additional_special_tokens = []
if extra_ids > 0 and "<extra_id_0>" not in additional_special_tokens:
    additional_special_tokens = ["<extra_id_{}>".format(i) for i in range(extra_ids)]
    additional_special_tokens.extend(["<extra_l_id_{}>".format(i) for i in range(extra_ids)])
    additional_special_tokens.extend(["</extra_l_id_{}>".format(i) for i in range(extra_ids)])
    additional_special_tokens.extend(["<extra_t_id_{}>".format(i) for i in range(extra_ids)])
    additional_special_tokens.extend(["</extra_t_id_{}>".format(i) for i in range(extra_ids)])

if loc_extra_ids > 0 and "<loc_0>" not in additional_special_tokens:
    additional_special_tokens.extend(["<loc_{}>".format(i) for i in range(loc_extra_ids)])

if other_extra_ids > 0 and "<other_0>" not in additional_special_tokens:
    additional_special_tokens.extend(["<other_{}>".format(i) for i in range(other_extra_ids)])

hf_tokenizer = UdopTokenizer.from_pretrained("t5-base", legacy=True, additional_special_tokens=additional_special_tokens)

As this could avoid the regex defined by the authors when encoding. However, this doesn't result in the same behaviour. Let's say you have the tensor [0, 8986, 32942, 32966, 32554, 32551, 1]. The original tokenizer decodes this to (using the original notebook):

from core.models import UdopTokenizer

original_tokenizer = UdopTokenizer.from_pretrained("path_to_original_files")

print(original_tokenizer.decode([0,  8986, 32942, 32966, 32554, 32551, 1]))
>>> paragraph<loc_58><loc_34><loc_446><loc_449>

whereas with the tokenizer defined above we get:

print(hf_tokenizer.decode([0,  8986, 32942, 32966, 32554, 32551, 1]))
>>> paragraph <loc_442> <loc_466> <loc_54> <loc_51>

This is due to this function, I assume. This causes the additional tokens to not have integers that are incremental:

for token in original_tokenizer.additional_special_tokens:
  print(token, original_tokenizer.convert_tokens_to_ids([token]))
<extra_id_0> [32099]
<extra_id_1> [32098]
<extra_id_2> [32097]
<extra_id_3> [32096]
<extra_id_4> [32095]
<extra_id_5> [32094]
<extra_id_6> [32093]
<extra_id_7> [32092]
<extra_id_8> [32091]
<extra_id_9> [32090]
<extra_id_10> [32089]
<extra_id_11> [32088]
<extra_id_12> [32087]
<extra_id_13> [32086]

Expected behavior

Equivalent encoding/decoding of the 1201 special tokens with the original implementation

@NielsRogge
Copy link
Contributor Author

Ok I think I figured it out. We can sort the additional_special_tokens of the original tokenizer (making sure the indices are going incremental), and then add those to the HF tokenizer, like so (thanks, ChatGPT!):

true_special_tokens = []

for token in original_tokenizer.additional_special_tokens:
  true_special_tokens.append((original_tokenizer.convert_tokens_to_ids([token]), token))

sorted_data = sorted(true_special_tokens, key=lambda x: x[0][0])
sorted_tokens = [item[1] for item in sorted_data]

print(sorted_tokens)

This gives us the 1201 additional tokens, but making sure that their IDs are incremental.

@ArthurZucker
Copy link
Collaborator

Answered on the PR, yes we need to add them in a particular order!

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants