Special tokens of UDOP aren't encoded/decoded correctly #29591

NielsRogge · 2024-03-11T15:09:27Z

System Info

Transformers dev 4.38.2

Who can help?

Reproduction

UDOP has been added and the model works as expected. However for the tokenizer, the authors defined 1201 special tokens (only used during pre-training) which need to be added manually to the vocabulary of the tokenizer in order to get one-on-one matching results with the original implementation.

Based on comment here: #22940 (comment), I tried doing this:

from transformers import UdopTokenizer

# Add extra_ids to the special token list
extra_ids = 100
loc_extra_ids = 501
other_extra_ids = 200

additional_special_tokens = []
if extra_ids > 0 and "<extra_id_0>" not in additional_special_tokens:
    additional_special_tokens = ["<extra_id_{}>".format(i) for i in range(extra_ids)]
    additional_special_tokens.extend(["<extra_l_id_{}>".format(i) for i in range(extra_ids)])
    additional_special_tokens.extend(["</extra_l_id_{}>".format(i) for i in range(extra_ids)])
    additional_special_tokens.extend(["<extra_t_id_{}>".format(i) for i in range(extra_ids)])
    additional_special_tokens.extend(["</extra_t_id_{}>".format(i) for i in range(extra_ids)])

if loc_extra_ids > 0 and "<loc_0>" not in additional_special_tokens:
    additional_special_tokens.extend(["<loc_{}>".format(i) for i in range(loc_extra_ids)])

if other_extra_ids > 0 and "<other_0>" not in additional_special_tokens:
    additional_special_tokens.extend(["<other_{}>".format(i) for i in range(other_extra_ids)])

hf_tokenizer = UdopTokenizer.from_pretrained("t5-base", legacy=True, additional_special_tokens=additional_special_tokens)

As this could avoid the regex defined by the authors when encoding. However, this doesn't result in the same behaviour. Let's say you have the tensor [0, 8986, 32942, 32966, 32554, 32551, 1]. The original tokenizer decodes this to (using the original notebook):

from core.models import UdopTokenizer

original_tokenizer = UdopTokenizer.from_pretrained("path_to_original_files")

print(original_tokenizer.decode([0,  8986, 32942, 32966, 32554, 32551, 1]))
>>> paragraph<loc_58><loc_34><loc_446><loc_449>

whereas with the tokenizer defined above we get:

print(hf_tokenizer.decode([0,  8986, 32942, 32966, 32554, 32551, 1]))
>>> paragraph <loc_442> <loc_466> <loc_54> <loc_51>

This is due to this function, I assume. This causes the additional tokens to not have integers that are incremental:

for token in original_tokenizer.additional_special_tokens:
  print(token, original_tokenizer.convert_tokens_to_ids([token]))

<extra_id_0> [32099]
<extra_id_1> [32098]
<extra_id_2> [32097]
<extra_id_3> [32096]
<extra_id_4> [32095]
<extra_id_5> [32094]
<extra_id_6> [32093]
<extra_id_7> [32092]
<extra_id_8> [32091]
<extra_id_9> [32090]
<extra_id_10> [32089]
<extra_id_11> [32088]
<extra_id_12> [32087]
<extra_id_13> [32086]

Expected behavior

Equivalent encoding/decoding of the 1201 special tokens with the original implementation

The text was updated successfully, but these errors were encountered:

NielsRogge · 2024-03-11T15:28:37Z

Ok I think I figured it out. We can sort the additional_special_tokens of the original tokenizer (making sure the indices are going incremental), and then add those to the HF tokenizer, like so (thanks, ChatGPT!):

true_special_tokens = []

for token in original_tokenizer.additional_special_tokens:
  true_special_tokens.append((original_tokenizer.convert_tokens_to_ids([token]), token))

sorted_data = sorted(true_special_tokens, key=lambda x: x[0][0])
sorted_tokens = [item[1] for item in sorted_data]

print(sorted_tokens)

This gives us the 1201 additional tokens, but making sure that their IDs are incremental.

ArthurZucker · 2024-03-17T23:05:54Z

Answered on the PR, yes we need to add them in a particular order!

github-actions · 2024-04-11T08:03:28Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge mentioned this issue Mar 11, 2024

[UDOP] Add special tokens to tokenizer #29594

Merged

NielsRogge closed this as completed in #29594 Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special tokens of UDOP aren't encoded/decoded correctly #29591

Special tokens of UDOP aren't encoded/decoded correctly #29591

NielsRogge commented Mar 11, 2024 •

edited

Loading

NielsRogge commented Mar 11, 2024

ArthurZucker commented Mar 17, 2024

github-actions bot commented Apr 11, 2024

Special tokens of UDOP aren't encoded/decoded correctly #29591

Special tokens of UDOP aren't encoded/decoded correctly #29591

Comments

NielsRogge commented Mar 11, 2024 • edited Loading

System Info

Who can help?

Reproduction

Expected behavior

NielsRogge commented Mar 11, 2024

ArthurZucker commented Mar 17, 2024

github-actions bot commented Apr 11, 2024

NielsRogge commented Mar 11, 2024 •

edited

Loading