Slow Tokenizer adds whitespace after special token #25073

g588928812 · 2023-07-25T10:07:20Z

System Info

Python 3.10.6
Transformers 4.31.0
<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>

Who can help?

@ArthurZucker @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
import transformers

tokenizer = AutoTokenizer.from_pretrained(
	"../models/llama-2-7b",
	use_fast=False,
)

txt="this is one sentence." + tokenizer.eos_token + "this is another sentence." + tokenizer.eos_token + "this is the third sentence." + tokenizer.eos_token

txt_encoded = tokenizer.encode(txt, add_special_tokens=False)
txt_encoded_decoded = tokenizer.decode(txt_encoded)
txt_encoded_decoded_spaces_false = tokenizer.decode(txt_encoded, spaces_between_special_tokens=False)

print(transformers.__version__)
print(tokenizer.__class__)

print(f"INPUT:\n{txt}\n")
print(f"ROUNDTRIP:\n{txt_encoded_decoded}\n")
print(f"ROUNDTRIP w/ spaces_between_special_tokens=F:\n{txt_encoded_decoded}\n")

Output:

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
4.31.0
<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>
INPUT:
this is one sentence.</s>this is another sentence.</s>this is the third sentence.</s>

ROUNDTRIP:
 this is one sentence.</s> this is another sentence.</s> this is the third sentence.</s>

ROUNDTRIP w/ spaces_between_special_tokens=F:
 this is one sentence.</s> this is another sentence.</s> this is the third sentence.</s>

Expected behavior

txt == txt_encoded_decoded

I expect text to be the same as decode(encode(text)), however a whitespace is added after each special token (</s>). From what I saw in previous issues, spaces_between_special_tokens=F should change that but it does not, whitespaces are still there.

What am I missing?

Thank you for your help and apologies in advance, this issue seems to come up quite often and I spent quite some time going through issues in this repo but nothing solved it for me.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-07-25T11:19:01Z

Hey! My first suggestion would be to not use the legacy behaviour by setting legacy = False when you initialize the tokenizer.
Second, the txt_encoded == txt_encoded_decoded assumption is not always true for all tokenizers. In this case, the decoding adds an extra space, maybe because it is based on the previous legacy behaviour. Will investigate

g588928812 · 2023-07-25T11:45:05Z

My first suggestion would be to not use the legacy behaviour by setting legacy = False when you initialize the tokenizer.

thanks! I tried that though and it did not change the output

ArthurZucker · 2023-07-26T10:51:38Z

Ok, the same issue exists with the fast version, but the problem is with the encoding that adds extra spaces between the special tokens.... It's a mess haha

wlhgtc · 2023-07-28T08:30:48Z

@ArthurZucker
Sorry I can't understand when and why we need to set legacy=False , Could you exlpain？
I run the code as follows:

    txt = "one more thing" + "<s>" + "traditionally" + "<s>"
    tokenizer1 = LlamaTokenizer.from_pretrained(
        "./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
    )
    tokenizer2 = LlamaTokenizer.from_pretrained(
        "./resources/models/llama-2-7b-hf", legacy=False, use_fast=False
    )

    t1 = tokenizer1.tokenize(txt)
    t2 = tokenizer2.tokenize(txt)

Then I got:

t1:['▁one', '▁more', '▁thing', '<s>', '▁tradition', 'ally', '<s>']
t2:['▁one', '▁more', '▁thing', '<s>', 'tradition', 'ally', '<s>']

The word starting with a ▁ usually means the start of a new word (as when comparing ▁more and ally).
Even though we don't add a space before "traditionally", it is still considered a new word.
So, seems tokenizer2 is meaningful?

ArthurZucker · 2023-07-28T09:35:08Z

No, words starting with _ means that these word have a space before them, and thus the token is _tradition. While tradition is a different token. If you read the documentation that points to the PR #24565, there is a similar example.
What's important to understand is the concept of added tokens.

Most often, sentencepiece tokenizers have a vocabulary, but some tokens are added afterwards. This happens with t5 for example. In transformers, we do not modify the underlying sentencepiece object. But we still support adding tokens.

Now imagine if thin is part of the sentencpiece vocab, but not _thin. If thin appears next to a work like thinking, is will be tokenized as [_, thin, king], not [_, thin, _king]. The same applies for any tokens that are originally part of the sentencepiece model.

In transformers all special tokens are kind of added to the vocabulary, so we want to reproduce the behaviour and not add extra space.

PS: please refrain from asking something pretty much unrelated. If you have a question (not a bug) feel free to post it on the discussion forum

github-actions · 2023-08-25T08:02:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

xenova · 2023-12-17T14:18:17Z

@ArthurZucker this should be reopened, right? As stated in your previous response:

In transformers all special tokens are kind of added to the vocabulary, so we want to reproduce the behaviour and not add extra space.

So, basically, there should not be added space after special tokens... However, I'm getting the opposite results to this, with legacy=False being incorrect.

from transformers import AutoTokenizer
text = "hello world"

# 1. Legacy tokenizer
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer", use_fast=False, legacy=True)
token_ids = tokenizer.encode(text, add_special_tokens=True)
print(f'{token_ids=}')                              # [1, 22172, 3186] (correct)
print(f'{tokenizer.decode(token_ids)=}')            # '<s>hello world' (correct)

# 2. Non-Legacy tokenizer
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer", use_fast=False, legacy=False)
token_ids = tokenizer.encode(text, add_special_tokens=True)
print(f'{token_ids=}')                              # [1, 22172, 3186]  (correct)
print(f'{tokenizer.decode(token_ids)=}')            # '<s> hello world' (incorrect)

(this is also different to the other related issues, since those deals with encoding and not decoding)

ArthurZucker · 2023-12-18T08:40:24Z

Yes, until #26678 is merged

ArthurZucker · 2024-01-12T15:45:24Z

Just wait a bit!

github-actions · 2024-02-06T08:06:46Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

xenova · 2024-02-06T10:04:39Z

Closed by #26678 👍

Butanium · 2024-08-14T10:52:14Z

Why is this closed while depending on if you set use_fast to True or False, the behavior is not the same @ArthurZucker ?

ArthurZucker · 2024-08-19T13:09:28Z

No the reason it's closed is because this has a flag: legacy, which can be set to True or False

ArthurZucker mentioned this issue Jul 31, 2023

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

Merged

1 task

ozreact mentioned this issue Aug 3, 2023

Tokenizer failing to encode chatml correctly #25304

Closed

4 tasks

github-actions bot closed this as completed Sep 2, 2023

xenova reopened this Dec 17, 2023

huggingface deleted a comment from github-actions bot Jan 12, 2024

xenova closed this as completed Feb 6, 2024

Butanium mentioned this issue Aug 14, 2024

Space after unnormalized token is added when use_fast=True for Llama tokenizer huggingface/tokenizers#1613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Tokenizer adds whitespace after special token #25073

Slow Tokenizer adds whitespace after special token #25073

g588928812 commented Jul 25, 2023 •

edited

Loading

ArthurZucker commented Jul 25, 2023

g588928812 commented Jul 25, 2023

ArthurZucker commented Jul 26, 2023

wlhgtc commented Jul 28, 2023

ArthurZucker commented Jul 28, 2023 •

edited

Loading

github-actions bot commented Aug 25, 2023

xenova commented Dec 17, 2023 •

edited

Loading

ArthurZucker commented Dec 18, 2023

ArthurZucker commented Jan 12, 2024

github-actions bot commented Feb 6, 2024

xenova commented Feb 6, 2024

Butanium commented Aug 14, 2024 •

edited

Loading

ArthurZucker commented Aug 19, 2024

Slow Tokenizer adds whitespace after special token #25073

Slow Tokenizer adds whitespace after special token #25073

Comments

g588928812 commented Jul 25, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jul 25, 2023

g588928812 commented Jul 25, 2023

ArthurZucker commented Jul 26, 2023

wlhgtc commented Jul 28, 2023

ArthurZucker commented Jul 28, 2023 • edited Loading

github-actions bot commented Aug 25, 2023

xenova commented Dec 17, 2023 • edited Loading

ArthurZucker commented Dec 18, 2023

ArthurZucker commented Jan 12, 2024

github-actions bot commented Feb 6, 2024

xenova commented Feb 6, 2024

Butanium commented Aug 14, 2024 • edited Loading

ArthurZucker commented Aug 19, 2024

g588928812 commented Jul 25, 2023 •

edited

Loading

ArthurZucker commented Jul 28, 2023 •

edited

Loading

xenova commented Dec 17, 2023 •

edited

Loading

Butanium commented Aug 14, 2024 •

edited

Loading