-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Tokenizer adds whitespace after special token #25073
Comments
Hey! My first suggestion would be to not use the legacy behaviour by setting |
thanks! I tried that though and it did not change the output |
Ok, the same issue exists with the fast version, but the problem is with the encoding that adds extra spaces between the special tokens.... It's a mess haha |
@ArthurZucker txt = "one more thing" + "<s>" + "traditionally" + "<s>"
tokenizer1 = LlamaTokenizer.from_pretrained(
"./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
)
tokenizer2 = LlamaTokenizer.from_pretrained(
"./resources/models/llama-2-7b-hf", legacy=False, use_fast=False
)
t1 = tokenizer1.tokenize(txt)
t2 = tokenizer2.tokenize(txt) Then I got:
The word starting with a |
No, words starting with Most often, sentencepiece tokenizers have a vocabulary, but some tokens are added afterwards. This happens with t5 for example. In Now imagine if In PS: please refrain from asking something pretty much unrelated. If you have a question (not a bug) feel free to post it on the discussion forum |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@ArthurZucker this should be reopened, right? As stated in your previous response:
So, basically, there should not be added space after special tokens... However, I'm getting the opposite results to this, with from transformers import AutoTokenizer
text = "hello world"
# 1. Legacy tokenizer
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer", use_fast=False, legacy=True)
token_ids = tokenizer.encode(text, add_special_tokens=True)
print(f'{token_ids=}') # [1, 22172, 3186] (correct)
print(f'{tokenizer.decode(token_ids)=}') # '<s>hello world' (correct)
# 2. Non-Legacy tokenizer
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer", use_fast=False, legacy=False)
token_ids = tokenizer.encode(text, add_special_tokens=True)
print(f'{token_ids=}') # [1, 22172, 3186] (correct)
print(f'{tokenizer.decode(token_ids)=}') # '<s> hello world' (incorrect) (this is also different to the other related issues, since those deals with encoding and not decoding) |
Yes, until #26678 is merged |
Just wait a bit! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Closed by #26678 👍 |
Why is this closed while depending on if you set |
No the reason it's closed is because this has a flag: |
System Info
Python 3.10.6
Transformers 4.31.0
<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>
Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Output:
Expected behavior
txt == txt_encoded_decoded
I expect
text
to be the same asdecode(encode(text))
, however a whitespace is added after each special token (</s>
). From what I saw in previous issues,spaces_between_special_tokens=F
should change that but it does not, whitespaces are still there.What am I missing?
Thank you for your help and apologies in advance, this issue seems to come up quite often and I spent quite some time going through issues in this repo but nothing solved it for me.
The text was updated successfully, but these errors were encountered: