We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[email protected]
I'm using LiteLLM to inference Llamafile in Lumigator (https://github.com/mozilla-ai/lumigator/blob/main/lumigator/jobs/inference/model_clients.py#L65) and Llamafile appears to be ignoring max_completion_tokens and only listens to max_tokens?
If I pass in max_completion_tokens=512, Looking at the llamafile verbose logs I see
process_token] next token | has_next_token=true n_remain=-1
But if I pass in max_tokens=512, then I see what I expect, i.e.
process_token] next token | has_next_token=true n_remain=511
Not sure why one works but not the other, since I see
llamafile/llamafile/server/v1_completions.cpp
Line 267 in 29b5f27
which makes it look like max_completion_tokens is the same as max_tokens.
When I look at the LiteLLM logs I see this when I pass in the max_tokens param
http://localhost:8080/v1/ \..... -d '{'model': 'DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf', ... 'temperature': 0.0, 'top_p': 0.9, 'max_tokens': 512, 'frequency_penalty': 0.0, 'extra_body': {}}'
When i pass in the max_completions_token, this is what LiteLLM sends:
http://localhost:8080/v1/ \..... -d '{'model': 'DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf','temperature': 0.0, 'top_p': 0.9, 'max_completion_tokens': 512, 'frequency_penalty': 0.0, 'extra_body': {}}'
So, best I can tell, LiteLLM is doing the right thing and passing the params along.
`
Model creator: unsloth Quantized GGUF files used: unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF Commit message "Update README.md" Commit hash 097680e4eed7a83b3df6b0bb5e5134099cadf1b0 LlamaFile version used: Mozilla-Ocho/llamafile Commit message "Merge pull request #687 from Xydane/main Add Support for DeepSeek-R1 models" Commit hash 29b5f27
Linux, Mac
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Contact Details
[email protected]
What happened?
I'm using LiteLLM to inference Llamafile in Lumigator (https://github.com/mozilla-ai/lumigator/blob/main/lumigator/jobs/inference/model_clients.py#L65) and Llamafile appears to be ignoring max_completion_tokens and only listens to max_tokens?
If I pass in max_completion_tokens=512, Looking at the llamafile verbose logs I see
But if I pass in max_tokens=512, then I see what I expect, i.e.
Not sure why one works but not the other, since I see
llamafile/llamafile/server/v1_completions.cpp
Line 267 in 29b5f27
which makes it look like max_completion_tokens is the same as max_tokens.
When I look at the LiteLLM logs I see this when I pass in the max_tokens param
When i pass in the max_completions_token, this is what LiteLLM sends:
So, best I can tell, LiteLLM is doing the right thing and passing the params along.
`
Version
Model creator: unsloth
Quantized GGUF files used: unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF
Commit message "Update README.md"
Commit hash 097680e4eed7a83b3df6b0bb5e5134099cadf1b0
LlamaFile version used: Mozilla-Ocho/llamafile
Commit message "Merge pull request #687 from Xydane/main Add Support for DeepSeek-R1 models"
Commit hash 29b5f27
What operating system are you seeing the problem on?
Linux, Mac
Relevant log output
The text was updated successfully, but these errors were encountered: