Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API SSE streaming too chunky #1411

Open
GlasslessPizza opened this issue Mar 7, 2025 · 2 comments
Open

API SSE streaming too chunky #1411

GlasslessPizza opened this issue Mar 7, 2025 · 2 comments

Comments

@GlasslessPizza
Copy link

I'm calling the api like this:

import requests, time

headers = {"Content-Type": "application/json"}
data = { "prompt": "<|im_end|>\n<|im_start|>user\nDescribe in-depth how scissors work.\n<|im_end|>\n<|im_start|>assistant\n", "temperature": 0, "max_context_length": 4096, "max_length": 2048 }
response = requests.post("http://127.0.0.1:5001/api/extra/generate/stream", headers=headers, json=data, stream=True, verify=False)

for line in response.iter_lines():
	print(time.time(), line)

The problem is that the stream isn't smooth, the tokens arrive in chunks roughly every half-second.

version: 1.85.1
windows 10

@LostRuins
Copy link
Owner

The tokens should arrive as they are generated. Looks fine to me.

@GlasslessPizza
Copy link
Author

After spending a morning, I found the most smoothbrain workaround imaginable.
I had to discover that requests' iter_lines actually has a chunk_size parameter with a default of 512. Setting it to something like 10 solved the problem.
However, if i run that script on a llama.cpp server /completion i get smooth streaming without needing to change chunk_size, that's why initially i thought it was a koboldcpp issue. I don't know why that is tho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants