API SSE streaming too chunky #1411

GlasslessPizza · 2025-03-07T22:45:13Z

I'm calling the api like this:

import requests, time

headers = {"Content-Type": "application/json"}
data = { "prompt": "<|im_end|>\n<|im_start|>user\nDescribe in-depth how scissors work.\n<|im_end|>\n<|im_start|>assistant\n", "temperature": 0, "max_context_length": 4096, "max_length": 2048 }
response = requests.post("http://127.0.0.1:5001/api/extra/generate/stream", headers=headers, json=data, stream=True, verify=False)

for line in response.iter_lines():
	print(time.time(), line)

The problem is that the stream isn't smooth, the tokens arrive in chunks roughly every half-second.

version: 1.85.1
windows 10

The text was updated successfully, but these errors were encountered:

LostRuins · 2025-03-08T09:23:16Z

The tokens should arrive as they are generated. Looks fine to me.

GlasslessPizza · 2025-03-08T11:15:37Z

After spending a morning, I found the most smoothbrain workaround imaginable.
I had to discover that requests' iter_lines actually has a chunk_size parameter with a default of 512. Setting it to something like 10 solved the problem.
However, if i run that script on a llama.cpp server /completion i get smooth streaming without needing to change chunk_size, that's why initially i thought it was a koboldcpp issue. I don't know why that is tho.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API SSE streaming too chunky #1411

API SSE streaming too chunky #1411

GlasslessPizza commented Mar 7, 2025

LostRuins commented Mar 8, 2025

GlasslessPizza commented Mar 8, 2025

API SSE streaming too chunky #1411

API SSE streaming too chunky #1411

Comments

GlasslessPizza commented Mar 7, 2025

LostRuins commented Mar 8, 2025

GlasslessPizza commented Mar 8, 2025