-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Frontend] Improve Shutdown And Logs #11737
base: main
Are you sure you want to change the base?
[V1][Frontend] Improve Shutdown And Logs #11737
Conversation
Signed-off-by: [email protected] <[email protected]>
… handle properly Signed-off-by: [email protected] <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Here is what the server logs look like for:
...
INFO: 127.0.0.1:45354 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45360 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45368 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45372 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45388 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45394 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-04 17:21:02 core.py:247] RUNNING: 306 | WAITING: 628
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] WorkerProc hit an exception: %s
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] Traceback (most recent call last):
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 397, in worker_busy_loop
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] output = getattr(self.worker, method)(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/v1/worker/gpu_worker.py", line 204, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 615, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] hidden_states = self.model(
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/model_executor/models/llama.py", line 571, in forward
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] raise RuntimeError("ERROR IN LLAMA!")
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200] EngineCore hit an exception: Traceback (most recent call last):
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 193, in run_engine_core
ERROR 01-04 17:21:04 core.py:200] engine_core.run_busy_loop()
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 231, in run_busy_loop
ERROR 01-04 17:21:04 core.py:200] outputs = self.step()
ERROR 01-04 17:21:04 core.py:200] ^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 124, in step
ERROR 01-04 17:21:04 core.py:200] output = self.model_executor.execute_model(scheduler_output)
ERROR 01-04 17:21:04 core.py:200] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 167, in execute_model
ERROR 01-04 17:21:04 core.py:200] model_output = self.collective_rpc("execute_model",
ERROR 01-04 17:21:04 core.py:200] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 161, in collective_rpc
ERROR 01-04 17:21:04 core.py:200] raise e
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 150, in collective_rpc
ERROR 01-04 17:21:04 core.py:200] raise result
ERROR 01-04 17:21:04 core.py:200] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200]
CRITICAL 01-04 17:21:04 async_llm.py:65] AsyncLLM got fatal signal from worker process, shutting down. See stack trace for root cause.
CRITICAL 01-04 17:21:05 launcher.py:91] Engine failed, terminating server.
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1067793] |
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
@@ -195,20 +209,12 @@ def signal_handler(signum, frame): | |||
signal.signal(signal.SIGTERM, signal_handler) | |||
signal.signal(signal.SIGINT, signal_handler) | |||
|
|||
parent_process = psutil.Process().parent() | |||
engine_core = None | |||
engine_core = EngineCoreProc(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewer:
EngineCoreProc
sends aFAILED
message over theready_pipe
if issue during startup- This is why it is outside of the try ... catch block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to leave this in a comment in the code
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Thanks for the improvement and writing these tests to verify this! I don't have issues with this PR as long as those tests pass. |
I'll double check the format of the error logs after the weekend. |
Co-authored-by: Cyrus Leung <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: [email protected] <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some test comments for now
@@ -187,6 +187,7 @@ steps: | |||
commands: | |||
# split the test to avoid interference | |||
- VLLM_USE_V1=1 pytest -v -s v1/core | |||
- VLLM_USE_V1=1 pytest -v -s v1/shutdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just sort of a side note, but it seems like updating these commands would be really easy to miss when adding new tests in a new directory.
# Monkeypatch an error in the model. | ||
monkeypatch.setattr(LlamaForCausalLM, "forward", evil_forward) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you created a monkeypatch.context()
, did you mean to use m.setattr(...)
here?
From reading the docs, it sounds like both will result in the same behavior, though I would find m.setattr(...)
a bit more clear that the change is limited to this context.
# Monkeypatch an error in the model. | |
monkeypatch.setattr(LlamaForCausalLM, "forward", evil_forward) | |
# Monkeypatch an error in the model. | |
m.setattr(LlamaForCausalLM, "forward", evil_forward) |
# Confirm all the processes are cleaned up. | ||
wait_for_gpu_memory_to_clear( | ||
devices=list(range(tensor_parallel_size)), | ||
threshold_bytes=2 * 2**30, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number kind of looks like magic. It would be great to get it put in a constant with a comment explaining it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a GiB_bytes
defined in vllm.utils
, you can consider using that.
|
||
MODELS = [ | ||
"meta-llama/Llama-3.2-1B", # Raises on first fwd pass. | ||
"mistralai/Mixtral-8x22B-Instruct-v0.1" # Causes OOM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to download the model and get a real OOM? From a quick look, it doesn't look like this is used elsewhere, so that'd be a net-new model to download during tests? If so, that doesn't seem worth the cost, especially given how unreliable HF has been in CI lately. Maybe I'm misunderstanding, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
I do think that it is important to flex both the cases here since there is a subtle difference:
- "meta-llama/Llama-3.2-1B", # Raises on first fwd pass. happens during the profiling (after IPC mechanisms are created)
- "mistralai/Mixtral-8x22B-Instruct-v0.1" # Causes OOM. happens during the weight loading (before IPC mechanisms are setup)
I will instead do a monkeypatch to raise an error on load_weights for case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monkeypatch sounds good if the error encountered is clear enough!
terminate_if_errored( | ||
server=server, | ||
engine=request.app.state.engine_client, | ||
) | ||
|
||
return Response(status_code=HTTPStatus.INTERNAL_SERVER_ERROR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to check an assumption here -- by responding with a 500 / Internal Server Error, we're saying that every case that reaches this code is an unexpected error from something being broken. It should never be something like the input from the user request was invalid, as that should get a different response.
Hopefully all of those cases are caught sooner, but I just wanted to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
EngineDeadError
is an unrecoverable error inEngineCore
(the best example is illegal memory access)EngineGenerateError
is a recoverable error that is request specific (e.g. an invalid request)
Looking at this again, we should consider removing the RuntimeError
from this for V1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Do you know where RuntimeError
would come from? It does seem rather broad and generic compared to the other stuff we're handling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russellb that was me: https://github.com/vllm-project/vllm/pull/6594/files
Back in the good old days of the V0 AsyncLLMEngine, the error handling was much less comprehensive and we could end up with RuntimeError
s raised to the frontend. Often the engine was dead in this case, but there was no handler in the frontend to catch that and kill the server immediately.
I think the V0 engine is robust enough these days that we could consider removing the RuntimeError
check entirely. I do like the safety of having a broader catch and log so we know what's going on when some error case slips through the cracks though.
@@ -195,20 +209,12 @@ def signal_handler(signum, frame): | |||
signal.signal(signal.SIGTERM, signal_handler) | |||
signal.signal(signal.SIGINT, signal_handler) | |||
|
|||
parent_process = psutil.Process().parent() | |||
engine_core = None | |||
engine_core = EngineCoreProc(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to leave this in a comment in the code
except SystemExit: | ||
logger.debug("EngineCore interrupted.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that you don't want to catch SystemExit anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally when I wrote this, I thought that SystemExit
would be caught by Exception
, so I wanted different behavior.
But in the intervening time I learn that the difference between Exception
and BaseException
is that Exception
does not catch SystemExit
or asyncio.CancelledError
and so this is not needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im happy to leave this in, though, if it makes it clearer that we have 2 cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm understanding clearly, sorry!
You've pointed out that SystemExit
does NOT derive from Exception
, but the change seems to reflect the opposite, dropping it assuming Exception
will cover it.
The proposed change makes sense if you intentionally do NOT want to catch and handle SystemExit
here at all. Is that the intention?
except Exception as e: | ||
raise self._format_exception(e) from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You called this out in the PR description - I think it'd be worth filling an issue to help us remember to come back and look at zmq error handling. Depending on the error, our handling should differ. For example, a ZMQError
exception with an errno
of EAGAIN
means we should try again as opposed to reporting an error.
except SystemExit: | ||
logger.debug("Worker interrupted.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you intentionally not catching this anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other comment about SystemExit
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
@@ -129,7 +128,8 @@ def step(self) -> EngineCoreOutputs: | |||
return engine_core_outputs | |||
|
|||
def shutdown(self): | |||
self.model_executor.shutdown() | |||
if self.model_executor: | |||
model_executor.shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_executor.shutdown() | |
self.model_executor.shutdown() |
"""Raised when the EngineCore dies. Unrecoverable.""" | ||
|
||
def __init__(self, *args, suppress_context: bool = False, **kwargs): | ||
ENGINE_DEAD_MESSAGE = "EngineCore encountered an issue. See stack trace for the root cause." # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENGINE_DEAD_MESSAGE = "EngineCore encountered an issue. See stack trace for the root cause." # noqa: E501 | |
ENGINE_DEAD_MESSAGE = "EngineCore encountered an issue. See stack trace (above) for the root cause." # noqa: E501 |
I'm happy with the current error format. Only change I'd suggest is to explicitly tell users to look above the final error message for the full stack trace.
SUMMARY:
DESIGN:
__init__
code with try...catch and pushFAILED
over the ready PIPE. This works well since the parent processes are waiting for confirmationOne weakness is that issues with the ipc mechanisms themselves are not handled explicitly
TEST MATRIX:
Fixes: #12690