Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 25 01 20 #368

Merged
merged 92 commits into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
289b519
[Doc] Fix build from source and installation link in README.md (#12013)
Yikun Jan 13, 2025
f35ec46
[Bugfix] Fix deepseekv3 gate bias error (#12002)
SunflowerAries Jan 13, 2025
1a40125
[Docs] Add Sky Computing Lab to project intro (#12019)
WoosukKwon Jan 14, 2025
078da31
[HPU][Bugfix] set_forward_context and CI test execution (#12014)
kzawora-intel Jan 14, 2025
8a1f938
[Doc] Update Quantization Hardware Support Documentation (#12025)
tjtanaa Jan 14, 2025
ff39141
[HPU][misc] add comments for explanation (#12034)
youkaichao Jan 14, 2025
bb354e6
[Bugfix] Fix various bugs in multi-modal processor (#12031)
DarkLight1337 Jan 14, 2025
1f18adb
[Kernel] Revert the API change of Attention.forward (#12038)
heheda12345 Jan 14, 2025
2e0e017
[Platform] Add output for Attention Backend (#11981)
wangxiyuan Jan 14, 2025
a2d2acb
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)
heheda12345 Jan 14, 2025
c9d6ff5
Explain where the engine args go when using Docker (#12041)
hmellor Jan 14, 2025
87054a5
[Doc]: Update the Json Example of the `Engine Arguments` document (#1…
maang-h Jan 14, 2025
a3a3ee4
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_…
jeejeelee Jan 14, 2025
42f5e7c
[Kernel] Support MulAndSilu (#11624)
jeejeelee Jan 15, 2025
1a51b9f
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in se…
kzawora-intel Jan 15, 2025
9ddac56
[Platform] move current_memory_usage() into platform (#11369)
shen-shanshan Jan 15, 2025
b7ee940
[V1][BugFix] Fix edge case in VLM scheduling (#12065)
WoosukKwon Jan 15, 2025
0794e74
[Misc] Add multipstep chunked-prefill support for FlashInfer (#10467)
elfiegg Jan 15, 2025
f218f9c
[core] Turn off GPU communication overlap for Ray executor (#12051)
ruisearch42 Jan 15, 2025
ad34c0d
[core] platform agnostic executor via collective_rpc (#11256)
youkaichao Jan 15, 2025
3f9b7ab
[Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)
kylesayrs Jan 15, 2025
994fc65
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCache…
heheda12345 Jan 15, 2025
cbe9439
Fix: cases with empty sparsity config (#12057)
rahul-tuli Jan 15, 2025
ad388d2
Type-fix: make execute_model output type optional (#12020)
youngkent Jan 15, 2025
3adf0ff
[Platform] Do not raise error if _Backend is not found (#12023)
wangxiyuan Jan 15, 2025
97eb97b
[Model]: Support internlm3 (#12037)
RunningLeon Jan 15, 2025
5ecf3e0
Misc: allow to use proxy in `HTTPConnection` (#12042)
zhouyuan Jan 15, 2025
de0526f
[Misc][Quark] Upstream Quark format to VLLM (#10765)
kewang-xlnx Jan 15, 2025
57e729e
[Doc]: Update `OpenAI-Compatible Server` documents (#12082)
maang-h Jan 15, 2025
edce722
[Bugfix] use right truncation for non-generative tasks (#12050)
joerunde Jan 15, 2025
70755e8
[V1][Core] Autotune encoder cache budget (#11895)
ywang96 Jan 15, 2025
ebd8c66
[Bugfix] Fix _get_lora_device for HQQ marlin (#12090)
varun-sundar-rabindranath Jan 15, 2025
cd9d06f
Allow hip sources to be directly included when compiling for rocm. (#…
tvirolai-amd Jan 15, 2025
fa0050d
[Core] Default to using per_token quantization for fp8 when cutlass i…
elfiegg Jan 16, 2025
f8ef146
[Doc] Add documentation for specifying model architecture (#12105)
DarkLight1337 Jan 16, 2025
9aa1519
Various cosmetic/comment fixes (#12089)
mgoin Jan 16, 2025
dd7c9ad
[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12…
Isotr0py Jan 16, 2025
bf53e0c
Support torchrun and SPMD-style offline inference (#12071)
youkaichao Jan 16, 2025
92e793d
[core] LLM.collective_rpc interface and RLHF example (#12084)
youkaichao Jan 16, 2025
874f7c2
[Bugfix] Fix max image feature size for Llava-one-vision (#12104)
ywang96 Jan 16, 2025
5fd24ec
[misc] Add LoRA kernel micro benchmarks (#11579)
varun-sundar-rabindranath Jan 16, 2025
62b06ba
[Model] Add support for deepseek-vl2-tiny model (#12068)
Isotr0py Jan 16, 2025
d06e824
[Bugfix] Set enforce_eager automatically for mllama (#12127)
heheda12345 Jan 16, 2025
ebc73f2
[Bugfix] Fix a path bug in disaggregated prefill example script. (#12…
KuntaiDu Jan 17, 2025
fead53b
[CI]add genai-perf benchmark in nightly benchmark (#10704)
jikunshang Jan 17, 2025
1475847
[Doc] Add instructions on using Podman when SELinux is active (#12136)
terrytangyuan Jan 17, 2025
b8bfa46
[Bugfix] Fix issues in CPU build Dockerfile (#12135)
terrytangyuan Jan 17, 2025
d1adb9b
[BugFix] add more `is not None` check in VllmConfig.__post_init__ (#1…
heheda12345 Jan 17, 2025
d75ab55
[Misc] Add deepseek_vl2 chat template (#12143)
Isotr0py Jan 17, 2025
8027a72
[ROCm][MoE] moe tuning support for rocm (#12049)
divakar-amd Jan 17, 2025
69d765f
[V1] Move more control of kv cache initialization from model_executor…
heheda12345 Jan 17, 2025
07934cc
[Misc][LoRA] Improve the readability of LoRA error messages (#12102)
jeejeelee Jan 17, 2025
d4e6194
[CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
bigPYJ1151 Jan 17, 2025
87a0c07
[core] allow callable in collective_rpc (#12151)
youkaichao Jan 17, 2025
58fd57f
[Bugfix] Fix score api for missing max_model_len validation (#12119)
wallashss Jan 17, 2025
54cacf0
[Bugfix] Mistral tokenizer encode accept list of str (#12149)
jikunshang Jan 17, 2025
b5b57e3
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)
gshtras Jan 17, 2025
7b98a65
[torch.compile] disable logging when cache is disabled (#12043)
youkaichao Jan 17, 2025
2b83503
[misc] fix cross-node TP (#12166)
youkaichao Jan 18, 2025
c09503d
[AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172)
hongxiayang Jan 18, 2025
da02cb4
[core] further polish memory profiling (#12126)
youkaichao Jan 18, 2025
813f249
[Docs] Fix broken link in SECURITY.md (#12175)
russellb Jan 18, 2025
02798ec
[Model] Port deepseek-vl2 processor, remove dependency (#12169)
Isotr0py Jan 18, 2025
6d0e3d3
[core] clean up executor class hierarchy between v1 and v0 (#12171)
youkaichao Jan 18, 2025
32eb0da
[Misc] Support register quantization method out-of-tree (#11969)
ice-tong Jan 19, 2025
7a8a48d
[V1] Collect env var for usage stats (#12115)
simon-mo Jan 19, 2025
4e94951
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#…
madamczykhabana Jan 19, 2025
630eb5b
[Bugfix] Fix multi-modal processors for transformers 4.48 (#12187)
DarkLight1337 Jan 19, 2025
e66faf4
[torch.compile] store inductor compiled Python file (#12182)
youkaichao Jan 19, 2025
936db11
benchmark_serving support --served-model-name param (#12109)
gujingit Jan 19, 2025
edaae19
[Misc] Add BNB support to GLM4-V model (#12184)
Isotr0py Jan 19, 2025
81763c5
[V1] Add V1 support of Qwen2-VL (#12128)
ywang96 Jan 19, 2025
bbe5f9d
[Model] Support for fairseq2 Llama (#11442)
MartinGleize Jan 19, 2025
df450aa
[Bugfix] Fix num_heads value for simple connector when tp enabled (#1…
ShangmingCai Jan 20, 2025
51ef828
[torch.compile] fix sym_tensor_indices (#12191)
youkaichao Jan 20, 2025
3ea7b94
Move linting to `pre-commit` (#11975)
hmellor Jan 20, 2025
c5c0620
[DOC] Fix typo in docstring and assert message (#12194)
terrytangyuan Jan 20, 2025
d264312
[DOC] Add missing docstring in LLMEngine.add_request() (#12195)
terrytangyuan Jan 20, 2025
0974c9b
[Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196)
terrytangyuan Jan 20, 2025
8360979
[Model] Add Qwen2 PRM model support (#12202)
Isotr0py Jan 20, 2025
59a0192
[Core] Interface for accessing model from `VllmRunner` (#10353)
DarkLight1337 Jan 20, 2025
5c89a29
[misc] add placeholder format.sh (#12206)
youkaichao Jan 20, 2025
4001ea1
[CI/Build] Remove dummy CI steps (#12208)
DarkLight1337 Jan 20, 2025
3127e97
[CI/Build] Make pre-commit faster (#12212)
DarkLight1337 Jan 20, 2025
b37d827
[Model] Upgrade Aria to transformers 4.48 (#12203)
DarkLight1337 Jan 20, 2025
170eb35
[misc] print a message to suggest how to bypass commit hooks (#12217)
youkaichao Jan 20, 2025
c222f47
[core][bugfix] configure env var during import vllm (#12209)
youkaichao Jan 20, 2025
5f0ec39
[V1] Remove `_get_cache_block_size` (#12214)
heheda12345 Jan 20, 2025
86bfb6d
[Misc] Pass `attention` to impl backend (#12218)
wangxiyuan Jan 20, 2025
18572e3
[Bugfix] Fix `HfExampleModels.find_hf_info` (#12223)
DarkLight1337 Jan 20, 2025
9666369
[CI] Pass local python version explicitly to pre-commit mypy.sh (#12224)
heheda12345 Jan 20, 2025
031e6eb
Merge remote-tracking branch 'upstream/main'
gshtras Jan 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ main() {



# The figures should be genereated by a separate process outside the CI/CD pipeline
# The figures should be generated by a separate process outside the CI/CD pipeline

# # generate figures
# python3 -m pip install tabulate pandas matplotlib
Expand Down
107 changes: 107 additions & 0 deletions .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,104 @@ run_serving_tests() {
kill_gpu_processes
}

run_genai_perf_tests() {
# run genai-perf tests

# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
genai_perf_test_file=$1

# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')

# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi

# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}

# get common parameters
common_params=$(echo "$params" | jq -r '.common_parameters')
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')

# get client and server arguments
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"

# check if there is enough GPU to run the test
if [[ $gpu_count -lt $tp ]]; then
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi

if [[ $reuse_server == "true" ]]; then
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi

if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
break
fi

# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps=$num_prompts
echo "now qps is $qps"
fi

new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE

if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
#TODO: add output dir.
client_command="genai-perf profile \
-m $model \
--service-kind openai \
--backend vllm \
--endpoint-type chat \
--streaming \
--url localhost:$port \
--request-rate $qps \
--num-prompts $num_prompts \
"

echo "Client command: $client_command"

eval "$client_command"

#TODO: process/record outputs
done
done

kill_gpu_processes

}

prepare_dataset() {

Expand Down Expand Up @@ -328,12 +426,17 @@ main() {

pip install -U transformers

pip install -r requirements-dev.txt
which genai-perf

# check storage
df -h

ensure_installed wget
ensure_installed curl
ensure_installed jq
# genai-perf dependency
ensure_installed libb64-0d

prepare_dataset

Expand All @@ -345,6 +448,10 @@ main() {
# run the test
run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"

# run genai-perf tests
run_genai_perf_tests "$BENCHMARK_ROOT/tests/genai-perf-tests.json"
mv artifacts/ $RESULTS_FOLDER/

# upload benchmark results to buildkite
python3 -m pip install tabulate pandas
python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"
Expand Down
23 changes: 23 additions & 0 deletions .buildkite/nightly-benchmarks/tests/genai-perf-tests.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[
{
"test_name": "llama8B_tp1_genai_perf",
"qps_list": [4,8,16,32],
"common_parameters": {
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"tp": 1,
"port": 8000,
"num_prompts": 500,
"reuse_server": false
},
"vllm_server_parameters": {
"disable_log_stats": "",
"disable_log_requests": "",
"gpu_memory_utilization": 0.9,
"num_scheduler_steps": 10,
"max_num_seqs": 512,
"dtype": "bfloat16"
},
"genai_perf_input_parameters": {
}
}
]
4 changes: 2 additions & 2 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,6 @@ function cpu_tests() {
tests/lora/test_qwen2vl.py"
}

# All of CPU tests are expected to be finished less than 25 mins.
# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 30m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
timeout 40m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
12 changes: 10 additions & 2 deletions .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,17 @@ set -ex
docker build -t hpu-test-env -f Dockerfile.hpu .

# Setup cleanup
# certain versions of HPU software stack have a bug that can
# override the exit code of the script, so we need to use
# separate remove_docker_container and remove_docker_container_and_exit
# functions, while other platforms only need one remove_docker_container
# function.
EXITCODE=1
remove_docker_container() { docker rm -f hpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container_and_exit() { remove_docker_container; exit $EXITCODE; }
trap remove_docker_container_and_exit EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
EXITCODE=$?
10 changes: 8 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@ steps:
- tests/worker
- tests/standalone_tests/lazy_torch_compile.py
commands:
- pip install git+https://github.com/Isotr0py/DeepSeek-VL2.git # Used by multimoda processing test
- python3 standalone_tests/lazy_torch_compile.py
- pytest -v -s mq_llm_engine # MQLLMEngine
- pytest -v -s async_engine # AsyncLLMEngine
Expand Down Expand Up @@ -107,7 +106,7 @@ steps:
source_file_dependencies:
- vllm/
commands:
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
Expand All @@ -126,11 +125,15 @@ steps:
- tests/distributed
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile
- examples/offline_inference/rlhf.py
commands:
- pytest -v -s distributed/test_utils.py
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests
- python3 ../examples/offline_inference/rlhf.py

- label: Metrics, Tracing Test # 10min
num_gpus: 2
Expand Down Expand Up @@ -462,7 +465,10 @@ steps:
- vllm/worker/worker_base.py
- vllm/worker/worker.py
- vllm/worker/model_runner.py
- entrypoints/llm/test_collective_rpc.py
commands:
- pytest -v -s entrypoints/llm/test_collective_rpc.py
- torchrun --nproc-per-node=2 distributed/test_torchrun_example.py
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
Expand Down
41 changes: 0 additions & 41 deletions .github/workflows/actionlint.yml

This file was deleted.

55 changes: 0 additions & 55 deletions .github/workflows/clang-format.yml

This file was deleted.

47 changes: 0 additions & 47 deletions .github/workflows/codespell.yml

This file was deleted.

Loading
Loading