Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into granitemoe
Browse files Browse the repository at this point in the history
  • Loading branch information
njhill committed Sep 26, 2024
2 parents 0e0da7f + 4b377d6 commit 280b22b
Show file tree
Hide file tree
Showing 22 changed files with 247 additions and 59 deletions.
5 changes: 3 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,8 @@ steps:
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- pytest -v -s entrypoints/openai
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py
- pytest -v -s entrypoints/openai/test_oot_registration.py # it needs a clean process
- pytest -v -s entrypoints/test_chat_utils.py
- pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests

Expand Down Expand Up @@ -459,7 +460,7 @@ steps:
# NOTE: don't test llama model here, it seems hf implementation is buggy
# see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py
- TARGET_TEST_SUITE=A100 pytest -v -s distributed/test_basic_distributed_correctness.py
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m distributed_2_gpus
- pytest -v -s -x lora/test_mixtral.py

- label: LM Eval Large Models # optional
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/scripts/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ PATH=${cuda_home}/bin:$PATH
LD_LIBRARY_PATH=${cuda_home}/lib64:$LD_LIBRARY_PATH

# Install requirements
$python_executable -m pip install wheel packaging
$python_executable -m pip install wheel packaging 'setuptools-scm>=8'
$python_executable -m pip install -r requirements-cuda.txt

# Limit the number of parallel jobs to avoid OOM
Expand Down
9 changes: 9 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,14 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \
&& python3 --version && python3 -m pip --version

# Upgrade to GCC 10 to avoid https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92519
# as it was causing spam when compiling the CUTLASS kernels
RUN apt-get install -y gcc-10 g++-10
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 110 --slave /usr/bin/g++ g++ /usr/bin/g++-10
RUN <<EOF
gcc --version
EOF

# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
Expand Down Expand Up @@ -67,6 +75,7 @@ COPY csrc csrc
COPY setup.py setup.py
COPY cmake cmake
COPY CMakeLists.txt CMakeLists.txt
COPY README.md README.md
COPY requirements-common.txt requirements-common.txt
COPY requirements-cuda.txt requirements-cuda.txt
COPY pyproject.toml pyproject.toml
Expand Down
4 changes: 2 additions & 2 deletions csrc/prepare_inputs/advance_step.cu
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ void advance_step_flashinfer(
printf(" num_seqs = %d\n", num_seqs);
printf(" num_queries = %d\n", num_queries);
printf(" block_size = %d\n", block_size);
printf(" block_tables.stride(0) = %d\n", block_tables.stride(0));
printf(" block_tables.stride(0) = %zu\n", block_tables.stride(0));
}
// Verify all tensors
verify_tensor("input_tokens", input_tokens, num_seqs, -1, at::kLong);
Expand Down Expand Up @@ -303,4 +303,4 @@ void advance_step_flashinfer(
num_seqs, num_queries, block_size, input_tokens, sampled_token_ids,
input_positions, seq_lens, slot_mapping, block_tables, paged_kv_indices,
paged_kv_indptr, paged_kv_last_page_len, block_table_bound);
}
}
34 changes: 31 additions & 3 deletions docs/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,41 @@ You can install vLLM using pip:
$ # export VLLM_COMMIT=...
$ # pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
Build from source (without compilation)
---------------------------------------

If you want to develop vLLM, and you only need to change the Python code, you can build vLLM without compilation.

The first step is to follow the previous instructions to install the latest vLLM wheel:

.. code-block:: console
$ export VLLM_VERSION=0.6.1.post1
$ pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
After verifying that the installation is successful, we have a script for you to copy and link directories, so that you can edit the Python code directly:

.. code-block:: console
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python python_only_dev.py
It will:

- Find the installed vLLM in the current environment.
- Copy built files to the current directory.
- Rename the installed vLLM
- Symbolically link the current directory to the installed vLLM.

This way, you can edit the Python code in the current directory, and the changes will be reflected in the installed vLLM.

.. _build_from_source:

Build from source
-----------------
Build from source (with compilation)
------------------------------------

You can also build and install vLLM from source:
If you need to touch the C++ or CUDA code, you need to build vLLM from source:

.. code-block:: console
Expand Down
54 changes: 54 additions & 0 deletions python_only_dev.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# enable python only development
# copy compiled files to the current directory directly

import os
import shutil
import subprocess
import sys

# cannot directly `import vllm` , because it will try to
# import from the current directory
output = subprocess.run([sys.executable, "-m", "pip", "show", "vllm"],
capture_output=True)

assert output.returncode == 0, "vllm is not installed"

text = output.stdout.decode("utf-8")

package_path = None
for line in text.split("\n"):
if line.startswith("Location: "):
package_path = line.split(": ")[1]
break

assert package_path is not None, "could not find package path"

cwd = os.getcwd()

assert cwd != package_path, "should not import from the current directory"

files_to_copy = [
"vllm/_C.abi3.so",
"vllm/_core_C.abi3.so",
"vllm/_moe_C.abi3.so",
"vllm/vllm_flash_attn/vllm_flash_attn_c.abi3.so",
"vllm/vllm_flash_attn/flash_attn_interface.py",
"vllm/vllm_flash_attn/__init__.py",
# "vllm/_version.py", # not available in nightly wheels yet
]

for file in files_to_copy:
src = os.path.join(package_path, file)
dst = file
print(f"Copying {src} to {dst}")
shutil.copyfile(src, dst)

pre_built_vllm_path = os.path.join(package_path, "vllm")
tmp_path = os.path.join(package_path, "vllm_pre_built")
current_vllm_path = os.path.join(cwd, "vllm")

print(f"Renaming {pre_built_vllm_path} to {tmp_path}")
os.rename(pre_built_vllm_path, tmp_path)

print(f"linking {current_vllm_path} to {pre_built_vllm_path}")
os.symlink(current_vllm_path, pre_built_vllm_path)
1 change: 0 additions & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -699,7 +699,6 @@ def generate_w_logprobs(
if videos is not None:
for i, video in enumerate(videos):
inputs[i]["multi_modal_data"] = {"video": video}
print(f"[INPUTS!!!!]: {inputs}, {sampling_params}")

req_outputs = self.model.generate(inputs,
sampling_params=sampling_params)
Expand Down
7 changes: 0 additions & 7 deletions tests/distributed/test_pipeline_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@
import os

import pytest
from packaging import version
from transformers import __version__ as transformers_version

from vllm.logger import init_logger

Expand Down Expand Up @@ -49,11 +47,6 @@ def test_compare_tp(TP_SIZE, PP_SIZE, EAGER_MODE, CHUNKED_PREFILL,
pytest.skip("Skipping multi-node pipeline parallel test for "
"multiprocessing distributed backend")

# Skip tests that require transformers>=4.45.0
if "Qwen2-VL" in MODEL_NAME and version.parse(
transformers_version) < version.parse("4.45.0.dev0"):
pytest.skip("This test requires transformers>=4.45.0")

pp_args = [
# use half precision for speed and memory savings in CI environment
"--dtype",
Expand Down
8 changes: 4 additions & 4 deletions tests/engine/test_custom_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ def test_custom_executor_type_checking(model):


@pytest.mark.parametrize("model", ["facebook/opt-125m"])
def test_custom_executor(model, tmpdir):
def test_custom_executor(model, tmp_path):
cwd = os.path.abspath(".")
os.chdir(tmpdir)
os.chdir(tmp_path)
try:
assert not os.path.exists(".marker")

Expand All @@ -68,9 +68,9 @@ def test_custom_executor(model, tmpdir):


@pytest.mark.parametrize("model", ["facebook/opt-125m"])
def test_custom_executor_async(model, tmpdir):
def test_custom_executor_async(model, tmp_path):
cwd = os.path.abspath(".")
os.chdir(tmpdir)
os.chdir(tmp_path)
try:
assert not os.path.exists(".marker")

Expand Down
6 changes: 6 additions & 0 deletions tests/entrypoints/openai/test_serving_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@
BASE_MODEL_PATHS = [BaseModelPath(name=MODEL_NAME, model_path=MODEL_NAME)]


@dataclass
class MockHFConfig:
model_type: str = "any"


@dataclass
class MockModelConfig:
tokenizer = MODEL_NAME
Expand All @@ -24,6 +29,7 @@ class MockModelConfig:
tokenizer_revision = None
embedding_mode = False
multimodal_config = MultiModalConfig()
hf_config = MockHFConfig()


@dataclass
Expand Down
4 changes: 2 additions & 2 deletions tests/lora/test_tokenizer_group.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ async def test_tokenizer_group_lora(sql_lora_files, tokenizer_group_type):
lora_request)


def test_get_lora_tokenizer(sql_lora_files, tmpdir):
def test_get_lora_tokenizer(sql_lora_files, tmp_path):
lora_request = None
tokenizer = get_lora_tokenizer(lora_request)
assert not tokenizer
Expand All @@ -50,6 +50,6 @@ def test_get_lora_tokenizer(sql_lora_files, tmpdir):
tokenizer = get_lora_tokenizer(lora_request)
assert tokenizer.get_added_vocab()

lora_request = LoRARequest("1", 1, str(tmpdir))
lora_request = LoRARequest("1", 1, str(tmp_path))
tokenizer = get_lora_tokenizer(lora_request)
assert not tokenizer
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from typing import List, Optional, Tuple, Type, overload

import pytest
import transformers
from transformers import AutoConfig, AutoModelForVision2Seq, AutoTokenizer

from vllm.multimodal.utils import (rescale_video_size, resize_video,
Expand Down Expand Up @@ -158,8 +157,6 @@ def run_test(
)


@pytest.mark.skipif(transformers.__version__ < "4.45",
reason="Waiting for next transformers release")
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
"size_factors",
Expand Down Expand Up @@ -203,8 +200,6 @@ def test_models(hf_runner, vllm_runner, video_assets, model, size_factors,
)


@pytest.mark.skipif(transformers.__version__ < "4.45",
reason="Waiting for next transformers release")
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
"sizes",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from typing import List, Optional, Tuple, Type, overload

import pytest
import transformers
from transformers import (AutoConfig, AutoModelForVision2Seq, AutoTokenizer,
BatchEncoding)

Expand Down Expand Up @@ -166,8 +165,6 @@ def process(hf_inputs: BatchEncoding):
)


@pytest.mark.skipif(transformers.__version__ < "4.45",
reason="Waiting for next transformers release")
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
"size_factors",
Expand Down Expand Up @@ -211,8 +208,6 @@ def test_models(hf_runner, vllm_runner, video_assets, model, size_factors,
)


@pytest.mark.skipif(transformers.__version__ < "4.45",
reason="Waiting for next transformers release")
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
"sizes",
Expand Down Expand Up @@ -259,7 +254,9 @@ def run_image_test(
# max_model_len should be greater than image_feature_size
with vllm_runner(model,
dtype=dtype,
max_model_len=32768,
max_num_seqs=1,
max_model_len=16384,
gpu_memory_utilization=0.98,
tensor_parallel_size=tensor_parallel_size,
distributed_executor_backend=distributed_executor_backend,
enforce_eager=True,
Expand Down Expand Up @@ -305,8 +302,8 @@ def process(hf_inputs: BatchEncoding):
)


@pytest.mark.skipif(transformers.__version__ < "4.45",
reason="Waiting for next transformers release")
# FIXME: Swap to a smaller model for this architecture
@pytest.mark.skip(reason="Model OOMing on CI")
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [128])
Expand Down
6 changes: 0 additions & 6 deletions tests/models/test_registry.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,9 @@
import pytest
import transformers

from vllm.model_executor.models import _MODELS, ModelRegistry


@pytest.mark.parametrize("model_cls", _MODELS)
def test_registry_imports(model_cls):
if (model_cls in ("LlavaOnevisionForConditionalGeneration",
"Qwen2VLForConditionalGeneration")
and transformers.__version__ < "4.45"):
pytest.skip("Waiting for next transformers release")

# Ensure all model classes can be imported successfully
ModelRegistry.resolve_model_cls([model_cls])
18 changes: 15 additions & 3 deletions tests/samplers/test_sampler.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import itertools
import random
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
from unittest.mock import Mock, patch

Expand Down Expand Up @@ -596,8 +597,19 @@ def test_sampler_top_k_top_p(seed: int, device: str):
generation_config = GenerationConfig(top_k=top_k,
top_p=top_p,
do_sample=True)
warpers = generation_model._get_logits_warper(generation_config, device)
assert len(warpers) == 2 # top_p and top_k

@dataclass
class MockConfig:
is_encoder_decoder: bool = False

generation_model.config = MockConfig() # needed by the following method
generation_model._prepare_special_tokens(generation_config, device=device)
processors = generation_model._get_logits_processor(generation_config,
None,
None,
None, [],
device=device)
assert len(processors) == 2 # top_p and top_k

seq_group_metadata_list: List[SequenceGroupMetadata] = []
seq_lens: List[int] = []
Expand Down Expand Up @@ -639,7 +651,7 @@ def mock_sample(probs, *args, **kwargs):

assert sample_probs is not None

hf_probs = warpers(torch.zeros_like(fake_logits), fake_logits.clone())
hf_probs = processors(torch.zeros_like(fake_logits), fake_logits.clone())
hf_probs = torch.softmax(hf_probs, dim=-1, dtype=torch.float)
torch.testing.assert_close(hf_probs, sample_probs, rtol=0.0, atol=1e-5)
assert torch.equal(hf_probs.eq(0), sample_probs.eq(0))
Expand Down
Loading

0 comments on commit 280b22b

Please sign in to comment.