Skip to content

Commit

Permalink
Long Context usage README update (#836)
Browse files Browse the repository at this point in the history
Long Context usage README update

---------

Signed-off-by: Iryna Boiko <[email protected]>
  • Loading branch information
iboiko-habana authored Mar 4, 2025
1 parent acf2cd5 commit f78aeb9
Showing 1 changed file with 57 additions and 0 deletions.
57 changes: 57 additions & 0 deletions README_GAUDI.md
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,63 @@ However, disabling this feature in production environments is not recommended, a
> - `VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
> - `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
# Long Context Support

Long context feature enables support for a token context window exceeding 32K tokens. It is supported by the following models:
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
- [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
- [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)

## Environment Variables Settings

Set the following environment variables to avoid OOM/functional issues. Additional environment variable settings depend on context length:

- `VLLM_ENGINE_ITERATION_TIMEOUT_S=3600`
- `VLLM_RPC_TIMEOUT=100000`
- `VLLM_PROMPT_USE_FUSEDSDPA=1`
- `PT_HPU_ENABLE_LAZY_COLLECTIVES=true`
- `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1`
- `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1`

**32K context length flags examples:**

- `VLLM_GRAPH_RESERVED_MEM` - The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
- `VLLM_PROMPT_BS_BUCKET_MIN=1` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
- `VLLM_PROMPT_BS_BUCKET_STEP=16` - Suggested value, depends on the model. Increasing the step value results in fewer buckets. If an OOM error occurs, the value should be increased.
- `VLLM_PROMPT_BS_BUCKET_MAX=16` - Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs.
- `VLLM_PROMPT_SEQ_BUCKET_MIN=24576` - Suggested value, depends on warmup results.
- `VLLM_PROMPT_SEQ_BUCKET_STEP=2048` - Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3.
- `VLLM_PROMPT_SEQ_BUCKET_MAX=32768` - Value for context length of 32K. Use 16384 for 16K.
- `VLLM_DECODE_BLOCK_BUCKET_MIN=1024` - Suggested value, depends on warmup results.
- `VLLM_DECODE_BLOCK_BUCKET_STEP=1024` - Suggested value, depends on warmup results.
- `VLLM_DECODE_BLOCK_BUCKET_MAX=33792` - `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example:
- `128 *((32 + 1)* 1024) / 128`
- `32 *((32 + 1)* 1024) / 128`

## Batch Size Settings

The default `batch_size=256` is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups.

If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.

**Recompilation message example:**

```bash
Configuration: (prompt, 1, 36864) was not warmed-up!
```

**Warning message example:**

```bash
Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory.
```

**Usage of Multi-Step Scheduling feature**
Enabling of Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details.

# Troubleshooting

The following steps address Out of Memory related errors:
Expand Down

0 comments on commit f78aeb9

Please sign in to comment.