vLLM Offline Client¶
llm_annotator.clients.vllm_offline_client
¶
vLLM offline provider implementation.
VLLMOfflineRuntimeOptions
dataclass
¶
VLLMOfflineRuntimeOptions(
max_tokens: int | None = None,
json_schema: dict[str, Any] | None = None,
top_k: int | None = None,
repetition_penalty: float | None = None,
chat_template_kwargs: dict[str, Any] | None = None,
temperature: float | None = None,
top_p: float | None = None,
stop: list[str] | None = None,
presence_penalty: float = 0.0,
frequency_penalty: float = 0.0,
seed: int | None = None,
n: int = 1,
whitespace_pattern: str | None = "[ ]?",
)
Bases: VLLMBaseRuntimeOptions
Generation options for the vLLM offline client.
Extends :class:VLLMBaseRuntimeOptions (which provides top_k,
repetition_penalty, and chat_template_kwargs) with
SamplingParams-compatible fields for in-process vLLM inference.
Attributes:
| Name | Type | Description |
|---|---|---|
max_tokens |
int | None
|
Maximum number of output tokens. Inherited from
:class: |
json_schema |
dict[str, Any] | None
|
Optional JSON schema dict for structured output via guided
decoding. Inherited from
:class: |
top_k |
int | None
|
Top-k sampling cutoff. Inherited from
:class: |
repetition_penalty |
float | None
|
Multiplicative penalty for token repetition.
Inherited from :class: |
chat_template_kwargs |
dict[str, Any] | None
|
Additional kwargs forwarded to the chat template.
Inherited from :class: |
temperature |
float | None
|
Sampling temperature. |
top_p |
float | None
|
Top-p nucleus sampling probability. |
stop |
list[str] | None
|
Optional list of strings that halt generation when produced. |
presence_penalty |
float
|
Penalty applied to tokens already present in the
output. Defaults to |
frequency_penalty |
float
|
Penalty applied proportional to token frequency in
the output. Defaults to |
seed |
int | None
|
Optional fixed random seed for reproducible generation. |
n |
int
|
Number of independent output sequences to generate per request.
Defaults to |
whitespace_pattern |
str | None
|
Regex pattern inserted between JSON tokens during
guided decoding. Only used when |
to_payload
¶
Build a SamplingParams-compatible payload dict.
The dict can be passed directly to vllm.SamplingParams(**payload).
chat_template_kwargs is intentionally excluded; it must be passed
separately to LLM.chat().
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict of |
Raises:
| Type | Description |
|---|---|
ImportError
|
If vLLM is not installed. |
View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 153–192
VLLMOfflineClient
¶
VLLMOfflineClient(
model: str,
*,
tensor_parallel_size: int = 1,
max_num_seqs: int = 256,
gpu_memory_utilization: float = 0.9,
enforce_eager: bool = False,
quantization: str | None = None,
max_model_len: int | None = None,
max_num_batched_tokens: int | None = None,
enable_prefix_caching: bool = True,
enable_chunked_prefill: bool = True,
language_model_only: bool = True,
speculative_config: dict[str, Any] | None = None,
extra_vllm_kwargs: dict[str, Any] | None = None,
on_error: OnError = "warn",
batch_size: int | None = None,
min_batch_size: int = 1,
)
Bases: Client[VLLMOfflineRuntimeOptions]
Offline vLLM client that runs inference in-process.
Loads the model into GPU memory on construction and uses vLLM's
LLM.chat API for batched generation. Supports structured output
via JSON schema guided decoding, automatic prefix caching, and chunked
prefill. Use as a context manager to ensure GPU resources are released
when done.
batch_generate automatically splits the message list into chunks of
batch_size and retries failing chunks with a halved size on CUDA
out-of-memory errors (see :func:auto_reduce_batch_size). When
batch_size is None (the default) all messages are sent in a
single vLLM call, mirroring the original behaviour while still
recovering from OOM when possible.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Hugging Face model identifier or local path. |
required |
tensor_parallel_size
|
int
|
Number of GPUs for tensor parallelism. |
1
|
max_num_seqs
|
int
|
Maximum number of sequences processed in parallel. |
256
|
gpu_memory_utilization
|
float
|
Target fraction of GPU memory to use. |
0.9
|
enforce_eager
|
bool
|
Disable CUDA graphs and run in eager mode. |
False
|
quantization
|
str | None
|
Quantization method (e.g. |
None
|
max_model_len
|
int | None
|
Maximum total sequence length the model supports. |
None
|
max_num_batched_tokens
|
int | None
|
Maximum tokens per forward pass. |
None
|
enable_prefix_caching
|
bool
|
Enable automatic KV-cache prefix reuse. |
True
|
enable_chunked_prefill
|
bool
|
Process prefills in chunks to bound memory. |
True
|
language_model_only
|
bool
|
If True, all non-text modalities are disabled, saving some memory. |
True
|
speculative_config
|
dict[str, Any] | None
|
Optional dict of vLLM speculative decoding config parameters. |
None
|
extra_vllm_kwargs
|
dict[str, Any] | None
|
Additional keyword arguments forwarded to
|
None
|
batch_size
|
int | None
|
Starting chunk size for :meth: |
None
|
min_batch_size
|
int
|
Smallest permitted chunk size before an OOM error is re-raised. Must be >= 1. |
1
|
Examples:
Basic generation:
client = VLLMOfflineClient(
model="meta-llama/Llama-3.2-3B-Instruct",
max_model_len=4096,
)
response = client.generate(
messages=[{"role": "user", "content": "Hello!"}]
)
client.destroy()
Context manager (recommended):
with VLLMOfflineClient(
model="meta-llama/Llama-3.2-3B-Instruct",
max_model_len=4096,
) as client:
responses = client.batch_generate(
messages=[
[{"role": "user", "content": "Hello!"}],
[{"role": "user", "content": "What is 2+2?"}],
]
)
Structured output with JSON schema:
schema = {
"type": "object",
"properties": {"label": {"type": "string"}},
"required": ["label"],
}
opts = VLLMOfflineRuntimeOptions(
max_tokens=128, json_schema=schema
)
with VLLMOfflineClient(
model="meta-llama/Llama-3.2-3B-Instruct"
) as client:
responses = client.batch_generate(
messages=[
[{"role": "user", "content": "Classify: great"}]
],
options=opts,
)
Initialize the offline vLLM client and load the model into memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Hugging Face model identifier or local path. |
required |
tensor_parallel_size
|
int
|
Number of GPUs for tensor parallelism. |
1
|
max_num_seqs
|
int
|
Maximum number of sequences processed in parallel. |
256
|
gpu_memory_utilization
|
float
|
Target fraction of GPU memory to use. |
0.9
|
enforce_eager
|
bool
|
Disable CUDA graphs and run in eager mode. |
False
|
quantization
|
str | None
|
Quantization method (e.g. |
None
|
max_model_len
|
int | None
|
Maximum total sequence length the model supports. |
None
|
max_num_batched_tokens
|
int | None
|
Maximum tokens per forward pass. |
None
|
enable_prefix_caching
|
bool
|
Enable automatic KV-cache prefix reuse. Particularly beneficial when many prompts share a common prefix (e.g. a system message), since the shared prefix is only encoded once. |
True
|
enable_chunked_prefill
|
bool
|
Process prefills in chunks to reduce peak memory usage and improve scheduling efficiency. |
True
|
language_model_only
|
bool
|
If |
True
|
speculative_config
|
dict[str, Any] | None
|
Optional dict of vLLM speculative decoding config parameters. |
None
|
extra_vllm_kwargs
|
dict[str, Any] | None
|
Additional keyword arguments forwarded to
|
None
|
on_error
|
OnError
|
Error behavior when generation fails.
Defaults to |
'warn'
|
batch_size
|
int | None
|
Starting chunk size for :meth: |
None
|
min_batch_size
|
int
|
Smallest permitted chunk size before an OOM is re-raised. Must be >= 1. |
1
|
Raises:
| Type | Description |
|---|---|
ImportError
|
If vLLM is not installed (raised on first use). |
View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 283–354
warm_up
¶
warm_up(
*,
system_message: str | None = None,
prompt_prefix: str | None = None,
options: VLLMOfflineRuntimeOptions | None = None,
) -> None
Prime the KV-cache with a shared prefix before the main workload.
When many prompts share a common system message or prompt prefix, running a single cheap forward pass first ensures the shared tokens are cached before the first real batch, avoiding a cold-start latency spike on the initial batch.
This is a no-op if neither system_message nor prompt_prefix
is provided, or if the model has not been loaded yet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
system_message
|
str | None
|
Optional system message used in every request. |
None
|
prompt_prefix
|
str | None
|
Optional fixed prefix that starts every user turn. |
None
|
options
|
VLLMOfflineRuntimeOptions | None
|
Optional generation options. Only used to derive a base
|
None
|
Raises:
| Type | Description |
|---|---|
ProviderError
|
If the warm-up inference call fails. |
View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 400–469
generate
¶
generate(
*,
messages: list[dict[str, str]],
options: VLLMOfflineRuntimeOptions | None = None,
gen_kwargs: dict[str, Any] | None = None,
) -> Response
Generate a single response for a conversation.
Delegates to batch_generate with a single-item batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
messages
|
list[dict[str, str]]
|
Conversation as a list of role/content dicts. |
required |
options
|
VLLMOfflineRuntimeOptions | None
|
Optional generation configuration. Pass a VLLMRuntimeOptions instance to use vLLM-specific settings. |
None
|
gen_kwargs
|
dict[str, Any] | None
|
Additional provider-specific generation kwargs that are
not covered by |
None
|
Returns:
| Type | Description |
|---|---|
Response
|
A Response object containing the generated text and metadata. |
Raises:
| Type | Description |
|---|---|
ProviderError
|
If the vLLM call fails or the stop reason is an error condition. |
View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 508–537
batch_generate
¶
batch_generate(
*,
messages: list[list[dict[str, str]]],
options: VLLMOfflineRuntimeOptions | None = None,
gen_kwargs: dict[str, Any] | None = None,
) -> list[Response]
Generate responses for a batch of conversations.
The full messages list is automatically split into chunks and each
chunk is dispatched to vLLM separately. On a CUDA out-of-memory error
the chunk size is halved and retried. Chunk size and minimum are
configured via the batch_size and min_batch_size constructor
arguments. Response order matches input order.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
messages
|
list[list[dict[str, str]]]
|
List of conversations, where each conversation is a list of role/content dicts. |
required |
options
|
VLLMOfflineRuntimeOptions | None
|
Optional generation configuration. Pass a VLLMRuntimeOptions instance to use vLLM-specific settings such as temperature, top-p, or a JSON schema. |
None
|
gen_kwargs
|
dict[str, Any] | None
|
Additional provider-specific generation kwargs that are
not covered by |
None
|
Returns:
| Type | Description |
|---|---|
list[Response]
|
A list of Response objects, one per input conversation, in the |
list[Response]
|
same order as the input. |
Raises:
| Type | Description |
|---|---|
ProviderError
|
If the model is not loaded or the vLLM call fails. |
View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 539–635
destroy
¶
Free GPU memory and clean up all vLLM resources.
Safe to call multiple times; subsequent calls after the first are no-ops. Also invoked automatically when the client is used as a context manager.
View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 637–675
auto_reduce_batch_size
¶
Decorate a batch_generate method to retry with halved chunk size on OOM.
Intended for use with :class:VLLMOfflineClient. On each call the full
messages list is split into chunks and dispatched one at a time. When a
CUDA out-of-memory error is detected the current chunk size is halved and
the failing chunk is retried at the new size. This continues until the chunk
succeeds or the size would fall below the instance's _min_batch_size,
at which point the error is re-raised.
The chunk size and minimum are read from the instance's _batch_size and
_min_batch_size attributes on every call, so they can be adjusted after
construction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
Callable[..., list[Response]]
|
Unbound |
required |
Returns:
| Type | Description |
|---|---|
Callable[..., list[Response]]
|
The wrapped method with adaptive OOM-recovery logic applied. |
View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 42–103
Adaptive batching on out-of-memory errors¶
When running inference on large datasets the batch passed to
batch_generate
can exceed available GPU memory. batch_generate is decorated with
auto_reduce_batch_size,
which automatically splits the message list into chunks and halves the chunk
size whenever a CUDA out-of-memory error is detected, retrying the failing
chunk at the smaller size.
The chunk size is controlled by the batch_size and min_batch_size
constructor arguments:
from llm_annotator import VLLMOfflineClient, VLLMRuntimeOptions
messages = [
[{"role": "user", "content": text}]
for text in my_texts # (1)!
]
opts = VLLMRuntimeOptions(max_tokens=256, temperature=0.0)
with VLLMOfflineClient(
model="meta-llama/Llama-3.2-3B-Instruct",
max_model_len=4096,
batch_size=64, # (2)!
min_batch_size=1, # (3)!
) as client:
responses = client.batch_generate(messages=messages, options=opts)
- Build one conversation per input text.
- Start by processing 64 conversations per vLLM call. On OOM this halves to 32, then 16, and so on.
- Re-raise the OOM error if the chunk size would drop below this value.
Defaults to
1.
When batch_size is None (the default) all messages are sent in a single
call, mirroring the original behaviour while still recovering automatically
if that single call triggers an OOM.
OOM detection walks the full exception chain, so it works whether the raw
torch.cuda.OutOfMemoryError propagates directly or is wrapped inside a
ProviderError (the default
behaviour with on_error="raise").
Note
batch_size controls only the number of conversations sent in a single
Python call to vLLM. It is independent of the max_num_seqs constructor
argument, which governs the vLLM scheduler and requires reloading the
model to change.