Skip to content

vLLM Offline Client

llm_annotator.clients.vllm_offline_client

vLLM offline provider implementation.

VLLMOfflineRuntimeOptions dataclass

VLLMOfflineRuntimeOptions(
    max_tokens: int | None = None,
    json_schema: dict[str, Any] | None = None,
    top_k: int | None = None,
    repetition_penalty: float | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    stop: list[str] | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    seed: int | None = None,
    n: int = 1,
    whitespace_pattern: str | None = "[ ]?",
)

Bases: VLLMBaseRuntimeOptions

Generation options for the vLLM offline client.

Extends :class:VLLMBaseRuntimeOptions (which provides top_k, repetition_penalty, and chat_template_kwargs) with SamplingParams-compatible fields for in-process vLLM inference.

Attributes:

Name Type Description
max_tokens int | None

Maximum number of output tokens. Inherited from :class:~llm_annotator.clients.base.ProviderRuntimeOptions.

json_schema dict[str, Any] | None

Optional JSON schema dict for structured output via guided decoding. Inherited from :class:~llm_annotator.clients.base.ProviderRuntimeOptions. When provided, vLLM constrains generation to valid JSON matching the schema.

top_k int | None

Top-k sampling cutoff. Inherited from :class:VLLMBaseRuntimeOptions. None uses the model default.

repetition_penalty float | None

Multiplicative penalty for token repetition. Inherited from :class:VLLMBaseRuntimeOptions.

chat_template_kwargs dict[str, Any] | None

Additional kwargs forwarded to the chat template. Inherited from :class:VLLMBaseRuntimeOptions. Pass {"enable_thinking": True} here to enable thinking mode.

temperature float | None

Sampling temperature. None uses the model default.

top_p float | None

Top-p nucleus sampling probability. None uses the model default.

stop list[str] | None

Optional list of strings that halt generation when produced.

presence_penalty float

Penalty applied to tokens already present in the output. Defaults to 0.0 (vLLM default).

frequency_penalty float

Penalty applied proportional to token frequency in the output. Defaults to 0.0 (vLLM default).

seed int | None

Optional fixed random seed for reproducible generation.

n int

Number of independent output sequences to generate per request. Defaults to 1.

whitespace_pattern str | None

Regex pattern inserted between JSON tokens during guided decoding. Only used when json_schema is set.

to_payload

to_payload() -> dict[str, Any]

Build a SamplingParams-compatible payload dict.

The dict can be passed directly to vllm.SamplingParams(**payload). chat_template_kwargs is intentionally excluded; it must be passed separately to LLM.chat().

Returns:

Type Description
dict[str, Any]

A dict of SamplingParams-compatible keyword arguments.

Raises:

Type Description
ImportError

If vLLM is not installed.

VLLMOfflineClient

VLLMOfflineClient(
    model: str,
    *,
    tensor_parallel_size: int = 1,
    max_num_seqs: int = 256,
    gpu_memory_utilization: float = 0.9,
    enforce_eager: bool = False,
    quantization: str | None = None,
    max_model_len: int | None = None,
    max_num_batched_tokens: int | None = None,
    enable_prefix_caching: bool = True,
    enable_chunked_prefill: bool = True,
    language_model_only: bool = True,
    speculative_config: dict[str, Any] | None = None,
    extra_vllm_kwargs: dict[str, Any] | None = None,
    on_error: OnError = "warn",
    batch_size: int | None = None,
    min_batch_size: int = 1,
)

Bases: Client[VLLMOfflineRuntimeOptions]

Offline vLLM client that runs inference in-process.

Loads the model into GPU memory on construction and uses vLLM's LLM.chat API for batched generation. Supports structured output via JSON schema guided decoding, automatic prefix caching, and chunked prefill. Use as a context manager to ensure GPU resources are released when done.

batch_generate automatically splits the message list into chunks of batch_size and retries failing chunks with a halved size on CUDA out-of-memory errors (see :func:auto_reduce_batch_size). When batch_size is None (the default) all messages are sent in a single vLLM call, mirroring the original behaviour while still recovering from OOM when possible.

Parameters:

Name Type Description Default
model str

Hugging Face model identifier or local path.

required
tensor_parallel_size int

Number of GPUs for tensor parallelism.

1
max_num_seqs int

Maximum number of sequences processed in parallel.

256
gpu_memory_utilization float

Target fraction of GPU memory to use.

0.9
enforce_eager bool

Disable CUDA graphs and run in eager mode.

False
quantization str | None

Quantization method (e.g. "fp8", "awq").

None
max_model_len int | None

Maximum total sequence length the model supports.

None
max_num_batched_tokens int | None

Maximum tokens per forward pass.

None
enable_prefix_caching bool

Enable automatic KV-cache prefix reuse.

True
enable_chunked_prefill bool

Process prefills in chunks to bound memory.

True
language_model_only bool

If True, all non-text modalities are disabled, saving some memory.

True
speculative_config dict[str, Any] | None

Optional dict of vLLM speculative decoding config parameters.

None
extra_vllm_kwargs dict[str, Any] | None

Additional keyword arguments forwarded to vllm.LLM. Explicit constructor arguments take precedence over any conflicting keys here.

None
batch_size int | None

Starting chunk size for :meth:batch_generate. Defaults to None, which sends all messages in one call. On OOM the chunk size is halved automatically until it succeeds or falls below min_batch_size.

None
min_batch_size int

Smallest permitted chunk size before an OOM error is re-raised. Must be >= 1.

1

Examples:

Basic generation:

client = VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
)
response = client.generate(
    messages=[{"role": "user", "content": "Hello!"}]
)
client.destroy()

Context manager (recommended):

with VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
) as client:
    responses = client.batch_generate(
        messages=[
            [{"role": "user", "content": "Hello!"}],
            [{"role": "user", "content": "What is 2+2?"}],
        ]
    )

Structured output with JSON schema:

schema = {
    "type": "object",
    "properties": {"label": {"type": "string"}},
    "required": ["label"],
}
opts = VLLMOfflineRuntimeOptions(
    max_tokens=128, json_schema=schema
)
with VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct"
) as client:
    responses = client.batch_generate(
        messages=[
            [{"role": "user", "content": "Classify: great"}]
        ],
        options=opts,
    )

Initialize the offline vLLM client and load the model into memory.

Parameters:

Name Type Description Default
model str

Hugging Face model identifier or local path.

required
tensor_parallel_size int

Number of GPUs for tensor parallelism.

1
max_num_seqs int

Maximum number of sequences processed in parallel.

256
gpu_memory_utilization float

Target fraction of GPU memory to use.

0.9
enforce_eager bool

Disable CUDA graphs and run in eager mode.

False
quantization str | None

Quantization method (e.g. "fp8", "awq").

None
max_model_len int | None

Maximum total sequence length the model supports.

None
max_num_batched_tokens int | None

Maximum tokens per forward pass.

None
enable_prefix_caching bool

Enable automatic KV-cache prefix reuse. Particularly beneficial when many prompts share a common prefix (e.g. a system message), since the shared prefix is only encoded once.

True
enable_chunked_prefill bool

Process prefills in chunks to reduce peak memory usage and improve scheduling efficiency.

True
language_model_only bool

If True, all non-text modalities are disabled, saving some memory. Defaults to True.

True
speculative_config dict[str, Any] | None

Optional dict of vLLM speculative decoding config parameters.

None
extra_vllm_kwargs dict[str, Any] | None

Additional keyword arguments forwarded to vllm.LLM. Explicit constructor arguments take precedence over any conflicting keys here.

None
on_error OnError

Error behavior when generation fails. Defaults to "warn".

'warn'
batch_size int | None

Starting chunk size for :meth:batch_generate. When None (the default) all messages are sent in one call. On OOM the chunk size is halved until the call succeeds or falls below min_batch_size.

None
min_batch_size int

Smallest permitted chunk size before an OOM is re-raised. Must be >= 1.

1

Raises:

Type Description
ImportError

If vLLM is not installed (raised on first use).

warm_up

warm_up(
    *,
    system_message: str | None = None,
    prompt_prefix: str | None = None,
    options: VLLMOfflineRuntimeOptions | None = None,
) -> None

Prime the KV-cache with a shared prefix before the main workload.

When many prompts share a common system message or prompt prefix, running a single cheap forward pass first ensures the shared tokens are cached before the first real batch, avoiding a cold-start latency spike on the initial batch.

This is a no-op if neither system_message nor prompt_prefix is provided, or if the model has not been loaded yet.

Parameters:

Name Type Description Default
system_message str | None

Optional system message used in every request.

None
prompt_prefix str | None

Optional fixed prefix that starts every user turn.

None
options VLLMOfflineRuntimeOptions | None

Optional generation options. Only used to derive a base SamplingParams; max_tokens is forced to 1 for the warm-up run regardless of the value set here.

None

Raises:

Type Description
ProviderError

If the warm-up inference call fails.

generate

generate(
    *,
    messages: list[dict[str, str]],
    options: VLLMOfflineRuntimeOptions | None = None,
    gen_kwargs: dict[str, Any] | None = None,
) -> Response

Generate a single response for a conversation.

Delegates to batch_generate with a single-item batch.

Parameters:

Name Type Description Default
messages list[dict[str, str]]

Conversation as a list of role/content dicts.

required
options VLLMOfflineRuntimeOptions | None

Optional generation configuration. Pass a VLLMRuntimeOptions instance to use vLLM-specific settings.

None
gen_kwargs dict[str, Any] | None

Additional provider-specific generation kwargs that are not covered by options. Has precedence over options.

None

Returns:

Type Description
Response

A Response object containing the generated text and metadata.

Raises:

Type Description
ProviderError

If the vLLM call fails or the stop reason is an error condition.

batch_generate

batch_generate(
    *,
    messages: list[list[dict[str, str]]],
    options: VLLMOfflineRuntimeOptions | None = None,
    gen_kwargs: dict[str, Any] | None = None,
) -> list[Response]

Generate responses for a batch of conversations.

The full messages list is automatically split into chunks and each chunk is dispatched to vLLM separately. On a CUDA out-of-memory error the chunk size is halved and retried. Chunk size and minimum are configured via the batch_size and min_batch_size constructor arguments. Response order matches input order.

Parameters:

Name Type Description Default
messages list[list[dict[str, str]]]

List of conversations, where each conversation is a list of role/content dicts.

required
options VLLMOfflineRuntimeOptions | None

Optional generation configuration. Pass a VLLMRuntimeOptions instance to use vLLM-specific settings such as temperature, top-p, or a JSON schema.

None
gen_kwargs dict[str, Any] | None

Additional provider-specific generation kwargs that are not covered by options. Has precedence over options.

None

Returns:

Type Description
list[Response]

A list of Response objects, one per input conversation, in the

list[Response]

same order as the input.

Raises:

Type Description
ProviderError

If the model is not loaded or the vLLM call fails.

destroy

destroy() -> None

Free GPU memory and clean up all vLLM resources.

Safe to call multiple times; subsequent calls after the first are no-ops. Also invoked automatically when the client is used as a context manager.

auto_reduce_batch_size

auto_reduce_batch_size(
    method: Callable[..., list[Response]],
) -> Callable[..., list[Response]]

Decorate a batch_generate method to retry with halved chunk size on OOM.

Intended for use with :class:VLLMOfflineClient. On each call the full messages list is split into chunks and dispatched one at a time. When a CUDA out-of-memory error is detected the current chunk size is halved and the failing chunk is retried at the new size. This continues until the chunk succeeds or the size would fall below the instance's _min_batch_size, at which point the error is re-raised.

The chunk size and minimum are read from the instance's _batch_size and _min_batch_size attributes on every call, so they can be adjusted after construction.

Parameters:

Name Type Description Default
method Callable[..., list[Response]]

Unbound batch_generate method to wrap.

required

Returns:

Type Description
Callable[..., list[Response]]

The wrapped method with adaptive OOM-recovery logic applied.

Adaptive batching on out-of-memory errors

When running inference on large datasets the batch passed to batch_generate can exceed available GPU memory. batch_generate is decorated with auto_reduce_batch_size, which automatically splits the message list into chunks and halves the chunk size whenever a CUDA out-of-memory error is detected, retrying the failing chunk at the smaller size.

The chunk size is controlled by the batch_size and min_batch_size constructor arguments:

from llm_annotator import VLLMOfflineClient, VLLMRuntimeOptions

messages = [
    [{"role": "user", "content": text}]
    for text in my_texts  # (1)!
]

opts = VLLMRuntimeOptions(max_tokens=256, temperature=0.0)

with VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
    batch_size=64,     # (2)!
    min_batch_size=1,  # (3)!
) as client:
    responses = client.batch_generate(messages=messages, options=opts)
  1. Build one conversation per input text.
  2. Start by processing 64 conversations per vLLM call. On OOM this halves to 32, then 16, and so on.
  3. Re-raise the OOM error if the chunk size would drop below this value. Defaults to 1.

When batch_size is None (the default) all messages are sent in a single call, mirroring the original behaviour while still recovering automatically if that single call triggers an OOM.

OOM detection walks the full exception chain, so it works whether the raw torch.cuda.OutOfMemoryError propagates directly or is wrapped inside a ProviderError (the default behaviour with on_error="raise").

Note

batch_size controls only the number of conversations sent in a single Python call to vLLM. It is independent of the max_num_seqs constructor argument, which governs the vLLM scheduler and requires reloading the model to change.