vLLM Offline Client¶

llm_annotator.clients.vllm_offline_client ¶

vLLM offline provider implementation.

VLLMOfflineRuntimeOptions `dataclass` ¶

VLLMOfflineRuntimeOptions(
    max_tokens: int | None = None,
    json_schema: dict[str, Any] | None = None,
    top_k: int | None = None,
    repetition_penalty: float | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
    temperature: float | None = None,
    top_p: float | None = None,
    stop: list[str] | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    seed: int | None = None,
    n: int = 1,
    whitespace_pattern: str | None = "[ ]?",
)

Bases: VLLMBaseRuntimeOptions

Generation options for the vLLM offline client.

Extends :class:VLLMBaseRuntimeOptions (which provides top_k, repetition_penalty, and chat_template_kwargs) with SamplingParams-compatible fields for in-process vLLM inference.

Attributes:

Name	Type	Description
`max_tokens`	`int \| None`	Maximum number of output tokens. Inherited from :class:`~llm_annotator.clients.base.ProviderRuntimeOptions`.
`json_schema`	`dict[str, Any] \| None`	Optional JSON schema dict for structured output via guided decoding. Inherited from :class:`~llm_annotator.clients.base.ProviderRuntimeOptions`. When provided, vLLM constrains generation to valid JSON matching the schema.
`top_k`	`int \| None`	Top-k sampling cutoff. Inherited from :class:`VLLMBaseRuntimeOptions`. `None` uses the model default.
`repetition_penalty`	`float \| None`	Multiplicative penalty for token repetition. Inherited from :class:`VLLMBaseRuntimeOptions`.
`chat_template_kwargs`	`dict[str, Any] \| None`	Additional kwargs forwarded to the chat template. Inherited from :class:`VLLMBaseRuntimeOptions`. Pass `{"enable_thinking": True}` here to enable thinking mode.
`temperature`	`float \| None`	Sampling temperature. `None` uses the model default.
`top_p`	`float \| None`	Top-p nucleus sampling probability. `None` uses the model default.
`stop`	`list[str] \| None`	Optional list of strings that halt generation when produced.
`presence_penalty`	`float`	Penalty applied to tokens already present in the output. Defaults to `0.0` (vLLM default).
`frequency_penalty`	`float`	Penalty applied proportional to token frequency in the output. Defaults to `0.0` (vLLM default).
`seed`	`int \| None`	Optional fixed random seed for reproducible generation.
`n`	`int`	Number of independent output sequences to generate per request. Defaults to `1`.
`whitespace_pattern`	`str \| None`	Regex pattern inserted between JSON tokens during guided decoding. Only used when `json_schema` is set.

to_payload ¶

to_payload() -> dict[str, Any]

Build a SamplingParams-compatible payload dict.

The dict can be passed directly to vllm.SamplingParams(**payload). chat_template_kwargs is intentionally excluded; it must be passed separately to LLM.chat().

Returns:

Type	Description
`dict[str, Any]`	A dict of `SamplingParams`-compatible keyword arguments.

Raises:

Type	Description
`ImportError`	If vLLM is not installed.

View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 153–192

VLLMOfflineClient ¶

VLLMOfflineClient(
    model: str,
    *,
    tensor_parallel_size: int = 1,
    max_num_seqs: int = 256,
    gpu_memory_utilization: float = 0.9,
    enforce_eager: bool = False,
    quantization: str | None = None,
    max_model_len: int | None = None,
    max_num_batched_tokens: int | None = None,
    enable_prefix_caching: bool = True,
    enable_chunked_prefill: bool = True,
    language_model_only: bool = True,
    speculative_config: dict[str, Any] | None = None,
    extra_vllm_kwargs: dict[str, Any] | None = None,
    on_error: OnError = "warn",
    batch_size: int | None = None,
    min_batch_size: int = 1,
)

Bases: Client[VLLMOfflineRuntimeOptions]

Offline vLLM client that runs inference in-process.

Loads the model into GPU memory on construction and uses vLLM's LLM.chat API for batched generation. Supports structured output via JSON schema guided decoding, automatic prefix caching, and chunked prefill. Use as a context manager to ensure GPU resources are released when done.

batch_generate automatically splits the message list into chunks of batch_size and retries failing chunks with a halved size on CUDA out-of-memory errors (see :func:auto_reduce_batch_size). When batch_size is None (the default) all messages are sent in a single vLLM call, mirroring the original behaviour while still recovering from OOM when possible.

Parameters:

Name	Type	Description	Default
`model`	`str`	Hugging Face model identifier or local path.	required
`tensor_parallel_size`	`int`	Number of GPUs for tensor parallelism.	`1`
`max_num_seqs`	`int`	Maximum number of sequences processed in parallel.	`256`
`gpu_memory_utilization`	`float`	Target fraction of GPU memory to use.	`0.9`
`enforce_eager`	`bool`	Disable CUDA graphs and run in eager mode.	`False`
`quantization`	`str \| None`	Quantization method (e.g. `"fp8"`, `"awq"`).	`None`
`max_model_len`	`int \| None`	Maximum total sequence length the model supports.	`None`
`max_num_batched_tokens`	`int \| None`	Maximum tokens per forward pass.	`None`
`enable_prefix_caching`	`bool`	Enable automatic KV-cache prefix reuse.	`True`
`enable_chunked_prefill`	`bool`	Process prefills in chunks to bound memory.	`True`
`language_model_only`	`bool`	If True, all non-text modalities are disabled, saving some memory.	`True`
`speculative_config`	`dict[str, Any] \| None`	Optional dict of vLLM speculative decoding config parameters.	`None`
`extra_vllm_kwargs`	`dict[str, Any] \| None`	Additional keyword arguments forwarded to `vllm.LLM`. Explicit constructor arguments take precedence over any conflicting keys here.	`None`
`batch_size`	`int \| None`	Starting chunk size for :meth:`batch_generate`. Defaults to `None`, which sends all messages in one call. On OOM the chunk size is halved automatically until it succeeds or falls below `min_batch_size`.	`None`
`min_batch_size`	`int`	Smallest permitted chunk size before an OOM error is re-raised. Must be >= 1.	`1`

Examples:

Basic generation:

client = VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
)
response = client.generate(
    messages=[{"role": "user", "content": "Hello!"}]
)
client.destroy()

Context manager (recommended):

with VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
) as client:
    responses = client.batch_generate(
        messages=[
            [{"role": "user", "content": "Hello!"}],
            [{"role": "user", "content": "What is 2+2?"}],
        ]
    )

Structured output with JSON schema:

schema = {
    "type": "object",
    "properties": {"label": {"type": "string"}},
    "required": ["label"],
}
opts = VLLMOfflineRuntimeOptions(
    max_tokens=128, json_schema=schema
)
with VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct"
) as client:
    responses = client.batch_generate(
        messages=[
            [{"role": "user", "content": "Classify: great"}]
        ],
        options=opts,
    )

Initialize the offline vLLM client and load the model into memory.

Parameters:

Name	Type	Description	Default
`model`	`str`	Hugging Face model identifier or local path.	required
`tensor_parallel_size`	`int`	Number of GPUs for tensor parallelism.	`1`
`max_num_seqs`	`int`	Maximum number of sequences processed in parallel.	`256`
`gpu_memory_utilization`	`float`	Target fraction of GPU memory to use.	`0.9`
`enforce_eager`	`bool`	Disable CUDA graphs and run in eager mode.	`False`
`quantization`	`str \| None`	Quantization method (e.g. `"fp8"`, `"awq"`).	`None`
`max_model_len`	`int \| None`	Maximum total sequence length the model supports.	`None`
`max_num_batched_tokens`	`int \| None`	Maximum tokens per forward pass.	`None`
`enable_prefix_caching`	`bool`	Enable automatic KV-cache prefix reuse. Particularly beneficial when many prompts share a common prefix (e.g. a system message), since the shared prefix is only encoded once.	`True`
`enable_chunked_prefill`	`bool`	Process prefills in chunks to reduce peak memory usage and improve scheduling efficiency.	`True`
`language_model_only`	`bool`	If `True`, all non-text modalities are disabled, saving some memory. Defaults to `True`.	`True`
`speculative_config`	`dict[str, Any] \| None`	Optional dict of vLLM speculative decoding config parameters.	`None`
`extra_vllm_kwargs`	`dict[str, Any] \| None`	Additional keyword arguments forwarded to `vllm.LLM`. Explicit constructor arguments take precedence over any conflicting keys here.	`None`
`on_error`	`OnError`	Error behavior when generation fails. Defaults to `"warn"`.	`'warn'`
`batch_size`	`int \| None`	Starting chunk size for :meth:`batch_generate`. When `None` (the default) all messages are sent in one call. On OOM the chunk size is halved until the call succeeds or falls below `min_batch_size`.	`None`
`min_batch_size`	`int`	Smallest permitted chunk size before an OOM is re-raised. Must be >= 1.	`1`

Raises:

Type	Description
`ImportError`	If vLLM is not installed (raised on first use).

View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 283–354

warm_up ¶

warm_up(
    *,
    system_message: str | None = None,
    prompt_prefix: str | None = None,
    options: VLLMOfflineRuntimeOptions | None = None,
) -> None

Prime the KV-cache with a shared prefix before the main workload.

When many prompts share a common system message or prompt prefix, running a single cheap forward pass first ensures the shared tokens are cached before the first real batch, avoiding a cold-start latency spike on the initial batch.

This is a no-op if neither system_message nor prompt_prefix is provided, or if the model has not been loaded yet.

Parameters:

Name	Type	Description	Default
`system_message`	`str \| None`	Optional system message used in every request.	`None`
`prompt_prefix`	`str \| None`	Optional fixed prefix that starts every user turn.	`None`
`options`	`VLLMOfflineRuntimeOptions \| None`	Optional generation options. Only used to derive a base `SamplingParams`; `max_tokens` is forced to 1 for the warm-up run regardless of the value set here.	`None`

Raises:

Type	Description
`ProviderError`	If the warm-up inference call fails.

View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 400–469

generate ¶

generate(
    *,
    messages: list[dict[str, str]],
    options: VLLMOfflineRuntimeOptions | None = None,
    gen_kwargs: dict[str, Any] | None = None,
) -> Response

Generate a single response for a conversation.

Delegates to batch_generate with a single-item batch.

Parameters:

Name	Type	Description	Default
`messages`	`list[dict[str, str]]`	Conversation as a list of role/content dicts.	required
`options`	`VLLMOfflineRuntimeOptions \| None`	Optional generation configuration. Pass a VLLMRuntimeOptions instance to use vLLM-specific settings.	`None`
`gen_kwargs`	`dict[str, Any] \| None`	Additional provider-specific generation kwargs that are not covered by `options`. Has precedence over `options`.	`None`

Returns:

Type	Description
`Response`	A Response object containing the generated text and metadata.

Raises:

Type	Description
`ProviderError`	If the vLLM call fails or the stop reason is an error condition.

View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 508–537

batch_generate ¶

batch_generate(
    *,
    messages: list[list[dict[str, str]]],
    options: VLLMOfflineRuntimeOptions | None = None,
    gen_kwargs: dict[str, Any] | None = None,
) -> list[Response]

Generate responses for a batch of conversations.

The full messages list is automatically split into chunks and each chunk is dispatched to vLLM separately. On a CUDA out-of-memory error the chunk size is halved and retried. Chunk size and minimum are configured via the batch_size and min_batch_size constructor arguments. Response order matches input order.

Parameters:

Name	Type	Description	Default
`messages`	`list[list[dict[str, str]]]`	List of conversations, where each conversation is a list of role/content dicts.	required
`options`	`VLLMOfflineRuntimeOptions \| None`	Optional generation configuration. Pass a VLLMRuntimeOptions instance to use vLLM-specific settings such as temperature, top-p, or a JSON schema.	`None`
`gen_kwargs`	`dict[str, Any] \| None`	Additional provider-specific generation kwargs that are not covered by `options`. Has precedence over `options`.	`None`

Returns:

Type	Description
`list[Response]`	A list of Response objects, one per input conversation, in the
`list[Response]`	same order as the input.

Raises:

Type	Description
`ProviderError`	If the model is not loaded or the vLLM call fails.

View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 539–635

destroy ¶

destroy() -> None

Free GPU memory and clean up all vLLM resources.

Safe to call multiple times; subsequent calls after the first are no-ops. Also invoked automatically when the client is used as a context manager.

View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 637–675

auto_reduce_batch_size ¶

auto_reduce_batch_size(
    method: Callable[..., list[Response]],
) -> Callable[..., list[Response]]

Decorate a batch_generate method to retry with halved chunk size on OOM.

Intended for use with :class:VLLMOfflineClient. On each call the full messages list is split into chunks and dispatched one at a time. When a CUDA out-of-memory error is detected the current chunk size is halved and the failing chunk is retried at the new size. This continues until the chunk succeeds or the size would fall below the instance's _min_batch_size, at which point the error is re-raised.

The chunk size and minimum are read from the instance's _batch_size and _min_batch_size attributes on every call, so they can be adjusted after construction.

Parameters:

Name	Type	Description	Default
`method`	`Callable[..., list[Response]]`	Unbound `batch_generate` method to wrap.	required

Returns:

Type	Description
`Callable[..., list[Response]]`	The wrapped method with adaptive OOM-recovery logic applied.

View source on GitHub: src/llm_annotator/clients/vllm_offline_client.py lines 42–103

Adaptive batching on out-of-memory errors¶

When running inference on large datasets the batch passed to batch_generate can exceed available GPU memory. batch_generate is decorated with auto_reduce_batch_size, which automatically splits the message list into chunks and halves the chunk size whenever a CUDA out-of-memory error is detected, retrying the failing chunk at the smaller size.

The chunk size is controlled by the batch_size and min_batch_size constructor arguments:

from llm_annotator import VLLMOfflineClient, VLLMRuntimeOptions

messages = [
    [{"role": "user", "content": text}]
    for text in my_texts  # (1)!
]

opts = VLLMRuntimeOptions(max_tokens=256, temperature=0.0)

with VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
    batch_size=64,     # (2)!
    min_batch_size=1,  # (3)!
) as client:
    responses = client.batch_generate(messages=messages, options=opts)

Build one conversation per input text.
Start by processing 64 conversations per vLLM call. On OOM this halves to 32, then 16, and so on.
Re-raise the OOM error if the chunk size would drop below this value. Defaults to 1.

When batch_size is None (the default) all messages are sent in a single call, mirroring the original behaviour while still recovering automatically if that single call triggers an OOM.

OOM detection walks the full exception chain, so it works whether the raw torch.cuda.OutOfMemoryError propagates directly or is wrapped inside a ProviderError (the default behaviour with on_error="raise").

Note

batch_size controls only the number of conversations sent in a single Python call to vLLM. It is independent of the max_num_seqs constructor argument, which governs the vLLM scheduler and requires reloading the model to change.

vLLM Offline Client¶

llm_annotator.clients.vllm_offline_client ¶

VLLMOfflineRuntimeOptions dataclass ¶

to_payload ¶

VLLMOfflineClient ¶

warm_up ¶

generate ¶

batch_generate ¶

destroy ¶

auto_reduce_batch_size ¶

Adaptive batching on out-of-memory errors¶

VLLMOfflineRuntimeOptions `dataclass` ¶