vLLM Client (server)¶

llm_annotator.clients.vllm_client ¶

VLLM provider implementation.

VLLMBaseRuntimeOptions `dataclass` ¶

VLLMBaseRuntimeOptions(
    max_tokens: int | None = None,
    json_schema: dict[str, Any] | None = None,
    top_k: int | None = None,
    repetition_penalty: float | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
)

Bases: ProviderRuntimeOptions

Shared generation options for both vLLM server and offline clients.

Attributes:

Name	Type	Description
`top_k`	`int \| None`	Controls the number of top tokens to consider. Set to -1 to consider all tokens.
`repetition_penalty`	`float \| None`	Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens; values < 1 encourage repetition.
`chat_template_kwargs`	`dict[str, Any] \| None`	Additional kwargs forwarded to the chat template.

to_payload ¶

to_payload() -> dict[str, Any]

Build the shared vLLM request payload dict.

Returns:

Type	Description
`dict[str, Any]`	A dict containing the fields common to the vLLM server and offline
`dict[str, Any]`	clients.

View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 37–49

VLLMRuntimeOptions `dataclass` ¶

VLLMRuntimeOptions(
    max_tokens: int | None = None,
    json_schema: dict[str, Any] | None = None,
    top_k: int | None = None,
    repetition_penalty: float | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
    add_generation_prompt: bool = True,
    chat_template: str | None = None,
    mm_processor_kwargs: dict[str, Any] | None = None,
)

Bases: VLLMBaseRuntimeOptions

Generation options for the vLLM OpenAI-compatible server.

Extends :class:VLLMBaseRuntimeOptions with server-specific parameters from the /v1/chat/completions extra-params API. See https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#api-reference

Attributes:

Name	Type	Description
`add_generation_prompt`	`bool`	If `True`, appends a generation prompt to each message. Defaults to `True`.
`chat_template`	`str \| None`	Optional chat template string. When omitted the model’s default template is used.
`mm_processor_kwargs`	`dict[str, Any] \| None`	Arguments forwarded to the model’s multi-modal processor (e.g. `{"num_crops": 4}` for Phi-3-Vision).

to_payload ¶

to_payload() -> dict[str, Any]

Build the vLLM server request payload dict.

Returns:

Type	Description
`dict[str, Any]`	A dict of vLLM server-specific request parameters, including all
`dict[str, Any]`	shared base fields.

View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 73–94

VLLMClient ¶

VLLMClient(
    model: str | None = None,
    base_url: str = "http://localhost:8000/v1",
    on_error: OnError = "warn",
)

Bases: OpenAIClient[VLLMRuntimeOptions]

Client wrapper for VLLM's OpenAI-compatible server/client.

Initialize the VLLM client.

Parameters:

Name	Type	Description	Default
`model`	`str \| None`	VLLM model identifier.	`None`
`base_url`	`str`	Base URL for the vLLM API endpoint.	`'http://localhost:8000/v1'`
`on_error`	`OnError`	Error behavior when generation fails.	`'warn'`

View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 102–124

batch_generate ¶

batch_generate(
    *,
    messages: list[list[dict[str, str]]],
    options: VLLMRuntimeOptions | None = None,
    gen_kwargs: dict[str, Any] | None = None,
    use_batch_api: bool = False,
    poll_interval: float = 10.0,
) -> list[Response]

Generate responses for a batch of inputs using vLLM's native batch endpoint.

Sends all conversations in a single request to /v1/chat/completions/batch. The OpenAI Batch API is not supported; passing use_batch_api=True raises a :class:ConfigurationError.

Parameters:

Name	Type	Description	Default
`messages`	`list[list[dict[str, str]]]`	List of message lists, where each list is a conversation.	required
`options`	`VLLMRuntimeOptions \| None`	Optional generation configuration.	`None`
`gen_kwargs`	`dict[str, Any] \| None`	Additional provider-specific generation kwargs that are not covered by the standard options. Has precedence over `options`.	`None`
`use_batch_api`	`bool`	Must be `False`. The OpenAI Batch API is not supported by the vLLM server client.	`False`
`poll_interval`	`float`	Accepted for interface compatibility with :class:`OpenAIClient`. Ignored.	`10.0`

Returns:

Type	Description
`list[Response]`	A list of Response objects, one per input conversation,
`list[Response]`	indexed in the same order as input.

Raises:

Type	Description
`ConfigurationError`	If `use_batch_api=True`.
`ProviderError`	If the batch request fails.

View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 126–233

vLLM Client (server)¶

llm_annotator.clients.vllm_client ¶

VLLMBaseRuntimeOptions dataclass ¶

to_payload ¶

VLLMRuntimeOptions dataclass ¶

to_payload ¶

VLLMClient ¶

batch_generate ¶

VLLMBaseRuntimeOptions `dataclass` ¶

VLLMRuntimeOptions `dataclass` ¶