Skip to content

vLLM Client (server)

llm_annotator.clients.vllm_client

VLLM provider implementation.

VLLMBaseRuntimeOptions dataclass

VLLMBaseRuntimeOptions(
    max_tokens: int | None = None,
    json_schema: dict[str, Any] | None = None,
    top_k: int | None = None,
    repetition_penalty: float | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
)

Bases: ProviderRuntimeOptions

Shared generation options for both vLLM server and offline clients.

Attributes:

Name Type Description
top_k int | None

Controls the number of top tokens to consider. Set to -1 to consider all tokens.

repetition_penalty float | None

Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens; values < 1 encourage repetition.

chat_template_kwargs dict[str, Any] | None

Additional kwargs forwarded to the chat template.

to_payload

to_payload() -> dict[str, Any]

Build the shared vLLM request payload dict.

Returns:

Type Description
dict[str, Any]

A dict containing the fields common to the vLLM server and offline

dict[str, Any]

clients.

VLLMRuntimeOptions dataclass

VLLMRuntimeOptions(
    max_tokens: int | None = None,
    json_schema: dict[str, Any] | None = None,
    top_k: int | None = None,
    repetition_penalty: float | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
    add_generation_prompt: bool = True,
    chat_template: str | None = None,
    mm_processor_kwargs: dict[str, Any] | None = None,
)

Bases: VLLMBaseRuntimeOptions

Generation options for the vLLM OpenAI-compatible server.

Extends :class:VLLMBaseRuntimeOptions with server-specific parameters from the /v1/chat/completions extra-params API. See https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#api-reference

Attributes:

Name Type Description
add_generation_prompt bool

If True, appends a generation prompt to each message. Defaults to True.

chat_template str | None

Optional chat template string. When omitted the model’s default template is used.

mm_processor_kwargs dict[str, Any] | None

Arguments forwarded to the model’s multi-modal processor (e.g. {"num_crops": 4} for Phi-3-Vision).

to_payload

to_payload() -> dict[str, Any]

Build the vLLM server request payload dict.

Returns:

Type Description
dict[str, Any]

A dict of vLLM server-specific request parameters, including all

dict[str, Any]

shared base fields.

VLLMClient

VLLMClient(
    model: str | None = None,
    base_url: str = "http://localhost:8000/v1",
    on_error: OnError = "warn",
)

Bases: OpenAIClient[VLLMRuntimeOptions]

Client wrapper for VLLM's OpenAI-compatible server/client.

Initialize the VLLM client.

Parameters:

Name Type Description Default
model str | None

VLLM model identifier.

None
base_url str

Base URL for the vLLM API endpoint.

'http://localhost:8000/v1'
on_error OnError

Error behavior when generation fails.

'warn'

batch_generate

batch_generate(
    *,
    messages: list[list[dict[str, str]]],
    options: VLLMRuntimeOptions | None = None,
    gen_kwargs: dict[str, Any] | None = None,
    use_batch_api: bool = False,
    poll_interval: float = 10.0,
) -> list[Response]

Generate responses for a batch of inputs using vLLM's native batch endpoint.

Sends all conversations in a single request to /v1/chat/completions/batch. The OpenAI Batch API is not supported; passing use_batch_api=True raises a :class:ConfigurationError.

Parameters:

Name Type Description Default
messages list[list[dict[str, str]]]

List of message lists, where each list is a conversation.

required
options VLLMRuntimeOptions | None

Optional generation configuration.

None
gen_kwargs dict[str, Any] | None

Additional provider-specific generation kwargs that are not covered by the standard options. Has precedence over options.

None
use_batch_api bool

Must be False. The OpenAI Batch API is not supported by the vLLM server client.

False
poll_interval float

Accepted for interface compatibility with :class:OpenAIClient. Ignored.

10.0

Returns:

Type Description
list[Response]

A list of Response objects, one per input conversation,

list[Response]

indexed in the same order as input.

Raises:

Type Description
ConfigurationError

If use_batch_api=True.

ProviderError

If the batch request fails.