vLLM Client (server)¶
llm_annotator.clients.vllm_client
¶
VLLM provider implementation.
VLLMBaseRuntimeOptions
dataclass
¶
VLLMBaseRuntimeOptions(
max_tokens: int | None = None,
json_schema: dict[str, Any] | None = None,
top_k: int | None = None,
repetition_penalty: float | None = None,
chat_template_kwargs: dict[str, Any] | None = None,
)
Bases: ProviderRuntimeOptions
Shared generation options for both vLLM server and offline clients.
Attributes:
| Name | Type | Description |
|---|---|---|
top_k |
int | None
|
Controls the number of top tokens to consider. Set to -1 to consider all tokens. |
repetition_penalty |
float | None
|
Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens; values < 1 encourage repetition. |
chat_template_kwargs |
dict[str, Any] | None
|
Additional kwargs forwarded to the chat template. |
to_payload
¶
Build the shared vLLM request payload dict.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict containing the fields common to the vLLM server and offline |
dict[str, Any]
|
clients. |
View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 37–49
VLLMRuntimeOptions
dataclass
¶
VLLMRuntimeOptions(
max_tokens: int | None = None,
json_schema: dict[str, Any] | None = None,
top_k: int | None = None,
repetition_penalty: float | None = None,
chat_template_kwargs: dict[str, Any] | None = None,
add_generation_prompt: bool = True,
chat_template: str | None = None,
mm_processor_kwargs: dict[str, Any] | None = None,
)
Bases: VLLMBaseRuntimeOptions
Generation options for the vLLM OpenAI-compatible server.
Extends :class:VLLMBaseRuntimeOptions with server-specific parameters
from the /v1/chat/completions extra-params API.
See https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#api-reference
Attributes:
| Name | Type | Description |
|---|---|---|
add_generation_prompt |
bool
|
If |
chat_template |
str | None
|
Optional chat template string. When omitted the model’s default template is used. |
mm_processor_kwargs |
dict[str, Any] | None
|
Arguments forwarded to the model’s multi-modal
processor (e.g. |
to_payload
¶
Build the vLLM server request payload dict.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict of vLLM server-specific request parameters, including all |
dict[str, Any]
|
shared base fields. |
View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 73–94
VLLMClient
¶
VLLMClient(
model: str | None = None,
base_url: str = "http://localhost:8000/v1",
on_error: OnError = "warn",
)
Bases: OpenAIClient[VLLMRuntimeOptions]
Client wrapper for VLLM's OpenAI-compatible server/client.
Initialize the VLLM client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | None
|
VLLM model identifier. |
None
|
base_url
|
str
|
Base URL for the vLLM API endpoint. |
'http://localhost:8000/v1'
|
on_error
|
OnError
|
Error behavior when generation fails. |
'warn'
|
View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 102–124
batch_generate
¶
batch_generate(
*,
messages: list[list[dict[str, str]]],
options: VLLMRuntimeOptions | None = None,
gen_kwargs: dict[str, Any] | None = None,
use_batch_api: bool = False,
poll_interval: float = 10.0,
) -> list[Response]
Generate responses for a batch of inputs using vLLM's native batch endpoint.
Sends all conversations in a single request to /v1/chat/completions/batch.
The OpenAI Batch API is not supported; passing use_batch_api=True raises
a :class:ConfigurationError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
messages
|
list[list[dict[str, str]]]
|
List of message lists, where each list is a conversation. |
required |
options
|
VLLMRuntimeOptions | None
|
Optional generation configuration. |
None
|
gen_kwargs
|
dict[str, Any] | None
|
Additional provider-specific generation kwargs that are
not covered by the standard options. Has precedence over
|
None
|
use_batch_api
|
bool
|
Must be |
False
|
poll_interval
|
float
|
Accepted for interface compatibility with
:class: |
10.0
|
Returns:
| Type | Description |
|---|---|
list[Response]
|
A list of Response objects, one per input conversation, |
list[Response]
|
indexed in the same order as input. |
Raises:
| Type | Description |
|---|---|
ConfigurationError
|
If |
ProviderError
|
If the batch request fails. |
View source on GitHub: src/llm_annotator/clients/vllm_client.py lines 126–233