Annotator¶
llm_annotator.annotator
¶
Annotator
dataclass
¶
Annotator(
client: Client,
batch_size: int = 256,
num_proc: int | None = DEFAULT_CPU_COUNT,
verbose: bool = False,
)
Sensible base class for LLM-based dataset annotation.
This class provides a framework for annotating datasets using large language
models via a pluggable :class:~llm_annotator.clients.base.Client. It handles
dataset loading, processing, and output generation with support for batching
and uploading to Hugging Face Hub.
The Annotator class has four public entry points:
* :meth:`prepare_data`. Apply prompt templates, sorting, and caching
without running inference. Backs prepared artifacts up to Hugging Face
Hub if ``prepared_hub_id`` is provided.
* :meth:`run_annotation`. Run inference only, consuming data returned by
:meth:`prepare_data` or loaded from a local path or Hub repo.
* :meth:`annotate_dataset`. Convenience wrapper that calls
:meth:`prepare_data` and then :meth:`run_annotation` in one call.
* :meth:`generate_dataset`. Generate a new dataset from scratch by calling
:meth:`annotate_dataset` over a synthetic prompt dataset.
The staged :meth:prepare_data + :meth:run_annotation pattern is
recommended for large-scale or cluster (SLURM) workflows. When
prepared_hub_id is provided, prepared artifacts are stored on
Hugging Face Hub and restored automatically on the next call, so a
failed generation job can be restarted without repeating the
preparation step.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client
|
Client
|
An initialised :class: |
required |
batch_size
|
int
|
Number of samples per inference batch. |
256
|
num_proc
|
int | None
|
Number of processes for dataset preprocessing. |
DEFAULT_CPU_COUNT
|
verbose
|
bool
|
Whether to print progress information. |
False
|
Examples:
Basic usage with an OpenAI client:
from llm_annotator import Annotator, OpenAIClient
client = OpenAIClient(model="gpt-4o-mini")
with Annotator(client=client) as anno:
ds = anno.annotate_dataset(
output_dir="outputs/data",
prompt_template="Process: {text}",
dataset_name="my-dataset",
)
Usage with vLLM offline client:
from llm_annotator import Annotator, VLLMOfflineClient
client = VLLMOfflineClient(
model="meta-llama/Llama-3.2-3B-Instruct",
max_model_len=4096,
)
try:
ds = Annotator(client=client).annotate_dataset(
output_dir="outputs/data",
prompt_template="Process: {text}",
dataset_name="my-dataset",
)
finally:
client.destroy()
__post_init__
¶
Initialize a package-scoped logger for annotator runtime messages.
View source on GitHub: src/llm_annotator/annotator.py lines 146–148
__enter__
¶
Enter the context manager, returning the annotator instance.
View source on GitHub: src/llm_annotator/annotator.py lines 150–152
__exit__
¶
Exit the context manager and free all client resources.
View source on GitHub: src/llm_annotator/annotator.py lines 154–156
destroy
¶
Clean up all resources used by the underlying client.
View source on GitHub: src/llm_annotator/annotator.py lines 158–160
prepare_data
¶
prepare_data(
output_dir: str | Path,
prompt_template: str,
*,
dataset_name: str | None = None,
dataset: Dataset | None = None,
dataset_config: str | None = None,
data_dir: str | None = None,
dataset_split: str | None = None,
max_num_samples: int | None = None,
shuffle_seed: int | None = None,
preprocess_fn: Callable | None = None,
prompt_field_swapper: dict[str, str] | None = None,
idx_column: str = "idx",
task_prefix: str = "",
sort_by_length: bool
| Literal["shortest_first", "longest_first"] = False,
system_message: str | None = None,
prepared_hub_id: str | None = None,
force_data_preparation: bool = False,
) -> tuple[Dataset, Path | None, str | None]
Prepare input data for annotation without running generation.
The method reuses local prepared data first, then optionally restores prepared data from Hugging Face Hub, and finally falls back to building the prepared dataset from source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | Path
|
Directory where prepared artifacts are stored. |
required |
prompt_template
|
str
|
Prompt template used to build chat messages. |
required |
dataset_name
|
str | None
|
Name or path of the dataset to load. |
None
|
dataset
|
Dataset | None
|
Pre-loaded dataset to use instead of loading from name/path. |
None
|
dataset_config
|
str | None
|
Dataset configuration name (optional). |
None
|
data_dir
|
str | None
|
Data directory for local datasets (optional). |
None
|
dataset_split
|
str | None
|
Specific split to load (optional). |
None
|
max_num_samples
|
int | None
|
Maximum number of samples to prepare. |
None
|
shuffle_seed
|
int | None
|
Seed for dataset shuffling. |
None
|
preprocess_fn
|
Callable | None
|
Optional function to preprocess the dataset after loading and before applying the prompt template. |
None
|
prompt_field_swapper
|
dict[str, str] | None
|
Optional mapping to replace template fields. |
None
|
idx_column
|
str
|
Column name used as unique identifier. Must not exist in the input dataset. |
'idx'
|
task_prefix
|
str
|
Prefix for internal columns and artifact names. |
''
|
sort_by_length
|
bool | Literal['shortest_first', 'longest_first']
|
Whether to sort prompts by length. |
False
|
system_message
|
str | None
|
Optional system message for chat prompts. |
None
|
prepared_hub_id
|
str | None
|
Optional Hugging Face dataset ID for prepared-data backup and restore. Will be stored in the "prepared_cache" configuration of the repo if provided. |
None
|
force_data_preparation
|
bool
|
Whether to rebuild prepared data even when local or Hub artifacts already exist. |
False
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Tuple of prepared dataset, local prepared-data path when available, |
Path | None
|
and prepared Hugging Face dataset ID when available. |
View source on GitHub: src/llm_annotator/annotator.py lines 524–662
run_annotation
¶
run_annotation(
output_dir: str | Path,
prompt_template: str,
*,
prepared_dataset: Dataset | None = None,
prepared_data_path: str | Path | None = None,
prepared_hub_id: str | None = None,
resume_from_hub_id: str | None = None,
new_hub_id: str | None = None,
overwrite: bool = False,
dataset_split: str | None = None,
dataset_config: str | None = None,
keep_columns: str | Iterable[str] | bool | None = None,
options: ProviderRuntimeOptions | None = None,
output_schema: str | dict[str, Any] | None = None,
idx_column: str = "idx",
upload_every_n_samples: int | None = 0,
max_samples_per_output_file: int = 0,
task_prefix: str = "",
validate_fn: Callable | None = None,
postprocess_fn: Callable | None = None,
num_retries_invalid: int = 5,
system_message: str | None = None,
keep_idx_column: bool = False,
) -> Dataset
Run model generation on already prepared annotation inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | Path
|
Directory where annotation output is written. |
required |
prompt_template
|
str
|
Prompt template used for warm-up metadata. |
required |
prepared_dataset
|
Dataset | None
|
Pre-prepared dataset with messages column. |
None
|
prepared_data_path
|
str | Path | None
|
Local path to prepared data on disk. |
None
|
prepared_hub_id
|
str | None
|
Hugging Face dataset ID with prepared data. |
None
|
resume_from_hub_id
|
str | None
|
Optional Hugging Face dataset ID to restore generation checkpoints from. |
None
|
new_hub_id
|
str | None
|
Optional Hugging Face dataset ID for generation uploads. |
None
|
overwrite
|
bool
|
Whether to overwrite existing output directory. |
False
|
dataset_split
|
str | None
|
Dataset split used for skip filtering. |
None
|
dataset_config
|
str | None
|
Dataset config used for skip filtering. |
None
|
keep_columns
|
str | Iterable[str] | bool | None
|
Columns to keep in output. |
None
|
options
|
ProviderRuntimeOptions | None
|
Runtime options passed to the client. |
None
|
output_schema
|
str | dict[str, Any] | None
|
Convenience JSON schema input. When provided, it is
injected into |
None
|
idx_column
|
str
|
Column name used as unique identifier. |
'idx'
|
upload_every_n_samples
|
int | None
|
Upload to Hub every N samples. |
0
|
max_samples_per_output_file
|
int
|
Maximum samples per output file. |
0
|
task_prefix
|
str
|
Prefix for internal columns and file names. |
''
|
validate_fn
|
Callable | None
|
Optional custom validation function. |
None
|
postprocess_fn
|
Callable | None
|
Optional postprocessing function that takes in a sample and must return a dict. |
None
|
num_retries_invalid
|
int
|
Number of retries for invalid outputs. |
5
|
system_message
|
str | None
|
Optional system message for chat prompts. |
None
|
keep_idx_column
|
bool
|
Whether to keep idx column in final dataset. |
False
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Final concatenated annotation dataset. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no prepared data source can be resolved. |
View source on GitHub: src/llm_annotator/annotator.py lines 664–1024
annotate_dataset
¶
annotate_dataset(
output_dir: str | Path,
prompt_template: str | None = None,
*,
full_prompt_template: str | None = None,
dataset_name: str | None = None,
dataset: Dataset | None = None,
dataset_config: str | None = None,
data_dir: str | None = None,
dataset_split: str | None = None,
max_num_samples: int | None = None,
shuffle_seed: int | None = None,
preprocess_fn: Callable | None = None,
prompt_field_swapper: dict[str, str] | None = None,
idx_column: str = "idx",
task_prefix: str = "",
sort_by_length: bool
| Literal["shortest_first", "longest_first"] = False,
system_message: str | None = None,
prepared_hub_id: str | None = None,
force_data_preparation: bool = False,
new_hub_id: str | None = None,
overwrite: bool = False,
keep_columns: str | Iterable[str] | bool | None = None,
options: ProviderRuntimeOptions | None = None,
output_schema: str | dict[str, Any] | None = None,
upload_every_n_samples: int | None = 0,
max_samples_per_output_file: int = 0,
validate_fn: Callable | None = None,
postprocess_fn: Callable | None = None,
num_retries_invalid: int = 5,
keep_idx_column: bool = False,
) -> Dataset
Annotate an existing dataset in one call.
This is a convenience wrapper around :meth:prepare_data and
:meth:run_annotation for callers that prefer a single entry point.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | Path
|
Directory where annotation output is written. |
required |
prompt_template
|
str | None
|
Prompt template with dataset fields. Defaults to
|
None
|
full_prompt_template
|
str | None
|
Backwards-compatible alias for
|
None
|
dataset_name
|
str | None
|
Name or path of the dataset to load. |
None
|
dataset
|
Dataset | None
|
Pre-loaded dataset to annotate instead of loading one. |
None
|
dataset_config
|
str | None
|
Dataset configuration name. |
None
|
data_dir
|
str | None
|
Data directory for local datasets. |
None
|
dataset_split
|
str | None
|
Dataset split to load. |
None
|
max_num_samples
|
int | None
|
Maximum number of samples to annotate. |
None
|
shuffle_seed
|
int | None
|
Seed for dataset shuffling. |
None
|
preprocess_fn
|
Callable | None
|
Optional preprocessing callback. |
None
|
prompt_field_swapper
|
dict[str, str] | None
|
Optional mapping that renames prompt fields. |
None
|
idx_column
|
str
|
Column name used as the stable sample identifier. |
'idx'
|
task_prefix
|
str
|
Prefix for internal column names and output files. |
''
|
sort_by_length
|
bool | Literal['shortest_first', 'longest_first']
|
Whether to sort prompts by length. |
False
|
system_message
|
str | None
|
Optional system message for the chat prompt. |
None
|
prepared_hub_id
|
str | None
|
Optional Hub dataset ID for prepared-data cache. |
None
|
force_data_preparation
|
bool
|
Rebuild prepared data even if cached. |
False
|
new_hub_id
|
str | None
|
Optional Hub dataset ID for annotation outputs. |
None
|
overwrite
|
bool
|
Whether to overwrite the output directory. |
False
|
keep_columns
|
str | Iterable[str] | bool | None
|
Columns to keep in the final dataset. |
None
|
options
|
ProviderRuntimeOptions | None
|
Runtime options passed to the client. |
None
|
output_schema
|
str | dict[str, Any] | None
|
Optional JSON schema for structured output. |
None
|
upload_every_n_samples
|
int | None
|
Upload checkpoint cadence. |
0
|
max_samples_per_output_file
|
int
|
Maximum samples per output file. |
0
|
validate_fn
|
Callable | None
|
Optional validation callback. |
None
|
postprocess_fn
|
Callable | None
|
Optional postprocessing callback. |
None
|
num_retries_invalid
|
int
|
Number of retries for invalid outputs. |
5
|
keep_idx_column
|
bool
|
Whether to keep the index column in the result. |
False
|
Returns:
| Type | Description |
|---|---|
Dataset
|
The concatenated annotation dataset. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If no prompt template is provided. |
View source on GitHub: src/llm_annotator/annotator.py lines 1026–1158
generate_dataset
¶
generate_dataset(
output_dir: str | Path,
prompts: str | Sequence[str],
*,
prompt_prefix: str | None = None,
new_hub_id: str | None = None,
overwrite: bool = False,
options: ProviderRuntimeOptions | None = None,
max_num_samples: int | None = None,
output_schema: str | dict[str, Any] | None = None,
idx_column: str = "idx",
upload_every_n_samples: int | None = 0,
max_samples_per_output_file: int = 0,
task_prefix: str = "",
validate_fn: Callable | None = None,
postprocess_fn: Callable | None = None,
num_retries_invalid: int = 5,
keep_idx_column: bool = False,
) -> Dataset
Generate a new dataset from prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | Path
|
Directory where annotation output is written. |
required |
prompts
|
str | Sequence[str]
|
A single prompt or a sequence of prompts. |
required |
prompt_prefix
|
str | None
|
Optional shared prefix used for prefix caching. |
None
|
new_hub_id
|
str | None
|
Optional Hub dataset ID for annotation outputs. |
None
|
overwrite
|
bool
|
Whether to overwrite the output directory. |
False
|
options
|
ProviderRuntimeOptions | None
|
Runtime options passed to the client. |
None
|
max_num_samples
|
int | None
|
Number of times to repeat a single prompt. |
None
|
output_schema
|
str | dict[str, Any] | None
|
Optional JSON schema for structured output. |
None
|
idx_column
|
str
|
Column name used as the stable sample identifier. |
'idx'
|
upload_every_n_samples
|
int | None
|
Upload checkpoint cadence. |
0
|
max_samples_per_output_file
|
int
|
Maximum samples per output file. |
0
|
task_prefix
|
str
|
Prefix for internal column names and output files. |
''
|
validate_fn
|
Callable | None
|
Optional validation callback. |
None
|
postprocess_fn
|
Callable | None
|
Optional postprocessing callback. |
None
|
num_retries_invalid
|
int
|
Number of retries for invalid outputs. |
5
|
keep_idx_column
|
bool
|
Whether to keep the index column in the result. |
False
|
Returns:
| Type | Description |
|---|---|
Dataset
|
The concatenated annotation dataset. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no prompts are provided. |
View source on GitHub: src/llm_annotator/annotator.py lines 1160–1246
get_pfout_name
¶
get_pfout_name(
*,
pdout: Path | str,
max_samples_per_output_file: int,
processed_n_samples: int | None = None,
) -> Path
Generate the output file name based on configuration.
Creates appropriate file names for output files, handling both single-file and multi-file output modes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdout
|
Path | str
|
The output directory path. |
required |
max_samples_per_output_file
|
int
|
Maximum samples per output file (0 for unlimited). |
required |
processed_n_samples
|
int | None
|
The number of samples processed so far. |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path object for the output file name. |
View source on GitHub: src/llm_annotator/annotator.py lines 1296–1323
push_dir_to_hub
¶
push_dir_to_hub(
dir_path: Path | str,
new_hub_id: str | None = None,
*,
task_prefix: str = "",
revision: str | None = None,
allow_patterns: list[str] | None = None,
ignore_patterns: list[str] | None = None,
) -> None
Upload the output directory to Hugging Face Hub.
Creates a dataset repository and uploads all annotation files, excluding cached input data. Uses a separate branch for uploads.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dir_path
|
Path | str
|
Path to the directory containing annotation files. |
required |
new_hub_id
|
str | None
|
Optional Hugging Face dataset ID to override the instance's new_hub_id. |
None
|
task_prefix
|
str
|
String prefix to use for branch naming. |
''
|
revision
|
str | None
|
Optional explicit branch name. Defaults to
|
None
|
allow_patterns
|
list[str] | None
|
Optional include patterns for upload. |
None
|
ignore_patterns
|
list[str] | None
|
Optional ignore patterns for upload. |
None
|
View source on GitHub: src/llm_annotator/annotator.py lines 1325–1385
destroy_on_error
¶
Decorate an Annotator method to call :meth:~Annotator.destroy on any exception.
Catches BaseException (including KeyboardInterrupt and SystemExit)
so resources are freed even on forced termination. The original exception
is always re-raised after the cleanup attempt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable[..., Any]
|
The instance method to wrap. |
required |
Returns:
| Type | Description |
|---|---|
Callable[..., Any]
|
The wrapped callable with automatic cleanup on failure. |
View source on GitHub: src/llm_annotator/annotator.py lines 47–73