Utils¶

llm_annotator.utils ¶

get_hash ¶

get_hash(text: str) -> str

Compute a SHA256 hash for a given text string.

Parameters:

Name	Type	Description	Default
`text`	`str`	The input string to hash.	required

Returns:

Type	Description
`str`	A 64-character hexadecimal SHA256 digest.

Examples:

len(get_hash("hello"))
# 64
get_hash("hello") == get_hash("hello")
# True
get_hash("hello") == get_hash("world")
# False

View source on GitHub: src/llm_annotator/utils.py lines 20–37

convert_int_to_annotated_str ¶

convert_int_to_annotated_str(num: int) -> str

Convert an integer to a concise string approximating its magnitude.

Parameters:

Name	Type	Description	Default
`num`	`int`	Non-negative integer to format.	required

Returns:

Type	Description
`str`	A compact string representation such as `"1B"`, `"1.2M"`, or `"1.2K"`.

Examples:

convert_int_to_annotated_str(1_000_000_000)
# '1B'
convert_int_to_annotated_str(1_234_567)
# '1.2M'
convert_int_to_annotated_str(1_234)
# '1.2K'
convert_int_to_annotated_str(42)
# '42'

View source on GitHub: src/llm_annotator/utils.py lines 40–71

yield_jsonl_robust ¶

yield_jsonl_robust(
    pfiles: list[Path | str],
    keep_columns: list[str] | None = None,
    disable_tqdm: bool = False,
    deduplicate_on: str | None = None,
) -> Generator[dict, None, None]

Read a set of .jsonl files robustly, skipping corrupt lines, and yield one sample at a time.

Parameters:

Name	Type	Description	Default
`pfiles`	`list[Path \| str]`	List of `.jsonl` file paths to read.	required
`keep_columns`	`list[str] \| None`	Columns to retain in each yielded sample. `None` keeps all columns.	`None`
`disable_tqdm`	`bool`	Whether to suppress the file-level progress bar.	`False`
`deduplicate_on`	`str \| None`	Column name whose value is hashed for deduplication. When provided, only the first occurrence of each unique value is yielded.	`None`

Yields:

Type	Description
`dict`	One parsed JSON record (`dict`) per non-corrupt line across all files.

View source on GitHub: src/llm_annotator/utils.py lines 74–139

count_lines ¶

count_lines(fname: str | PathLike) -> int

Count the number of lines in a file.

Parameters:

Name	Type	Description	Default
`fname`	`str \| PathLike`	Path to the file to count lines in.	required

Returns: The total number of lines in the file.

View source on GitHub: src/llm_annotator/utils.py lines 142–151

remove_empty_jsonl_files ¶

remove_empty_jsonl_files(pdout: Path) -> list[Path]

Remove any empty .jsonl files in the given directory.

Parameters:

Name	Type	Description	Default
`pdout`	`Path`	Output directory path to clean up.	required

Returns:

Type	Description
`list[Path]`	A list of removed files.

View source on GitHub: src/llm_annotator/utils.py lines 154–169

ensure_returns_bool ¶

ensure_returns_bool(
    func: Callable[..., Any], *args: Any, **kwargs: Any
) -> bool

Ensure that a callable returns a boolean value.

Parameters:

Name	Type	Description	Default
`func`	`Callable[..., Any]`	Callable to invoke.	required
`*args`	`Any`	Positional arguments forwarded to `func`.	`()`
`**kwargs`	`Any`	Keyword arguments forwarded to `func`.	`{}`

Returns:

Type	Description
`bool`	The boolean result returned by `func`.

Raises:

Type	Description
`TypeError`	If `func` does not return a boolean.

View source on GitHub: src/llm_annotator/utils.py lines 172–193

ensure_returns_dict ¶

ensure_returns_dict(
    func: Callable[..., Any], *args: Any, **kwargs: Any
) -> dict[str, Any]

Ensure that a callable returns a dictionary.

Parameters:

Name	Type	Description	Default
`func`	`Callable[..., Any]`	Callable to invoke.	required
`*args`	`Any`	Positional arguments forwarded to `func`.	`()`
`**kwargs`	`Any`	Keyword arguments forwarded to `func`.	`{}`

Returns:

Type	Description
`dict[str, Any]`	The dictionary result returned by `func`.

Raises:

Type	Description
`TypeError`	If `func` does not return a dictionary.

View source on GitHub: src/llm_annotator/utils.py lines 196–217

get_lib_versions ¶

get_lib_versions() -> dict[str, str]

Get the versions of key dependencies.

View source on GitHub: src/llm_annotator/utils.py lines 220–244

get_hf_username ¶

get_hf_username() -> str | None

Get the Hugging Face username of the current user, if logged in. Otherwise, return None.

Returns:

Type	Description
`str \| None`	The Hugging Face username, or None if not logged in.

View source on GitHub: src/llm_annotator/utils.py lines 247–260

extract_prompt_prefix ¶

extract_prompt_prefix(prompt: str) -> str

Extract the prefix of a prompt up to the first {placeholder}, or the entire prompt if none exists.

Can return an empty string when the prompt starts with a {placeholder}. This is expected when using generate_dataset with fully variable prompts.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The full prompt string, optionally containing `{field}` placeholders.	required

Returns:

Type	Description
`str`	The substring before the first `{placeholder}`, or the entire prompt when
`str`	no placeholder is present.

Examples:

extract_prompt_prefix("Classify: {text}")
# 'Classify: '
extract_prompt_prefix("{text} is the input")
# ''
extract_prompt_prefix("No placeholders here")
# 'No placeholders here'

View source on GitHub: src/llm_annotator/utils.py lines 266–287

add_schema_additional_properties_false ¶

add_schema_additional_properties_false(schema: Any) -> Any

Recursively set additionalProperties: false on all object schemas.

Claude requires this on every object type in the schema; without it the API returns a 400 error.

Parameters:

Name	Type	Description	Default
`schema`	`Any`	A JSON-schema dict (or any nested value).	required

Returns:

Type	Description
`Any`	A new schema dict with `additionalProperties` set to `False` on
`Any`	every sub-schema whose `type` is `"object"`.

View source on GitHub: src/llm_annotator/utils.py lines 290–310

is_in_range ¶

is_in_range(
    value: int | float,
    min_value: int | float | None,
    max_value: int | float | None,
) -> bool

Check if a numeric value falls within an optional range (inclusive). Utility function that models can use for validation.

Parameters:

Name	Type	Description	Default
`value`	`int \| float`	The numeric value to check.	required
`min_value`	`int \| float \| None`	The minimum allowed value (inclusive), or None for no minimum.	required
`max_value`	`int \| float \| None`	The maximum allowed value (inclusive), or None for no maximum.	required

Returns:

Type	Description
`bool`	True if the value is within the range, False otherwise.

View source on GitHub: src/llm_annotator/utils.py lines 313–333

is_length ¶

is_length(
    text: str,
    min_length: int | None,
    max_length: int | None,
) -> bool

Check if the length of a text string falls within an optional range. Utility function that models can use for validation.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text string to check.	required
`min_length`	`int \| None`	The minimum allowed length (inclusive), or None for no minimum.	required
`max_length`	`int \| None`	The maximum allowed length (inclusive), or None for no maximum.	required

Returns: True if the text length is within the range, False otherwise.

View source on GitHub: src/llm_annotator/utils.py lines 336–350