Utils¶
llm_annotator.utils
¶
get_hash
¶
Compute a SHA256 hash for a given text string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input string to hash. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A 64-character hexadecimal SHA256 digest. |
Examples:
len(get_hash("hello"))
# 64
get_hash("hello") == get_hash("hello")
# True
get_hash("hello") == get_hash("world")
# False
View source on GitHub: src/llm_annotator/utils.py lines 20–37
convert_int_to_annotated_str
¶
Convert an integer to a concise string approximating its magnitude.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num
|
int
|
Non-negative integer to format. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A compact string representation such as |
Examples:
convert_int_to_annotated_str(1_000_000_000)
# '1B'
convert_int_to_annotated_str(1_234_567)
# '1.2M'
convert_int_to_annotated_str(1_234)
# '1.2K'
convert_int_to_annotated_str(42)
# '42'
View source on GitHub: src/llm_annotator/utils.py lines 40–71
yield_jsonl_robust
¶
yield_jsonl_robust(
pfiles: list[Path | str],
keep_columns: list[str] | None = None,
disable_tqdm: bool = False,
deduplicate_on: str | None = None,
) -> Generator[dict, None, None]
Read a set of .jsonl files robustly, skipping corrupt lines, and yield one sample at a time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pfiles
|
list[Path | str]
|
List of |
required |
keep_columns
|
list[str] | None
|
Columns to retain in each yielded sample. |
None
|
disable_tqdm
|
bool
|
Whether to suppress the file-level progress bar. |
False
|
deduplicate_on
|
str | None
|
Column name whose value is hashed for deduplication. When provided, only the first occurrence of each unique value is yielded. |
None
|
Yields:
| Type | Description |
|---|---|
dict
|
One parsed JSON record ( |
View source on GitHub: src/llm_annotator/utils.py lines 74–139
count_lines
¶
Count the number of lines in a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fname
|
str | PathLike
|
Path to the file to count lines in. |
required |
Returns: The total number of lines in the file.
View source on GitHub: src/llm_annotator/utils.py lines 142–151
remove_empty_jsonl_files
¶
Remove any empty .jsonl files in the given directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdout
|
Path
|
Output directory path to clean up. |
required |
Returns:
| Type | Description |
|---|---|
list[Path]
|
A list of removed files. |
View source on GitHub: src/llm_annotator/utils.py lines 154–169
ensure_returns_bool
¶
Ensure that a callable returns a boolean value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable[..., Any]
|
Callable to invoke. |
required |
*args
|
Any
|
Positional arguments forwarded to |
()
|
**kwargs
|
Any
|
Keyword arguments forwarded to |
{}
|
Returns:
| Type | Description |
|---|---|
bool
|
The boolean result returned by |
Raises:
| Type | Description |
|---|---|
TypeError
|
If |
View source on GitHub: src/llm_annotator/utils.py lines 172–193
ensure_returns_dict
¶
Ensure that a callable returns a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable[..., Any]
|
Callable to invoke. |
required |
*args
|
Any
|
Positional arguments forwarded to |
()
|
**kwargs
|
Any
|
Keyword arguments forwarded to |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
The dictionary result returned by |
Raises:
| Type | Description |
|---|---|
TypeError
|
If |
View source on GitHub: src/llm_annotator/utils.py lines 196–217
get_lib_versions
¶
Get the versions of key dependencies.
View source on GitHub: src/llm_annotator/utils.py lines 220–244
get_hf_username
¶
Get the Hugging Face username of the current user, if logged in. Otherwise, return None.
Returns:
| Type | Description |
|---|---|
str | None
|
The Hugging Face username, or None if not logged in. |
View source on GitHub: src/llm_annotator/utils.py lines 247–260
extract_prompt_prefix
¶
Extract the prefix of a prompt up to the first {placeholder}, or the entire prompt if none exists.
Can return an empty string when the prompt starts with a {placeholder}.
This is expected when using generate_dataset with fully variable prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str
|
The full prompt string, optionally containing |
required |
Returns:
| Type | Description |
|---|---|
str
|
The substring before the first |
str
|
no placeholder is present. |
Examples:
extract_prompt_prefix("Classify: {text}")
# 'Classify: '
extract_prompt_prefix("{text} is the input")
# ''
extract_prompt_prefix("No placeholders here")
# 'No placeholders here'
View source on GitHub: src/llm_annotator/utils.py lines 266–287
add_schema_additional_properties_false
¶
Recursively set additionalProperties: false on all object schemas.
Claude requires this on every object type in the schema; without it the API returns a 400 error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
Any
|
A JSON-schema dict (or any nested value). |
required |
Returns:
| Type | Description |
|---|---|
Any
|
A new schema dict with |
Any
|
every sub-schema whose |
View source on GitHub: src/llm_annotator/utils.py lines 290–310
is_in_range
¶
is_in_range(
value: int | float,
min_value: int | float | None,
max_value: int | float | None,
) -> bool
Check if a numeric value falls within an optional range (inclusive). Utility function that models can use for validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
int | float
|
The numeric value to check. |
required |
min_value
|
int | float | None
|
The minimum allowed value (inclusive), or None for no minimum. |
required |
max_value
|
int | float | None
|
The maximum allowed value (inclusive), or None for no maximum. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the value is within the range, False otherwise. |
View source on GitHub: src/llm_annotator/utils.py lines 313–333
is_length
¶
Check if the length of a text string falls within an optional range. Utility function that models can use for validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text string to check. |
required |
min_length
|
int | None
|
The minimum allowed length (inclusive), or None for no minimum. |
required |
max_length
|
int | None
|
The maximum allowed length (inclusive), or None for no maximum. |
required |
Returns: True if the text length is within the range, False otherwise.
View source on GitHub: src/llm_annotator/utils.py lines 336–350