Skip to content

Utils

llm_annotator.utils

get_hash

get_hash(text: str) -> str

Compute a SHA256 hash for a given text string.

Parameters:

Name Type Description Default
text str

The input string to hash.

required

Returns:

Type Description
str

A 64-character hexadecimal SHA256 digest.

Examples:

len(get_hash("hello"))
# 64
get_hash("hello") == get_hash("hello")
# True
get_hash("hello") == get_hash("world")
# False

convert_int_to_annotated_str

convert_int_to_annotated_str(num: int) -> str

Convert an integer to a concise string approximating its magnitude.

Parameters:

Name Type Description Default
num int

Non-negative integer to format.

required

Returns:

Type Description
str

A compact string representation such as "1B", "1.2M", or "1.2K".

Examples:

convert_int_to_annotated_str(1_000_000_000)
# '1B'
convert_int_to_annotated_str(1_234_567)
# '1.2M'
convert_int_to_annotated_str(1_234)
# '1.2K'
convert_int_to_annotated_str(42)
# '42'

yield_jsonl_robust

yield_jsonl_robust(
    pfiles: list[Path | str],
    keep_columns: list[str] | None = None,
    disable_tqdm: bool = False,
    deduplicate_on: str | None = None,
) -> Generator[dict, None, None]

Read a set of .jsonl files robustly, skipping corrupt lines, and yield one sample at a time.

Parameters:

Name Type Description Default
pfiles list[Path | str]

List of .jsonl file paths to read.

required
keep_columns list[str] | None

Columns to retain in each yielded sample. None keeps all columns.

None
disable_tqdm bool

Whether to suppress the file-level progress bar.

False
deduplicate_on str | None

Column name whose value is hashed for deduplication. When provided, only the first occurrence of each unique value is yielded.

None

Yields:

Type Description
dict

One parsed JSON record (dict) per non-corrupt line across all files.

count_lines

count_lines(fname: str | PathLike) -> int

Count the number of lines in a file.

Parameters:

Name Type Description Default
fname str | PathLike

Path to the file to count lines in.

required

Returns: The total number of lines in the file.

remove_empty_jsonl_files

remove_empty_jsonl_files(pdout: Path) -> list[Path]

Remove any empty .jsonl files in the given directory.

Parameters:

Name Type Description Default
pdout Path

Output directory path to clean up.

required

Returns:

Type Description
list[Path]

A list of removed files.

ensure_returns_bool

ensure_returns_bool(
    func: Callable[..., Any], *args: Any, **kwargs: Any
) -> bool

Ensure that a callable returns a boolean value.

Parameters:

Name Type Description Default
func Callable[..., Any]

Callable to invoke.

required
*args Any

Positional arguments forwarded to func.

()
**kwargs Any

Keyword arguments forwarded to func.

{}

Returns:

Type Description
bool

The boolean result returned by func.

Raises:

Type Description
TypeError

If func does not return a boolean.

ensure_returns_dict

ensure_returns_dict(
    func: Callable[..., Any], *args: Any, **kwargs: Any
) -> dict[str, Any]

Ensure that a callable returns a dictionary.

Parameters:

Name Type Description Default
func Callable[..., Any]

Callable to invoke.

required
*args Any

Positional arguments forwarded to func.

()
**kwargs Any

Keyword arguments forwarded to func.

{}

Returns:

Type Description
dict[str, Any]

The dictionary result returned by func.

Raises:

Type Description
TypeError

If func does not return a dictionary.

get_lib_versions

get_lib_versions() -> dict[str, str]

get_hf_username

get_hf_username() -> str | None

Get the Hugging Face username of the current user, if logged in. Otherwise, return None.

Returns:

Type Description
str | None

The Hugging Face username, or None if not logged in.

extract_prompt_prefix

extract_prompt_prefix(prompt: str) -> str

Extract the prefix of a prompt up to the first {placeholder}, or the entire prompt if none exists.

Can return an empty string when the prompt starts with a {placeholder}. This is expected when using generate_dataset with fully variable prompts.

Parameters:

Name Type Description Default
prompt str

The full prompt string, optionally containing {field} placeholders.

required

Returns:

Type Description
str

The substring before the first {placeholder}, or the entire prompt when

str

no placeholder is present.

Examples:

extract_prompt_prefix("Classify: {text}")
# 'Classify: '
extract_prompt_prefix("{text} is the input")
# ''
extract_prompt_prefix("No placeholders here")
# 'No placeholders here'

add_schema_additional_properties_false

add_schema_additional_properties_false(schema: Any) -> Any

Recursively set additionalProperties: false on all object schemas.

Claude requires this on every object type in the schema; without it the API returns a 400 error.

Parameters:

Name Type Description Default
schema Any

A JSON-schema dict (or any nested value).

required

Returns:

Type Description
Any

A new schema dict with additionalProperties set to False on

Any

every sub-schema whose type is "object".

is_in_range

is_in_range(
    value: int | float,
    min_value: int | float | None,
    max_value: int | float | None,
) -> bool

Check if a numeric value falls within an optional range (inclusive). Utility function that models can use for validation.

Parameters:

Name Type Description Default
value int | float

The numeric value to check.

required
min_value int | float | None

The minimum allowed value (inclusive), or None for no minimum.

required
max_value int | float | None

The maximum allowed value (inclusive), or None for no maximum.

required

Returns:

Type Description
bool

True if the value is within the range, False otherwise.

is_length

is_length(
    text: str,
    min_length: int | None,
    max_length: int | None,
) -> bool

Check if the length of a text string falls within an optional range. Utility function that models can use for validation.

Parameters:

Name Type Description Default
text str

The text string to check.

required
min_length int | None

The minimum allowed length (inclusive), or None for no minimum.

required
max_length int | None

The maximum allowed length (inclusive), or None for no maximum.

required

Returns: True if the text length is within the range, False otherwise.