Validation¶

Semantic validation of a parsed BCQL AST against a corpus-specific CorpusSpec. See the tagset validation guide for an overview.

Spec¶

bcql_py.validation.spec ¶

Corpus-specific semantic specification used by :func:bcql_py.validation.validate.

A :class:CorpusSpec describes the surface vocabulary of a particular corpus: which annotations exist, which annotations are closed-class (with a fixed set of allowed values), which XML span tags and attributes are available, and whether alignment or dependency-relation queries are allowed at all. This is a semantic layer that can be used on top of the "syntactic" AST structure to validate a query against corpus-specific constraints.

The spec is a frozen Pydantic model; use :meth:CorpusSpec.extend or :meth:CorpusSpec.merge to compose specs (e.g. to add your own corpus on top of a preset).

CorpusSpec ¶

Bases: BaseModel

Immutable description of a corpus' semantic vocabulary.

All fields default to the most permissive setting ("anything goes") so that a bare CorpusSpec() is a no-op validator. Narrow the spec by listing the annotations, tags, and relations your corpus actually supports.

Attributes:

Name	Type	Description
`open_attributes`	`frozenset[str]`	Annotation names whose value space is unconstrained (e.g. `word`, `lemma`).
`closed_attributes`	`dict[str, frozenset[str]]`	Annotation names whose values are restricted to a fixed set (e.g. `pos` -> `{"NOUN", "VERB", ...}`).
`strict_attributes`	`bool`	When `True`, any annotation not listed in `open_attributes` or `closed_attributes` is an error. When `False` (default), unknown annotations are accepted.
`allowed_span_tags`	`frozenset[str] \| None`	Allowed XML span tag names (e.g. `s`, `p`, `ne`), or `None` to allow any tag.
`allowed_span_attributes`	`dict[str, frozenset[str]] \| None`	Per-tag allowed XML attribute values. Missing tags default to no constraint. Use `None` to allow any attribute.
`allow_alignment`	`bool`	If `False`, any use of the alignment (`==>`) operator raises a validation error.
`allowed_alignment_fields`	`frozenset[str] \| None`	Allowed target field names for alignment queries, or `None` to allow any.
`allow_relations`	`bool`	If `False`, any relation operator (`-type->` or `^-type->`) raises a validation error.
`allowed_relations`	`frozenset[str] \| None`	Allowed relation type names, or `None` to allow any. An empty set means "no named relations allowed" (use `allow_relations=False` for that instead).

Example::

spec = CorpusSpec(open_attributes={"word"}, closed_attributes={"pos": {"NOUN", "VERB"}})
"pos" in spec.closed_attributes
# True
sorted(spec.closed_attributes["pos"])
# ['NOUN', 'VERB']

description `property` ¶

description: str

A human-readable description of this spec. Can be overridden in subclasses. Potentially useful for error messages, debugging, or as information to LLM agents.

extend ¶

extend(
    *,
    open_attributes: Iterable[str] | None = None,
    closed_attributes: Mapping[str, Iterable[str]]
    | None = None,
    allowed_span_tags: Iterable[str] | None = None,
    allowed_span_attributes: Mapping[str, Iterable[str]]
    | None = None,
    allowed_alignment_fields: Iterable[str] | None = None,
    allowed_relations: Iterable[str] | None = None,
    strict_attributes: bool | None = None,
    allow_alignment: bool | None = None,
    allow_relations: bool | None = None,
) -> CorpusSpec

Return a new spec with the given additions/overrides merged in. Similar to :meth:merge, but with a more granular API that allows adding specific entries without having to construct a full spec.

Parameters:

Name	Type	Description	Default
`open_attributes`	`Iterable[str] \| None`	Extra open-class annotation names to union in.	`None`
`closed_attributes`	`Mapping[str, Iterable[str]] \| None`	Extra closed-class attributes; per-key values union.	`None`
`allowed_span_tags`	`Iterable[str] \| None`	Extra allowed span tag names.	`None`
`allowed_span_attributes`	`Mapping[str, Iterable[str]] \| None`	Extra per-tag attribute names.	`None`
`allowed_alignment_fields`	`Iterable[str] \| None`	Extra alignment target fields.	`None`
`allowed_relations`	`Iterable[str] \| None`	Extra relation type names.	`None`
`strict_attributes`	`bool \| None`	Override the strict-attributes flag.	`None`
`allow_alignment`	`bool \| None`	Override the alignment allowed flag.	`None`
`allow_relations`	`bool \| None`	Override the relations allowed flag.	`None`

Returns:

Type	Description
`CorpusSpec`	A new :class:`CorpusSpec`; the receiver is not modified.

Example::

base = CorpusSpec(open_attributes={"word"})
extended = base.extend(open_attributes={"lemma"})
sorted(extended.open_attributes)
# ['lemma', 'word']

View source on GitHub: src/bcql_py/validation/spec.py lines 144–220

merge ¶

merge(other: CorpusSpec) -> CorpusSpec

Return a new spec combining this spec with other. In case of conflict, other wins (except for boolean flags, see below).

Set-valued fields are unioned. For the nullable set-valued fields (allowed_span_tags, allowed_alignment_fields, allowed_relations, and the dict-shaped allowed_span_attributes), None means "no constraint". A concrete set/dict is treated as more restrictive than None, so when one side is None and the other lists entries, the result is the listed entries: None survives only when both sides are None. This mirrors the boolean rule below: a concrete restriction always beats "no constraint".

WARNING: For boolean flags, other wins only when it is more restrictive (False beats True) so that merging in a preset cannot silently re-enable something the caller disabled.

Parameters:

Name	Type	Description	Default
`other`	`CorpusSpec`	Another spec to merge into this one.	required

Returns:

Type	Description
`CorpusSpec`	A new :class:`CorpusSpec` representing the union.

Example::

spec1 = CorpusSpec(open_attributes={"word"}, allow_alignment=True)
spec2 = CorpusSpec(open_attributes={"lemma"}, closed_attributes={"pos": {"NOUN", "VERB"}}, allow_alignment=False)
merged = spec1.merge(spec2)
sorted(merged.open_attributes)
# ['lemma', 'word']
"pos" in merged.closed_attributes
# True
merged.allow_alignment
# False

View source on GitHub: src/bcql_py/validation/spec.py lines 222–303

has_annotation ¶

has_annotation(name: str) -> bool

Return whether name is a known annotation on this spec.

An annotation is considered known when it is listed in either :attr:open_attributes or :attr:closed_attributes. This method is independent of :attr:strict_attributes: it only reports membership, not whether an unknown annotation would raise during validation.

Parameters:

Name	Type	Description	Default
`name`	`str`	The annotation name to check.	required

Returns:

Type	Description
`bool`	`True` if name is either an open or closed attribute on this
`bool`	spec, `False` otherwise.

Example::

spec = CorpusSpec(
    open_attributes={"word"},
    closed_attributes={"pos": {"NOUN", "VERB"}},
)
spec.has_annotation("word")
# True
spec.has_annotation("pos")
# True
spec.has_annotation("lemma")
# False

View source on GitHub: src/bcql_py/validation/spec.py lines 305–333

Validator¶

bcql_py.validation.validator ¶

So-called Visitor that walks a BCQL AST and checks it against a :class:CorpusSpec.

The traversal uses Pydantic's model_fields introspection to recurse into any field whose value is a :class:~bcql_py.models.base.BCQLNode, including nested lists and dict values (used by :class:~bcql_py.models.span.SpanQuery for attributes).

TODO: only literal string values are checked against closed attribute sets; regex values are skipped for now.

validate ¶

validate(
    ast: BCQLNode,
    spec: CorpusSpec,
    *,
    fail_fast: bool = True,
)

Validate a parsed BCQL AST against spec, raising on any issue.

Parameters:

Name	Type	Description	Default
`ast`	`BCQLNode`	The root :class:`~bcql_py.models.base.BCQLNode` returned by :func:`bcql_py.parse`.	required
`spec`	`CorpusSpec`	The :class:`CorpusSpec` describing what the corpus allows.	required
`fail_fast`	`bool`	When `True` (default), raise as soon as the first issue is found. When `False`, collect every issue and raise once at the end so callers can report them all together.	`True`

Raises:

Type	Description
`BCQLValidationError`	If one or more validation issues are found. The raised exception's `issues` attribute holds the full list.

Example::

from bcql_py import CorpusSpec, parse, validate
spec = CorpusSpec(
    open_attributes={"word"},
    closed_attributes={"pos": {"NOUN", "VERB"}},
)
validate(parse('[pos="NOUN"]'), spec)  # passes silently
try:
    validate(parse('[pos="ADJ"]'), spec)
except Exception as exc:
    print(exc.issues[0].kind)
# invalid_annotation_value

View source on GitHub: src/bcql_py/validation/validator.py lines 420–452

Bundled presets¶

bcql_py.validation.presets.ud ¶

Full Universal Dependencies (UD v2) preset.

Universal POS tags (:data:UD_POS_TAGS, wired as closed values for the upos annotations).
Universal morphological features (:data:UD_FEATURE_VALUES), each one a closed attribute (Number, Case, PronType, ...).
Core universal dependency relation labels (:data:UD_RELATION_LABELS), wired as :class:CorpusSpec.allowed_relations; the relation label is also exposed as the closed deprel annotation for corpora that store it on the token.
Common CoNLL-U-style open annotations (:data:UD_OPEN_ATTRIBUTES): word, lemma, xpos, feats, misc, plus id, head.

References

POS: https://universaldependencies.org/u/pos/all.html
Features: https://universaldependencies.org/u/feat/all.html
Relations: https://universaldependencies.org/u/dep/all.html

Language-specific POS sub-types and relation subtypes (e.g. nsubj:pass) are intentionally not included. Extend the preset to add them::

spec = UD.extend(allowed_relations={"nsubj:pass", "acl:relcl", "obl:agent"})

bcql_py.validation.presets.lassy ¶

Lassy / Alpino preset derived from the alpino_ds DTD.

See the alpino guide, Figures 1.1 and 1.2 on pages 13-14.

Based on this this preset describes:

The full Alpino relation inventory (rel), also exposed as :data:LASSY_RELATION_LABELS and integreated as :class:CorpusSpec.allowed_relations.
Phrasal categories (cat), part-of-speech tags (pt), and morphosyntactic features (ntype, getal, graad, ...).
Open-string annotations (word, lemma, postag, plus identifier / position fields).
The DTD element names as :class:CorpusSpec.allowed_span_tags, for corpora that expose alpino_ds / node as XML spans.

Note that "pos" and "root" are excluded, as per the documentation:

De attributen pos en root representeren de door Alpino gebruikte POSTAG en ROOT waardes. Deze worden hier niet afzonderlijk gedocumenteerd, en zijn geen officieel onderdeel van de annotatie.

LASSY_FEATURE_VALUES `module-attribute` ¶

LASSY_FEATURE_VALUES: dict[str, frozenset[str]] = {
    "dial": frozenset({"dial"}),
    "ntype": frozenset({"soort", "eigen"}),
    "getal": frozenset({"getal", "ev", "mv"}),
    "graad": frozenset({"basis", "comp", "sup", "dim"}),
    "genus": frozenset(
        {"genus", "zijd", "masc", "fem", "onz"}
    ),
    "naamval": frozenset(
        {"stan", "nomin", "obl", "bijz", "gen", "dat"}
    ),
    "positie": frozenset(
        {"prenom", "nom", "postnom", "vrij"}
    ),
    "buiging": frozenset({"zonder", "met-e", "met-s"}),
    "getal-n": frozenset({"zonder-n", "mv-n"}),
    "wvorm": frozenset({"pv", "inf", "od", "vd"}),
    "pvtijd": frozenset({"tgw", "verl", "conj"}),
    "pvagr": frozenset({"ev", "mv", "met-t"}),
    "numtype": frozenset({"hoofd", "rang"}),
    "vwtype": frozenset(
        {
            "pr",
            "pers",
            "refl",
            "recip",
            "bez",
            "vb",
            "vrag",
            "betr",
            "excl",
            "aanw",
            "onbep",
        }
    ),
    "pdtype": frozenset(
        {"pron", "adv-pron", "det", "grad"}
    ),
    "persoon": frozenset(
        {
            "persoon",
            "1",
            "2",
            "2v",
            "2b",
            "3",
            "3p",
            "3m",
            "3v",
            "3o",
        }
    ),
    "status": frozenset({"vol", "red", "nadr"}),
    "npagr": frozenset(
        {
            "agr",
            "evon",
            "rest",
            "evz",
            "mv",
            "agr3",
            "evmo",
            "rest3",
            "evf",
        }
    ),
    "lwtype": frozenset({"bep", "onbep"}),
    "vztype": frozenset({"init", "versm", "fin"}),
    "conjtype": frozenset({"neven", "onder"}),
    "spectype": frozenset(
        {
            "afgebr",
            "onverst",
            "vreemd",
            "deeleigen",
            "meta",
            "comment",
            "achter",
            "afk",
            "symb",
            "enof",
        }
    ),
    "rel": LASSY_RELATION_LABELS,
    "cat": LASSY_CAT_LABELS,
    "pt": LASSY_PT_LABELS,
}

postag="VNW(aanw,det,stan,nom,met-e,mv-n)"¶

pt="vnw" vwtype="aanw" pdtype="det" naamval="stan" positie="nom" buiging="met-e" getal-n="mv-n"

Validation¶

Spec¶

bcql_py.validation.spec ¶

CorpusSpec ¶

description property ¶

extend ¶

merge ¶

has_annotation ¶

Validator¶

bcql_py.validation.validator ¶

validate ¶

Bundled presets¶

bcql_py.validation.presets.ud ¶

bcql_py.validation.presets.lassy ¶

LASSY_FEATURE_VALUES module-attribute ¶

postag="VNW(aanw,det,stan,nom,met-e,mv-n)"¶

description `property` ¶

LASSY_FEATURE_VALUES `module-attribute` ¶