Validation¶

Semantic validation of a parsed BCQL AST against a corpus-specific CorpusSpec. See the tagset validation guide for an overview.

Spec¶

bcql_py.validation.spec ¶

Corpus-specific semantic specification used by validate().

A CorpusSpec describes the surface vocabulary of a particular corpus: which annotations exist, which annotations are closed-class (with a fixed set of allowed values), which XML span tags and attributes are available, and whether alignment or dependency-relation queries are allowed at all. This is a semantic layer that can be used on top of the "syntactic" AST structure to validate a query against corpus-specific constraints.

The spec is a frozen Pydantic model; use CorpusSpec.extend() or CorpusSpec.merge() to compose specs (e.g. to add your own corpus on top of a preset).

CorpusSpec ¶

Bases: BaseModel

Immutable description of a corpus' semantic vocabulary.

All fields default to the most permissive setting ("anything goes") so that a bare CorpusSpec() is a no-op validator. Narrow the spec by listing the annotations, tags, and relations your corpus actually supports.

Attributes:

Name	Type	Description
`open_attributes`	`frozenset[str]`	Annotation names whose value space is unconstrained (e.g. `word`, `lemma`).
`closed_attributes`	`dict[str, frozenset[str]]`	Annotation names whose values are restricted to a fixed set (e.g. `pos` -> `{"NOUN", "VERB", ...}`).
`strict_attributes`	`bool`	When `True`, any annotation not listed in `open_attributes` or `closed_attributes` is an error. When `False` (default), unknown annotations are accepted.
`allowed_span_tags`	`frozenset[str] \| None`	Allowed XML span tag names (e.g. `s`, `p`, `ne`), or `None` to allow any tag.
`allowed_span_attributes`	`dict[str, frozenset[str]] \| None`	Per-tag allowed XML attribute values. Missing tags default to no constraint. Use `None` to allow any attribute.
`allow_alignment`	`bool`	If `False`, any use of the alignment (`==>`) operator raises a validation error.
`allowed_alignment_fields`	`frozenset[str] \| None`	Allowed target field names for alignment queries, or `None` to allow any.
`allow_relations`	`bool`	If `False`, any relation operator (`-type->` or `^-type->`) raises a validation error.
`allowed_relations`	`frozenset[str] \| None`	Allowed relation type names, or `None` to allow any. An empty set means "no named relations allowed" (use `allow_relations=False` for that instead).

Example::

spec = CorpusSpec(open_attributes={"word"}, closed_attributes={"pos": {"NOUN", "VERB"}})
"pos" in spec.closed_attributes
# True
sorted(spec.closed_attributes["pos"])
# ['NOUN', 'VERB']

description `property` ¶

description: str

A human-readable description of this spec. Can be overridden in subclasses. Potentially useful for error messages, debugging, or as information to LLM agents.

extend ¶

extend(
    *,
    open_attributes: Iterable[str] | None = None,
    closed_attributes: Mapping[str, Iterable[str]]
    | None = None,
    allowed_span_tags: Iterable[str] | None = None,
    allowed_span_attributes: Mapping[str, Iterable[str]]
    | None = None,
    allowed_alignment_fields: Iterable[str] | None = None,
    allowed_relations: Iterable[str] | None = None,
    strict_attributes: bool | None = None,
    allow_alignment: bool | None = None,
    allow_relations: bool | None = None,
) -> CorpusSpec

Return a new spec with the given additions/overrides merged in. Similar to merge(), but with a more granular API that allows adding specific entries without having to construct a full spec.

Parameters:

Name	Type	Description	Default
`open_attributes`	`Iterable[str] \| None`	Extra open-class annotation names to union in.	`None`
`closed_attributes`	`Mapping[str, Iterable[str]] \| None`	Extra closed-class attributes; per-key values union.	`None`
`allowed_span_tags`	`Iterable[str] \| None`	Extra allowed span tag names.	`None`
`allowed_span_attributes`	`Mapping[str, Iterable[str]] \| None`	Extra per-tag attribute names.	`None`
`allowed_alignment_fields`	`Iterable[str] \| None`	Extra alignment target fields.	`None`
`allowed_relations`	`Iterable[str] \| None`	Extra relation type names.	`None`
`strict_attributes`	`bool \| None`	Override the strict-attributes flag.	`None`
`allow_alignment`	`bool \| None`	Override the alignment allowed flag.	`None`
`allow_relations`	`bool \| None`	Override the relations allowed flag.	`None`

Returns:

Type	Description
`CorpusSpec`	A new CorpusSpec; the receiver is not modified.

Example::

base = CorpusSpec(open_attributes={"word"})
extended = base.extend(open_attributes={"lemma"})
sorted(extended.open_attributes)
# ['lemma', 'word']

View source on GitHub: src/bcql_py/validation/spec.py lines 145–221

merge ¶

merge(other: CorpusSpec) -> CorpusSpec

Return a new spec combining this spec with other. In case of conflict, other wins (except for boolean flags, see below).

Set-valued fields are unioned. For the nullable set-valued fields (allowed_span_tags, allowed_alignment_fields, allowed_relations, and the dict-shaped allowed_span_attributes), None means "no constraint". A concrete set/dict is treated as more restrictive than None, so when one side is None and the other lists entries, the result is the listed entries: None survives only when both sides are None. This mirrors the boolean rule below: a concrete restriction always beats "no constraint".

WARNING: For boolean flags, other wins only when it is more restrictive (False beats True) so that merging in a preset cannot silently re-enable something the caller disabled.

Parameters:

Name	Type	Description	Default
`other`	`CorpusSpec`	Another spec to merge into this one.	required

Returns:

Type	Description
`CorpusSpec`	A new CorpusSpec representing the union.

Example::

spec1 = CorpusSpec(open_attributes={"word"}, allow_alignment=True)
spec2 = CorpusSpec(open_attributes={"lemma"}, closed_attributes={"pos": {"NOUN", "VERB"}}, allow_alignment=False)
merged = spec1.merge(spec2)
sorted(merged.open_attributes)
# ['lemma', 'word']
"pos" in merged.closed_attributes
# True
merged.allow_alignment
# False

View source on GitHub: src/bcql_py/validation/spec.py lines 223–304

has_annotation ¶

has_annotation(name: str) -> bool

Return whether name is a known annotation on this spec.

An annotation is considered known when it is listed in either open_attributes or closed_attributes. This method is independent of strict_attributes: it only reports membership, not whether an unknown annotation would raise during validation.

Parameters:

Name	Type	Description	Default
`name`	`str`	The annotation name to check.	required

Returns:

Type	Description
`bool`	`True` if name is either an open or closed attribute on this
`bool`	spec, `False` otherwise.

Example::

spec = CorpusSpec(
    open_attributes={"word"},
    closed_attributes={"pos": {"NOUN", "VERB"}},
)
spec.has_annotation("word")
# True
spec.has_annotation("pos")
# True
spec.has_annotation("lemma")
# False

View source on GitHub: src/bcql_py/validation/spec.py lines 306–334

Validator¶

bcql_py.validation.validator ¶

So-called Visitor that walks a BCQL AST and checks it against a CorpusSpec.

The traversal uses Pydantic's model_fields introspection to recurse into any field whose value is a BCQLNode, including nested lists and dict values (used by SpanQuery for attributes).

TODO: only literal string values are checked against closed attribute sets; regex values are skipped for now.

validate ¶

validate(
    ast: BCQLNode,
    spec: CorpusSpec,
    *,
    fail_fast: bool = True,
)

Validate a parsed BCQL AST against spec, raising on any issue.

Parameters:

Name	Type	Description	Default
`ast`	`BCQLNode`	The root BCQLNode returned by parse().	required
`spec`	`CorpusSpec`	The CorpusSpec describing what the corpus allows.	required
`fail_fast`	`bool`	When `True` (default), raise as soon as the first issue is found. When `False`, collect every issue and raise once at the end so callers can report them all together.	`True`

Raises:

Type	Description
`BCQLValidationError`	If one or more validation issues are found. The raised exception's `issues` attribute holds the full list.

Example::

from bcql_py import CorpusSpec, parse, validate
spec = CorpusSpec(
    open_attributes={"word"},
    closed_attributes={"pos": {"NOUN", "VERB"}},
)
validate(parse('[pos="NOUN"]'), spec)  # passes silently
try:
    validate(parse('[pos="ADJ"]'), spec)
except Exception as exc:
    print(exc.issues[0].kind)
# invalid_annotation_value

View source on GitHub: src/bcql_py/validation/validator.py lines 435–467

Bundled presets¶

bcql_py.validation.presets.ud ¶

Full Universal Dependencies (UD v2) preset.

Universal POS tags (UD_POS_TAGS, wired as closed values for the upos annotations).
Universal morphological features (UD_FEATURE_VALUES), each one a closed attribute (Number, Case, PronType, ...).
Core universal dependency relation labels (UD_RELATION_LABELS, wired as allowed relation values; the relation label is also exposed as the closed deprel annotation for corpora that store it on the token.
Common CoNLL-U-style open annotations (UD_OPEN_ATTRIBUTES): word, lemma, xpos, feats, misc, plus id, head.

References

Language-specific POS sub-types and relation subtypes (e.g. nsubj:pass) are intentionally not included. Extend the preset to add them::

spec = UD.extend(allowed_relations={"nsubj:pass", "acl:relcl", "obl:agent"})

UD_POS_TAGS `module-attribute` ¶

UD_POS_TAGS: frozenset[str] = frozenset(
    {
        "ADJ",
        "ADP",
        "ADV",
        "AUX",
        "CCONJ",
        "DET",
        "INTJ",
        "NOUN",
        "NUM",
        "PART",
        "PRON",
        "PROPN",
        "PUNCT",
        "SCONJ",
        "SYM",
        "VERB",
        "X",
    }
)

Universal Dependencies v2 universal POS tags.

Wired as closed attribute values for the upos annotation in the UD preset.

UD_RELATION_LABELS `module-attribute` ¶

UD_RELATION_LABELS: frozenset[str] = frozenset(
    {
        "acl",
        "advcl",
        "advmod",
        "amod",
        "appos",
        "aux",
        "case",
        "cc",
        "ccomp",
        "clf",
        "compound",
        "conj",
        "cop",
        "csubj",
        "dep",
        "det",
        "discourse",
        "dislocated",
        "expl",
        "fixed",
        "flat",
        "goeswith",
        "iobj",
        "list",
        "mark",
        "nmod",
        "nsubj",
        "nummod",
        "obj",
        "obl",
        "orphan",
        "parataxis",
        "punct",
        "reparandum",
        "vocative",
        "xcomp",
    }
)

Core Universal Dependencies v2 dependency relation labels.

Wired as allowed relation values in the UD preset. Language-specific subtypes (e.g., nsubj:pass, acl:relcl) are not included; extend the preset to add them.

UD_FEATURE_VALUES `module-attribute` ¶

UD_FEATURE_VALUES: dict[str, frozenset[str]] = {
    "PronType": frozenset(
        {
            "Art",
            "Dem",
            "Emp",
            "Exc",
            "Ind",
            "Int",
            "Neg",
            "Prs",
            "Rcp",
            "Rel",
            "Tot",
        }
    ),
    "NumType": frozenset(
        {
            "Card",
            "Dist",
            "Frac",
            "Mult",
            "Ord",
            "Range",
            "Sets",
        }
    ),
    "Poss": frozenset({"Yes"}),
    "Reflex": frozenset({"Yes"}),
    "Foreign": frozenset({"Yes"}),
    "Abbr": frozenset({"Yes"}),
    "Typo": frozenset({"Yes"}),
    "ExtPos": frozenset(
        {
            "ADJ",
            "ADP",
            "ADV",
            "AUX",
            "CCONJ",
            "DET",
            "INTJ",
            "PRON",
            "PROPN",
            "SCONJ",
        }
    ),
    "Gender": frozenset({"Com", "Fem", "Masc", "Neut"}),
    "Animacy": frozenset({"Anim", "Hum", "Inan", "Nhum"}),
    "NounClass": frozenset(
        {
            "Bantu1",
            "Bantu2",
            "Bantu3",
            "Bantu4",
            "Bantu5",
            "Bantu6",
            "Bantu7",
            "Bantu8",
            "Bantu9",
            "Bantu10",
            "Bantu11",
            "Bantu12",
            "Bantu13",
            "Bantu14",
            "Bantu15",
            "Bantu16",
            "Bantu17",
            "Bantu18",
            "Bantu19",
            "Bantu20",
            "Bantu21",
            "Bantu22",
            "Bantu23",
            "Wol1",
            "Wol2",
            "Wol3",
            "Wol4",
            "Wol5",
            "Wol6",
            "Wol7",
            "Wol8",
            "Wol9",
            "Wol10",
            "Wol11",
            "Wol12",
        }
    ),
    "Number": frozenset(
        {
            "Coll",
            "Count",
            "Dual",
            "Grpa",
            "Grpl",
            "Inv",
            "Pauc",
            "Plur",
            "Ptan",
            "Sing",
            "Tri",
        }
    ),
    "Case": frozenset({"Abs", "Acc", "Erg", "Nom"}),
    "Definite": frozenset(
        {"Com", "Cons", "Def", "Ind", "Spec"}
    ),
    "Deixis": frozenset(
        {"Abv", "Bel", "Even", "Med", "Nvis", "Prx", "Remt"}
    ),
    "DeixisRef": frozenset({"1", "2"}),
    "Degree": frozenset(
        {"Abs", "Aug", "Cmp", "Dim", "Equ", "Pos", "Sup"}
    ),
    "VerbForm": frozenset(
        {
            "Conv",
            "Fin",
            "Gdv",
            "Ger",
            "Inf",
            "Part",
            "Sup",
            "Vnoun",
        }
    ),
    "Mood": frozenset(
        {
            "Adm",
            "Cnd",
            "Des",
            "Imp",
            "Ind",
            "Int",
            "Irr",
            "Jus",
            "Nec",
            "Opt",
            "Pot",
            "Prp",
            "Qot",
            "Sub",
        }
    ),
    "Tense": frozenset(
        {"Fut", "Imp", "Past", "Pqp", "Pres"}
    ),
    "Aspect": frozenset(
        {"Hab", "Imp", "Iter", "Perf", "Prog", "Prosp"}
    ),
    "Voice": frozenset(
        {
            "Act",
            "Antip",
            "Bfoc",
            "Cau",
            "Dir",
            "Inv",
            "Lfoc",
            "Mid",
            "Pass",
            "Rcp",
        }
    ),
    "Evident": frozenset({"Fh", "Nfh"}),
    "Polarity": frozenset({"Neg", "Pos"}),
    "Person": frozenset({"0", "1", "2", "3", "4"}),
    "Polite": frozenset({"Elev", "Form", "Humb", "Infm"}),
    "Clusivity": frozenset({"Ex", "In"}),
}

Universal morphological features and their allowed values.

Wired as closed attributes in the UD preset (e.g., Number, Case, Tense, etc.).

UD_OPEN_ATTRIBUTES `module-attribute` ¶

UD_OPEN_ATTRIBUTES: frozenset[str] = frozenset(
    {"word", "lemma", "xpos", "feats", "misc", "id", "head"}
)

Common CoNLL-U open annotations in Universal Dependencies.

Open attributes are those whose values are not restricted to a fixed set. Includes token form, lemma, extended POS tag, features, metadata, ID, and head index.

UD `module-attribute` ¶

UD = CorpusSpec(
    open_attributes=UD_OPEN_ATTRIBUTES,
    closed_attributes=_UD_CLOSED_ATTRIBUTES,
    allowed_relations=UD_RELATION_LABELS,
)

Universal Dependencies v2 corpus specification.

A ready-made CorpusSpec for validating BCQL queries against Universal Dependencies v2 corpora. Includes universal POS tags, morphological features, core dependency relations, and standard CoNLL-U annotations.

Language-specific subtypes and variations can be added via extend().

bcql_py.validation.presets.lassy ¶

Lassy / Alpino preset derived from the alpino_ds DTD.

See the the LASSY manual, Figures 1.1 and 1.2 on pages 13-14.

Based on this this preset describes:

The full Alpino relation inventory (rel), also exposed as LASSY_RELATION_LABELS and integrated as allowed relation values.
Phrasal categories (cat), part-of-speech tags (pt), and morphosyntactic features (ntype, getal, graad, ...).
Open-string annotations (word, lemma, postag, plus identifier / position fields).
The DTD element names as allowed span tags, for corpora that expose alpino_ds / node as XML spans.

Note that "pos" and "root" are excluded, as per the documentation:

De attributen pos en root representeren de door Alpino gebruikte POSTAG en ROOT waardes. Deze worden hier niet afzonderlijk gedocumenteerd, en zijn geen officieel onderdeel van de annotatie.

LASSY_RELATION_LABELS `module-attribute` ¶

LASSY_RELATION_LABELS: frozenset[str] = frozenset(
    {
        "--",
        "app",
        "body",
        "cmp",
        "cnj",
        "crd",
        "det",
        "dlink",
        "dp",
        "hd",
        "hdf",
        "ld",
        "me",
        "mod",
        "mwp",
        "nucl",
        "obcomp",
        "obj1",
        "obj2",
        "pc",
        "pobj1",
        "predc",
        "predm",
        "rhd",
        "sat",
        "se",
        "su",
        "sup",
        "svp",
        "tag",
        "top",
        "vc",
        "whd",
    }
)

Lassy/Alpino dependency relation labels (rel attribute).

Wired as allowed relation values in the LASSY preset.

LASSY_CAT_LABELS `module-attribute` ¶

LASSY_CAT_LABELS: frozenset[str] = frozenset(
    {
        "advp",
        "ahi",
        "ap",
        "conj",
        "cp",
        "detp",
        "du",
        "inf",
        "mwu",
        "np",
        "oti",
        "pp",
        "ppart",
        "rel",
        "smain",
        "ssub",
        "sv1",
        "svan",
        "ti",
        "top",
        "whq",
        "whrel",
        "whsub",
    }
)

Lassy phrasal category labels (cat attribute).

Wired as closed attribute values in the LASSY preset.

LASSY_PT_LABELS `module-attribute` ¶

LASSY_PT_LABELS: frozenset[str] = frozenset(
    {
        "adj",
        "bw",
        "let",
        "lid",
        "n",
        "spec",
        "tsw",
        "tw",
        "vg",
        "vnw",
        "vz",
        "ww",
    }
)

Lassy part-of-speech tags (pt attribute).

Wired as closed attribute values in the LASSY preset.

LASSY_FEATURE_VALUES `module-attribute` ¶

LASSY_FEATURE_VALUES: dict[str, frozenset[str]] = {
    "dial": frozenset({"dial"}),
    "ntype": frozenset({"soort", "eigen"}),
    "getal": frozenset({"getal", "ev", "mv"}),
    "graad": frozenset({"basis", "comp", "sup", "dim"}),
    "genus": frozenset(
        {"genus", "zijd", "masc", "fem", "onz"}
    ),
    "naamval": frozenset(
        {"stan", "nomin", "obl", "bijz", "gen", "dat"}
    ),
    "positie": frozenset(
        {"prenom", "nom", "postnom", "vrij"}
    ),
    "buiging": frozenset({"zonder", "met-e", "met-s"}),
    "getal-n": frozenset({"zonder-n", "mv-n"}),
    "wvorm": frozenset({"pv", "inf", "od", "vd"}),
    "pvtijd": frozenset({"tgw", "verl", "conj"}),
    "pvagr": frozenset({"ev", "mv", "met-t"}),
    "numtype": frozenset({"hoofd", "rang"}),
    "vwtype": frozenset(
        {
            "pr",
            "pers",
            "refl",
            "recip",
            "bez",
            "vb",
            "vrag",
            "betr",
            "excl",
            "aanw",
            "onbep",
        }
    ),
    "pdtype": frozenset(
        {"pron", "adv-pron", "det", "grad"}
    ),
    "persoon": frozenset(
        {
            "persoon",
            "1",
            "2",
            "2v",
            "2b",
            "3",
            "3p",
            "3m",
            "3v",
            "3o",
        }
    ),
    "status": frozenset({"vol", "red", "nadr"}),
    "npagr": frozenset(
        {
            "agr",
            "evon",
            "rest",
            "evz",
            "mv",
            "agr3",
            "evmo",
            "rest3",
            "evf",
        }
    ),
    "lwtype": frozenset({"bep", "onbep"}),
    "vztype": frozenset({"init", "versm", "fin"}),
    "conjtype": frozenset({"neven", "onder"}),
    "spectype": frozenset(
        {
            "afgebr",
            "onverst",
            "vreemd",
            "deeleigen",
            "meta",
            "comment",
            "achter",
            "afk",
            "symb",
            "enof",
        }
    ),
    "rel": LASSY_RELATION_LABELS,
    "cat": LASSY_CAT_LABELS,
    "pt": LASSY_PT_LABELS,
}

Lassy morphosyntactic features and their allowed values.

Wired as closed attributes in the LASSY preset (e.g., ntype, getal, graad, etc.).

LASSY_OPEN_ATTRIBUTES `module-attribute` ¶

LASSY_OPEN_ATTRIBUTES: frozenset[str] = frozenset(
    {
        "word",
        "lemma",
        "postag",
        "id",
        "index",
        "begin",
        "end",
    }
)

Open annotations in Lassy/Alpino corpora.

Open attributes are those whose values are not restricted to a fixed set.

LASSY_SPAN_TAGS `module-attribute` ¶

LASSY_SPAN_TAGS: frozenset[str] = frozenset(
    {"alpino_ds", "node", "sentence", "comments", "comment"}
)

Allowed XML span tag names in Lassy/Alpino corpora.

Wired as allowed span tags in the LASSY preset.

LASSY `module-attribute` ¶

LASSY = CorpusSpec(
    open_attributes=LASSY_OPEN_ATTRIBUTES,
    closed_attributes=LASSY_FEATURE_VALUES,
    allowed_span_tags=LASSY_SPAN_TAGS,
    allowed_relations=LASSY_RELATION_LABELS,
)

LASSY/Alpino corpus specification for Dutch.

A ready-made CorpusSpec for validating BCQL queries against Lassy/Alpino-annotated Dutch corpora. Includes Alpino POS tags, morphosyntactic features, dependency relations, DTD span tags, and standard CoNLL-like annotations.

Validation¶

Spec¶

bcql_py.validation.spec ¶

CorpusSpec ¶

description property ¶

extend ¶

merge ¶

has_annotation ¶

Validator¶

bcql_py.validation.validator ¶

validate ¶

Bundled presets¶

bcql_py.validation.presets.ud ¶

UD_POS_TAGS module-attribute ¶

UD_RELATION_LABELS module-attribute ¶

UD_FEATURE_VALUES module-attribute ¶

UD_OPEN_ATTRIBUTES module-attribute ¶

UD module-attribute ¶

bcql_py.validation.presets.lassy ¶

LASSY_RELATION_LABELS module-attribute ¶

LASSY_CAT_LABELS module-attribute ¶

LASSY_PT_LABELS module-attribute ¶

LASSY_FEATURE_VALUES module-attribute ¶

LASSY_OPEN_ATTRIBUTES module-attribute ¶

LASSY_SPAN_TAGS module-attribute ¶

LASSY module-attribute ¶

description `property` ¶

UD_POS_TAGS `module-attribute` ¶

UD_RELATION_LABELS `module-attribute` ¶

UD_FEATURE_VALUES `module-attribute` ¶

UD_OPEN_ATTRIBUTES `module-attribute` ¶

UD `module-attribute` ¶

LASSY_RELATION_LABELS `module-attribute` ¶

LASSY_CAT_LABELS `module-attribute` ¶

LASSY_PT_LABELS `module-attribute` ¶

LASSY_FEATURE_VALUES `module-attribute` ¶

LASSY_OPEN_ATTRIBUTES `module-attribute` ¶

LASSY_SPAN_TAGS `module-attribute` ¶

LASSY `module-attribute` ¶