Skip to content

Validation

Semantic validation of a parsed BCQL AST against a corpus-specific CorpusSpec. See the tagset validation guide for an overview.

Spec

bcql_py.validation.spec

Corpus-specific semantic specification used by :func:bcql_py.validation.validate.

A :class:CorpusSpec describes the surface vocabulary of a particular corpus: which annotations exist, which annotations are closed-class (with a fixed set of allowed values), which XML span tags and attributes are available, and whether alignment or dependency-relation queries are allowed at all. This is a semantic layer that can be used on top of the "syntactic" AST structure to validate a query against corpus-specific constraints.

The spec is a frozen Pydantic model; use :meth:CorpusSpec.extend or :meth:CorpusSpec.merge to compose specs (e.g. to add your own corpus on top of a preset).

CorpusSpec

Bases: BaseModel

Immutable description of a corpus' semantic vocabulary.

All fields default to the most permissive setting ("anything goes") so that a bare CorpusSpec() is a no-op validator. Narrow the spec by listing the annotations, tags, and relations your corpus actually supports.

Attributes:

Name Type Description
open_attributes frozenset[str]

Annotation names whose value space is unconstrained (e.g. word, lemma).

closed_attributes dict[str, frozenset[str]]

Annotation names whose values are restricted to a fixed set (e.g. pos -> {"NOUN", "VERB", ...}).

strict_attributes bool

When True, any annotation not listed in open_attributes or closed_attributes is an error. When False (default), unknown annotations are accepted.

allowed_span_tags frozenset[str] | None

Allowed XML span tag names (e.g. s, p, ne), or None to allow any tag.

allowed_span_attributes dict[str, frozenset[str]] | None

Per-tag allowed XML attribute values. Missing tags default to no constraint. Use None to allow any attribute.

allow_alignment bool

If False, any use of the alignment (==>) operator raises a validation error.

allowed_alignment_fields frozenset[str] | None

Allowed target field names for alignment queries, or None to allow any.

allow_relations bool

If False, any relation operator (-type-> or ^-type->) raises a validation error.

allowed_relations frozenset[str] | None

Allowed relation type names, or None to allow any. An empty set means "no named relations allowed" (use allow_relations=False for that instead).

Example::

spec = CorpusSpec(open_attributes={"word"}, closed_attributes={"pos": {"NOUN", "VERB"}})
"pos" in spec.closed_attributes
# True
sorted(spec.closed_attributes["pos"])
# ['NOUN', 'VERB']

description property

description: str

A human-readable description of this spec. Can be overridden in subclasses. Potentially useful for error messages, debugging, or as information to LLM agents.

extend

extend(
    *,
    open_attributes: Iterable[str] | None = None,
    closed_attributes: Mapping[str, Iterable[str]]
    | None = None,
    allowed_span_tags: Iterable[str] | None = None,
    allowed_span_attributes: Mapping[str, Iterable[str]]
    | None = None,
    allowed_alignment_fields: Iterable[str] | None = None,
    allowed_relations: Iterable[str] | None = None,
    strict_attributes: bool | None = None,
    allow_alignment: bool | None = None,
    allow_relations: bool | None = None,
) -> CorpusSpec

Return a new spec with the given additions/overrides merged in. Similar to :meth:merge, but with a more granular API that allows adding specific entries without having to construct a full spec.

Parameters:

Name Type Description Default
open_attributes Iterable[str] | None

Extra open-class annotation names to union in.

None
closed_attributes Mapping[str, Iterable[str]] | None

Extra closed-class attributes; per-key values union.

None
allowed_span_tags Iterable[str] | None

Extra allowed span tag names.

None
allowed_span_attributes Mapping[str, Iterable[str]] | None

Extra per-tag attribute names.

None
allowed_alignment_fields Iterable[str] | None

Extra alignment target fields.

None
allowed_relations Iterable[str] | None

Extra relation type names.

None
strict_attributes bool | None

Override the strict-attributes flag.

None
allow_alignment bool | None

Override the alignment allowed flag.

None
allow_relations bool | None

Override the relations allowed flag.

None

Returns:

Type Description
CorpusSpec

A new :class:CorpusSpec; the receiver is not modified.

Example::

base = CorpusSpec(open_attributes={"word"})
extended = base.extend(open_attributes={"lemma"})
sorted(extended.open_attributes)
# ['lemma', 'word']

merge

merge(other: CorpusSpec) -> CorpusSpec

Return a new spec combining this spec with other. In case of conflict, other wins (except for boolean flags, see below).

Set-valued fields are unioned. For the nullable set-valued fields (allowed_span_tags, allowed_alignment_fields, allowed_relations, and the dict-shaped allowed_span_attributes), None means "no constraint". A concrete set/dict is treated as more restrictive than None, so when one side is None and the other lists entries, the result is the listed entries: None survives only when both sides are None. This mirrors the boolean rule below: a concrete restriction always beats "no constraint".

WARNING: For boolean flags, other wins only when it is more restrictive (False beats True) so that merging in a preset cannot silently re-enable something the caller disabled.

Parameters:

Name Type Description Default
other CorpusSpec

Another spec to merge into this one.

required

Returns:

Type Description
CorpusSpec

A new :class:CorpusSpec representing the union.

Example::

spec1 = CorpusSpec(open_attributes={"word"}, allow_alignment=True)
spec2 = CorpusSpec(open_attributes={"lemma"}, closed_attributes={"pos": {"NOUN", "VERB"}}, allow_alignment=False)
merged = spec1.merge(spec2)
sorted(merged.open_attributes)
# ['lemma', 'word']
"pos" in merged.closed_attributes
# True
merged.allow_alignment
# False

has_annotation

has_annotation(name: str) -> bool

Return whether name is a known annotation on this spec.

An annotation is considered known when it is listed in either :attr:open_attributes or :attr:closed_attributes. This method is independent of :attr:strict_attributes: it only reports membership, not whether an unknown annotation would raise during validation.

Parameters:

Name Type Description Default
name str

The annotation name to check.

required

Returns:

Type Description
bool

True if name is either an open or closed attribute on this

bool

spec, False otherwise.

Example::

spec = CorpusSpec(
    open_attributes={"word"},
    closed_attributes={"pos": {"NOUN", "VERB"}},
)
spec.has_annotation("word")
# True
spec.has_annotation("pos")
# True
spec.has_annotation("lemma")
# False

Validator

bcql_py.validation.validator

So-called Visitor that walks a BCQL AST and checks it against a :class:CorpusSpec.

The traversal uses Pydantic's model_fields introspection to recurse into any field whose value is a :class:~bcql_py.models.base.BCQLNode, including nested lists and dict values (used by :class:~bcql_py.models.span.SpanQuery for attributes).

TODO: only literal string values are checked against closed attribute sets; regex values are skipped for now.

validate

validate(
    ast: BCQLNode,
    spec: CorpusSpec,
    *,
    fail_fast: bool = True,
)

Validate a parsed BCQL AST against spec, raising on any issue.

Parameters:

Name Type Description Default
ast BCQLNode

The root :class:~bcql_py.models.base.BCQLNode returned by :func:bcql_py.parse.

required
spec CorpusSpec

The :class:CorpusSpec describing what the corpus allows.

required
fail_fast bool

When True (default), raise as soon as the first issue is found. When False, collect every issue and raise once at the end so callers can report them all together.

True

Raises:

Type Description
BCQLValidationError

If one or more validation issues are found. The raised exception's issues attribute holds the full list.

Example::

from bcql_py import CorpusSpec, parse, validate
spec = CorpusSpec(
    open_attributes={"word"},
    closed_attributes={"pos": {"NOUN", "VERB"}},
)
validate(parse('[pos="NOUN"]'), spec)  # passes silently
try:
    validate(parse('[pos="ADJ"]'), spec)
except Exception as exc:
    print(exc.issues[0].kind)
# invalid_annotation_value

Bundled presets

bcql_py.validation.presets.ud

Full Universal Dependencies (UD v2) preset.

  • Universal POS tags (:data:UD_POS_TAGS, wired as closed values for the upos annotations).
  • Universal morphological features (:data:UD_FEATURE_VALUES), each one a closed attribute (Number, Case, PronType, ...).
  • Core universal dependency relation labels (:data:UD_RELATION_LABELS), wired as :class:CorpusSpec.allowed_relations; the relation label is also exposed as the closed deprel annotation for corpora that store it on the token.
  • Common CoNLL-U-style open annotations (:data:UD_OPEN_ATTRIBUTES): word, lemma, xpos, feats, misc, plus id, head.
References
  • POS: https://universaldependencies.org/u/pos/all.html
  • Features: https://universaldependencies.org/u/feat/all.html
  • Relations: https://universaldependencies.org/u/dep/all.html

Language-specific POS sub-types and relation subtypes (e.g. nsubj:pass) are intentionally not included. Extend the preset to add them::

spec = UD.extend(allowed_relations={"nsubj:pass", "acl:relcl", "obl:agent"})

bcql_py.validation.presets.lassy

Lassy / Alpino preset derived from the alpino_ds DTD.

See the alpino guide, Figures 1.1 and 1.2 on pages 13-14.

Based on this this preset describes:

  • The full Alpino relation inventory (rel), also exposed as :data:LASSY_RELATION_LABELS and integreated as :class:CorpusSpec.allowed_relations.
  • Phrasal categories (cat), part-of-speech tags (pt), and morphosyntactic features (ntype, getal, graad, ...).
  • Open-string annotations (word, lemma, postag, plus identifier / position fields).
  • The DTD element names as :class:CorpusSpec.allowed_span_tags, for corpora that expose alpino_ds / node as XML spans.

Note that "pos" and "root" are excluded, as per the documentation:

De attributen pos en root representeren de door Alpino gebruikte POSTAG en ROOT waardes. Deze worden hier niet afzonderlijk gedocumenteerd, en zijn geen officieel onderdeel van de annotatie.

LASSY_FEATURE_VALUES module-attribute

LASSY_FEATURE_VALUES: dict[str, frozenset[str]] = {
    "dial": frozenset({"dial"}),
    "ntype": frozenset({"soort", "eigen"}),
    "getal": frozenset({"getal", "ev", "mv"}),
    "graad": frozenset({"basis", "comp", "sup", "dim"}),
    "genus": frozenset(
        {"genus", "zijd", "masc", "fem", "onz"}
    ),
    "naamval": frozenset(
        {"stan", "nomin", "obl", "bijz", "gen", "dat"}
    ),
    "positie": frozenset(
        {"prenom", "nom", "postnom", "vrij"}
    ),
    "buiging": frozenset({"zonder", "met-e", "met-s"}),
    "getal-n": frozenset({"zonder-n", "mv-n"}),
    "wvorm": frozenset({"pv", "inf", "od", "vd"}),
    "pvtijd": frozenset({"tgw", "verl", "conj"}),
    "pvagr": frozenset({"ev", "mv", "met-t"}),
    "numtype": frozenset({"hoofd", "rang"}),
    "vwtype": frozenset(
        {
            "pr",
            "pers",
            "refl",
            "recip",
            "bez",
            "vb",
            "vrag",
            "betr",
            "excl",
            "aanw",
            "onbep",
        }
    ),
    "pdtype": frozenset(
        {"pron", "adv-pron", "det", "grad"}
    ),
    "persoon": frozenset(
        {
            "persoon",
            "1",
            "2",
            "2v",
            "2b",
            "3",
            "3p",
            "3m",
            "3v",
            "3o",
        }
    ),
    "status": frozenset({"vol", "red", "nadr"}),
    "npagr": frozenset(
        {
            "agr",
            "evon",
            "rest",
            "evz",
            "mv",
            "agr3",
            "evmo",
            "rest3",
            "evf",
        }
    ),
    "lwtype": frozenset({"bep", "onbep"}),
    "vztype": frozenset({"init", "versm", "fin"}),
    "conjtype": frozenset({"neven", "onder"}),
    "spectype": frozenset(
        {
            "afgebr",
            "onverst",
            "vreemd",
            "deeleigen",
            "meta",
            "comment",
            "achter",
            "afk",
            "symb",
            "enof",
        }
    ),
    "rel": LASSY_RELATION_LABELS,
    "cat": LASSY_CAT_LABELS,
    "pt": LASSY_PT_LABELS,
}

postag="VNW(aanw,det,stan,nom,met-e,mv-n)"

pt="vnw" vwtype="aanw" pdtype="det" naamval="stan" positie="nom" buiging="met-e" getal-n="mv-n"