Validation¶
Semantic validation of a parsed BCQL AST against a corpus-specific
CorpusSpec. See the
tagset validation guide for an overview.
Spec¶
bcql_py.validation.spec
¶
Corpus-specific semantic specification used by :func:bcql_py.validation.validate.
A :class:CorpusSpec describes the surface vocabulary of a particular corpus:
which annotations exist, which annotations are closed-class (with a fixed set of
allowed values), which XML span tags and attributes are available, and whether
alignment or dependency-relation queries are allowed at all. This is a semantic
layer that can be used on top of the "syntactic" AST structure to validate
a query against corpus-specific constraints.
The spec is a frozen Pydantic model; use :meth:CorpusSpec.extend or :meth:CorpusSpec.merge to
compose specs (e.g. to add your own corpus on top of a preset).
CorpusSpec
¶
Bases: BaseModel
Immutable description of a corpus' semantic vocabulary.
All fields default to the most permissive setting ("anything goes") so that a
bare CorpusSpec() is a no-op validator. Narrow the spec by listing the
annotations, tags, and relations your corpus actually supports.
Attributes:
| Name | Type | Description |
|---|---|---|
open_attributes |
frozenset[str]
|
Annotation names whose value space is unconstrained
(e.g. |
closed_attributes |
dict[str, frozenset[str]]
|
Annotation names whose values are restricted to a
fixed set (e.g. |
strict_attributes |
bool
|
When |
allowed_span_tags |
frozenset[str] | None
|
Allowed XML span tag names (e.g. |
allowed_span_attributes |
dict[str, frozenset[str]] | None
|
Per-tag allowed XML attribute values. Missing
tags default to no constraint. Use |
allow_alignment |
bool
|
If |
allowed_alignment_fields |
frozenset[str] | None
|
Allowed target field names for alignment
queries, or |
allow_relations |
bool
|
If |
allowed_relations |
frozenset[str] | None
|
Allowed relation type names, or |
Example::
spec = CorpusSpec(open_attributes={"word"}, closed_attributes={"pos": {"NOUN", "VERB"}})
"pos" in spec.closed_attributes
# True
sorted(spec.closed_attributes["pos"])
# ['NOUN', 'VERB']
description
property
¶
A human-readable description of this spec. Can be overridden in subclasses. Potentially useful for error messages, debugging, or as information to LLM agents.
extend
¶
extend(
*,
open_attributes: Iterable[str] | None = None,
closed_attributes: Mapping[str, Iterable[str]]
| None = None,
allowed_span_tags: Iterable[str] | None = None,
allowed_span_attributes: Mapping[str, Iterable[str]]
| None = None,
allowed_alignment_fields: Iterable[str] | None = None,
allowed_relations: Iterable[str] | None = None,
strict_attributes: bool | None = None,
allow_alignment: bool | None = None,
allow_relations: bool | None = None,
) -> CorpusSpec
Return a new spec with the given additions/overrides merged in.
Similar to :meth:merge, but with a more granular API that allows adding
specific entries without having to construct a full spec.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
open_attributes
|
Iterable[str] | None
|
Extra open-class annotation names to union in. |
None
|
closed_attributes
|
Mapping[str, Iterable[str]] | None
|
Extra closed-class attributes; per-key values union. |
None
|
allowed_span_tags
|
Iterable[str] | None
|
Extra allowed span tag names. |
None
|
allowed_span_attributes
|
Mapping[str, Iterable[str]] | None
|
Extra per-tag attribute names. |
None
|
allowed_alignment_fields
|
Iterable[str] | None
|
Extra alignment target fields. |
None
|
allowed_relations
|
Iterable[str] | None
|
Extra relation type names. |
None
|
strict_attributes
|
bool | None
|
Override the strict-attributes flag. |
None
|
allow_alignment
|
bool | None
|
Override the alignment allowed flag. |
None
|
allow_relations
|
bool | None
|
Override the relations allowed flag. |
None
|
Returns:
| Type | Description |
|---|---|
CorpusSpec
|
A new :class: |
Example::
base = CorpusSpec(open_attributes={"word"})
extended = base.extend(open_attributes={"lemma"})
sorted(extended.open_attributes)
# ['lemma', 'word']
View source on GitHub: src/bcql_py/validation/spec.py lines 144–220
merge
¶
Return a new spec combining this spec with other. In case of conflict, other wins (except for boolean flags, see below).
Set-valued fields are unioned. For the nullable set-valued fields
(allowed_span_tags, allowed_alignment_fields, allowed_relations,
and the dict-shaped allowed_span_attributes), None means "no
constraint". A concrete set/dict is treated as more restrictive than
None, so when one side is None and the other lists entries, the
result is the listed entries: None survives only when both sides are
None. This mirrors the boolean rule below: a concrete restriction
always beats "no constraint".
WARNING: For boolean flags, other wins only when it is more restrictive
(False beats True) so that merging in a preset cannot silently
re-enable something the caller disabled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
CorpusSpec
|
Another spec to merge into this one. |
required |
Returns:
| Type | Description |
|---|---|
CorpusSpec
|
A new :class: |
Example::
spec1 = CorpusSpec(open_attributes={"word"}, allow_alignment=True)
spec2 = CorpusSpec(open_attributes={"lemma"}, closed_attributes={"pos": {"NOUN", "VERB"}}, allow_alignment=False)
merged = spec1.merge(spec2)
sorted(merged.open_attributes)
# ['lemma', 'word']
"pos" in merged.closed_attributes
# True
merged.allow_alignment
# False
View source on GitHub: src/bcql_py/validation/spec.py lines 222–303
has_annotation
¶
Return whether name is a known annotation on this spec.
An annotation is considered known when it is listed in either
:attr:open_attributes or :attr:closed_attributes. This method is
independent of :attr:strict_attributes: it only reports membership,
not whether an unknown annotation would raise during validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The annotation name to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
spec, |
Example::
spec = CorpusSpec(
open_attributes={"word"},
closed_attributes={"pos": {"NOUN", "VERB"}},
)
spec.has_annotation("word")
# True
spec.has_annotation("pos")
# True
spec.has_annotation("lemma")
# False
View source on GitHub: src/bcql_py/validation/spec.py lines 305–333
Validator¶
bcql_py.validation.validator
¶
So-called Visitor that walks a BCQL AST and checks it against a :class:CorpusSpec.
The traversal uses Pydantic's model_fields introspection to recurse into any
field whose value is a :class:~bcql_py.models.base.BCQLNode, including nested
lists and dict values (used by :class:~bcql_py.models.span.SpanQuery for
attributes).
TODO: only literal string values are checked against closed attribute sets; regex values are skipped for now.
validate
¶
Validate a parsed BCQL AST against spec, raising on any issue.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
BCQLNode
|
The root :class: |
required |
spec
|
CorpusSpec
|
The :class: |
required |
fail_fast
|
bool
|
When |
True
|
Raises:
| Type | Description |
|---|---|
BCQLValidationError
|
If one or more validation issues are found. The
raised exception's |
Example::
from bcql_py import CorpusSpec, parse, validate
spec = CorpusSpec(
open_attributes={"word"},
closed_attributes={"pos": {"NOUN", "VERB"}},
)
validate(parse('[pos="NOUN"]'), spec) # passes silently
try:
validate(parse('[pos="ADJ"]'), spec)
except Exception as exc:
print(exc.issues[0].kind)
# invalid_annotation_value
View source on GitHub: src/bcql_py/validation/validator.py lines 420–452
Bundled presets¶
bcql_py.validation.presets.ud
¶
Full Universal Dependencies (UD v2) preset.
- Universal POS tags (:data:
UD_POS_TAGS, wired as closed values for theuposannotations). - Universal morphological features (:data:
UD_FEATURE_VALUES), each one a closed attribute (Number,Case,PronType, ...). - Core universal dependency relation labels (:data:
UD_RELATION_LABELS), wired as :class:CorpusSpec.allowed_relations; the relation label is also exposed as the closeddeprelannotation for corpora that store it on the token. - Common CoNLL-U-style open annotations (:data:
UD_OPEN_ATTRIBUTES):word,lemma,xpos,feats,misc, plusid,head.
References
- POS: https://universaldependencies.org/u/pos/all.html
- Features: https://universaldependencies.org/u/feat/all.html
- Relations: https://universaldependencies.org/u/dep/all.html
Language-specific POS sub-types and relation subtypes (e.g. nsubj:pass)
are intentionally not included. Extend the preset to add them::
spec = UD.extend(allowed_relations={"nsubj:pass", "acl:relcl", "obl:agent"})
bcql_py.validation.presets.lassy
¶
Lassy / Alpino preset derived from the alpino_ds DTD.
See the alpino guide, Figures 1.1 and 1.2 on pages 13-14.
Based on this this preset describes:
- The full Alpino relation inventory (
rel), also exposed as :data:LASSY_RELATION_LABELSand integreated as :class:CorpusSpec.allowed_relations. - Phrasal categories (
cat), part-of-speech tags (pt), and morphosyntactic features (ntype,getal,graad, ...). - Open-string annotations (
word,lemma,postag, plus identifier / position fields). - The DTD element names as :class:
CorpusSpec.allowed_span_tags, for corpora that exposealpino_ds/nodeas XML spans.
Note that "pos" and "root" are excluded, as per the documentation:
De attributen pos en root representeren de door Alpino gebruikte POSTAG en ROOT waardes. Deze worden hier niet afzonderlijk gedocumenteerd, en zijn geen officieel onderdeel van de annotatie.
LASSY_FEATURE_VALUES
module-attribute
¶
LASSY_FEATURE_VALUES: dict[str, frozenset[str]] = {
"dial": frozenset({"dial"}),
"ntype": frozenset({"soort", "eigen"}),
"getal": frozenset({"getal", "ev", "mv"}),
"graad": frozenset({"basis", "comp", "sup", "dim"}),
"genus": frozenset(
{"genus", "zijd", "masc", "fem", "onz"}
),
"naamval": frozenset(
{"stan", "nomin", "obl", "bijz", "gen", "dat"}
),
"positie": frozenset(
{"prenom", "nom", "postnom", "vrij"}
),
"buiging": frozenset({"zonder", "met-e", "met-s"}),
"getal-n": frozenset({"zonder-n", "mv-n"}),
"wvorm": frozenset({"pv", "inf", "od", "vd"}),
"pvtijd": frozenset({"tgw", "verl", "conj"}),
"pvagr": frozenset({"ev", "mv", "met-t"}),
"numtype": frozenset({"hoofd", "rang"}),
"vwtype": frozenset(
{
"pr",
"pers",
"refl",
"recip",
"bez",
"vb",
"vrag",
"betr",
"excl",
"aanw",
"onbep",
}
),
"pdtype": frozenset(
{"pron", "adv-pron", "det", "grad"}
),
"persoon": frozenset(
{
"persoon",
"1",
"2",
"2v",
"2b",
"3",
"3p",
"3m",
"3v",
"3o",
}
),
"status": frozenset({"vol", "red", "nadr"}),
"npagr": frozenset(
{
"agr",
"evon",
"rest",
"evz",
"mv",
"agr3",
"evmo",
"rest3",
"evf",
}
),
"lwtype": frozenset({"bep", "onbep"}),
"vztype": frozenset({"init", "versm", "fin"}),
"conjtype": frozenset({"neven", "onder"}),
"spectype": frozenset(
{
"afgebr",
"onverst",
"vreemd",
"deeleigen",
"meta",
"comment",
"achter",
"afk",
"symb",
"enof",
}
),
"rel": LASSY_RELATION_LABELS,
"cat": LASSY_CAT_LABELS,
"pt": LASSY_PT_LABELS,
}
postag="VNW(aanw,det,stan,nom,met-e,mv-n)"¶
pt="vnw" vwtype="aanw" pdtype="det" naamval="stan" positie="nom" buiging="met-e" getal-n="mv-n"