Validation¶
Semantic validation of a parsed BCQL AST against a corpus-specific
CorpusSpec. See the
tagset validation guide for an overview.
Spec¶
bcql_py.validation.spec
¶
Corpus-specific semantic specification used by validate().
A CorpusSpec describes the surface vocabulary of a particular corpus: which annotations exist, which annotations are closed-class (with a fixed set of allowed values), which XML span tags and attributes are available, and whether alignment or dependency-relation queries are allowed at all. This is a semantic layer that can be used on top of the "syntactic" AST structure to validate a query against corpus-specific constraints.
The spec is a frozen Pydantic model; use CorpusSpec.extend() or CorpusSpec.merge() to compose specs (e.g. to add your own corpus on top of a preset).
CorpusSpec
¶
Bases: BaseModel
Immutable description of a corpus' semantic vocabulary.
All fields default to the most permissive setting ("anything goes") so that a
bare CorpusSpec() is a no-op validator. Narrow the spec by listing the
annotations, tags, and relations your corpus actually supports.
Attributes:
| Name | Type | Description |
|---|---|---|
open_attributes |
frozenset[str]
|
Annotation names whose value space is unconstrained
(e.g. |
closed_attributes |
dict[str, frozenset[str]]
|
Annotation names whose values are restricted to a
fixed set (e.g. |
strict_attributes |
bool
|
When |
allowed_span_tags |
frozenset[str] | None
|
Allowed XML span tag names (e.g. |
allowed_span_attributes |
dict[str, frozenset[str]] | None
|
Per-tag allowed XML attribute values. Missing
tags default to no constraint. Use |
allow_alignment |
bool
|
If |
allowed_alignment_fields |
frozenset[str] | None
|
Allowed target field names for alignment
queries, or |
allow_relations |
bool
|
If |
allowed_relations |
frozenset[str] | None
|
Allowed relation type names, or |
Example::
spec = CorpusSpec(open_attributes={"word"}, closed_attributes={"pos": {"NOUN", "VERB"}})
"pos" in spec.closed_attributes
# True
sorted(spec.closed_attributes["pos"])
# ['NOUN', 'VERB']
description
property
¶
A human-readable description of this spec. Can be overridden in subclasses. Potentially useful for error messages, debugging, or as information to LLM agents.
extend
¶
extend(
*,
open_attributes: Iterable[str] | None = None,
closed_attributes: Mapping[str, Iterable[str]]
| None = None,
allowed_span_tags: Iterable[str] | None = None,
allowed_span_attributes: Mapping[str, Iterable[str]]
| None = None,
allowed_alignment_fields: Iterable[str] | None = None,
allowed_relations: Iterable[str] | None = None,
strict_attributes: bool | None = None,
allow_alignment: bool | None = None,
allow_relations: bool | None = None,
) -> CorpusSpec
Return a new spec with the given additions/overrides merged in. Similar to merge(), but with a more granular API that allows adding specific entries without having to construct a full spec.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
open_attributes
|
Iterable[str] | None
|
Extra open-class annotation names to union in. |
None
|
closed_attributes
|
Mapping[str, Iterable[str]] | None
|
Extra closed-class attributes; per-key values union. |
None
|
allowed_span_tags
|
Iterable[str] | None
|
Extra allowed span tag names. |
None
|
allowed_span_attributes
|
Mapping[str, Iterable[str]] | None
|
Extra per-tag attribute names. |
None
|
allowed_alignment_fields
|
Iterable[str] | None
|
Extra alignment target fields. |
None
|
allowed_relations
|
Iterable[str] | None
|
Extra relation type names. |
None
|
strict_attributes
|
bool | None
|
Override the strict-attributes flag. |
None
|
allow_alignment
|
bool | None
|
Override the alignment allowed flag. |
None
|
allow_relations
|
bool | None
|
Override the relations allowed flag. |
None
|
Returns:
| Type | Description |
|---|---|
CorpusSpec
|
A new CorpusSpec; the receiver is not modified. |
Example::
base = CorpusSpec(open_attributes={"word"})
extended = base.extend(open_attributes={"lemma"})
sorted(extended.open_attributes)
# ['lemma', 'word']
View source on GitHub: src/bcql_py/validation/spec.py lines 145–221
merge
¶
Return a new spec combining this spec with other. In case of conflict, other wins (except for boolean flags, see below).
Set-valued fields are unioned. For the nullable set-valued fields
(allowed_span_tags, allowed_alignment_fields, allowed_relations,
and the dict-shaped allowed_span_attributes), None means "no
constraint". A concrete set/dict is treated as more restrictive than
None, so when one side is None and the other lists entries, the
result is the listed entries: None survives only when both sides are
None. This mirrors the boolean rule below: a concrete restriction
always beats "no constraint".
WARNING: For boolean flags, other wins only when it is more restrictive
(False beats True) so that merging in a preset cannot silently
re-enable something the caller disabled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
CorpusSpec
|
Another spec to merge into this one. |
required |
Returns:
| Type | Description |
|---|---|
CorpusSpec
|
A new CorpusSpec representing the union. |
Example::
spec1 = CorpusSpec(open_attributes={"word"}, allow_alignment=True)
spec2 = CorpusSpec(open_attributes={"lemma"}, closed_attributes={"pos": {"NOUN", "VERB"}}, allow_alignment=False)
merged = spec1.merge(spec2)
sorted(merged.open_attributes)
# ['lemma', 'word']
"pos" in merged.closed_attributes
# True
merged.allow_alignment
# False
View source on GitHub: src/bcql_py/validation/spec.py lines 223–304
has_annotation
¶
Return whether name is a known annotation on this spec.
An annotation is considered known when it is listed in either
open_attributes or closed_attributes. This method is
independent of strict_attributes: it only reports membership,
not whether an unknown annotation would raise during validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The annotation name to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
|
bool
|
spec, |
Example::
spec = CorpusSpec(
open_attributes={"word"},
closed_attributes={"pos": {"NOUN", "VERB"}},
)
spec.has_annotation("word")
# True
spec.has_annotation("pos")
# True
spec.has_annotation("lemma")
# False
View source on GitHub: src/bcql_py/validation/spec.py lines 306–334
Validator¶
bcql_py.validation.validator
¶
So-called Visitor that walks a BCQL AST and checks it against a CorpusSpec.
The traversal uses Pydantic's model_fields introspection to recurse into any
field whose value is a BCQLNode, including nested
lists and dict values (used by SpanQuery for
attributes).
TODO: only literal string values are checked against closed attribute sets; regex values are skipped for now.
validate
¶
Validate a parsed BCQL AST against spec, raising on any issue.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
BCQLNode
|
required | |
spec
|
CorpusSpec
|
The CorpusSpec describing what the corpus allows. |
required |
fail_fast
|
bool
|
When |
True
|
Raises:
| Type | Description |
|---|---|
BCQLValidationError
|
If one or more validation issues are found. The
raised exception's |
Example::
from bcql_py import CorpusSpec, parse, validate
spec = CorpusSpec(
open_attributes={"word"},
closed_attributes={"pos": {"NOUN", "VERB"}},
)
validate(parse('[pos="NOUN"]'), spec) # passes silently
try:
validate(parse('[pos="ADJ"]'), spec)
except Exception as exc:
print(exc.issues[0].kind)
# invalid_annotation_value
View source on GitHub: src/bcql_py/validation/validator.py lines 435–467
Bundled presets¶
bcql_py.validation.presets.ud
¶
Full Universal Dependencies (UD v2) preset.
- Universal POS tags (
UD_POS_TAGS, wired as closed values for theuposannotations). - Universal morphological features (
UD_FEATURE_VALUES), each one a closed attribute (Number,Case,PronType, ...). - Core universal dependency relation labels (
UD_RELATION_LABELS, wired as allowed relation values; the relation label is also exposed as the closeddeprelannotation for corpora that store it on the token. - Common CoNLL-U-style open annotations (
UD_OPEN_ATTRIBUTES):word,lemma,xpos,feats,misc, plusid,head.
References
Language-specific POS sub-types and relation subtypes (e.g. nsubj:pass)
are intentionally not included. Extend the preset to add them::
spec = UD.extend(allowed_relations={"nsubj:pass", "acl:relcl", "obl:agent"})
UD_POS_TAGS
module-attribute
¶
UD_POS_TAGS: frozenset[str] = frozenset(
{
"ADJ",
"ADP",
"ADV",
"AUX",
"CCONJ",
"DET",
"INTJ",
"NOUN",
"NUM",
"PART",
"PRON",
"PROPN",
"PUNCT",
"SCONJ",
"SYM",
"VERB",
"X",
}
)
Universal Dependencies v2 universal POS tags.
Wired as closed attribute values for the upos annotation in the UD preset.
UD_RELATION_LABELS
module-attribute
¶
UD_RELATION_LABELS: frozenset[str] = frozenset(
{
"acl",
"advcl",
"advmod",
"amod",
"appos",
"aux",
"case",
"cc",
"ccomp",
"clf",
"compound",
"conj",
"cop",
"csubj",
"dep",
"det",
"discourse",
"dislocated",
"expl",
"fixed",
"flat",
"goeswith",
"iobj",
"list",
"mark",
"nmod",
"nsubj",
"nummod",
"obj",
"obl",
"orphan",
"parataxis",
"punct",
"reparandum",
"vocative",
"xcomp",
}
)
Core Universal Dependencies v2 dependency relation labels.
Wired as allowed relation values in the UD preset.
Language-specific subtypes (e.g., nsubj:pass, acl:relcl) are not included; extend the preset to add them.
UD_FEATURE_VALUES
module-attribute
¶
UD_FEATURE_VALUES: dict[str, frozenset[str]] = {
"PronType": frozenset(
{
"Art",
"Dem",
"Emp",
"Exc",
"Ind",
"Int",
"Neg",
"Prs",
"Rcp",
"Rel",
"Tot",
}
),
"NumType": frozenset(
{
"Card",
"Dist",
"Frac",
"Mult",
"Ord",
"Range",
"Sets",
}
),
"Poss": frozenset({"Yes"}),
"Reflex": frozenset({"Yes"}),
"Foreign": frozenset({"Yes"}),
"Abbr": frozenset({"Yes"}),
"Typo": frozenset({"Yes"}),
"ExtPos": frozenset(
{
"ADJ",
"ADP",
"ADV",
"AUX",
"CCONJ",
"DET",
"INTJ",
"PRON",
"PROPN",
"SCONJ",
}
),
"Gender": frozenset({"Com", "Fem", "Masc", "Neut"}),
"Animacy": frozenset({"Anim", "Hum", "Inan", "Nhum"}),
"NounClass": frozenset(
{
"Bantu1",
"Bantu2",
"Bantu3",
"Bantu4",
"Bantu5",
"Bantu6",
"Bantu7",
"Bantu8",
"Bantu9",
"Bantu10",
"Bantu11",
"Bantu12",
"Bantu13",
"Bantu14",
"Bantu15",
"Bantu16",
"Bantu17",
"Bantu18",
"Bantu19",
"Bantu20",
"Bantu21",
"Bantu22",
"Bantu23",
"Wol1",
"Wol2",
"Wol3",
"Wol4",
"Wol5",
"Wol6",
"Wol7",
"Wol8",
"Wol9",
"Wol10",
"Wol11",
"Wol12",
}
),
"Number": frozenset(
{
"Coll",
"Count",
"Dual",
"Grpa",
"Grpl",
"Inv",
"Pauc",
"Plur",
"Ptan",
"Sing",
"Tri",
}
),
"Case": frozenset({"Abs", "Acc", "Erg", "Nom"}),
"Definite": frozenset(
{"Com", "Cons", "Def", "Ind", "Spec"}
),
"Deixis": frozenset(
{"Abv", "Bel", "Even", "Med", "Nvis", "Prx", "Remt"}
),
"DeixisRef": frozenset({"1", "2"}),
"Degree": frozenset(
{"Abs", "Aug", "Cmp", "Dim", "Equ", "Pos", "Sup"}
),
"VerbForm": frozenset(
{
"Conv",
"Fin",
"Gdv",
"Ger",
"Inf",
"Part",
"Sup",
"Vnoun",
}
),
"Mood": frozenset(
{
"Adm",
"Cnd",
"Des",
"Imp",
"Ind",
"Int",
"Irr",
"Jus",
"Nec",
"Opt",
"Pot",
"Prp",
"Qot",
"Sub",
}
),
"Tense": frozenset(
{"Fut", "Imp", "Past", "Pqp", "Pres"}
),
"Aspect": frozenset(
{"Hab", "Imp", "Iter", "Perf", "Prog", "Prosp"}
),
"Voice": frozenset(
{
"Act",
"Antip",
"Bfoc",
"Cau",
"Dir",
"Inv",
"Lfoc",
"Mid",
"Pass",
"Rcp",
}
),
"Evident": frozenset({"Fh", "Nfh"}),
"Polarity": frozenset({"Neg", "Pos"}),
"Person": frozenset({"0", "1", "2", "3", "4"}),
"Polite": frozenset({"Elev", "Form", "Humb", "Infm"}),
"Clusivity": frozenset({"Ex", "In"}),
}
Universal morphological features and their allowed values.
Wired as closed attributes in the UD preset (e.g., Number, Case, Tense, etc.).
UD_OPEN_ATTRIBUTES
module-attribute
¶
UD_OPEN_ATTRIBUTES: frozenset[str] = frozenset(
{"word", "lemma", "xpos", "feats", "misc", "id", "head"}
)
Common CoNLL-U open annotations in Universal Dependencies.
Open attributes are those whose values are not restricted to a fixed set. Includes token form, lemma, extended POS tag, features, metadata, ID, and head index.
UD
module-attribute
¶
UD = CorpusSpec(
open_attributes=UD_OPEN_ATTRIBUTES,
closed_attributes=_UD_CLOSED_ATTRIBUTES,
allowed_relations=UD_RELATION_LABELS,
)
Universal Dependencies v2 corpus specification.
A ready-made CorpusSpec for validating BCQL queries against Universal Dependencies v2 corpora. Includes universal POS tags, morphological features, core dependency relations, and standard CoNLL-U annotations.
Language-specific subtypes and variations can be added via extend().
bcql_py.validation.presets.lassy
¶
Lassy / Alpino preset derived from the alpino_ds DTD.
See the the LASSY manual, Figures 1.1 and 1.2 on pages 13-14.
Based on this this preset describes:
- The full Alpino relation inventory (
rel), also exposed asLASSY_RELATION_LABELSand integrated as allowed relation values. - Phrasal categories (
cat), part-of-speech tags (pt), and morphosyntactic features (ntype,getal,graad, ...). - Open-string annotations (
word,lemma,postag, plus identifier / position fields). - The DTD element names as allowed span tags, for corpora
that expose
alpino_ds/nodeas XML spans.
Note that "pos" and "root" are excluded, as per the documentation:
De attributen pos en root representeren de door Alpino gebruikte POSTAG en ROOT waardes. Deze worden hier niet afzonderlijk gedocumenteerd, en zijn geen officieel onderdeel van de annotatie.
LASSY_RELATION_LABELS
module-attribute
¶
LASSY_RELATION_LABELS: frozenset[str] = frozenset(
{
"--",
"app",
"body",
"cmp",
"cnj",
"crd",
"det",
"dlink",
"dp",
"hd",
"hdf",
"ld",
"me",
"mod",
"mwp",
"nucl",
"obcomp",
"obj1",
"obj2",
"pc",
"pobj1",
"predc",
"predm",
"rhd",
"sat",
"se",
"su",
"sup",
"svp",
"tag",
"top",
"vc",
"whd",
}
)
Lassy/Alpino dependency relation labels (rel attribute).
Wired as allowed relation values in the LASSY preset.
LASSY_CAT_LABELS
module-attribute
¶
LASSY_CAT_LABELS: frozenset[str] = frozenset(
{
"advp",
"ahi",
"ap",
"conj",
"cp",
"detp",
"du",
"inf",
"mwu",
"np",
"oti",
"pp",
"ppart",
"rel",
"smain",
"ssub",
"sv1",
"svan",
"ti",
"top",
"whq",
"whrel",
"whsub",
}
)
Lassy phrasal category labels (cat attribute).
Wired as closed attribute values in the LASSY preset.
LASSY_PT_LABELS
module-attribute
¶
LASSY_PT_LABELS: frozenset[str] = frozenset(
{
"adj",
"bw",
"let",
"lid",
"n",
"spec",
"tsw",
"tw",
"vg",
"vnw",
"vz",
"ww",
}
)
Lassy part-of-speech tags (pt attribute).
Wired as closed attribute values in the LASSY preset.
LASSY_FEATURE_VALUES
module-attribute
¶
LASSY_FEATURE_VALUES: dict[str, frozenset[str]] = {
"dial": frozenset({"dial"}),
"ntype": frozenset({"soort", "eigen"}),
"getal": frozenset({"getal", "ev", "mv"}),
"graad": frozenset({"basis", "comp", "sup", "dim"}),
"genus": frozenset(
{"genus", "zijd", "masc", "fem", "onz"}
),
"naamval": frozenset(
{"stan", "nomin", "obl", "bijz", "gen", "dat"}
),
"positie": frozenset(
{"prenom", "nom", "postnom", "vrij"}
),
"buiging": frozenset({"zonder", "met-e", "met-s"}),
"getal-n": frozenset({"zonder-n", "mv-n"}),
"wvorm": frozenset({"pv", "inf", "od", "vd"}),
"pvtijd": frozenset({"tgw", "verl", "conj"}),
"pvagr": frozenset({"ev", "mv", "met-t"}),
"numtype": frozenset({"hoofd", "rang"}),
"vwtype": frozenset(
{
"pr",
"pers",
"refl",
"recip",
"bez",
"vb",
"vrag",
"betr",
"excl",
"aanw",
"onbep",
}
),
"pdtype": frozenset(
{"pron", "adv-pron", "det", "grad"}
),
"persoon": frozenset(
{
"persoon",
"1",
"2",
"2v",
"2b",
"3",
"3p",
"3m",
"3v",
"3o",
}
),
"status": frozenset({"vol", "red", "nadr"}),
"npagr": frozenset(
{
"agr",
"evon",
"rest",
"evz",
"mv",
"agr3",
"evmo",
"rest3",
"evf",
}
),
"lwtype": frozenset({"bep", "onbep"}),
"vztype": frozenset({"init", "versm", "fin"}),
"conjtype": frozenset({"neven", "onder"}),
"spectype": frozenset(
{
"afgebr",
"onverst",
"vreemd",
"deeleigen",
"meta",
"comment",
"achter",
"afk",
"symb",
"enof",
}
),
"rel": LASSY_RELATION_LABELS,
"cat": LASSY_CAT_LABELS,
"pt": LASSY_PT_LABELS,
}
Lassy morphosyntactic features and their allowed values.
Wired as closed attributes in the LASSY preset (e.g., ntype, getal, graad, etc.).
LASSY_OPEN_ATTRIBUTES
module-attribute
¶
LASSY_OPEN_ATTRIBUTES: frozenset[str] = frozenset(
{
"word",
"lemma",
"postag",
"id",
"index",
"begin",
"end",
}
)
Open annotations in Lassy/Alpino corpora.
Open attributes are those whose values are not restricted to a fixed set.
LASSY_SPAN_TAGS
module-attribute
¶
LASSY_SPAN_TAGS: frozenset[str] = frozenset(
{"alpino_ds", "node", "sentence", "comments", "comment"}
)
Allowed XML span tag names in Lassy/Alpino corpora.
Wired as allowed span tags in the LASSY preset.
LASSY
module-attribute
¶
LASSY = CorpusSpec(
open_attributes=LASSY_OPEN_ATTRIBUTES,
closed_attributes=LASSY_FEATURE_VALUES,
allowed_span_tags=LASSY_SPAN_TAGS,
allowed_relations=LASSY_RELATION_LABELS,
)
LASSY/Alpino corpus specification for Dutch.
A ready-made CorpusSpec for validating BCQL queries against Lassy/Alpino-annotated Dutch corpora. Includes Alpino POS tags, morphosyntactic features, dependency relations, DTD span tags, and standard CoNLL-like annotations.