AST and parser design¶
This guide explains how to reason about the internals of bcql_py without reading every model
and parser method.
Design goals¶
The implementation optimises for three goals:
- Full BCQL coverage: all token queries, sequences, repetitions, spans, lookaround, captures, global constraints, relations, and alignments supported by BlackLab.
- Functionally lossless round-trip:
parse(query).to_bcql()reproduces the original query, attempting to maximally retain quote style and shorthand forms. - Position-aware diagnostics:
BCQLSyntaxErrorcarries the original source and a character offset, so downstream tools (editors, LLMs) can point users directly at the problem.
The pipeline¶
BCQL string --> tokenize() --> tuple[Token, ...] --> BCQLParser --> BCQLNode (AST)
|
(optional) validate()
|
to_bcql() / model_dump()
tokenize()is implemented byBCQLLexer.parse_from_tokens()drivesBCQLParser.parse()is the convenience wrapper that combines both. It also accepts aspec=argument that, when provided, runsvalidate()against the resulting AST before returning it; see the tagset validation guide.
tokenize() and the cached parse step use functools.lru_cache, so repeat calls on the same
source return the same (immutable) token tuple / AST instance.
The three grammar layers¶
The parser has three expression domains with separate operator rules. Keeping them apart avoids ambiguities and makes precedence rules explicit.
| Layer | Where it applies | Representative nodes |
|---|---|---|
| Sequence-level | Everything outside [...] |
SequenceNode, RepetitionNode, CaptureNode, SpanQuery, RelationNode, AlignmentNode |
| Token-constraint | Inside [...] |
AnnotationConstraint, BoolConstraint, IntegerRangeConstraint |
| Capture-constraint | After :: |
AnnotationRef, ConstraintComparison, ConstraintBoolean |
The node hierarchy¶
All nodes inherit from BCQLNode, a frozen Pydantic
BaseModel. Every concrete subclass:
- has a unique
node_type: Literal[...]discriminator; - overrides
to_bcql()for lossless reconstruction; - is immutable (
model_config = ConfigDict(frozen=True)).
from bcql_py import parse
from bcql_py.models import TokenQuery, AnnotationConstraint, StringValue
ast = parse('[word="man"]')
assert isinstance(ast, TokenQuery)
assert isinstance(ast.constraint, AnnotationConstraint)
assert isinstance(ast.constraint.value, StringValue)
ast.constraint.annotation # 'word'
ast.constraint.value.value # 'man'
See the API reference for the exhaustive list.
Traversing the AST¶
Because every node is a Pydantic model, a generic walker can introspect fields and recurse into
any BCQLNode-valued attribute or list.
from bcql_py import parse
from bcql_py.models import BCQLNode
def walk(node):
"""Depth-first generator yielding every BCQLNode in a tree."""
yield node
for field_name in type(node).model_fields:
value = getattr(node, field_name)
if isinstance(value, BCQLNode):
yield from walk(value)
elif isinstance(value, (list, tuple)):
for item in value:
if isinstance(item, BCQLNode):
yield from walk(item)
elif isinstance(value, dict):
for item in value.values():
if isinstance(item, BCQLNode):
yield from walk(item)
ast = parse('A:[lemma="run"]+ "fast" :: A.word != "ran"')
types = [node.node_type for node in walk(ast)]
# types includes 'global_constraint', 'sequence', 'repetition',
# 'capture', 'token_query', 'annotation_constraint', 'string_value', etc.
Prefer isinstance(node, BCQLNode) over hasattr(node, "node_type"): Pydantic field values that
happen to have a node_type attribute by coincidence will not match.
Round-tripping¶
to_bcql() produces a string that re-parses to an ==-equivalent AST. Every node in the library
is covered by tests/test_roundtrip.py.
from bcql_py import parse
source = '"the" [pos="ADJ"]+ "man" within <s/>'
assert parse(parse(source).to_bcql()) == parse(source)
The output is not guaranteed to be byte-identical to the input: insignificant whitespace may be
normalised, and some brace quantifiers are canonicalised (e.g. {0,5} round-trips as {,5}).
Custom traversal with node_type¶
For pattern-matching without isinstance, use node_type:
from bcql_py import parse
ast = parse('[word="man"]')
match ast.node_type:
case "token_query":
print("a single token match")
case "sequence":
print("a sequence of tokens")
case _:
print("something else")
This is especially handy for serialised ASTs: the discriminator survives any JSON or dictionary representation.
Why we do not use ANTLR¶
BlackLab's reference parser is generated from a
Bcql.g4 ANTLR grammar.
bcql_py re-implements the same grammar as a hand-written recursive-descent parser for three
reasons:
- Clarity: the generated code is hard to read and tweak; a hand-written parser is directly
mapped to the BNF in
bnf.md. - Zero runtime dependency: no ANTLR runtime needs to ship with the library.
- Position-aware errors: we can attach precise offsets and context to every error without fighting the generated machinery.
When Bcql.g4 and this library diverge, Bcql.g4 wins. Open questions and deliberate
differences are collected in
questions.md.