Skip to content

A Python parser for BlackLab Corpus Query Language

A full-coverage Python parser for the BlackLab Corpus Query Language (BCQL) that converts query strings into a Pydantic v2 AST (Abstract Syntax Tree) with lossless round-trip reconstruction and structured error reporting.

To get started, you can check out:

Features

  • Complete BCQL coverage: token queries, sequences, repetitions, spans, lookarounds, captures, global constraints, relations, alignments, and built-in functions.
  • Immutable Pydantic v2 AST: every node is a frozen BaseModel subclass with a node_type discriminator, making inspection and pattern matching straightforward.
  • Lossless BCQL round-trip: to_bcql() reproduces the original query (preserving shorthand forms, quote characters, sensitivity flags, etc.).
  • Position-aware syntax errors: BCQLSyntaxError carries the original query, the 0-based offset, and a caret-annotated message: ready to forward to a user or LLM.
  • Optional semantic validation: a CorpusSpec describes which annotations, span tags, alignment fields, and dependency relations your corpus supports. Pass it as parse(query, spec=spec) to catch typos and unsupported features before they reach the corpus. See the tagset validation guide.
  • Zero runtime dependencies beyond Pydantic.

Installation

pip install bcql_py

Or with uv:

uv add bcql_py

Try the demo

A small Gradio app under app/ lets you paste a BCQL query, pick or build a CorpusSpec, and inspect parse + validation results. The hosted demo runs on Hugging Face Spaces at BramVanroy/bcql_py_validation.

To run it locally:

uv sync --group app
uv run python app/app.py

Supported BCQL constructs

Category Examples
Token queries [word="man"], "man", [], [pos != "noun"]
Regex & literal strings "(wo)?man", l"e.g.", "(?-i)Panama"
Boolean constraints [lemma="search" & pos="noun"], [a="x" \| b="y"]
Sequences "the" "tall" "man"
Repetitions [pos="ADJ"]+, []{2,5}, "word"?
Spans <s/>, <s>, </s>, <ne type="PERS"/>
Position filters "baker" within <person/>, <s/> containing "dog"
Captures A:[pos="ADJ"], A:[] "by" B:[] :: A.word = B.word
Relations _ -obj-> _, _ -subj-> _ ; -obj-> _, ^--> "have"
Alignments "cat" ==>nl _, "cat" ==>nl? _
Lookaround (?= "next"), (?<= "prev"), (?! "not")
Functions meet(...), rspan(...), rfield(...)

See the cheatsheet for a quick-reference table of every operator.

Development

git clone https://github.com/BramVanroy/bcql_py.git
cd bcql_py
uv sync

# Run tests and doctests
uv run pytest

# Lint and format
make quality   # check only
make style     # auto-fix