TUSNELDA Query Syntax

  SFB 441
  Project C1
  Sustainable Data Formats
  Annotation standard
  Submit a query

As we use XMLQUERY as search engine for TUSNELDA, parts of the following specifications have been adopted from the XMLQUERY documentation.

Query language

A query is an expression built up from atomic constants and operators as specified below.

Atomic expressions

TypePrecedenceOperator
BRACKET0. . matches any XML element
BRACKET0# # matches any piece of PCDATA
BRACKET0quoted string a string valued constant
BRACKET0numeric constant a numeric constant
BRACKET0name the name of an XML element or attribute

Operators allowed in query language

There are really two sublanguages in the query language, one which describes the tree structure of the elements you want to find, the other is a language for expressing boolean conditions on the attribute values of these elements.

Tree structure

TypePrecedenceOperator
BINOPR200/ A/B if a B is a child of an A
BINOPR100, A,B if a B is right sibling of an A
BINOPR150& A & B if an A and a B in any order
MONPOST400* A* is multiple occurances of A. This is context dependant, if before a /, then this means multiple nested occurances, if not before a /, then this means multiple sequential occurances.
MONPOST400? A? if there is an optional occurance of A
BINOPR250| A|B if an A or a B

Relational operators

TypePrecedenceOperator
BINOPL300= string equality
BINOPL300== numeric equality
BINOPL300> greater than
BINOPL300< less than
BINOPL300>= greater than or equal
BINOPL300<= less than or equal
BINOPL300~ string regular expression match
BINOPL300!= string inequality
BINOPL300!== numeric inequality
BINOPL300!~ string regular expression non-match

Arithmetic and string operators

TypePrecedenceOperator
BINOPL500+ numeric plus
BINOPL500- numeric minus

Other constants and operators

TypePrecedenceOperator
MONPRE400% introduces variables
MONPRE400$ introduces variables (as %)
MONPRE400- numeric negation
MONPOST270! marking something to be saved
MONPOST400[ ] A[B] if there is an A which satisfies boolean condition B
BRACKET0( ) general purpose brackets
BRACKET0^ Marks the beginning (i.e. before the first child) of an element
BRACKET0$ Marks the end (i.e. after the last child) of an element


Query language syntax

In more detail, the query language syntax is described below. Text in green are literal operators, text in italics are nonterminals in the grammar. The grammar below is ambiguous, as usual ambiguities are resolved using the operator precedences given above (operators with higher precedence bind more tightly).

The syntax of an tree_expression is
tree_expression :== tree_expression / tree_expression
:== tree_expression & tree_expression
:== tree_expression , tree_expression
:== tree_expression | tree_expression
:== ( tree_expression )
:== tree_expression *
:== tree_expression !
:== element_expression
:== ^
:== $

The syntax of an element_expression is
element_expression :== element_name_expression ( [ boolean_expression ] )?
:== #   string_bool_op   string_expression
element_name_expression :== .
:== #
:== (element_name_expression )
:== $ variable_name
:== % variable_name
:== element_name
:== element_name_expression | element_name_expression
boolean_expression :== boolean_expression , boolean_expression
:== boolean_expression | boolean_expression
:== string_expression   string_bool_op   string_expression
:== number_expression   number_bool_op   number_expression
:== $ variable_name
:== % variable_name
:== number_expression
number_expression :== number_expression   arith_op   number_expression
:== number_literal
string_expression :== number_literal
:== quoted_string_literal
:== attribute_name
:== #
:== $ variable_name
:== % variable_name
number_bool_op :== == | !== | < | > | <= | >=
string_bool_op :== = | != | ~ | !~
arith_op :== + | -
To be continued.


Example queries

   .*/sp!/.*/#~"Idefiks"
matches every <sp> element which contains the text string "Idefiks" (matches in the B8 Comic Corpus).
   .*/sp!/.*/reg
matches every <sp> element which contains a <reg> element (matches in the B8 sub-corpora).
   .*/sp/.*/reg
same as above, but does not display the matching <sp> elements (which must be indicated by the ! operator) but only the number of matches found.
   .*/sp!/.*/reg!
same as above, but displays each <sp> element once for each <reg> element in it, highlighting the <reg> element.
   .*/sp!/.*/marked[type="deic-loc"]
matches every <sp> element which contains a local deictic (matches in the B8 and B9 sub-corpora)
   .*/.!/marked[type="deic-loc"]
matches any element which contains a local deictic as immediate child.
   .*/figure!/.*/situation/keywords/term/#~"forefinger"
matches every <figure> element which contains a situational characterization with the keyword "forefinger" (matches in the B8 Comic Corpus).
   .*/figure!/.*/situation/keywords/(term/#~"forefinger" & term/#~"bent")
matches every <figure> element which contains a situational characterization with the keywords "forefinger" and "bent" (matches in the B8 Comic Corpus).
   .*/((sp/spokenpar!/ptr[target=%X]),.*,figure[id=%X]!)
matches every spoken paragraph which contains a pointer to a figure and the corresponding figure (matches in the B9 text Brasil - soccer match (tv).
Last modified 11 March 2009.