GLDV-Früjahrstagung 2007

Our sponsors:

Markup Languages and Schema Languages for Linguistic, Textual, Documentary Resources

C. M. Sperberg-McQueen

This paper will consider design issues in the construction of schemas and schema languages for textual resources intended for linguistic computing, computational linguistics, and computer philology. The emphasis will be on SGML and XML vocabularies and schema languages for specifying them, with occasional reference to other systems.
Like any good metalanguage, a good schema language must support good design at the language level. Good language design practices should be encouraged, bad practices should be discouraged or (if the metalanguage designer is ambitious) made impossible. (As Orwell writes, "The Revolution will be complete when the language is perfect.") And to be useful, the metalanguage must allow the language designer to express their design decisions, preferably clearly, preferably concisely.
Some design issues of importance for markup languages will be outlined.

Over- and under-generation

In the ideal case, the schema for a language provides a formal recognition criterion which recognizes every sequence which we wish to accept as a sentence in our language, and does not recognize any other sequence. In less ideal cases, it may be necessary to live with some discrepancy between the language as we imagine it and the formal definition we work with. Is it better to under-generate? Then we can be sure that every sequence recognized by the schema is truly acceptable, at the cost of having some intuitively plausible utterances fail to be recognized by the schema. Or is it better to overgenerate? Then every acceptable sequence will be recognized, as will some number of non-sensical, unacceptable sequences. Which is preferable depends on the purpose of the schema: schemas serving as a contract between data producers and data exchange partners have one role; schemas used primarily to provide automatic annotation of the data have another; schemas which express our understanding of a corpus, in the form of a document grammar, have yet another. The notions of descriptive and prescriptive grammar also play a role.

Concrete vs. abstract structures

The feel of a markup language depends, more than anything else, on the designer's choice of element types. Will there be chapter, section, and subsection elements, or a single generic 'div' element with an attribute to distinguish the kind of textual division involved? Some aspects of this fact are obvious. Will element types be chosen to reflect typographic distinctions? Rhetorical and compositional distinctions? Linguistic phenomena? Equally important - and far more difficult to resolve satisfactorily - is the desire to capture both concrete details of the document (leading often to fine-grained distinctions among element types) and regularities visible only at a more abstract level. If the markup language provides a wide variety of phrase-level element types (as conventional document-oriented language often do), how can we capture generalizations true for all phrase-level types (e. g., in a stylesheet, or in a scholarly annotation). If the markup language were to provide only a single phrase-level element (with an attribute, perhaps, to allow us to distinguish different kinds of phrases), then such generalizations would be easier to capture. But the details of the text would be somewhat more cumbersome to capture. The choice of concrete or abstract structures has serious implications for validation of the data, at least with current validation technologies. Microformats, as currently used in some HTML, provide a useful concrete illustration both of the design issues involved and of the validation issues.

Ontological commitments

One of the issues most keenly felt by some designers and users of markup languages is that of ontological commitment. Providing names for things can be, and usually is, interpreted as entailing a claim that the things named actually exist, or can exist. It is not always easy to reach agreement, within a design team, about the nature of the ontological commitment involved in defining a particular element type, or a particular attribute value. And vocabularies intended for wide use must reckon with the possibility that different members of the target user community will have different and conflicting ontological leanings; sometimes the ontological commitments of a vocabulary are left intentionally vague.

Variability in the material

When existing material is digitized, an interesting pattern of variability in the material is sometimes found. In a given dictionary, for example, or in a collection of dictionaries, most articles may follow a fairly simple pattern; some will be more complex; a few will be simply anomalous. What should the schema author do? We can write a document grammar that captures the regularities in the vast majority of cases, at the cost of declaring some small portion of the material invalid. We can write a more forgiving document grammar that accepts everything in the corpus, at the expense of failing to capture the regularities which dominate the material in practice; the problems of over- and under-generation recur here in different guise.

Data Structures

SGML and XML are readily interpreted as describing trees; other markup systems are most conveniently understood as serializations of other data structures. What is to be done when the 'natural' data structure for our material doesn't seem to match the data structure of the markup system? Also - can we perform schema validation without trees? Is it possible for a schema to be incorrect? Is it desirable for it to be falsifiable in principle? Some errors of schema design are worth noting and warning against:

the Waterloo error (extreme over-generation; may take the form of deciding not to define a schema at all and relying instead only on the material being well-formed)
the tag-everything error (systematic overkill in the vocabulary, often proceeding from a desire to "tag everything that might be important", lest useful information be lost through not being marked up; reflects a failure to engage with the practical impossibility of marking everything)
the insignificant-order error (a technical issue involving the interleave operator of some schema languages)

Design issues at the language level are only half the problem, though. There are also design issues at the metalanguage level. Metalanguage designers continually trade off expressive power against tractability of validation and other processes. Convenience features for schema authors compete for attention with the simplicity and regularity that make a schema language easier to implement. Should the schema language (and by extension most schema-informed processes) be monolithic or modular? If modular, do the modules form a sequence of layers or are there interactions more complex? How does one best serve the maintainability of the schema? What operations on schemas would it be useful to support? How should the schema language go about supporting openness and extensibility in schema-defined vocabularies? How do we suport extensibility in the schema vocabulary itself? Examples will be drawn largely from the experience of the last decade in the design, implementation, and use of XML Schema 1.0 and 1.1.