Guidelines

The TUSNELDA corpus encoding standard was developed as a common annotation standard for corpora collected and annotated in SFB 441. These corpora will be part of the "Tübingen collection of reusable, empirical, linguistic data structures" (TUSNELDA).

A common annotation standard for TUSNELDA was necessary in order to provide comparability of the corpora in SFB 441. Further, the standard is needed to guarantee reusability of the corpora with standard corpus-linguistic tools. On the other hand, the TUSNELDA standard is flexible enough to meet the different needs of the projects in SFB 441.

The TUSNELDA standard is based on the corpus encoding standard (CES) developed by EAGLES (Expert Advisory Group on Language Engineering Standards) in the sense that depending on the needs of the corpus-linguistic work in SFB 441, the CES was extended and partly modified. In some cases where CES was not sufficiently structured for the needs of the corpus linguists in SFB 441, the original TEI solutions proved to be more suitable and therefore were (with some modifications) taken over.

The main features of the TUSNELDA corpus encoding standard are also described in Kallmeyer & Wagner: The TUSNELDA annotation standard: An XML encoding standard for multilingual corpora supporting various aspects of linguistic research. To appear in Proceedings of the conference Digital Resources for the Humanities DRH 2000, Sheffield, September 2000.

0.2 Notation conventions for these guidelines

Element names are always set in boldface enclosed in brackets "<" and ">", e.g. <tusneldaHeader>. Attribute names are also set in boldface, e.g. type. For attributes with a default value, the default is marked with an asterisk in the list of values.

1. Some general remarks concerning the whole corpus

1.1 Upwards Compatibility of the DTD

All further modifications of the DTD for TUSNELDA must be such that any text or corpus satisfying a previous version of the DTD also satisfies the new version.

1.2 Meta Language

Several elements in the corpus contain natural language descriptions or notes that are not part of the corpus text itself. For example, as part of the encoding description, the header of a corpus contains a description of the project for which the corpus was encoded. In principle, these natural language meta texts can be written in any language. However, one should take into account the possible users of the corpus when choosing a meta language. Furthermore, if a language other than English is chosen, an additional abstract of the text in English is recommended.

1.3 TUSNELDA conformance

Whether several levels of conformance in the style of the CES levels of conformance are useful for TUSNELDA, still needs to be considered. The original CES levels of conformance are not adequate for TUSNELDA, since the DTD for tusneldaDoc is not only an extension but also a modification of the original cesDoc DTD. It is no longer compatible with the cesDoc DTD, i.e. corpora encoded with the TUSNELDA DTDs do not satisfy the CES or XCES (the XML variant of CES) DTDs.

As a desirable level of conformance for the corpora and texts in TUSNELDA, for each publication unit, the following conditions should hold:

the text structure should be encoded at least down to the paragraph level,
for all tags occurring at all in the encoded publication unit, the tagging should be complete in the sense that all elements correponding to this tag are encoded.

A publication unit should be either a text or a corpus. The specific project encoding a corpus can decide which portions of texts or corpora constitute a publication unit.

1.4 The element tusneldaCorpus

The topmost element of the TUSNELDA DTD is the element <tusneldaCorpus>. This element contains a header followed either by one or more texts or by one or more subcorpora. The whole corpus has a header, and each of its texts or subcorpora also has a separate header. A text together with its preceding header constitutes an element <tusneldaDoc>.

<tusneldaCorpus>

contains the whole of a TUSNELDA encoded corpus, comprising a single corpus header and one or more <tusneldaDoc> elements, each containing a single text header and a text. Additionally, the <tusneldaCorpus> element can be recursively nested, and sequences of this element can appear at any nested level, in order to identify sub-corpora. In addition to the global attributes (see Section 2.1), it has the following attributes:

type: used to identify the type of a sub-corpus (by language, genre, etc.) when nested <tusneldaCorpus> elements are used.
version: provides the version of the tusneldaDoc DTD to which this corpus is compliant. If different parts of the corpus were created using different versions of the DTD (this is possible since any version is upward-compatible with its successor), then the value here reflects the highest version number used in the corpus--i.e., the version with which the corpus can be parsed. The attribute version is required.

For organizational reasons, the corpus TUSNELDA will be divided into subcorpora, that contain corpora themselves. Corpora encoded by the same project form one subcorpus. Further decisions on how to divide corpora into subcorpora will be made later.

The attribute type is used to characterize subcorpora with resepct to language, genre, etc. At a later point, recommendations for possible values of type will be given. These values depend of course on the plans for further subdivisions of the corpus into subcorpora.

Example:

<tusneldaCorpus type="multilingual, SFB 441">
  <tusneldaHeader> 
     ...
     (Header of the whole corpus TUSNELDA)
     ...
  </tusneldaHeader>
  <tusneldaCorpus type="romance languages, project B9">
     <tusneldaHeader>...</tusneldaHeader>
     <tusneldaCorpus type="Spanish, spoken texts"> 
           ...
     </tusneldaCorpus>
     <tusneldaCorpus type="Portuguese, historic texts"> 
         <tusneldaHeader>...</tusneldaHeader>
         <tusneldaCorpus type="Portuguese, historic texts, 15th century">
            ...
         </tusneldaCorpus>
         <tusneldaCorpus type="Portuguese, historic texts, 16th century">
            ...
         </tusneldaCorpus>
         ...           
     </tusneldaCorpus>
  </tusneldaCorpus>
</tusneldaCorpus>

2. The Header

2.1 Global attributes

Three global attributes are defined, which may appear on any element in the header:

id

a unique identifier for the element bearing the ID value.

n

a number or other label for the element, not necessarily unique within the corpus.

lang

indicates that the tag's content is in the specified language. The value of the lang attribute is composed of one of the following:

a two-letter code from ISO 639 (e.g., "en" for English;
a three-letter code from ISO 639-2 (e.g., "eng" for English);
a code from Ethnologue
one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).

The global attributes for elements in the header are defined at the top of the header.elt and represented by an entity, %A.HEADER. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the DTD of the header.

2.2 The Element tusneldaHeader

Each text in the corpus (i.e. each <tusneldaDoc> element) has its own header, referred to as a text header. The whole corpus and each subcorpus also has a header, referred to as a corpus header, which contains information applicable to the corpus in question (possibly with some local overriding). Both corpus and text headers are represented by <tusneldaHeader> elements. The type attribute with values TEXT or CORPUS is used to distinguish the two.

<tusneldaHeader>

contains the descriptive and declarative information making up an "electronic title page" prefixed to every text, or to the corpus as a whole.

type

specifies the kind of document to which the header is attached. Possible values are:

CORPUS the header is attached to the corpus.

TEXT* the header is attached to a single text.

creator

specifies the agency responsible for creating the header. For TUSNELDA, in most cases this is SFB 441, and further, a specific project can be added.

version

specifies the version and revision of the tusneldaHeader.elt used to encode this header. This number is found near the top of the tusneldaHeader.elt itself.

date.created

specifies the date on which the header content was created.

date.updated

specifies the date on which the header content was last updated.

The attributes date.created, date.updated and version are required.

The <tusneldaHeader> element contains the following four elements:

<fileDesc>: contains a full bibliographic description of the corpus itself or of a text within it.
<encodingDesc>: documents the relationship between an electronic text and the source or sources from which it was derived.
<profileDesc>: provides further information about various aspects of a text, specifically the language used, the situation and date of its production, the participants and their setting, and a descriptive classification for it.
<revisionDesc>: summarizes the revision history for a file.

The subelements <fileDesc> and <profileDesc> are required.

Note that if the lang or wsd attributes are used on elements in the main text, it is required to include a <profileDesc> element containing <langUsage> (for use of lang) and/or <wsdUsage> (for use of wsd).

2.3 File Description

The file description is the first of the four main constituents of the header and is represented by the <fileDesc> element. It is a required element of the header. The file description documents the electronic file itself, i.e. (in the case of a corpus header) the corpus, or (in the case of a text header) the individual text to which the header applies. The element consists of a title statement, an edition statement, an optional element <extent>, a publication statement and one or more source descriptions:

<titleStmt>: groups information concerning the title of the corpus or the individual text and its constituent texts.
<editionStmt>: contains any additional information relating to a particular version of a text.
<extent>: provides the size of the electronic text as stored on some carrier medium.
<publicationStmt>: groups information concerning the publication or distribution of the corpus and its constituent texts.
<sourceDesc>: supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated. Further detail is given in the following subsections.

<titleStmt>, <editionStmt>, <publicationStmt>, and <sourceDesc> are required. Note that the <titleStmt> describes the machine-readable file, while the source text is specified in the <sourceDesc>. The title in the <titleStmt> should indicate that this is a machine-readable version and should not be identical to the title of the source text.

2.3.1 Title Statement

This element consists of a <h.title> element followed by a <respStmt> element. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility is required. The element <respStmt> allows to list one or more statements of responsibility.

The subelements of <titleStmt>:

<h.title>: the title of the electronic file, including alternative titles or subtitles.
In case of the whole corpus, the title is TUSNELDA,
electronic texts inside TUSNELDA are entitled: Title - TUSNELDA electronic version. In those cases where the elctronic text has a source text, Title should be the original title of the source.
<respStmt>: supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription.

Both, <h.title> and <respStmt> are required.

<respStmt> in turn consists of one or more pairs of elements <respType> and <respName>:


<respType>: contains a phrase describing the nature of a person's or institution's intellectual responsibility. Recommended contents of <respType> are Editor, Publisher, Project, Annotation, Correction.
<respName>: the publisher of the corpus or text expressed as the proper name of a person, place or institution.

Example of a possible element <titleStmt>:

<titleStmt>
   <h.title> 
          Süddeutsche Zeitung - TUSNELDA electronic version
   </h.title>
   <respStmt>
          <respType>Project</respType>
          <respName>C1 (Marga Reis)</respName>
          <respType>Annotation (POS-Tagging)</respType>
          <respName>Student A</respName>
          <respType>Correction</respType>
          <respName>Student B</respName>
   </respStmt>
</titleStmt>

2.3.2 Edition Statement

The element <editionStmt> is required in the file description. The important information contributed by <editionStmt> is the value of the attribute version. This attribute is required.

In corpus headers, the version attribute of the <editionStmt> element is used to indicate both a version number and a revision number, in the form "version.revision", where "version" changes if texts are added to or removed from the corpus, and "revision" changes if amendments are made within texts or the corpus header. In individual text headers, the version attribute carries only a revision number.

Sample edition statement:

<editionStmt version='2'>Second version, substantially extended and corrected.</editionStmt>

Note that with a modification of a document (text or corpus), the value of version in the <editionStmt> of the header changes, and consequently, the attribute date.updated of the header also gets a new value. In other words, because of the version attribute in the edition statement, the attribute date.updated of a header actually refers not only to the header but to the whole corpus.

2.3.3 Extent Statement

The element <extent> describes the approximate size of the electronic text as stored on some carrier medium, specified in words, tokens, characters and additionally in Kb.

<extent> is optional and therefore, in particular in an early state of collecting and encoding a text or corpus, it can be left aside. However, once a text or corpus has reached a state where it is intended to be used for linguistic research by other people besides those encoding the corpus, it is highly recommended to add the <extent> element to the file description.

The <extent> tag contains:

<wordCount>

contains the count of the linguistic words in the text without punctuation

<tokenCount>

count of tokens, i.e. <wordCount> + count of punctuation marks

<characterCount>

count of characters referring to everything counted in <tokenCount>. For <characterCount>, entities are treated as follows: character entities are counted as a single character and other entities are resolved before counting, i.e. in this case the string the entitiy stands for is counted.

<byteCount>

contains the count of bytes in the file containing the text together with its markup.

units

gives the unit in which the bytecount is measured.

BYTES bytes

KB* kilobytes

MB megabytes

GB gigabytes

The <bytecount> tag gives the size of the text including its tags, in its representation as a text file encoded in an 8-bit ISO character set, which is useful for calculating media requirements or file download times.

<extNote>

a descriptive note supplying additional information of any kind relating to an extent information provided within a corpus or text header. The extent note should at least contain a characterization of punctuation marks, i.e. information about those parts of the text that are counted in <tokenCount> but not in <wordCount>.

<wordCount>, <tokenCount> and <byteCount> are required.

2.3.4 Publication Statement

The publication statement contains the subelements <distributor>, <pubAddress>, the optional elements <telephone>, <fax>, <idno> and <eAddress>, one or more elements <availability> and a <pubDate>:

<distributor>

gives the name of the person or institution who distributes the text or corpus. In case of TUSNELDA, this is usually SFB 441.

<pubAddress>

contains a postal address of the distributor.

<telephone>

gives the telephone number in of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.

<fax>

gives the fax number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.

<eAddress>

gives an electronic address of the person or institution who distributes the text or corpus. Note that more than one occurrence of this tag can appear, so that multiple addresses (possibly of different types) can be included. Attribute:

type

gives the type of the electronic address (email address, web site, ftp site, etc.). Suggested values include:

EMAIL* the value is an electronic mail address.

WWW the value is a web site address.

FTP the value is an ftp address.

<availability>

supplies information about the availability of a text, for example, any restrictions on its use or distribution, its copyright status, etc.

region

specifies the territories within which rights in the electronic text apply. Suggested values include:

WORLD*

UNI TUEBINGEN

SFB 441

status

supplies a code identifying the current availability of the text. Values are:

RESTRICTED the text is not freely available.

UNKNOWN* the status of the text is unknown.

FREE the text is freely available.

<idno>

supplies a number (e.g., ISBN) used to identify a bibliographic item.

type

gives the type of the identification number. Suggested values:

ISBN* the number is an ISBN number.

ISSN the number is an ISSN number.

<pubDate>

the publication date expressed in any format

value: specifies standard value for this date in ISO 8601 (Representation of dates and times) format

2.3.5 Source Description

The element <sourceDesc> contains one or more of the following subelements:

<biblFull>: contains a bibliographic citation for a text which has been previously encoded in electronic form. This element is intended to include the header of the electronic text from which the current document is derived. It contains the same elements as the <fileDesc> element. But in contrast to <fileDesc>, the subelements <editionStmt> and <sourceDesc> are optional in <biblFull>.
<biblStruct>: contains a structured bibliographic citation, in which only bibliographic sub-elements appear in a specified order.
<recordingStmt>: this element is intended for recordings of spoken text. It characterizes the source either as a recording made by the corpus encoding agency itself, or as a recording made by some broadcasting agency.
<p>: in some cases, it is not possible to give a source description or the source(s) are already specified at some other place. The first holds for example for a text that is not a recording and that is collected by the encoding agency itself. In this case a corresponding notice in the form of a paragraph (a <p> element) is sufficient. The second holds for corpus headers (in contrast to text headers). In general, the sources of the texts in a corpus should be specified in the header of the texts themselves. The headers of subcorpora may contain an unstructured description of the sources of the texts (in a <p> element) or they may contain an empty <p> element, i.e. no source description.

The headers of individual texts will each contain at least one of the above elements to specify their source. When a particular text contains items derived from more than one bibliographic source or recording, all relevant sources for which information is available are listed in the text header, and individual <div> elements are associated with the correct citation or recording by means of the decls attribute.

If an electronic text has been derived from a previous electronic version of the text, then the source description will contain a <biblFull> element. If this version had itself been derived from another electronic version, then this <biblFull> element may contain yet another <biblFull> element, and so on for as many recursive levels as required. If an electronic text is derived from a print source, it contains a <biblStruct> element describing that source. If it is derived from a recording, it contains a <recordingStmt>.

For electronic texts derived from previous electronic texts, it is recommended to add at least one recursive level, i.e. at least the <sourceDesc> element for the electronic source of the text.

The <biblStruct> element has the following component sub-elements:

<analytic>: contains bibliographic elements describing an item (e.g. an article or poem) published within a monograph, journal, or periodical and not as an independent publication.
<monogr>: contains bibliographic elements describing an item (e.g. a book or journal) published as an independent item (i.e. as a separate physical object).

At least one <monogr> element must be present in a <biblStruct> element. It may contain the following elements:

<h.title>

the title of a work.

<h.author>

in a bibliographic reference, contains the name of an author (personal or corporate) of a work; names should be given in a canonical form, with surnames preceding forenames.

<respStmt>

supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription. (see the description of <respStmt>)

<edition>

provides bibliographic details for an edition of some text.

<imprint>

groups information relating to the publication or distribution of a bibliographic item.

<idno>

supplies a standard (e.g., ISBN) number used to identify a bibliographic item.

type

a name or abbreviation (e.g., ISBN) identifying what type of identifying number is given. Unless provided explicitly the default value is:

ISBN* the value is an ISBN number.

<biblScope>

defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work.

type

identifies the type of information conveyed by the element.

PP the element contains a page number or page range.

VOL the element contains a volume number.

ISSUE the element contains an issue number, or volume and issue numbers.

<biblNote>

a descriptive note supplying additional information of any kind relating to a bibliographic item described within a corpus or text header.

Published texts must contain at least one <imprint> element, which can contain the following elements:

<publisher>

proper name of a person, place or institution.

type

categorises the name. Legal values are:

PERSON name of a person

PLACE name of a place

ORG name of an organization article in a periodical

<pubDate>

a calendar date in any format.

value: specifies standard value for this date in ISO 8601 format

<pubPlace>

place of publication for a book, article, etc.

The <analytic> element is used when multiple monographic records are grouped together into single items. When the item described by a bibliographic citation forms a part of some other bibliographic item (as, for example, a newspaper article within a newspaper, or a journal article within a collection), a monographic description should be given for the newspaper or collection, prefixed by an analytic description for the individual component, enclosed within an <analytic> element. This contains a mixture of the elements <h.author>, <respStmt> and <h.title> in any order and repeated as necessary.

Sample element <biblStruct>:

<biblStruct>
  <monogr>
     <h.title>Effi Briest</h.title>
     <h.author>Theodor Fontane</h.author>
     <imprint>
        <pubPlace>Stuttgart</pubPlace>
        <publisher>Manesse-Verlag</publisher>
        <pubDate>1991</pubDate>
     </imprint>
     <idno type="ISBN">3717511327</idno>
  </monogr>
</biblStruct>

The element <recordingStmt> consists of one or more <recording> elements each of which characterizes one of the recordings occurring in the corpus either by giving a bibliographic description of the broadcast the recording comes from or (in case the recording was made by the corpus encoding agency itself) by giving details about the recording such as date, the persons involved in the recording, the equipment used etc.

<recording> consists of an arbitrary number of recording notes or structured descriptions of the recording in the form of elements <respStmt>, <equipment>, <broadcast> or <date>.

<recording> has the following attributes:

type

characterizes the kind of recording. Legal values are:

AUDIO*

VIDEO

dur

gives the duration of the recording

The subelements of <recording> are:

<recNote>: an unstructured description of aspects of the recording that are notable and that cannot be described in any of the other subelements of <recording>

<respStmt>: (see the description of <respStmt>). Possible contents of the subelement <respType> are Interviewer, Interviewee etc.

<equipment>: describes the technical equipment used to perform the recording.

<broadcast>: contains a single subelement <biblStruct> that gives details of the broadcast in a form analogous to bibliographic citations. The broadcasting agency responsible for a broadcast is regarded as its author, while other participants (for example interviewers, interviewees, directors, producers, etc.) should be specified using the <respStmt> inside <biblStruct> or the <editor> element.

<date>: contains the date of the recording.

2.4 Encoding Description

The second major component of the header, the encoding description, contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus.

The <encodingDesc> element has the following seven components:

<projectDesc>: describes in detail the purpose for which an electronic file was encoded.
<samplingDecl>: contains a prose description of the rationale and methods used in sampling texts in the creation of the corpus.
<editorialDecl>: provides details of editorial principles and practices applied during the encoding of a text.
<tagsDecl>: provides detailed information about the tagging applied to an SGML document.
<annotations>: groups information about existing annotation files associated with the text.
<refsDecl>: specifies how canonical references are constructed for this text.
<classDecl>: contains a series of <category> elements, defining the classification codes used for texts within the corpus.

The element <projectDesc> is required, all other subelements of the encoding description are optional.

2.4.1 Project Description

This element provides information about the project for and by which the text or corpus was created, together with any other relevant information concerning the process by which it was assembled or collected. The content of this element is an unstructured note. Example:

      <projectDesc>
           The MULTEXT project is assembling a corpus consisting of
           mono-lingual texts in seven Eastern and Western European
           languages, together with parallel translations in each of
           these languages. The original texts were acquired in various
           forms and marked up for conformance with the MULTEXT/EAGLES
           Corpus Encoding Standard, to test and validate that scheme.
           
           MULTEXT has also developed a suite of annotation tools which
           have been tested on the texts in the corpus. 
      </projectDesc>

A minimal encoding description can contain only the <projectDesc> element. In this case, a prose description of the encoding methods can be provided. If documentation of encoding principles exists in another location (a manual, etc. in printed form, at a given URL, in an ftp site, etc.) this information should be provided.

In principle, the <projectDesc> can be written in any language. However, a consideration of potential users when choosing a language for the <projectDesc> is recommended (see also Section Meta language). If a language different from English is chosen, an additional short version of the project description in English is desired.

2.4.2 Sampling Declaration

The <samplingDecl> element is also an unstructued note, which contains information about the methods for text sampling in the corpus. Concerning the language that is chosen for the sampling declaration, the same considerations hold as in the case of the project description.

The encoding description can contain any number of sampling declarations.

A <samplingDecl> element occurring in the header of a corpus gives information about the choice of the texts in the corpus whereas an element <samplingDecl> occurring in the header of a text provides details about the inclusion or exclusion of portions of the text. In both cases, the sampling declaration preferably includes information about the reason for this sampling, and the means by which this is noted in the encoding, if any.

For example (adapted from English-Norwegian Parallel Corpus Project manual):

      <samplingDecl>
           The texts of the core corpus are mostly extracts from books. 
           The extracts are between 10,000 and 15,000 words long (30 - 40  
           pages), and are taken from the beginning of the texts. The front  
           matter, prefaces, forewords, list of contents, etc., are not  
           included in the extracts. In some cases, introductions have been  
           left out as well, e.g. introductions by scholars to works of  
           fiction.
           
           Omission of passages in the text may be marked by an 
           <omit> tag. 
      </samplingDecl>

2.4.3 Editorial Declaration

The <editorialDecl> element contains elements that specify each a particular kind of editorial practice used for some portion of the corpus. Where the same principles apply across the whole corpus (e.g., for the <segmentation> element), they can be documented only once within the corpus header. If different parts of the corpus apply different practices (as for example with the <quotation> or <hyphenation> elements), all possible practices can be defined in the corpus header, and particular parts of the corpus can specify the editorial practices applicable to them by using the decls attribute. When this method is used, if a practice is not explicitly associated with a part of the corpus in this way, it is assumed not to apply to it.

The <editorialDecl> element contains the following elements:

<correction>

specifies a set of correction practices applied in creating one or more components of the corpus. For TUSNELDA in all cases of corrections, the original must be retained in an attribute. This is automatically controlled. Corresponding to this, the default value of the attribute method is TAGS.

method

indicates whether corrections are made without notation or made by including editorial tags.

TAGS* correction indicated with tags

SILENT correction made silently

<quotation>

specifies editorial practice adopted with respect to quotation marks in the original.

marks

indicates whether or not quotation marks are retained as tag content in the text.

NONE no quotation marks have been retained

SOME some quotation marks have been retained

ALL* all quotation marks have been retained

form

specifies how quotation marks are indicated within the text.

STD use of quotation marks has been standardized; open and close quote marks are distinct.

NONSTD open and close quote marks are represented indiscriminately.

UNKNOWN* use of quotation marks is unknown.

<hyphenation>

summarizes the way in which end-of-line hyphenation in a source text has been treated in an encoded version of it. For TUSNELDA, it is recommended to document each elimination of hyphenation and to retain the original form in a tag. <hyphenation> should contain a description of the method used for eliminating hyphenation marks, preferably with information about the script applied for this process.

<segmentation>

describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc. A detailed specification of the methods applied for segmentation is recommended. <segmentation> consists of arbitrarily many pairs of a tag and the corresponding segmentation method followed by at least one segmentation note.

<tag>: a specific tag
<segmMethod>: the sementation method applied for this tag. All scripts used for segmentation with respect to a certain tag should be named in its segmentation method. If possible, these scripts are distributed together with the corpus.
<segmNote>: additional remarks concerning the segmentation of the text

<transduction>

describes the principles according to which the text has been transduced, either in transcribing it from audio tape to written form, or in converting from an electronic original. If possible, the scripts used for transduction should be named.

<normalization>

specifies a set of normalization practices applied in creating one or more components of the corpus. For <normalization>, similar to <correction>, the original form must be retained in an attribute. This is automatically controlled.

method

indicates whether normalization made without notation or made by including editorial tags.

TAGS* normalization indicated with tags

SILENT normalization made silently

Example of an element <segmentation>:

<segmentation>
 <tag>s</tag>
 <segmMethod>automatic tagging with the tool XYZ developed by 
    A.B. at the C.D. Institute. The tool works as follows: 
    If it encounters a question or exclamation mark, then the 
    end of an element s is recognized. If a full stop is 
    encountered, then the tool checks whether it is part of an 
    abbreviation or an ordering number. If this is not the case, 
    then the end of an element s is recognized. In 
    order to check for abbreviations, the tool makes use of an 
    abbreviation list.
 </segmMethod>
</segmentation>

2.4.4 Tags Declaration

The <tagsDecl> element is used differently in corpus and in text headers. In a corpus header, it is used to list all the element names actually used within the corpus, together with a brief description of their functions. Furthermore, it specifies the number of SGML elements actually tagged within each corpus. In text headers, the same element is used only to count the number of SGML elements tagged within the text. In both cases the element consists of a number of <tagUsage> elements, defined as follows:

<tagUsage>

supplies information about the usage of a specific element within the corpus or text with which this header is associated.

gi

the name (generic identifier) of the element indicated by the tag. This attribute is required.

occurs

specifies the number of occurrences of this element within the text.

wsd

can be used on a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified character set. Therefore the declaration

<tagUsage gi=term occurs=5 wsd="ISO 8859-5">

indicates that the content of all <term> elements is in the ISO 8859-5 character set.

Note that the global attribute lang can similarly be used in a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified language.

In the corpus header, each <tagUsage> element contains a brief description of the element specified by its gi , and the occurs attribute is not supplied. In text headers, the <tagUsage> elements may be empty, but the occurs attribute is always supplied.

The header of TUSNELDA must contain (as part of the <tagsDecl> element) one <tagUsage> element for each tag used in TUSNELDA. This <tagUsage> specifies in its content the semantics of the tag in question. This guarantees a uniform semantics of the single tags used throughout the subcorpora of TUSNELDA.

A typical written text has a tag declaration like the following:


            <tagsDecl>
               <tagUsage gi=name occurs=256>
               <tagUsage gi=div occurs=7>
               <tagUsage gi=head occurs=7>
               <tagUsage gi=p occurs=705>
               <tagUsage gi=reg occurs=2>
               <tagUsage gi=sic occurs=1>
               <tagUsage gi=body occurs=1>
            </tagsDecl>

A PERL script to automatically generate <tagUsage> elements with appropriate values for tags in any SGML text is available at

<URL: http://www.cs.vassar.edu/~priestdo/research/scripts/tagusage.txt>

2.4.5 Annotations in separate Files

The element <annotations> groups information about annotation documents associated with the text. The following elements are used for these purposes:

<annotation>

gives information about an annotation file associated with the text. Attributes:

type

indicates the type of annotation. Values include:

TOKEN annotation file contains segmentation into tokens.

MORPHSYN annotation file contains morpho-syntactic category information for the words in the text.

SYNTAX annotation file contains syntactic (i.e. structural) information for phrases in the text.

SEGMENT annotation file contains segmentation into sentences and words.

ALIGN annotation file contains alignment links to a parallel translation.

ann.loc

provides information (path/file name, URL, etc.) about the location of the annotation file.

trans.loc

for annotation files containing alignment information, trans.loc provides information (path/file name, URL, etc.) about the location of the file containing the aligned text.

2.4.6 Reference Declaration

The element <refsDecl> is useful for encoding corpora since it provides information about references which are often used in the alignment of parallel texts. In particular, it is common to use ID values on tags marking paragraphs and sentences as references in links associating two parallel texts. See for example, the English-Norwegian Parallel Corpus Project and The Lingua Parallel Concordancing Project.

     <refsDecl>
          A reference system is built up using the identifiers of the 
          following text units: text, division, paragraph, s-unit.
          Each nested division has an identifier which is built up by 
          successively adding to the identifier of the text. Each  
          paragraph has an identifier which adds yet another layer to the
          immediately superordinate identifier. S-units are numbered  
          within the nearest division, as shown above. After alignment,  
          each s-unit in the core corpus has a "corresp"  
          attribute containing a reference to the corresponding unit(s) in  
          the parallel text.   
      </refsDecl>

2.4.7 Class Declaration

The <classDecl> element provides means to define a set of text categories for classifying texts in the corpus. A standardized set of text categories is under development by the EAGLES Corpus Working Group on Text Typology, which may eventually eliminate the need to explicitly provide a descriptive taxonomy in the corpus header.

The <classDecl> element contains the descriptive taxonomy used to classify texts within the corpus. It occurs once, in the corpus header, and consists of one or more <taxonomy> elements. The <taxonomy> element in turn contains either a set of <category> elements, each representing a particular textual classification feature and a value for that feature; or one of the elements <h.bibl> or <biblStruct>, providing a bibliographic citation for documentation of a categorization scheme, followed optionally by a set of <category> elements. The <h.bibl> element contains only unstructured text and is used for cases where only a very simple citation is required.

<taxonomy>: defines a typology used to classify texts.

<category>: contains an individual descriptive category or feature-value pair.

The global id attribute is required for the <category> element, since it is used to associate a <catRef> within a text header with the descriptive category appropriate to it. The category element contains a set of <catDesc> elements:

<catDesc>: describes a category within the text typology, in the form of a brief prose description.

The <catDesc> element is used to contain the value for a feature within a <category>, unless that category is further subdivided, in which case a nested <category> element may be used.

Within the <textClass> element of the header for each text, a <catRef> element is provided, the target attribute of which lists the identifiers of all <category> elements applicable to that text.

When a standard set of text categories is developed, it is anticipated that an attribute on <textClass> will provide the category. Unless the standard categories are extended, no pointer to <category> elements in the corpus header will be required.

A taxonomy for the classification of texts inside TUSNELDA will be specified later.

2.5 Profile Description

The third component of the header is the profile description. The <profileDesc> element has the following components:

<creation>: groups information about the period and place of creation of a text.
<langUsage>: groups information describing the languages, sublanguages, registers, dialects etc. represented within a text.
<wsdUsage>: groups information describing the character set(s) used within a text.
<textClass>: groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.
<translations>: groups information about existing translations of the text.

TUSNELDA is a multilingual corpus. Therefore the specification of the languages in the corpus is indispensable. For this reason, <profileDesc> and its subelement <langUsage> are both required.

2.5.1 Creation

The element <creation> gives information about the period and the place of creation of a text. This element is different from <pubDate> (subelement of <imprint>) that relates to the publication of a text but not to its creation. (For Middle High German texts published in the 20th century for example, <pubDate> and the date specified in <creation> are different.) Since in many cases, the exact creation date of historic texts cannot be specified, <creation> has the attributes earliest and latest that allow to specify a creation period. If the exact creation date is known, it is the value of both attributes. Further, <creation> has an attribute place that is used to specify the place of creation of a text.

earliest: the earliest possible creation date, i.e. the latest date where one can be sure that the text was created later.
latest: the latest possible creation date, i.e. the earliest date where one can be sure that the text was created earlier.
place: the creation place.

Example:

<creation earliest=1066 latest=1127 place=England>...</creation>

2.5.2 Language Usage

The <langUsage> element contains one or more <language> elements, each identifying a language used on the text:

<language>

characterizes a language, sublanguage, register, dialect, etc., used within a single text.

type

indicates the type of language, e.g., sublanguage, dialect, etc.

ethnologue

gives the language code from Ethnologue.

iso639

gives the standard language code from ISO 639 in one of the following forms:

a two-letter code from ISO 639 (e.g., "en" for English;
a three-letter code from ISO 639-2 (e.g., "eng" for English);
one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).

This attribute is required.

In some cases, the language codes from ISO 639 or ISO 639-2 might be insufficient to characterize the languages used in a corpus. In these cases an alternatvie coding using the ethnologue code might be possible because this standard is more refined that the ISO standard. However, since ISO 639 is an important standard that is frequently used, a characterization using ISO 639 is required. Additionally, a characterization with the ethnologue code is allowed for. If neither of them provides a code for a language in question, the content of the <language> element or the attribute type give a description of the language.

Example of a <langUsage> element:

 <langUsage>
   <language id="fr" iso639="fr">French</language>
   <language id="en" iso639="en">English</language>
   <language id="la" iso639="la">Latin</language>
   <language id="swg" ethnologue="swg" iso639="de">Swabian</language>
 </langUsage>

The value of the id attribute on any <language> element should be given as a value for the global lang attribute when it is used on a tag in the text or header to refer to this language. For example,

  She ate <foreign lang=fr>croissants</foreign>

When more than one character set is used in a text, the wsd attribute should be used on each <language> tag to associate the language with a particular character set.

2.5.2 Writing System

The element <wsdUsage> contains one or more <writingSystem> elements, each identifying a character set used on the text:

<writingSystem>: characterizes a character set used within a single text.

Example:

      <wsdUsage>
          <writingSystem id="ISO 8859-1">ISO character set for western 
                   European languages</writingSystem>
          <writingSystem id="ISO 8859-5">ISO character set for 
                   Cyrillic</writingSystem>
      </wsdUsage>

The value of the id attribute on any <writingSystem> element should be given as a value for the global wsd attribute when it is used on a tag in the text or header to refer to this character set. For example,

       This is a patch of Cyrillic: 
       <foreign lang=bu wsd="ISO 8859-5">


       </foreign>

When a writing system declaration describing a transcription scheme is provided as an auxiliary document, the value of the wsd attribute on the <writingSystem> element must be an entity pointing to this document. Usually, the entity expands to be the name of the file in which the writing system declaration is stored. Note that for this reason, the type of the wsd attribute on the <writingSystem> element is ENTITY (indicating that its value must be an SGML entity). In all other instances, whether in the header or text, the type of the wsd attribute is CDATA.

2.5.3 Text Classification

The <textClass> element contains references to the text classification scheme and descriptive keywords which together describe the text concerned. The following elements are used for these purposes:

<catRef>

specifies one or more defined categories within the taxonomy or text typology given in <classDecl>. In TUSNELDA, all texts are classified using the same taxonomy.

target: identifies the text category or categories, by means of an IDREF pointing to one or more <category> elements defined in the corpus header.
scheme: adds information about the text classification scheme.

Endtag omission is allowed for <catRef> since the content of <catRef> is empty.

<h.keywords>

contains a list of key terms identifying the topic or nature of a text, each of which is tagged as a term.

<keyTerm>: a keyword or a phrase

Although EAGLES/PAROLE plans to provide a standard list of keywords, for TUSNELDA a standardization of the keywords is not recommended. Instead, it is rather an advantage of the keywords element that, in addition to a text classification in terms of a given taxonomy, <h.keywords> allows an unrestricted and flexible characterization of a text.

2.5.4 Translations

The element <translations> groups information about translations of the text which exist, usually within the same corpus. The following elements are used for these purposes:

<translation>

gives information about a translation of the text. The global lang attribute and the wsd attribute are required on this tag. Additionally, this tag has the following optional attribute:

trans.loc: provides information (path/file name, URL, etc.) about the location of the the translation.

<translator>: gives the name of the translator.

Note that endtag omission is allowed for the <translation> element, since in some cases all relevant information is supplied in attributes only. Thus, where appropriate, this element can function as an empty element, e.g.:

<translations>
     <translation trans.loc="1984.sl.ces" lang=sl wsd="ISO8859-1" n=1>
     <translation trans.loc="1984.es.ces" lang=es wsd="ISO8859-1" n=2>
     <translation trans.loc="1984.ro.ces" lang=ro wsd="ISO8859-1" n=3>
</translations>

2.6 Revision Description

The revision description is the fourth element in the header. It is used to record details of any significant change to the corpus. The <revisionDesc> element has the following component:

<change>: summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers.

Multiple <change> elements are provided for; one should appear per change.

The <change> element contains the following subelements

<changeDate>

gives the date of the change.

value: specifies standard value for this date in ISO 8601 format

<respName>

specifies the person responsible for the change.

<h.item>

specifies the nature of the change(s). One or more occurrences of this element may appear within each <change> element.

When any significant change is made to any component of the corpus, the following steps should be taken:

a <change> element is added to the <revisionDesc> of the text affected
the date.updated attributes of the text header and of any header above it are changed to the date of the change
the revision number specified on the version attribute of the <editionStmt> of the corpus header is incremented.

If possible, control and administration of the versions of a corpus should be done automatically, for example with RCS. The element <revisionDesc> then could be automatically generated and modified.

2.7 Summary

The minimal header has the following structure:


        <tusneldaHeader>
            <fileDesc>
                 <titleStmt>
                     <h.title></h.title>
                     <respStmt>
                         <respType></respType>
                         <respName></respName>
                     </respStmt>
                 </titleStmt>
                 <editionStmt></editionStmt>
                 <publicationStmt>
                     <distributor></distributor>
                     <pubAddress></pubAddress>
                     <availability></availability>
                     <pubDate></pubDate>
                 </publicationStmt>    
                 <sourceDesc>
                     <p></p>
                 </sourceDesc>
            </fileDesc>
            <profileDesc>
                 <langUsage>
                     <language></language>
                 </langUsage>
            </profileDesc>
        </tusneldaHeader>

3. Documents: The element <tusneldaDoc>

3.1 The element <tusneldaDoc>

The element <tusneldaDoc> contains a single document, either forming part of or derived from a corpus. The global attributes of the <tusneldaHeader> element, i.e. id, n and lang (see section 2.2), plus the following attributes are defined:

type: indicates the type of document (text, spoken data, etc.); the default is TEXT.
version: provides the version of the tusneldaDoc DTD to which this text is compliant. The version attribute is required.

A <tusneldaDoc> element consists of a <tusneldaHeader>, followed by a <text> element, which may in turn contain a <body> element or a <group> element.

<tusneldaHeader>

contains the header for the text. This element is fully described in section 2.2.

<text>

contains an individual text. Global attributes of the <tusneldaHeader> plus:

complete

specifies whether or not this text is complete or a sample.

Y* the full text of the original has been transcribed
N a sample of the original text has been taken

decls: specifies one or more IDs associated with elements in the text header that apply to this element.

3.2 Text contents

The <text> tag may contain one occurrence of one of the following:

<group>: groups together a sequence of distinct texts that are regarded as a unit, such as a sequence of prose essays, poems, etc. A <group> tag may contain an optional sequence of paragraph-level elements (as described in section 3.3.4), followed by one or more <body> elements.
<body>: contains the body of the text, excluding any front or back matter. Formally, it consists of an optional sequence of paragraph-level elements (cf. section 3.3.4), followed by an optional sequence of text divisions (as described in section 3.3).

For both the <group> element and the <body> element, the global attributes for the <tusneldaHeader> (cf. section 2.2) plus the following attributes are defined:

wsd

indicates that the element's content is encoded in the specified character set. The value of the attribute is the character set name (ISO-8859-1, etc.) which should be the same as that appearing on a <writingSystem> element in the header document which describes that character set.

rend

provides information about rendition in an original printed version. The rend attribute may e.g. take one of the following values (other values are also valid):

BO bold face

BX boxed

IT italic font

RO roman font

UL underlined

CA capital letters

These five attributes, namely id, n, lang, wsd, and rend, are also defined for any element embedded within a TUSNELDA <body> or <group> element. For simplicity, we will refer to them as "the textual attributes" from now on.

Note that there is no provision for the encoding of front matter such as cover page, table of contents, appendixes, etc., in the current TUSNELDA recommendations. For the most part, such material is unnecessary for corpus linguistics and should not be included.

3.3 Text divisions

Written texts exhibit a variety of different structural forms. Some have very little organization at levels higher than the paragraphs, while others have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles, etc.

The following element is used to represent textual divisions of all kinds:

<div>: any subdivision of a written text, e.g. chapter, section, sub-section, article, etc.

If a text has any structural subdivision, then at least those at the highest level should be identified.

The <div> element has the following attributes:

type

categorises the division in some respect, e.g. as a chapter, section etc. A set of precise values will be provided at a later stage. The attribute is required.

complete

specifies whether or not this division is complete or a sample.

Y* the full text of the original has been transcribed
N a sample of the original text has been taken

decls: specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.

The n global attribute can be used to carry an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example:

<div type=CHAPTER n=5>

Furthermore, the global attributes id and lang (cf. section 2.1), as well as the attributes rend and wsd (cf. section 3.2), are defined for <div> elements.

The content of the <div> tag is defined to consist of one or more division head elements (optional) followed by a sequence of paragraph-level elements and/or <div> elements, followed by one or more division closing elements (optional).

3.3.1 Contents of text divisions

Below the level of text divisions, there are three general groups of elements which may appear:

Division head elements: information such as section titles, bylines, etc. that often appears at the beginning of text sections.
Paragraph-level elements: further division of the text, into paragraphs, etc.
Division closing elements: information such as datelines, bylines, etc. that can appear at the end of a text section, especially in newspapers, etc.

3.3.2 Division head elements

Division head elements include:

<opener>

groups together any opening material that is not a heading at the start of a division, including in particular <dateline> and <keywords>.

<head>

contains any heading, for example, the title of a section. This element can also appear inside the <list> and <poem> elements to mark the title of a list or poem. It can contain any phrase-level element.

type: gives the type of header, e.g., main, sub, unspecified, etc.

<byline>

contains the primary statement of responsibility given for a work on its title page or at the head or ending of the work, most often applicable to newspapers. Can contain any phrase-level element plus the tag <docAuthor> for the author's name.

Any (possibly empty) sequence of these forms a division opening element.

3.3.3 Division closing elements

Division closing elements include:

<closer>: groups together material appearing at the end of a division, including in particular <dateline> and <keywords>.
<byline>: same as above.

The <keywords> element can contain terms and lists of terms that may appear at the beginning or end of a text as identifying material.

The <dateline> element can contain untagged prose intermixed with markup for dates, times, names, addresses, abbreviations, and numbers.

Any (possibly empty) sequence of these forms a division closing element.

3.3.4 Paragraph-level elements

A number of divisons of text occur at what is called the paragraph-level, since the most common such division at this level is <p> (paragraph). There are in addition several other elements which may appear directly within structural divisions (that is, not nested within some other element). The first six elements have been taken over without any changes from the CES.

<p>: a paragraph in a written text.
<list>: a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit.
<poem>: a poem, or an extract from one, embedded or quoted within a text.
<caption>: (1) a heading, title etc. attached to a picture or diagram (2) a "pull quote" or other text about or extracted from a text and superimposed upon it to draw attention to it.
<bibl>: a loosely-structured bibliographic citation appearing within a corpus text.
<note>: any form of note, usually a footnote. This tag is used only for notes that are a part of the original data only, not notes which may be added by the encoder, etc.
<table>: contains text displayed in tabular form, in rows and columns.

The next four paragraph-level elements have been revised for the special needs of the TUSNELDA corpus annotation. The <linkGrp> element has been adopted from the cesAlign DTD.

<sp>: contains material marked as "written to be spoken" or "written as spoken", usually by the presence of a speaker prefix, for example in a play script or printed interview.
<quote>: a quotation from some author other than that of the surrounding text, usually either embedded or displayed.
<figure>: indicates the location of a graphic, illustration, or figure.
<linkGrp>: serves to bundle a group of links together.

The paragraph-level elements are discussed in more detail below.

3.3.4.1 Paragraphs

Paragraphs consist of any kind of sequence of phrase-level elements. All attributes appropriate for textual elements, i.e. the global attributes id, n, and lang, plus rend and wsd are defined (for details see sections 2.1 and 3.2).

3.3.4.2 Lists

Lists may occur between paragraphs (i.e., on the paragraph-level) or within paragraphs (i.e. on the phrase-level). A list consists of an optional <head> element, followed by one or more <item> elements, each of which may optionally be prefixed by a <label> element:

<item>: an item within a list.
<label>: an enumerator or other label attached to a list item. Lists may or may not be marked. Where marked, they may appear within or between paragraphs.

The <label> element is used to hold the identifier or tag sometimes attached to a list item, for example "(a)'', or a word or phrase used for a similar purpose. An example:


<list>
   <label>Erstens</label>
   <item>sollte die DTD den Bedürfnissen von Korpuslinguisten, 
         speziell im Rahmen von TUSNELDA, entsprechen,
   </item >
   <label>zweiten</label>
   <item>sollte die DTD problemlos in XML konvertierbar sein, und
   </item> 
   <label>drittens</label>
   <item>sollte sie sich so weit wie möglich an bestehenden 
     Standards orientieren.
   </item>
</list>

Here the <label> element is part of the annotated text itself and must not be deleted. Note, however, that for the purposes of corpus-based work, it is preferable in many cases to regard list labels as rendition information and to encode them in the n attribute, rather than as part of the document content.

The <item> element may appear only inside lists. It contains the same elements as a paragraph, or a sequence of paragraphs, and may therefore contain one or more nested lists.

3.3.4.3 Poems

Poems or fragments of verse or song may only appear between paragraphs. They are marked using the <poem> element, which contains an optional series of <head> elements followed by one or more <lg> or <l> (for line) elements, which is used to mark metrical lines, rather than typographic lines:

<lg>

groups verse lines (marked by <l>), most often into stanzas. Use the type attribute to identify the reason for the grouping.

<l>

a line of verse.

part

indicates whether the verse line is metrically complete.

U* metricality is not known or inapplicable

Y the line is metrically complete

N the line is metrically incomplete

Note that the <lg> element may be recursively nested, in order to provide for sub-groupings of lines. In this case, the n attribute should be used to indicate the nesting level (e.g., n=1 for outer level, n=1.1 for nested sub-level, etc. ; see the section on Reference systems.

Here is an example of (part of) a TUSNELDA-annotated poem, which has been taken from the Gentle Introduction to SGML:


      <poem>
         <head>The Sick Rose</head>
         <lg type=stanza n=1>
            <l part=Y>O Rose thou art sick.</l>
            <l part=Y>The invisible worm,</l>
            <l part=Y>That flies in the night</l>
            <l part=Y>In the howling storm:</l>
         </lg>
         <lg type=stanza n=2>
            <l part=N>Has found thy bed</l>
            …
         </lg>
      </poem>

3.3.4.4 Captions

We distinguish between <head> elements, which can appear only at the start of a text division and are logically associated with it (for example, chapter titles, newspaper headlines etc.) and <caption> elements, which are logically independent of the position they may have within a textual division (e.g., captions attached to pictures or figures, "pull-quotes" embedded within the text, "by-lines" identifying authorship and provenance of a newspaper or periodical article.

The type attribute may be used to indicate the function of the caption:

type

categorizes the caption.

BYLINE caption containing authorship of an article

DISPLAY extra-textual caption (displayed box, etc.)

ATTACHED caption describing a figure, photograph, etc.

UNSPEC* not specified or unknown

A caption can be placed at a point other than where it appears, so as not to interrupt the normal flow of a text, by using it with the <ptr> tag. See the section on Pointing and reference.

3.3.4.5 Annotations (<note> and <bibl>)

Annotations and bibliographic citations or references are marked using the following elements:

<note>

any form of note, usually a footnote. This tag marks only notes that are a part of the original text, not notes that may be added by the encoder, etc. Possible attributes are:

place

for a written text, specifies the location of an original note in the source text.

FOOT note at foot of page.

END note at end of current division or text.

SIDE note in left or right margin.

UNSPEC* placement unknown or unspecified.

<bibl>

a loosely-structured bibliographic citation appearing within a corpus text.

Original notes may contain paragraphs, s-units, dialogue, and any other phrase-level element. The global n attribute can be used to indicate the value of a numbered note.

Like captions, notes are often moved from their original location in the original data and placed at another point so as not to interrupt the normal flow of a text, by using the <ptr> tag as follows (see the section on Pointing and reference):

      Here is a text, with a "1" at the end for a
      footnote. [1].
      <<Then, this note appears at
      this point in the original.>>
      But we would like to keep the text together.

This can be encoded as

       <p>Here is a text.
      <ptr target=N1 n=1 rend=bracketed>
      But we would like to keep the text together. </p>
      <note id=N1 place=foot>Then, this note appears at
      this point in the original.</note>

Bibliographic citations or references within running texts are marked using the <bibl> element, which can contain any phrase-level element plus the <author> element.

3.3.4.6 Spoken paragraphs

The <sp> element is used to mark parts of a written text which are intended to be spoken (for example the speeches in a dramatic text), or which comprise the transcription of a speech, interview, debates, etc. typically intended for publication (i.e., which have been transcribed to be read as text). Such parts are generally readily identifiable by the use of conventions such as speaker prefixes (the label supplying the name of the speaker) and stage directions. Within the TUSNELDA standard, this element has been substantially revised, for two main reasons:

1. CES provides a sub-element <stage> for the annotation of stage directions, which may occur on an arbitrary level of embedding within <sp> or one of its sub-elements. This kind of occurrence restriction (a so-called inclusion exception) is admissible in SGML, but not in XML. Therefore, it could neither be reproduced in XCES (CES in XML), nor in the TUSNELDA standard.

2. The transcription of comics, as undertaken in one of the SFB441 projects, requires additional elements for the description of gesture, for the inclusion of "metatextual" remarks added by the author (e.g. on the edge of single pictures), and for situation descriptions added by the annotator.

For a more detailed description of the differences between the CES standard and the TUSNELDA standard cf. Wagner, A. & L. Kallmeyer (2001).

In the TUSNELDA standard, the <sp> element takes the following attributes (in addition to the global ones):

who: name of the speaker
what: name of the object displaying a non-spoken piece of text

The <sp> element may contain these sub-elements:

<speaker>

contains the information provided in the original source to identify the speaker of a passage written to be spoken

<display>

contains the information provided in the original source to identify the container of a displayed passage (e.g. a sign)

<stage>

contains any kind of stage direction within a dramatic text, with the attribute

type: indicating the kind of stage direction

<spokenPar>

is structured like a written paragraph (i.e. the element <p>), but may additionally contain <stage> elements for the annotation of stage directions

<displayedPar>

contains the text displayed by a non-speaking container, e.g. an inscription on a sign. Its internal structure is equal to that of a <p> element.

<situation>

specifies relevant parameters of the situation as specified by the annotator, either as informal text or in (a) <keywords> element(s).

All of the above may be further marked by the global attributes appopriate for textual elements, i.e. id, n, lang, rend, and wsd (see section 3.2).

Syntactically, a <sp> element contains a arbitrary sequence of <speaker> and/or <display> tags, followed by a sequence of one or more <spokenPar>, <displayedPar>, <stage>, or <situation> tags (in an arbitrary mixture). Thus, stage directions (inside the <stage> element) may be annotated at the <sp> or the <spokenPar> level, as demonstrated in the following example:


      <sp who="Lady Windermere">
         <speaker>Lady Windermere.</speaker>
         <spokenPar>That will do !</spokenPar>
      </sp>
      <sp><stage>Exit Parker C.</stage></sp>
      <sp who="Lady Windermere">
         <spokenPar><stage>Speaking to Lord Windermere</stage>
             Arthur , if that woman comes here - I warn you - 
         </spokenPar>
      </sp>

The stage direction pertaining to the scene as a whole is annotated as a seperate <sp> element, whereas the directions for Lady Windermere are part of the <spokenPar> associated with her as a speaker.

Consider also the usage of the <spokenPar> and <situation> elements in this transcription of a comic picture (on the tags <figure>, <figtrans>, and <marked> see below):

<figure id="s35b5" entity="belgiji/s35b5.bmp"> <figtrans> <sp who="Obeliks"> <spokenpar> Gde da nađem belu zastavu ? <marked type="deic-loc">Ovde</marked> je sve pusto ! </spokenpar> <situation> <keywords> <term>open hands <term> <term>slightly bent</term> </keywords> </situation> </sp> <sp who="Asteriks"> <spokenpar> <marked type="deic-loc">Tamo</marked> je neki mališan ! </spokenpar> <situation> <keywords> <term>forefinger</term> <term>stretched out</term> </keywords> </situation> </sp> </figtrans> </figure>

Finally, here is an example of the use of <displayedPar>:

 
   <figure id="s15b8">
      <figtrans>
         <sp who="Metaloplastiks">
            <spokenpar>
                    Spavaj , idiote !
            </spokenpar>
         </sp>
         <sp what="scoreboard">
            <displayedpar>
                    Metalopastiks
            </displayedpar>
         </sp>
      </figtrans>
   </figure>

3.3.4.7 Quotations

A quotation is a (usually long) extract from some other work than the text itself which is embedded within it. It is set off from the paragraphs that surround it typographically, by spacing similar to that for paragraphs (e.g., white space before and after). It may contain paragraphs, poems, s-units, dialogue (marked with <q>) or any other phrase-level element. In TUSNELDA, the contents of the <quote> tag has been further enlarged, such that quotations may also include <table> elements. The use of the <quote> tag is sharply distinguished from that of the <q> tag, which is used to mark quoted material that appears inside a paragraph.

Quotations are often split up by pieces of the main text. Nevertheless, they form a single -- more abstract -- entity, rather than two seperate ones. This ought to be expressed in the annotation, such that the parts may be treated as a whole if necessary. For this purpose, <quote> elements may be marked by the following attributes:

next: contains the ID of the next part of the same quotation
prev: contains the ID of the previous part of the same quotation
broken: YES broken quotation; NO* non-broken quotation

Additionally, the usual text-level attributes id, n, lang, rend, wsd, and the attribute type are defined.

3.3.4.8 Figures

Figures are marked with the following tag, which enables a reference to a stored image in another file:

<figure>

indicates the location of a graphic, illustration, or figure.

entity: names the external entity within which the graphic image of the figure is stored.

The <figure> element contains an optional <head> element for the figure title or heading, followed by an optional sequence of paragraphs for commentary or caption, an optional <figTrans> element, an optional <figDesc> element, and an optional <text> element for including the graphic itself, where desired. The <figure> element can be empty, serving only to mark the presence of a figure in the text.

<figTrans>: contains a transcript of the text contained in a picture, e.g. in comics, or displayed in parts of a graphic. This element has been added in TUSNELDA for the annotation of comics, cf. the example above.
<figDesc>: contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying it.

Note that in many instances, figures will not be retained at all in the encoded version of the text. In this case, the <gap> element should be used to indicate the omission.

3.3.4.9 Tables

The <table> element is used to include tables in the text. It takes the attributes:

rows: indicates the number of rows in the table.
cols: indicates the number of columns in the table.

A <table> element optionally contains a <head> element, and at least one <row> element. A <row> consists of an arbitrary, non-empty sequence of <cell> elements and/or further tables. A <cell> may host any kind of phrase-level elements. The following attributes are defined for the element <cell>:

rows: indicates the number of rows in the cell.
cols: indicates the number of columns in the cell.

In order to mark explicitly row or column headers/titles, the attribute role is used for both <row> and <cell> elements.

Note that in many instances, tables will not be retained at all in the encoded version of the text. In this case, the <gap> element should be used to indicate the omission.

3.3.4.10 Link groups

In recording transcriptions (among others), it is necessary to align parallel and overlapping utterances / context descriptions. CES does not offer a mechanism for alignment within one document. TEI offers several alignment mechanisms. In the TUSNELDA standard, we adopted the approach of placing reference points (<anchor> elements) within utterances and attach corresponding reference points by <link> elements. (For an example, see the section on pointing and reference. Several of these links can be bundled into a <linkGrp> element, which is a paragraph-level element. This makes sense e.g. in order to keep the links of one paragraph or dialogue together.

3.4 Sub-paragraph (phrase-level) elements

The tusneldaDoc DTD also includes tags for marking sub-paragraph-level elements. The phrase-level elements that are provided for in the tusneldaDoc DTD are selected on the basis of their relevance for corpus-based work. There are five main categories of phrase-level elements:

elements for identifying s-units (typically orthographic sentences) and quoted dialogue;
elements indicating editorial changes to the original text;
the <hi> element for marking typographically distinct words or phrases, especially when the purpose of the highlighting is not yet determined;
elements of linguistic interest;
elements for pointing and reference.

The tusneldaDoc DTD imposes a relatively strict structure on sub-paragraph elements, intended to disallow options and suit the needs of corpus-handling tools.

3.4.1 S-units and quoted dialogue

The segmentation of texts into s-units, or orthographic sentences, is usually accomplished by special tools. In some cases it is still desirable to mark s-units and/or quoted dialogue in the primary data. We therefore provide mechanisms for marking these elements.

In some cases quoted dialogue is not marked in the primary data, because the identification of quoted dialogue can be accomplished automatically (by detecting quotation marks etc.).

<s>

identifies an s-unit within a document, typically an orthographic sentence.

next: gives the id reference of a subsequent <s> element which contains a continuation of the current sentence.
prev: gives the id reference of a previous <s> element which contains the beginning fragment of the current sentence.
type: indicates the type of sentence.
broken: indicates whether this <s> element is broken between two or more <s> elements (linked using the next and prev attributes). The default is NO*.

Sometimes, sentences occur as (e.g. quoted) parts of other sentences, an information which might be interesting when processing the corpus. For this purpose, we introduced a new attribute into the TUSNELDA standard, namely

nested: NESTED; NOTNESTED*

Its usage is shown in the example at the end of this sub-section.

The element <q> contains quoted dialogue or other quoted material appearing inside a paragraph. Similarly as above,

next: gives the id reference of a subsequent <q> element which contains a continuation of the current quote.
prev: gives the id reference of a previous <q> element which contains the beginning fragment of the current quote.
type: indicates the type of quote.
broken: indicates whether this <q> element is broken between two or more <q> elements (linked using the next and prev attributes).

Furthermore,

who: indicates the speaker of the quote.
direct: shows whether a quote is given in direct or indirect speech:; YES; NO; UNSPECIFIED*
nested: indicates whether a <q> element has a further <q> subelement.

When s-units are tagged, no split should be made between a colon or semi-colon followed by a word beginning with a capital initial (unless there is an end-of-paragraph marker).

When both <s> and <q> are marked, the problem of overlapping hierarchies can arise. For this reason it has been necessary to allow for mutual recursive nesting of <s> and <q> tags, a practice which is otherwise avoided. This allows all the following encodings:

<s>
   <q>Indeed yes,</q>
   she replied.
</s>

<q rend="PRE lsquo POST rsquo">
   <s>I know precisely what you are feeling.</s>
   <s>I know all about your contempt, your hatred, your disgust.</s>
   <s>But don't worry, I am on your side!</s>
</q>
<s>And then the flash of intelligence was gone...</s>

However, it is recommended for the TUSNELDA standard that the <p> - <s> - <q> hierarchy be retained if possible -- that is, the hierarchy of <s> elements is treated as primary, and the hierarchy of <q> elements is treated as secondary. In a case such as the one above, this can be accomplished by breaking the quotes and using the next and prev attributes together with the global id attribute to associate the fragments, as follows:

<s>
   <q id=q1 type=part next=q2>
      I know precisely what you are feeling.
   </q>
</s>
<s>
   <q id=q2 type=part prev=q1 next=q3>
      I know all about your contempt, your hatred, your disgust.
   </q>
</s>
<s>
   <q id=q3 type=part prev=q2>
      But don't worry, I am on your side!
   </q>
</s>
<s>
   And then the flash of intelligence was gone...
</s>

In the following case, this method solves the problem of overlapping hierarchies:

<p>
   <s>According to the visiting leader, the economy of the country is
      <q id=q1 type=part next=q2>
         better than ever.
      </q>
   </s> 
   <q id=q2 type=part prev=q1>
      "
      <s>
         It is in fact in very good shape.
      </s>
      "
   </q>
</p>

Finally, consider a simple example that shows overlapping hierarchies in combination with the <nested> attribute for a broken up quote:


<s nested="nested">
   <q id=q1 broken="nested" next=q2>
     <s id=s1 broken="nested" next=s2 nested="notnested">Draußen</s>
   </q>, sprach er, 
   <q id=q2 broken="nested" prev=q1>
     <s id=s2 broken="nested" prev=s1 nested="notnested">ist es kalt.</s>
   </q>
</s>

3.4.2 Editorial corrections

If any editorial changes are made to a corpus text, it must be guaranteed that they are recognizable and that the original form may be easily recovered. The following tags are used to mark editorial changes:

<corr>

contains the correct form of a passage apparently erroneous in the copy text.

sic: gives the original form. This attribut is obligatory in the TUSNELDA standard.
resp: gives the name of the responsible editor
cert: used to indicate the degree of certainty with which the change has been made. In the TUSNELDA standard, three possible values for this attribute have been adopted, in order to ensure comparability:; SURE; PROBABLE; PRESUMABLE

<reg>

contains text which has been regularized or normalized in some sense.

orig: gives the original form. This attribute is obligatory in the TUSNELDA standard.
resp: gives the name of the responsible editor
cert: used to indicate the degree of certainty with which the change has been made. (Possible values: see above)

<gap>

indicates a point where material has been omitted in a transcription, whether for editorial sampling practice, or because the material is illegible.

desc: describes the omitted text
reason: gives the reason for the omission (sampling, illegible, etc.)
resp: gives the name of the responsible editor
cert: used to indicate the degree of certainty with which the change has been made.

<unclear>

This element has been taken over from TEI in order to mark unclear, pieces of text. This may be e.g. destroyed bits of old documents or stretches of speech recordings which are hard to understand and cannot be clearly transcribed. Its attributes are

reason: indicates the source of unclarity
resp: names the responsible editor
resp: marks the degree of certainty with which the change has been made.

Note that the <gap> element is useful for noting the omission of material which is often uninteresting for corpus-based language engineering applications, in particular, figures, tables, etc.

3.4.3 Rendition information

In general it is not desirable to mark typographic features of a given printing of a text in texts designated for use in corpus-based research. However, there are circumstances under which it is desirable to retain this information. In particular, certain items of linguistic interest may be marked by typography in the original; e.g., linguistic emphasis and foreign words are often rendered in italics. In addition, some applications (e.g., machine translation which attempts to reproduce the format of the original) demand retaining the rendition information.

In the process of up-translation from legacy data, a first step is often to translate relevant typographic information into SGML, with no attempt to interpret the significance of the rendering (e.g., that the italics signify a foreign word). Interpretation is often too costly because it is ambiguous (e.g., italics signify not only foreign words, but also emphasis, titles, etc.). In such cases the <hi> element can be used. Normally, in later phases of up-translation, <hi> tags are changed to more descriptive tags, such as <title>, <foreign>, <mentioned>, or <distinct>.

<hi>

marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made. The rend attribute should provide the original rendition information when its function has not yet been determined.

rend

describes the rendition or presentation of the highlighted item.

BO bold face

BX boxed

IT italic font

RO roman font

UL underlined

CA capital letters

Note: Several values from the list may be specified where appropriate, separated by spaces, e.g., "ro it".

When the <hi> tag is used, no claim about the reason is made. This may be the case in a low-level encoding, since determining the reasons for highlighting (e.g., presence of a foreign word, vs. emphasis, vs. a title, etc.) demands human intervention and is therefore too costly in the early stages of up-translation. Note that typographically highlighted phrases and the kind of highlighting used may be recorded in one of two ways:

using the global rend attribute
using the <hi> element with a rend attribute

The first method specifies an attribute on some element which contains all of and only the highlighted phrase. In this case, the function of the highlighting is clear (for example, to mark a heading), and the boundaries of the highlighted phrase therefore coincide with the boundaries of some other element. The rend attribute is given on the tag for that element, for example

<head rend=bo>The world beyond</head>

The second method inserts a new tag indicating that what it contains is highlighted. It is used

when the function of the highlighting is not clear;
where there is no tag identifying the feature concerned;
where the highlighted phrase is not co-terminous with some other element.

The rend attribute must be supplied on the <hi> element. The rend attribute is optional on all other elements.

Note that in cases where the <hi> element often appears with the same value for rend, a default value can be provided on the <tagUsage> element. When this mechanism is used, the rend attribute need be given only when the default does not apply to the given occurrence of the <hi> element.

Both the start and end tag for any SGML element must be contained within the start and end tag of any of its ancestors in the tree for that document. Since by definition <hi> elements can appear only within <p> elements, this means that where, for example, an italicized passage contains more than one paragraph or starts within a paragraph and spans one or more others, the <hi> element must be closed at the end of the enclosing element, and then re-opened within the next. For example, an italicized passage which crosses a <p> boundary must be tagged as follows:

<p>This is the start of a paragraph which 
<hi rend=it>switches to italics here 
and then goes on for several paragraphs.</hi>
</p>
<p>
<hi rend=it>This second paragraph is all in italics</hi>
</p>
<p>
<hi rend=it>This is the last bit of italics</hi> 
and the rest is in roman.</p>

That is, the <hi> element is closed before the end of the first paragraph and re-opened at the start of the next. Note that the following encoding is not acceptable:

<p>This is the start of a paragraph which 
<hi rend=it>switches to italics here and 
then goes on for several  paragraphs.</hi>
</p>
<p rend=it>This second paragraph is all in italics</p>
<p><hi rend=it>This is the last bit of italics</hi> 
and the rest is in roman.</p>

This second encoding mixes different styles of marking the same feature for a given span of text, which will cause problems for retrieval. In the CES standard, it is not admissible to include one <hi> element in another one -- <hi> may not be used recursively. The way this is excluded is neither XML-conformant nor factually desirable: E.g. one might find part of a bold headline in italics, as in

<hi rend="bo">Eine <hi rend="it">teilweise</hi> kursive Überschrift</hi>

For the same reasons, recursion is allowed in the TUSNELDA standard also for the elements <foreign>, <distinct>, <mentioned>, and <title> (see next section).

3.4.4 Linguistic elements

There have been three main defining forces behind the choice of elements:

the needs of corpus-annotation tools, such as morpho-syntactic taggers, whose performance can often be improved by pre-identification of elements such as names, addresses, title, dates, measures, foreign words and phrases, etc.
the need to identify objects which have intrinsic linguistic interest, or are often useful for the purposes of translation, text alignment, etc., such as abbreviations, names, terms, linguistically distinct words and phrases, etc.
the needs of the projects of SFB 441, which perform linguistic research based on specifically annotated corpora.

The phrase-level elements identifying linguistically relevant elements are:

<abbr>

contains an abbreviation of any sort. Consult Handling Punctuation for guidelines for encoding abbreviations.

expan: contains the expansion of the abbreviation

<date>

contains a date in any format.

ISO8601: ISO 8601 normalized form of the date

<list>

a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit. Note that <list> is the only phrase-level element which is also a paragraph-level element; its content model is exactly the same in both instances. For its full definition see section 3.3.4.2

<measure>

contains a number, word, phrase indicating a quantity.

type

the type of measure is handled very restrictively in CES. In the TUSNELDA standard, the CES values will be recommended, but not enforced: Any textual string will be appropriate. The CES values for type are:

WEIGHT

LENGTH

COUNT

AREA

VOLUME

CURRENCY

TEMPERATURE

value

contains the the ISO 4217 codes for currency representation when the type attribute specifies currency.

<name>

contains a proper noun or noun phrase.

type

indicates the type of proper noun. Suggested values include:

PERSON

PLACE

ORG

LANGUAGE

See Encoding Names.

<num>

contains a number, written in any form.

value: contains the normalized value of the number.

<term>

contains a single-word, multi-word or symbolic designation which is regarded as a technical term.

<time>

contains a phrase defining a time of day in any format.

ISO8601

ISO 8601 normalized form of the time.

type

the type attribute takes one of the following values:

24HOUR

DESCRIPTIVE

<distinct>

identifies a word or phrase regarded as linguistically distinct (e.g., archaic, technical, dialect, etc.). At the moment, there are no restrictions on this characterization. At a later stage, the type attribute may be used with a restricted set of linguistically useful values, depending on the demands of the corpus encoders.

<foreign>

identifies a word or phrase as belonging to some language other than that of the surrounding text. Use the global lang attribute to indicate the language.

<mentioned>

marks words or phrases mentioned, not used. This element should be used for the markup of e.g. linguistic examples (either in the main text or set off by paragraph breaks). However, corpora which consist only of a collection of linguistic examples should not be annotated using the <mentioned> element.

<title>

contains the title of a work, whether article, book, journal, or series, including any alternative titles or subtitles.

<marked>

contains any piece of text which is being marked for a specific, restricted research purpose. If an elaborate annotation (as e.g. POS-tagging) is available, this marking may be redundant; however, even elaborate tagging may not be sufficient for pragmatic aspects like e.g. courteous forms. The <marked> element has the following attributes:

type: indicates textually the category of the specially marked element. This attribute is required.
next: provides the ID of the next part of a marked string which is split up by other text
prev: provides the ID of the previous part of a marked string which is split up by other text
broken: indicates whether a marked string is split up by other text

The linguistic elements fall into two groups, which determine their content models:

elements which are, for many purposes of language engineering such as morpho-syntactic tagging, regarded as individual tokens, even when they may contain sub-constituents. In TUSNELDA, this group includes names, dates, times, measures, abbreviations, and terms. These elements therefore may contain PCDATA. They may also contain the <abbr> and <num> elements; abbreviations and numbers are frequently identified and tagged automatically, and therefore their placement must be relatively free. Note that to avoid unnecessary recursive nesting of elements, the<abbr> cannot contain another <abbr> tag, and <num> cannot contain another <num>.
This group of elements, which comprise the element class M.TOKEN, includes:
- <abbr>
- <num>
- <name>
- <date>
- <measure>
- <time>
- <term>
elements which may contain sub-constituents which are treated by corpus-analytic tools as tokens, or may be regarded as tokens in themselves. Each of these elements can contain any other phrase-level element. It is assumed that tokenizing tools may further analyze the content of these elements in order to identify constituent tokens where they exist.
This group of elements includes the following elements:
- <title>
- <foreign>
- <mentioned>
- <distinct>
- <marked>
This latter group also includes another tag, the <hi> tag, which is used to mark information which is rendered specially in some original, but for which the function of the highlighting is either unknown or unspecified. In later phases of up-translation when the function of the highlighting is determined, <hi> tags are very often changed to one of the other more descriptive tags in this group. See section 3.4.3 for a full discussion of the use of the <hi> tag.

3.4.5 Pointing and reference

References in the text which refer to another part of it can be tagged with

<ref>

a reference to another location in the current document, in terms of one or more identifiable elements, possibly modified by additional text or comment. Attributes include the global attributes plus the following:

corresp

points to elements that correspond to the current element in some way.

next

gives the id reference of an element which contains a continuation of the current element.

prev

gives the id reference of an element which contains the previous portion of the current element.

type

indicates the type of pointer, e.g., aggregating, aligning, etc.

resp

specifies the creator of the pointer.

crdate

specifies when the pointer was created.

targType

indicates the type of data being linked, e.g., paragraph, sentence, etc.

targOrder

specifies whether the order in which the identifiers in the targets list is significant. Values:

Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.

N No: the order of the IDREFs specified as the value of the targets attribute has no significance.

U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.

evaluate

specifies the intended meaning when the target or targets are pointers themselves. Values:

ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.

ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.

NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.

target

provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.

In some cases it is desirable to move an element to another location in the encoded text. This is common for footnotes which occur in-line in the electronic text, but which appear as footnotes, endnotes, etc. in a printed version. It is also common for captions, figures, bibliographic citations, and stage directions.

<ptr>

a pointer to another location in the current document in terms of one or more identifiable elements. Attributes include the global attributes plus the following:

corresp

points to elements that correspond to the current element in some way.

next

gives the id reference of an element which contains a continuation of the current element.

prev

gives the id reference of an element which contains the previous portion of the current element.

type

indicates the type of pointer, e.g., aggregating, aligning, etc.

resp

specifies the creator of the pointer.

crdate

specifies when the pointer was created.

targType

indicates the type of data being linked, e.g., paragraph, sentence, etc.

targOrder

specifies whether the order in which the identifiers in the targets list is significant. Values:

Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.

N No: the order of the IDREFs specified as the value of the targets attribute has no significance.

U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.

evaluate

specifies the intended meaning when the target or targets are pointers themselves. Values:

ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.

ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.

NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.

target

provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.

Examples:

     Here is a text.
     This caption appears at this point.
     But we would like to keep the text together.

This can be encoded as

     <p>Here is a text.
     <ptr target=C1>
     But we would like to keep the text together.</p>
     <caption id=C1>This caption appears at this point.</caption>

The note in the following example originally appeared at the location of the <ptr> tag:

The <name type=org>Ministry of Truth</name>, — <name type=org lang=ns>Minitrue</name>, in <name>Newspeak</name><ptr target=N1 rend=asterisk> — was startlingly different from any other object in sight...</p>
<note place=foot id=N1><name>Newspeak</name> was the official language of <name type=place>Oceania</name>. For an account of its structure and etymology see Appendix.</note>

Another mechanism, which has been adapted for TUSNELDA from TEI and CESAlign, is the use of anchors and links, much in the same way as in HTML-documents. An <anchor> is a phrase-level element which designates a reference point, a <link> element designates a corresponding point which is attached to the anchor. <link> elements may be bundled together using the <linkGrp> element. The use of these concepts is illustrated in the following example:


<sp who="S1">
  <spokenpar>Cinco , cinco veces<anchor id="S1.12a"> 
    tranquilo<anchor id="S1.12e"> .
  </spokenpar>
</sp>
<sp who="S3">
  <spokenpar>
    <anchor id="S3.37a">Ahí esa<anchor id="S3.37e">
    dis puesto el cuarto árbitro para enseñar 
    la cartulina para que todo el sepa .
  </spokenpar>
</sp>
<linkGrp>
  <link targets="S1.12a S3.37a">
  <link targets="S1.12e S3.37e">
</linkGrp>

3.5 Reference systems

For purposes of alignment or other reference to elements within a text, a reference system can be built up using the id attribute on appropriate elements.

We recommend the following strategy:

supply a unique identifying label in the id attribute of the <body> tag
for each nested division, give each unit an identifier which is built up by successively adding to the identifier of the text; for example

          <body id=ORW1>
            <div type=part id=ORW1.1>
              <div type=chapter id=ORW1.1.1>
                 <div type=section id=ORW1.1.1.1>
                 </div>
              </div>
            </div>
          </body>

for each paragraph, add another layer to the immediately superordinate identifier, as follows:

          <div type=chapter id=ORW1.1.1>
               <p id=ORW1.1.1.1.p1></p>
               <p id=ORW1.1.1.1.p2></p>
          </div>

for each s-unit, add another layer to the superordinate identifer on the enclosing <p> element:

          <div type=chapter id=ORW1.1.1>
               <p id=ORW1.1.1.1.p1>
                 <s id=ORW1.1.1.1.p1.s1></s>
               </p>
          </div>

3.6 Encoding names

When a string of characters is tagged as a name, many corpus-handling tools treat the string as a single token (e.g. some morpho-syntactic taggers) and do not perform additional analysis.

Titles and roles

For English, we can state the following rules:

Titles such as "Mr." and role names such as "Secretary" are not considered part of a person name:

Mme. <name>Edith Cresson</name>
(or : <abbr>Mme.</abbr> <name>Edith Cresson</name>)
President <name>Boris Yeltsin</name>
Appositives such as "Jr." are considered part of a person name:
<name>Sammy Davis, Jr.</name>

Where these rules can be used for encoding other languages they should be followed. Obviously, other languages may treat titles rather as part of the proper name, such that one would want to include the title in the <name> tag.

Possessives and inflected forms

In English the possessive is formed by the addition of "'s" which is tokenized separately, and should not be encoded as a part of the name:

<name>Winston</name>'s

Whereas this CES approach makes sense for English, it already causes difficulties for a language like German, where the possessive/genitive suffix and the name are mostly joined into one single word form. Depending on the language, one should therefore try to include the minimal word form expressing a name within the <name> tag. Cf.

<name>Tübingens</name> malerische Altstadt
<name>Hans</name>' neue Theorie

Adjectives derived from names should also be annotated using the <name> element (as opposed to the English rule mentioned in the CES guidelines), cf.

die malerische <name>Tübinger</name> Altstadt

Forms of names with punctuation

Punctuation is normally considered to be a separate token, and should be encoded outside the <name> tag. See the discussion in the next section.

Examples:

Jaguar is made is <name type=place>Britain</name>.

<name type=place>France</name>-based

<name type=place>U.S.</name>-<name type=place>Japan</name> trade negotations

Forms not to be tagged as names

Laws, diseases, prizes, etc. named after people or saints, etc. should not be tagged with <name type=person>.
Street addresses, street names, adjectival forms of place names should not be tagged as <name type=place>.

3.7 Handling punctuation

Punctuation should be left as in the original text, except in the cases noted below.

Note that punctuation and special characters are treated by many corpus-handling tools as separate tokens. For example, a text such as

                  <q>Ignorance is strength.</q>

may be tokenized as

                      TOKEN   Ignorance 
                      TOKEN   is 
                      TOKEN   strength
                      TOKEN   .

Full stops and ellipses

The full stop should be kept as both a part of an abbreviation and as an end-of-sentence indicator. The disambiguation of the two uses is accomplished by the marking of abbreviations and/or s-units, when such markup is provided.

Ellipses should be regularized so that the three periods are contiguous, with no spaces in between.

Full stops appearing as a part of abbreviations should not be separated from the rest of the abbreviation string when the abbreviation is marked with the <abbr> tag, even though the full stop may serve a double function (i.e., also signal end-of-sentence).

Example:

I'm back in the U.S.

should be tagged as

I'm back in the <abbr>U.S.</abbr>

even though the period is both part of the abbreviation and a signal of end-of-sentence. On the other hand, where punctuation after a name is a clear indication of end-of-sentence, it should does not be included in the <name> element (see also below on sentence punctuation):

Er besucht das malerische <name>Tübingen</name>.

Hyphens and dashes

Line-end (soft) hyphens should be removed where they are not part of the regular spelling of the word. In cases of doubt, guidance should be sought elsewhere in the same text or in dictionaries. If doubt still remains, a hyphen should be retained rather than removed. In any event, the original spelling has to be included in the value of the orig attribute, as in

Er besucht die malerische <reg orig="Alt-stadt">Altstadt</reg>.

Dashes are marked by an entity reference (—). No distinction should be made between different types of dashes.

Apostrophes

Apostrophes should be left as they are in the original text. Note that the apostrophe can be ambiguous with the single quotation mark (e.g., in English the possessive "Joneses'"). This may be disambiguated by the marking of quotations.

Punctuation and tokens identified by the encoder

There is a small class of tags which mark the presence of tokens that have been isolated and classified by the encoder. Among the elements included in the tusneldaDoc DTD, the following may be used to identify individual tokens:

                      <abbr>
                      <date>
                      <num>
                      <measure>
                      <name>
                      <term>
                      <time>

For many tools, when such an element is identified in the input stream, it is not desirable to further tokenize the string inside the tag; rather, the string inside the tag can be regarded as a single token (possibly with the type indicated by the tag name). For example, in some languages it may be possible be assumed for lexical lookup routines and morpho-syntactic taggers to assume that an element with the tag <name> is a single token with the grammatical category PROPER NOUN (Np). For example,

<name type=person>Big Brother</name>

can be tokenized as

TOKEN(name) Big Brother

Similarly, the string

<date>April 4th, 1984</date>

can be tokenized as

TOKEN(date) April 4th, 1984

Therefore, punctuation that is not a part of an identified token should not appear within the tag (except abbreviations--see below). For example, the text

The Ministry of Love, which maintained law and order.

should be encoded as

The <name type=org>Ministry of Love</name>, which maintained law and order.

Other examples:

<name type=org>Jaguar</name> company in <name type=place>Britain</name>.

...he had been born in <date>1944</date> or <date>1945</date>; but it...

...the three slogans of the <name type=org>Party</name>:...

Punctuation and quotations

When the <q> or <quote> tag is used, any quotation marks or other typographical device for indicating quoted dialogue should be removed from the text. The rend attribute can be used to indicate the means by which the quotation was originally marked in the text (this is not required). In these cases, the value of the rend attribute should be one of the following, which are consistent with entity names in ISOpub and ISOnum:

laquo angle quotation mark, left raquo angle quotation mark, right lsquo single quotation mark, left rsquo single quotation mark, right ldquo double quotation mark, left rdquo double quotation mark, right lsquor rising single quote, left (low) ldquor rising dbl quote, left (low) rdquor rising dbl quote, right (high) rsquor rising single quote, right (high) mdash dash the width of lowercase m

In principle, encode punctuation as inside or outside the <q> tag according to the position of the quotation marks in the original, as in these examples:

('dealing on the free market', it was called)
(<q rend="PRE lsquo POST rsquo">dealing on the free market</q>, it was called)
The dark-haired girl behind Winston had begun crying out `Swine! Swine! Swine!'
The dark-haired girl behind <name type=person>Winston</name> had begun crying out <q rend="PRE lsquo POST rsquo">Swine! Swine! Swine!</q>
'I am with you,' O'Brien seemed to be saying to him.
<q rend="PRE lsquo POST rsquo">I am with you,</q><name type=person>O'Brien</name>seemed to be saying to him.

In cases where the <q> tag is used for text that is not enclosed in quotation marks in the original, leave punctuation that is not a part of the actual cited text outside the <q> tags:

BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
<q rend=ca type=slogan><name type=person>Big Brother</name> is watching you</q>, the caption beneath it ran.
Never mind, it doesn't matter, he thought. ["Never mind, it doesn't matter" in italics]
<q rend=it>Never mind, it doesn't matter</q>, he thought.
Eureka! he shouted. ["Eureka!" in italics]
<q rend=it>Eureka!</q> he shouted.

Note, however, that the tokenization of the text should not be affected by the position of the punctuation relative to the closing tag; the same set of tokens is ultimately generated in either case.

Punctuation in <s> tags

Sentence terminating punctuation should always appear within an enclosing set of <s> and </s> tags:

<s><q rend=it>Eureka!</q> he shouted.</s>
<s>The dark-haired girl behind <name type=person>Winston</name> had begun crying out <q rend="PRE lsquo POST rsquo">Swine! Swine! Swine!</q></s>

Punctuation in other tags

Because tokenizers typically treat text within tags such as <hi> and <foreign> independently of punctuation, which can appear either inside or outside the closing tag without effect. Therefore, given this text:

She ordered a croque monsieur. ["croque monsieur" in italics]

either of the two following encodings is in principle acceptable, although the first one is to be preferred and should be used in TUSNELDA corpora:

She ordered a <foreign rend="it">croque monsieur</foreign>.

She ordered a <foreign rend="it">croque monsieur.</foreign>

4. The TUSNELDA DTDs

4.1 The tusneldaHeader DTD

4.2 The tusneldaDoc DTD

4.3 The TUSNELDA DTDs in hypertext navigable format

5. Acknowledgments

For fruitful discussions, we would like to thank the participants of the TUSNELDA standardization workshop, Michael Betsch, Bernhard Brehmer, Hervé Dejean, Sam Featherston, Gabriela Fulir, Stefanie Herrmann, Sandra Kübler, Lothar Lemnitzer, Jürgen Mellinger, Detmar Meurers, Frank Henrik Müller, Reimar Müller, Slavica Stevanovic and Tylman Ule.

Furthermore, we want to emphasize that the TUSNELDA corpus annotation standard owes a lot both to the EAGLES corpus encoding standard (CES) and to Text Encoding Initiative (TEI). Most parts of the TUSNELDA standard are adaptions of either CES or TEI to the needs of the projects in SFB 441.

Finally, we also want to mention that the CES guidelines were very helpful when writing the TUSNELDA guidelines. Since the structures of CES and TUSNELDA are very similar, for our guidelines, many parts were just taken over in a modified version from the CES guidelines.

Laura Kallmeyer, Roland Meyer and Andreas Wagner, 09/11/2001. Reviewed 03/09/2009.