4.3 The TUSNELDA DTDs in hypertext navigable format
5. Acknowledgments
The TUSNELDA corpus encoding standard was developed as a common
annotation standard for corpora collected and annotated in SFB
441. These corpora will be part of the "Tübingen collection
of reusable, empirical, linguistic data structures" (TUSNELDA).
A common annotation standard for TUSNELDA was necessary in order to
provide comparability of the corpora in SFB 441. Further,
the standard is needed to guarantee reusability of the corpora
with standard corpus-linguistic tools. On the other hand, the
TUSNELDA standard is flexible enough to meet the different needs
of the projects in SFB 441.
The TUSNELDA standard is based on the corpus
encoding standard
(CES) developed by EAGLES
(Expert Advisory Group on Language Engineering
Standards) in the sense that depending on the needs of the
corpus-linguistic work in SFB 441, the CES was extended and
partly modified. In some cases where CES was not sufficiently
structured for the needs of the corpus linguists in SFB 441, the
original TEI solutions proved to be more suitable and therefore
were (with some modifications) taken over.
The main features of the TUSNELDA corpus encoding standard are also
described in Kallmeyer & Wagner: The TUSNELDA annotation standard: An XML encoding standard for multilingual corpora supporting various aspects of linguistic research. To appear in Proceedings of the conference Digital Resources for the Humanities DRH 2000, Sheffield, September 2000.
Element names are always set in boldface enclosed in brackets
"<" and ">", e.g. <tusneldaHeader>. Attribute
names are also set in boldface, e.g. type. For attributes
with a default value, the default is marked with an asterisk in
the list of values.
All further modifications of the DTD for TUSNELDA must be such that
any text or corpus satisfying a previous version of the DTD also
satisfies the new version.
Several elements in the corpus contain natural language
descriptions or notes that are not part of the corpus text
itself. For example, as part of the encoding description, the
header of a corpus contains a description of the project for which
the corpus was encoded. In principle, these natural language meta
texts can be written in any language. However, one should take
into account the possible users of the corpus when choosing a meta
language. Furthermore, if a language other than English is chosen,
an additional abstract of the text in English is recommended.
Whether several levels of conformance in the style of the CES levels of
conformance are useful for TUSNELDA, still needs to be
considered.
The original CES levels of conformance are not adequate for TUSNELDA,
since the DTD for tusneldaDoc is not only an extension but also a
modification of the original cesDoc DTD. It is no longer
compatible with the cesDoc DTD, i.e. corpora encoded with the
TUSNELDA DTDs do not satisfy the CES or XCES (the XML variant of
CES) DTDs.
As a desirable level of conformance for the corpora and texts in
TUSNELDA, for each publication unit, the following conditions
should hold:
- the text structure should be encoded at least down to the paragraph level,
- for all tags occurring at all in the encoded publication unit, the tagging
should be complete in the sense that all elements
correponding to this tag are encoded.
A publication unit should be either a text or a corpus. The specific
project encoding a corpus can decide which portions of texts or
corpora constitute a publication unit.
The topmost element of the TUSNELDA DTD is the element
<tusneldaCorpus>. This element contains a header followed
either by one or more texts or by one or more subcorpora. The
whole corpus has a header, and each of its texts or subcorpora
also has a separate header. A text together with its preceding
header constitutes an element <tusneldaDoc>.
-
- <tusneldaCorpus>
- contains the whole of a TUSNELDA encoded corpus, comprising a
single corpus header and one or more
<tusneldaDoc> elements, each
containing a single text header and a text. Additionally,
the <tusneldaCorpus> element can be
recursively nested, and sequences of this element can
appear at any nested level, in order to identify
sub-corpora. In addition to the global attributes (see
Section 2.1), it has
the following attributes:
- type
- used to identify the type of a sub-corpus (by language, genre,
etc.) when nested <tusneldaCorpus>
elements are used.
- version
- provides the version of the tusneldaDoc DTD to which this corpus
is compliant. If different parts of the corpus were
created using different versions of the DTD (this is
possible since any version is upward-compatible with
its successor), then the value here reflects the
highest version number used in the corpus--i.e., the
version with which the corpus can be parsed. The
attribute version is required.
For organizational reasons, the corpus TUSNELDA will be divided into
subcorpora, that contain corpora themselves. Corpora
encoded by the same project form one subcorpus. Further
decisions on how to divide corpora into subcorpora will be made later.
The attribute type is used to characterize subcorpora
with resepct to language, genre, etc. At a later point,
recommendations for
possible values of type will be given. These values depend
of course on the plans for
further subdivisions of the corpus into subcorpora.
Example:
<tusneldaCorpus type="multilingual, SFB 441">
<tusneldaHeader>
...
(Header of the whole corpus TUSNELDA)
...
</tusneldaHeader>
<tusneldaCorpus type="romance languages, project B9">
<tusneldaHeader>...</tusneldaHeader>
<tusneldaCorpus type="Spanish, spoken texts">
...
</tusneldaCorpus>
<tusneldaCorpus type="Portuguese, historic texts">
<tusneldaHeader>...</tusneldaHeader>
<tusneldaCorpus type="Portuguese, historic texts, 15th century">
...
</tusneldaCorpus>
<tusneldaCorpus type="Portuguese, historic texts, 16th century">
...
</tusneldaCorpus>
...
</tusneldaCorpus>
</tusneldaCorpus>
</tusneldaCorpus>
Three global attributes are defined, which may appear on any element
in the header:
-
- id
- a unique identifier for the element bearing the ID value.
- n
- a number or other label for the element, not necessarily unique within the corpus.
- lang
- indicates that the tag's content is in the
specified language. The value of the lang attribute
is composed of one of the following:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- a code from Ethnologue
- one of the above extended by a country code from ISO 3166 (e.g.,
"en.uk" or "eng.uk" for English as spoken in the
United Kingdom).
The global attributes for elements in the header are defined at the
top of the header.elt
and
represented by an entity, %A.HEADER
. This entity is used
to represent the list of global attributes on the attribute
declarations for most elements in the DTD of the header.
Each text in the corpus (i.e. each <tusneldaDoc> element)
has its own
header, referred to as a text header.
The whole corpus and each subcorpus also has
a header, referred to as a corpus header, which contains
information applicable to the corpus in question (possibly with some local
overriding). Both corpus and text headers are represented by
<tusneldaHeader> elements. The type attribute with
values TEXT or CORPUS is used to
distinguish the two.
- <tusneldaHeader>
- contains the descriptive and declarative information making up an "electronic title page" prefixed to every text, or to the corpus as a whole.
- type
- specifies the kind of document to which the
header is attached. Possible values are:
- CORPUS the header is attached to the corpus.
- TEXT* the header is attached to a single text.
- creator
- specifies the agency responsible for creating
the header. For TUSNELDA, in most cases this is SFB
441, and further, a specific project can be added.
- version
- specifies the version and revision of the
tusneldaHeader.elt used to encode this header. This number is found
near the top of the tusneldaHeader.elt itself.
- date.created
- specifies the date on which the header
content was created.
- date.updated
- specifies the date on which the header
content was last updated.
The attributes date.created, date.updated and
version are required.
The <tusneldaHeader> element contains the following four
elements:
- <fileDesc>
- contains a full bibliographic description of the corpus itself or of a
text within it.
- <encodingDesc>
- documents the relationship between an electronic text and the source
or sources from which it was derived.
- <profileDesc>
- provides further information about various aspects of a text,
specifically the language used, the situation and date of its production, the
participants and their setting, and a descriptive classification for it.
- <revisionDesc>
- summarizes the revision history for a file.
The subelements <fileDesc> and <profileDesc>
are required.
Note that if the lang or wsd attributes are used on
elements in the main text, it is required to include a
<profileDesc> element containing
<langUsage> (for use of lang) and/or
<wsdUsage> (for use of wsd).
The file description is the first of the four main constituents of the header
and is represented by the <fileDesc> element. It is a
required element of the header.
The file description documents the
electronic file itself, i.e. (in the case of a corpus header) the
corpus,
or (in the case of a text header) the individual text to which the
header applies. The element consists of a title statement, an
edition statement, an optional element <extent>, a
publication statement and one or more source descriptions:
- <titleStmt>
- groups information concerning the title of the corpus or the
individual text and its
constituent texts.
- <editionStmt>
- contains any additional information relating to a particular version
of a text.
- <extent>
- provides the size of the electronic text as stored on
some carrier medium.
- <publicationStmt>
- groups information concerning the publication or distribution of the
corpus and its constituent texts.
- <sourceDesc>
- supplies a bibliographic description of the copy text(s) from which
an electronic text was derived or generated. Further detail is given in the
following subsections.
<titleStmt>, <editionStmt>,
<publicationStmt>, and <sourceDesc> are
required.
Note that the <titleStmt> describes the machine-readable file,
while the source text is specified in the <sourceDesc>. The title
in the <titleStmt> should indicate that this is a machine-readable
version and should not be identical to the title of the source
text.
This element consists of a <h.title> element followed by
a <respStmt> element. These sub-elements are used throughout
the header, wherever the title of a work or a statement of responsibility is
required. The element <respStmt> allows to list one or
more statements of responsibility.
The subelements of <titleStmt>:
- <h.title>
- the title of the electronic file, including alternative titles or
subtitles.
In case of the whole corpus, the title is TUSNELDA,
electronic texts inside TUSNELDA are entitled: Title - TUSNELDA
electronic version. In those cases where the
elctronic text has a source text, Title should be
the original title of the source.
- <respStmt>
- supplies information about any person or institution responsible for
the intellectual content of a text, edition, or electronic transcription.
Both, <h.title> and <respStmt> are required.
<respStmt> in turn consists of one or more pairs of
elements <respType> and <respName>:
-
- <respType>
- contains a phrase describing the nature of a person's or institution's
intellectual responsibility. Recommended contents of
<respType> are
Editor, Publisher, Project,
Annotation, Correction.
- <respName>
- the publisher of the corpus or text expressed as the proper name of a person, place or institution.
Example of a possible element <titleStmt>:
<titleStmt>
<h.title>
Süddeutsche Zeitung - TUSNELDA electronic version
</h.title>
<respStmt>
<respType>Project</respType>
<respName>C1 (Marga Reis)</respName>
<respType>Annotation (POS-Tagging)</respType>
<respName>Student A</respName>
<respType>Correction</respType>
<respName>Student B</respName>
</respStmt>
</titleStmt>
The element <editionStmt> is required in the file
description. The important information contributed by
<editionStmt> is the value of the attribute
version. This attribute is required.
In corpus headers, the version
attribute of the <editionStmt> element is used to indicate both a version number and a revision number, in
the form "version.revision", where "version" changes if texts are added to or
removed from the corpus, and "revision" changes if amendments are made within
texts or the corpus header.
In individual text headers, the version
attribute carries only a revision number.
Sample edition statement:
<editionStmt version='2'>Second version, substantially extended
and corrected.</editionStmt>
Note that with a modification of a document (text or corpus), the
value of version in the <editionStmt> of the
header changes, and
consequently, the attribute date.updated of the header
also gets a new value. In other words, because of the version
attribute in the edition statement, the attribute date.updated
of a header actually refers not only to the header but to the
whole corpus.
The element <extent>
describes the approximate size of the electronic text as stored on
some carrier medium, specified in words, tokens,
characters and additionally in Kb.
<extent> is optional and therefore, in particular in
an early state of collecting and encoding a text or corpus, it can be
left aside. However, once a text or corpus has reached a state where
it is intended to be used for linguistic research by other people
besides those encoding the corpus, it is highly recommended to add the
<extent> element to the file description.
The <extent> tag contains:
- <wordCount>
- contains the count of the linguistic words in the text without
punctuation
- <tokenCount>
- count of tokens, i.e. <wordCount> + count of
punctuation marks
- <characterCount>
- count of characters referring to everything counted in
<tokenCount>. For <characterCount>, entities
are treated as follows: character entities are counted as
a single character and other entities are resolved before
counting, i.e. in this case the string the entitiy stands
for is counted.
- <byteCount>
- contains the count of bytes in the file containing the text
together with its markup.
- units
- gives the unit in which the bytecount is measured.
- BYTES bytes
- KB* kilobytes
- MB megabytes
- GB gigabytes
The <bytecount> tag gives the size of the text including
its tags, in its representation as a text file encoded in an 8-bit ISO
character set, which is useful for calculating media requirements or file
download times.
- <extNote>
- a descriptive note supplying additional information of any kind
relating to an extent information provided within a corpus or text
header. The extent note should at least contain a
characterization of punctuation marks, i.e. information
about those parts of the text that are counted in
<tokenCount> but not in <wordCount>.
<wordCount>, <tokenCount> and
<byteCount> are required.
The publication statement contains the subelements
<distributor>, <pubAddress>, the
optional elements <telephone>, <fax>,
<idno> and <eAddress>, one or more
elements <availability> and a <pubDate>:
- <distributor>
- gives the name of the person or institution who distributes the
text or corpus. In case of TUSNELDA, this is usually SFB 441.
- <pubAddress>
- contains a postal address of the
distributor.
- <telephone>
- gives the telephone number in of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.
- <fax>
- gives the fax number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.
- <eAddress>
- gives an electronic address of the person or institution who distributes the text or corpus. Note that more than one occurrence of this tag can appear, so that multiple addresses (possibly of different types) can be included.
Attribute:
- type
- gives the type of the electronic address (email address, web site, ftp site, etc.). Suggested values include:
- EMAIL* the value is an electronic mail address.
- WWW the value is a web site address.
- FTP the value is an ftp address.
- <availability>
- supplies information about the availability of a text, for example,
any restrictions on its use or distribution, its copyright status, etc.
- region
- specifies the territories within which rights in the
electronic text apply. Suggested values include:
- WORLD*
- EU
- UNI TUEBINGEN
- SFB 441
- status
- supplies a code identifying the current availability of the
text. Values are:
- RESTRICTED
the text is not freely available.
- UNKNOWN* the status of the text is unknown.
- FREE the text is freely available.
- <idno>
- supplies a number (e.g., ISBN) used to identify a bibliographic item.
type- gives the type of the identification number. Suggested
values:
- ISBN* the number is an ISBN number.
- ISSN the number is an ISSN number.
- <pubDate>
- the publication date expressed in any format
- value
- specifies standard value for this date in ISO 8601 (Representation of dates and times)
format
The element <sourceDesc> contains one or more of the
following subelements:
- <biblFull>
- contains a bibliographic citation for a text which has been
previously encoded in electronic form. This element is
intended to include the header of the electronic text from
which the current document is derived. It contains the same elements as the
<fileDesc> element. But in contrast to
<fileDesc>, the subelements
<editionStmt> and <sourceDesc>
are optional in <biblFull>.
- <biblStruct>
- contains a structured bibliographic citation, in which only
bibliographic sub-elements appear in a specified order.
- <recordingStmt>
- this element is intended for recordings of spoken text. It
characterizes the source either as a recording made by the
corpus encoding agency itself, or as a
recording made by some broadcasting agency.
- <p>
- in some cases, it is not possible to give a
source description or the source(s) are already specified at some
other place. The first holds for example for a text that is not a
recording and that is collected by the encoding agency itself. In this
case a corresponding notice in the form of a paragraph (a
<p> element) is sufficient. The second holds for corpus
headers (in contrast to text headers). In general, the sources of the
texts in a corpus should be specified in the header of the texts
themselves. The headers of subcorpora may contain an unstructured
description of the sources of the texts (in a <p> element)
or they may contain an empty <p> element, i.e. no source
description.
The headers of
individual texts will each contain at least one of the above elements to
specify their source. When a particular text contains items derived from more
than one bibliographic source or recording, all relevant sources for which
information is available are listed in the text header, and individual
<div> elements are associated with the correct citation
or recording by means of the decls
attribute.
If an electronic text has been derived from a previous electronic
version of the text, then the source description will contain a
<biblFull> element. If this version had itself been
derived from another electronic version, then this
<biblFull> element may contain yet another
<biblFull> element, and so on for as many recursive
levels as required. If an electronic text is derived from a print source,
it contains a <biblStruct> element describing that
source. If it is derived from a recording, it contains a
<recordingStmt>.
For electronic texts derived from previous electronic
texts, it is recommended to add at least one recursive level,
i.e. at least the <sourceDesc> element for the
electronic source of the text.
The <biblStruct> element has the following component
sub-elements:
- <analytic>
- contains bibliographic elements describing an item (e.g. an article or
poem) published within a monograph, journal, or periodical and not as an
independent publication.
- <monogr>
- contains bibliographic elements describing an item (e.g. a book or
journal) published as an independent item (i.e. as a separate physical
object).
At least one <monogr> element must be present in a
<biblStruct> element. It may contain the following
elements:
- <h.title>
- the title of a work.
- <h.author>
- in a bibliographic reference, contains the name of an author
(personal or corporate) of a work; names should be given in a canonical form,
with surnames preceding forenames.
- <respStmt>
- supplies information about any person or institution responsible for
the intellectual content of a text, edition, or electronic
transcription. (see the description of <respStmt>)
- <edition>
- provides bibliographic details for an edition of some text.
- <imprint>
- groups information relating to the publication or distribution of a
bibliographic item.
- <idno>
- supplies a standard (e.g., ISBN) number used to identify a
bibliographic item.
- type
- a name or abbreviation (e.g., ISBN) identifying what type of
identifying number is given. Unless provided
explicitly the default value is:
- ISBN* the value is an ISBN number.
- <biblScope>
- defines the scope of a bibliographic reference, for example as a list
of page numbers, or a named subdivision of a larger work.
- type
- identifies the type of information conveyed by the element.
- PP the element contains a page number or page range.
- VOL the element contains a volume number.
- ISSUE the element contains an issue number, or volume and issue numbers.
- <biblNote>
- a descriptive note supplying additional information of any kind
relating to a bibliographic item described within a corpus or text header.
Published texts must contain at least one <imprint> element,
which can contain the following elements:
- <publisher>
- proper name of a person, place or institution.
- type
- categorises the name. Legal values are:
- PERSON name of a person
- PLACE name of a place
- ORG name of an organization article in
a periodical
- <pubDate>
- a calendar date in any format.
- value
- specifies standard value for this date in ISO 8601
format
- <pubPlace>
- place of publication for a book, article, etc.
The
<analytic> element is used when multiple monographic records are
grouped together into single items. When the item described by a bibliographic
citation forms a part of some other bibliographic item (as, for example, a
newspaper article within a newspaper, or a journal article within a
collection), a monographic description should be given for the newspaper or
collection, prefixed by an analytic description for the individual component,
enclosed within an <analytic> element. This contains a mixture of
the elements <h.author>, <respStmt> and
<h.title> in any order
and repeated as necessary.
Sample element <biblStruct>:
<biblStruct>
<monogr>
<h.title>Effi Briest</h.title>
<h.author>Theodor Fontane</h.author>
<imprint>
<pubPlace>Stuttgart</pubPlace>
<publisher>Manesse-Verlag</publisher>
<pubDate>1991</pubDate>
</imprint>
<idno type="ISBN">3717511327</idno>
</monogr>
</biblStruct>
The element <recordingStmt> consists of one or more
<recording> elements each of which characterizes one
of the recordings occurring in the corpus either by giving a
bibliographic description of the broadcast the recording comes
from or (in case the recording was made by the corpus encoding
agency itself) by giving details about the recording such as date,
the persons involved in the recording, the equipment used etc.
<recording> consists of an arbitrary number of recording
notes or structured descriptions of the recording in the form of
elements <respStmt>, <equipment>,
<broadcast> or <date>.
<recording> has the following attributes:
- type
- characterizes the kind of recording. Legal values are:
- AUDIO*
- VIDEO
- dur
- gives the duration of the recording
The subelements of <recording> are:
- <recNote>
- an unstructured description of aspects of the recording that are
notable and that cannot be described in any of the other
subelements of <recording>
- <respStmt>
- (see the description of <respStmt>). Possible contents
of the subelement <respType> are
Interviewer, Interviewee etc.
- <equipment>
- describes the technical equipment used to perform the
recording.
- <broadcast>
- contains a single subelement <biblStruct> that gives
details of the broadcast in a form analogous
to bibliographic citations. The broadcasting
agency responsible for a broadcast is regarded
as its author, while other participants (for
example interviewers, interviewees, directors,
producers, etc.) should be specified using the
<respStmt> inside
<biblStruct> or the
<editor> element.
- <date>
- contains the date of the recording.
The second major component of the header, the encoding description, contains
information about the relationship between an encoded text and its original
source and describes the editorial and other principles employed throughout the
corpus.
The <encodingDesc> element has the following seven components:
-
- <projectDesc>
- describes in detail the purpose for which an electronic file
was encoded.
- <samplingDecl>
- contains a prose description of the rationale and methods used in
sampling texts in the creation of the corpus.
- <editorialDecl>
- provides details of editorial principles and practices applied
during the encoding of a text.
- <tagsDecl>
- provides detailed information about the tagging applied to an SGML
document.
- <annotations>
- groups information about existing annotation files associated with
the text.
- <refsDecl>
- specifies how canonical references are constructed for this text.
- <classDecl>
- contains a series of <category> elements, defining the
classification codes used for texts within the corpus.
The element <projectDesc> is required, all other
subelements of the encoding description are optional.
This element provides information about the project for and by which
the text or corpus was created, together with any other relevant
information concerning the process by which it was assembled or
collected. The content of this element is an unstructured
note. Example:
<projectDesc>
The MULTEXT project is assembling a corpus consisting of
mono-lingual texts in seven Eastern and Western European
languages, together with parallel translations in each of
these languages. The original texts were acquired in various
forms and marked up for conformance with the MULTEXT/EAGLES
Corpus Encoding Standard, to test and validate that scheme.
MULTEXT has also developed a suite of annotation tools which
have been tested on the texts in the corpus.
</projectDesc>
A minimal encoding description can contain only the
<projectDesc> element. In this case, a prose
description of the encoding methods can be provided. If
documentation of encoding principles exists in another location
(a manual, etc. in printed form, at a given URL, in an ftp site,
etc.) this information should be provided.
In principle, the <projectDesc> can be written in any
language. However, a consideration of potential users when
choosing a language for the <projectDesc> is
recommended (see also Section Meta language). If a language different from
English is chosen, an additional short version of the project
description in English is desired.
The <samplingDecl> element
is also an unstructued note, which contains information about the
methods for text sampling in the corpus.
Concerning the language that is chosen for the sampling declaration,
the same considerations hold as in the case of the project
description.
The encoding description can contain any number of sampling declarations.
A <samplingDecl> element occurring in the header of a corpus
gives information about the choice of the texts in the corpus
whereas an element <samplingDecl> occurring in the header of
a text provides details about the inclusion or
exclusion of portions of the text. In both cases, the sampling
declaration preferably includes information about the reason for
this sampling, and the means by which this is noted in the
encoding, if any.
For example (adapted
from English-Norwegian
Parallel Corpus Project
manual):
<samplingDecl>
The texts of the core corpus are mostly extracts from books.
The extracts are between 10,000 and 15,000 words long (30 - 40
pages), and are taken from the beginning of the texts. The front
matter, prefaces, forewords, list of contents, etc., are not
included in the extracts. In some cases, introductions have been
left out as well, e.g. introductions by scholars to works of
fiction.
Omission of passages in the text may be marked by an
<omit> tag.
</samplingDecl>
The <editorialDecl> element contains elements that
specify each a particular kind of editorial practice used for some
portion of the corpus. Where the same principles apply across the
whole corpus (e.g., for the <segmentation> element), they
can be documented only once within the corpus header. If different
parts of the corpus apply different practices (as for example with the
<quotation> or <hyphenation> elements), all
possible practices can be defined in the corpus header, and particular
parts of the corpus can specify the editorial practices applicable to
them by using the decls attribute. When
this method is used, if a practice is not explicitly associated with a
part of the corpus in this way, it is assumed not to apply to it.
The <editorialDecl> element contains the following
elements:
-
- <correction>
- specifies a set of correction practices applied in creating one or more
components of the corpus. For TUSNELDA in all cases of corrections,
the original must be retained in an attribute. This is
automatically controlled.
Corresponding to this, the default value of the attribute
method is TAGS.
- method
- indicates whether corrections are made without notation or made
by including editorial tags.
- TAGS* correction indicated with tags
- SILENT correction made silently
- <quotation>
- specifies editorial practice adopted with respect to quotation marks
in the original.
- marks
- indicates whether or not quotation marks are retained as tag
content in the text.
- NONE no quotation marks have been retained
- SOME some quotation marks have been retained
- ALL* all quotation marks have been
retained
- form
- specifies how quotation marks are indicated within the
text.
- STD use of quotation marks has
been standardized; open and close quote marks are distinct.
- NONSTD open and close quote marks are
represented indiscriminately.
- UNKNOWN* use of quotation marks is
unknown.
- <hyphenation>
- summarizes the way in which end-of-line hyphenation in a source text
has been treated in an encoded version of it. For TUSNELDA, it is
recommended to document each elimination of hyphenation
and to retain the original form in a
tag. <hyphenation> should contain a
description of the method used for eliminating hyphenation
marks, preferably with information about the script
applied for this process.
- <segmentation>
- describes the principles according to which the text has been
segmented, for example into sentences, tone-units, graphemic strata, etc.
A detailed specification of the methods applied for segmentation is
recommended. <segmentation> consists of
arbitrarily many pairs of a tag and the corresponding
segmentation method followed by at least one
segmentation note.
- <tag>
- a specific tag
- <segmMethod>
- the sementation method applied for this tag. All scripts used for
segmentation with respect to a certain tag should be
named in its segmentation method. If possible, these
scripts are distributed together with the corpus.
- <segmNote>
- additional remarks concerning the segmentation of the text
- <transduction>
- describes the principles according to which the text has been
transduced, either in transcribing it from audio tape to written form, or in
converting from an electronic original. If possible, the scripts used
for transduction should be named.
- <normalization>
- specifies a set of normalization practices applied in creating one or more
components of the corpus. For <normalization>, similar to
<correction>, the original form must be
retained in an attribute. This is automatically
controlled.
- method
- indicates whether normalization made without notation or made
by including editorial tags.
- TAGS* normalization indicated with tags
- SILENT normalization made
silently
Example of an element <segmentation>:
<segmentation>
<tag>s</tag>
<segmMethod>automatic tagging with the tool XYZ developed by
A.B. at the C.D. Institute. The tool works as follows:
If it encounters a question or exclamation mark, then the
end of an element s is recognized. If a full stop is
encountered, then the tool checks whether it is part of an
abbreviation or an ordering number. If this is not the case,
then the end of an element s is recognized. In
order to check for abbreviations, the tool makes use of an
abbreviation list.
</segmMethod>
</segmentation>
The <tagsDecl> element is used differently in corpus and in
text headers. In a corpus
header, it is used to list all the element names actually used within the
corpus, together with a brief description of their
functions. Furthermore, it specifies the number of SGML elements
actually tagged
within each corpus. In text headers, the
same element is used only to count the number of SGML elements tagged
within the text. In both cases the element consists of a number of
<tagUsage> elements, defined as follows:
-
- <tagUsage>
- supplies information about the usage of a specific element within
the corpus or text with which this header is associated.
- gi
- the name (generic identifier) of the element
indicated by the tag. This attribute is required.
- occurs
- specifies the number of occurrences of this
element within the text.
- wsd
-
can be used on a <tagUsage> element to indicate that for
every appearance of the described element in the text, the content
defaults to the specified character set. Therefore the declaration
<tagUsage gi=term occurs=5 wsd="ISO 8859-5">
indicates that the content of all <term> elements is in
the ISO 8859-5 character set.
Note that the global attribute lang can
similarly be used in a
<tagUsage> element to indicate that for
every appearance of the described element in the text, the content
defaults to the specified language.
In the corpus header, each <tagUsage> element
contains a brief description of the element specified by its gi
, and the occurs attribute is not supplied. In text
headers, the <tagUsage> elements may be empty, but the
occurs attribute is always supplied.
The header of TUSNELDA must contain (as part of the
<tagsDecl> element) one <tagUsage>
element for each tag used in TUSNELDA. This
<tagUsage> specifies in its content the semantics
of the tag in question. This guarantees a uniform semantics of
the single tags used throughout the subcorpora of TUSNELDA.
A typical written text has a tag declaration like the following:
<tagsDecl>
<tagUsage gi=name occurs=256>
<tagUsage gi=div occurs=7>
<tagUsage gi=head occurs=7>
<tagUsage gi=p occurs=705>
<tagUsage gi=reg occurs=2>
<tagUsage gi=sic occurs=1>
<tagUsage gi=body occurs=1>
</tagsDecl>
A PERL script to automatically generate <tagUsage> elements
with appropriate values for tags in any SGML text is available at
<URL: http://www.cs.vassar.edu/~priestdo/research/scripts/tagusage.txt>
The element <annotations> groups information about
annotation documents associated with the text. The following
elements are used for these purposes:
-
- <annotation>
- gives information about an annotation file associated with the text.
Attributes:
- type
- indicates the type of annotation. Values include:
- TOKEN annotation file contains
segmentation into tokens.
- MORPHSYN annotation file contains
morpho-syntactic category information for the
words in the text.
- SYNTAX annotation file contains
syntactic (i.e. structural) information for
phrases in the text.
- SEGMENT annotation file contains
segmentation into sentences and words.
- ALIGN annotation file contains alignment links to a parallel translation.
- ann.loc
- provides information (path/file name, URL, etc.) about the location of the annotation file.
- trans.loc
- for annotation files containing alignment
information, trans.loc provides information (path/file name, URL, etc.) about the location of the file containing the aligned text.
The element <refsDecl> is useful for encoding corpora since it
provides information about references which are often used in the
alignment of parallel texts. In particular, it is common to use ID
values on tags marking paragraphs and sentences as references in
links associating two parallel texts. See for example, the
English-Norwegian Parallel Corpus Project
and
The Lingua Parallel Concordancing Project.
<refsDecl>
A reference system is built up using the identifiers of the
following text units: text, division, paragraph, s-unit.
Each nested division has an identifier which is built up by
successively adding to the identifier of the text. Each
paragraph has an identifier which adds yet another layer to the
immediately superordinate identifier. S-units are numbered
within the nearest division, as shown above. After alignment,
each s-unit in the core corpus has a "corresp"
attribute containing a reference to the corresponding unit(s) in
the parallel text.
</refsDecl>
The <classDecl> element provides means to define a set of text
categories for
classifying texts in the corpus. A standardized set of text categories is under
development by the EAGLES Corpus Working Group on Text Typology, which
may eventually eliminate the need to explicitly provide a
descriptive taxonomy in the corpus header.
The <classDecl> element contains the descriptive taxonomy used to
classify texts within the corpus. It occurs once, in the corpus
header, and consists of one or more <taxonomy>
elements. The <taxonomy> element in turn contains
either a set of <category> elements, each
representing a particular textual classification feature and a
value for that feature; or one of the elements <h.bibl> or
<biblStruct>, providing a bibliographic citation
for documentation of a categorization scheme, followed
optionally by a set of <category> elements. The
<h.bibl> element contains only unstructured text
and is used for cases
where only a very simple citation is required.
- <taxonomy>
- defines a typology used to classify texts.
- <category>
- contains an individual descriptive category or feature-value pair.
The global id attribute is required for the <category>
element, since it is used to associate a <catRef> within a text
header with the descriptive category appropriate to it. The category element
contains a set of <catDesc> elements:
- <catDesc>
- describes a category within the text typology, in the form of a brief
prose description.
The <catDesc> element is used to contain
the value for a feature within a <category>, unless that category
is further subdivided, in which case a nested <category> element
may be used.
Within the <textClass> element of the header for each text, a
<catRef> element is provided, the target attribute of which
lists the identifiers of all <category> elements applicable to
that text.
When a standard set of text categories is developed, it is anticipated that an
attribute on <textClass> will provide the category. Unless the
standard categories are extended, no pointer to <category>
elements in the corpus header will be required.
A taxonomy for the classification of texts inside TUSNELDA will be
specified later.
The third component of the header is the profile description. The
<profileDesc> element has the following components:
- <creation>
- groups information about the period and place of creation of a text.
- <langUsage>
- groups information describing the languages, sublanguages, registers,
dialects etc. represented within a text.
- <wsdUsage>
- groups information describing the character set(s) used within a text.
- <textClass>
- groups information which describes the nature or topic of a text in
terms of a standard classification scheme, thesaurus, etc.
- <translations>
- groups information about existing translations of the text.
TUSNELDA is a multilingual corpus. Therefore the specification of the
languages in the corpus is indispensable. For this reason,
<profileDesc> and its subelement
<langUsage> are both required.
The element <creation> gives information about the period and the place of creation of a
text. This
element is different from <pubDate> (subelement of
<imprint>) that relates to the publication of a text
but not to its creation. (For Middle High German texts
published in the 20th century for example, <pubDate> and
the date specified in <creation> are different.)
Since in many cases, the exact
creation date of historic texts cannot be specified,
<creation> has the
attributes earliest and latest that allow to
specify a creation period. If the exact creation date is
known, it is the value of both attributes. Further,
<creation> has an attribute place that is used
to specify the place of creation of a text.
- earliest
- the earliest possible creation date, i.e. the latest date where
one can be sure that the text was created later.
- latest
- the latest possible creation date, i.e. the earliest date where
one can be sure that the text was created earlier.
- place
- the creation place.
Example:
<creation earliest=1066 latest=1127 place=England>...</creation>
The <langUsage> element contains one or more
<language> elements, each identifying a language used on
the text:
- <language>
- characterizes a language, sublanguage, register, dialect,
etc., used within a single text.
- type
- indicates the type of language, e.g., sublanguage, dialect,
etc.
- ethnologue
- gives the language code from Ethnologue.
- iso639
- gives the standard language code from ISO 639 in one of the following forms:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
This attribute is required.
In some cases, the language codes from ISO 639 or ISO 639-2 might be
insufficient to characterize the languages used in a
corpus. In these cases an alternatvie coding using the ethnologue code
might be possible because this standard is more refined that the ISO
standard. However, since ISO 639 is an important standard that is
frequently used, a characterization using ISO 639 is
required. Additionally, a characterization with the
ethnologue code is allowed for. If neither of them provides a code
for a language in question, the content of the
<language> element or the attribute type give
a description of the language.
Example of a <langUsage> element:
<langUsage>
<language id="fr" iso639="fr">French</language>
<language id="en" iso639="en">English</language>
<language id="la" iso639="la">Latin</language>
<language id="swg" ethnologue="swg" iso639="de">Swabian</language>
</langUsage>
The value of the id attribute on any <language>
element should be given as a value for the global lang
attribute when it is used on a tag in the text or header to
refer to this language.
For example,
She ate <foreign lang=fr>croissants</foreign>
When more than one character set is used in a text, the wsd
attribute should be used on each <language> tag to
associate the language with a particular character set.
The element <wsdUsage> contains one or more
<writingSystem> elements, each identifying a character
set used on the text:
- <writingSystem>
- characterizes a character set used within a single text.
Example:
<wsdUsage>
<writingSystem id="ISO 8859-1">ISO character set for western
European languages</writingSystem>
<writingSystem id="ISO 8859-5">ISO character set for
Cyrillic</writingSystem>
</wsdUsage>
The value of the id attribute on any <writingSystem> element should be given as a value for the global wsd attribute when it is used on a tag in the text or header to refer to this character set.
For example,
This is a patch of Cyrillic:
<foreign lang=bu wsd="ISO 8859-5">

</foreign>
When a writing
system declaration describing a transcription scheme is provided
as an auxiliary document, the value of the wsd attribute on the
<writingSystem> element must be an entity pointing to
this document. Usually, the entity expands to be the name of the file
in which the writing system declaration is stored. Note that for this
reason, the type of the wsd attribute on the
<writingSystem> element is ENTITY (indicating that its
value must be an SGML entity). In all other instances, whether in the
header or text, the type of the wsd attribute is CDATA.
The <textClass> element contains references to
the text classification scheme and descriptive keywords which together describe
the text concerned. The following elements are used for these
purposes:
-
- <catRef>
- specifies one or more defined categories within the
taxonomy or text typology given in <classDecl>. In
TUSNELDA, all texts are classified using the same
taxonomy.
- target
- identifies the text category or categories, by
means of an IDREF pointing to one or more
<category> elements defined in the corpus
header.
- scheme
- adds information about the text classification
scheme.
- Endtag omission is allowed for <catRef> since the content
of <catRef> is empty.
- <h.keywords>
- contains a list of key terms identifying the topic or
nature of a text, each of which is tagged as a term.
- <keyTerm>
- a keyword or a phrase
Although EAGLES/PAROLE plans to provide a standard list of keywords,
for TUSNELDA a standardization of the keywords is not
recommended. Instead, it is rather an advantage of the keywords
element that, in addition to a text classification in terms of a
given taxonomy, <h.keywords> allows an unrestricted
and flexible characterization of a text.
The element <translations>
groups information about translations of the text which exist, usually
within the same corpus. The following elements are used for these
purposes:
-
- <translation>
- gives information about a translation of the text. The global
lang attribute and the wsd attribute are required on
this tag. Additionally, this tag has the following optional
attribute:
- trans.loc
- provides information
(path/file name, URL, etc.) about the location
of the the translation.
- <translator>
- gives the name of the translator.
Note that endtag omission is allowed for the <translation>
element, since in some cases all relevant information is
supplied in attributes only. Thus, where appropriate, this
element can function as an empty element, e.g.:
<translations>
<translation trans.loc="1984.sl.ces" lang=sl wsd="ISO8859-1" n=1>
<translation trans.loc="1984.es.ces" lang=es wsd="ISO8859-1" n=2>
<translation trans.loc="1984.ro.ces" lang=ro wsd="ISO8859-1" n=3>
</translations>
The revision description is the fourth element in the header. It is used to
record details of any significant change to the corpus. The
<revisionDesc> element has the following component:
- <change>
- summarizes a particular change or correction made to a particular
version of an electronic text which is shared between several
researchers.
Multiple <change> elements are provided for; one should
appear per change.
The
<change> element contains the following subelements
-
- <changeDate>
- gives the date of the change.
- value
- specifies standard value for this date in ISO 8601 format
- <respName>
- specifies the person responsible for
the change.
- <h.item>
- specifies the nature of the
change(s). One or more occurrences of this element may
appear within each <change> element.
When any significant change is made to any component of the corpus, the
following steps should be taken:
- a <change> element is added to the
<revisionDesc> of the text affected
- the date.updated attributes of the text header and of
any header above it are
changed to the date of the change
- the revision number specified on the version attribute
of the <editionStmt> of the corpus header is
incremented.
If possible, control and administration of the versions of a corpus
should be done automatically, for example with RCS. The element
<revisionDesc> then could be automatically
generated and modified.
The minimal header has the following structure:
<tusneldaHeader>
<fileDesc>
<titleStmt>
<h.title></h.title>
<respStmt>
<respType></respType>
<respName></respName>
</respStmt>
</titleStmt>
<editionStmt></editionStmt>
<publicationStmt>
<distributor></distributor>
<pubAddress></pubAddress>
<availability></availability>
<pubDate></pubDate>
</publicationStmt>
<sourceDesc>
<p></p>
</sourceDesc>
</fileDesc>
<profileDesc>
<langUsage>
<language></language>
</langUsage>
</profileDesc>
</tusneldaHeader>
The element <tusneldaDoc> contains a single document, either forming part of or derived from a corpus. The global attributes of the <tusneldaHeader> element, i.e. id, n and lang (see section 2.2), plus the following attributes are defined:
- type
- indicates the type of document (text, spoken data, etc.); the default
is TEXT.
- version
- provides the version of the tusneldaDoc DTD to which this text is compliant. The version attribute is required.
A <tusneldaDoc> element consists of a <tusneldaHeader>, followed by a <text> element, which may in turn contain a <body> element or a <group> element.
-
- <tusneldaHeader>
- contains the header for the text. This element is fully
described in section 2.2.
- <text>
- contains an individual text. Global attributes of the <tusneldaHeader> plus:
- complete
- specifies whether or not this text is complete or a
sample.
- Y* the full text of the original has been
transcribed
- N a sample of the original text has been taken
- decls
- specifies one or more IDs associated with elements in the text header that apply to this element.
The <text> tag may contain one occurrence of one of the following:
-
- <group>
- groups together a sequence of distinct texts that are regarded as a unit,
such as a sequence of prose essays, poems, etc. A <group> tag may contain an optional sequence of paragraph-level elements (as described in section 3.3.4), followed by one or more <body> elements.
- <body>
- contains the body of the text, excluding any front or back matter. Formally, it consists of an optional sequence of paragraph-level elements (cf. section 3.3.4), followed by an optional sequence of text divisions (as described in section 3.3).
For both the <group> element and the <body> element, the global attributes for the <tusneldaHeader> (cf. section 2.2) plus the following attributes are defined:
-
- wsd
- indicates that the element's content is encoded in the
specified character set. The value of the attribute is the character set
name (ISO-8859-1, etc.) which should be the same as that appearing on a
<writingSystem> element in the header document which describes
that character set.
- rend
- provides information about rendition in an original
printed version. The rend attribute may e.g. take one of the
following values (other values are also valid):
- BO bold face
- BX boxed
- IT italic font
- RO roman font
- UL underlined
- CA capital letters
These five attributes, namely id, n, lang, wsd, and rend, are also defined for any element embedded within a TUSNELDA <body> or <group> element. For simplicity, we will refer to them as "the textual attributes" from now on.
Note that there is no provision for the encoding of front matter such as
cover page, table of contents, appendixes, etc., in the current TUSNELDA
recommendations. For the most part, such material is unnecessary for corpus
linguistics and should not be included.
Written texts exhibit a variety of different structural forms. Some have very
little organization at levels higher than the paragraphs, while others have a
complex hierarchy of parts, sections, chapters etc. Novels are divided into
chapters, newspapers into sections, reference works into articles, etc.
The following element is used to represent textual divisions of all kinds:
-
- <div>
- any subdivision of a written text, e.g. chapter, section, sub-section,
article, etc.
If a text has any structural subdivision, then at least those at the highest
level should be identified.
The <div> element has the following attributes:
-
- type
- categorises the division in some respect, e.g. as a chapter,
section etc. A set of precise values will be provided at a later stage. The attribute is required.
- complete
- specifies whether or not this division is complete or a
sample.
- Y* the full text of the original has been
transcribed
- N a sample of the original text has been taken
- decls
- specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.
The n global attribute can be used to carry an identifying name or
number used within the text for a given division, for example, a chapter
number, as in the following example:
<div type=CHAPTER n=5>
Furthermore, the global attributes id and lang (cf. section 2.1), as well as the attributes rend and wsd (cf. section 3.2), are defined for <div> elements.
The content of the <div> tag is defined to consist of one
or more division head elements (optional) followed by a sequence of
paragraph-level elements and/or <div> elements, followed by one or more division closing elements (optional).
Below the level of text divisions, there are three general groups of elements
which may appear:
-
- Division head elements
- information such as section titles, bylines, etc. that often appears at the
beginning of text sections.
- Paragraph-level elements
- further division of the text, into paragraphs, etc.
- Division closing elements
- information such as datelines, bylines, etc. that can appear at the end
of a text section, especially in newspapers, etc.
Division head elements include:
-
- <opener>
- groups together any opening material that is not a heading at the
start of a division, including in particular <dateline> and
<keywords>.
- <head>
- contains any heading, for example, the title of a section. This element
can also appear inside the <list> and <poem> elements
to mark the title of a list or poem. It can contain any
phrase-level element.
- type
- gives the type of header, e.g., main, sub, unspecified, etc.
- <byline>
- contains the primary statement of responsibility given for a work on
its title page or at the head or ending of the work, most often applicable
to newspapers.
Can contain any phrase-level element plus the tag <docAuthor> for
the author's name.
Any (possibly empty) sequence of these forms a division opening element.
Division closing elements include:
- <closer>
- groups together material appearing at the end of a division, including
in particular <dateline> and
<keywords>.
- <byline>
- same as above.
The <keywords> element can contain
terms and lists of terms that may appear at the beginning or end of a text as
identifying material.
The <dateline> element can contain untagged prose intermixed
with markup for dates, times, names, addresses, abbreviations, and numbers.
Any (possibly empty) sequence of these forms a division closing element.
A number of divisons of text occur at what is called the paragraph-level, since
the most common such division at this level is <p> (paragraph).
There are in addition several other elements which may appear directly within
structural divisions (that is, not nested within some other element). The first six elements have been taken over without any changes from the CES.
- <p>
- a paragraph in a written text.
- <list>
- a collection of distinct items flagged as such by special layout in
written texts, often functioning as a single syntactic unit.
- <poem>
- a poem, or an extract from one, embedded or quoted within a text.
- <caption>
- (1) a heading, title etc. attached to a picture or diagram (2) a "pull
quote" or other text about or extracted from a text and superimposed upon it
to draw attention to it.
- <bibl>
- a loosely-structured bibliographic citation appearing within a corpus
text.
- <note>
- any form of note, usually a footnote. This tag is used only for notes
that are a part of the original data only, not notes which may be added by the
encoder, etc.
- <table>
- contains text displayed in tabular form, in rows and columns.
The next four paragraph-level elements have been revised for the special needs of the TUSNELDA corpus annotation. The <linkGrp> element has been adopted from the cesAlign DTD.
-
- <sp>
- contains material marked as "written to be spoken" or "written as
spoken", usually by the presence of a speaker prefix, for example in a play
script or printed interview.
- <quote>
- a quotation from some author other than that of the surrounding text,
usually either embedded or displayed.
- <figure>
- indicates the location of a graphic, illustration, or figure.
- <linkGrp>
- serves to bundle a group of links together.
The paragraph-level elements are discussed in more detail below.
Paragraphs consist of any kind of sequence of phrase-level elements. All attributes appropriate for textual elements, i.e. the global attributes id, n, and lang, plus rend and wsd are defined (for details see sections 2.1 and 3.2).
Lists may occur between paragraphs (i.e., on the paragraph-level) or within paragraphs (i.e. on the phrase-level). A list consists of an optional <head> element, followed by one or
more <item> elements, each of which may optionally be prefixed by
a <label> element:
- <item>
- an item within a list.
- <label>
- an enumerator or other label attached to a list item. Lists may or
may not be marked. Where marked, they may appear within or between
paragraphs.
The <label> element is used to hold the identifier
or tag sometimes attached to a list item, for example "(a)'', or a word or
phrase used for a similar purpose. An example:
<list>
<label>Erstens</label>
<item>sollte die DTD den Bedürfnissen von Korpuslinguisten,
speziell im Rahmen von TUSNELDA, entsprechen,
</item >
<label>zweiten</label>
<item>sollte die DTD problemlos in XML konvertierbar sein, und
</item>
<label>drittens</label>
<item>sollte sie sich so weit wie möglich an bestehenden
Standards orientieren.
</item>
</list>
Here the <label> element is part of the annotated text itself and must not be deleted. Note, however, that for the purposes of corpus-based work, it is preferable in many cases to regard list labels as rendition information and to encode them in the n attribute, rather than as part of the document content.
The <item> element may appear only inside lists. It contains the
same elements as a paragraph, or a sequence of paragraphs, and may therefore contain one or more nested
lists.
Poems or fragments of verse or song may only appear between paragraphs. They are marked using the
<poem> element, which contains an optional series of
<head>
elements followed by one or more <lg> or <l> (for
line) elements, which is used to mark metrical lines, rather than typographic
lines:
-
- <lg>
- groups verse lines (marked by <l>), most often into stanzas.
Use the type attribute to identify the reason for the grouping.
- <l>
- a line of verse.
- part
- indicates whether the verse line is metrically complete.
- U* metricality is not known or inapplicable
- Y the line is metrically complete
- N the line is metrically incomplete
Note that the <lg> element may be recursively nested, in order
to provide for sub-groupings of lines. In this case, the n attribute
should be used to indicate the nesting level (e.g., n=1 for outer level,
n=1.1 for nested sub-level, etc.
; see the section on Reference systems.
Here is an example of (part of) a TUSNELDA-annotated poem, which has been taken from the Gentle Introduction to SGML:
<poem>
<head>The Sick Rose</head>
<lg type=stanza n=1>
<l part=Y>O Rose thou art sick.</l>
<l part=Y>The invisible worm,</l>
<l part=Y>That flies in the night</l>
<l part=Y>In the howling storm:</l>
</lg>
<lg type=stanza n=2>
<l part=N>Has found thy bed</l>
…
</lg>
</poem>
We distinguish between <head> elements, which can appear only at
the start of a text division and are logically associated with it (for example,
chapter titles, newspaper headlines etc.) and <caption> elements,
which are logically independent of the position they may have within a textual
division (e.g., captions attached to pictures or figures, "pull-quotes"
embedded within the text, "by-lines" identifying authorship and provenance of
a newspaper or periodical article.
The type attribute may be used to indicate the function of the caption:
- type
- categorizes the caption.
- BYLINE caption containing authorship of an article
- DISPLAY extra-textual caption (displayed box,
etc.)
- ATTACHED caption describing a figure,
photograph, etc.
- UNSPEC* not specified or unknown
A caption can be placed at a point other than where it appears, so as not to
interrupt the normal flow of a text, by using it with the <ptr>
tag.
See the section on Pointing and reference.
Annotations and bibliographic citations or references are marked using the
following elements:
- <note>
- any form of note, usually a footnote. This tag marks only notes that
are a part of the original text, not notes that may be added by the encoder,
etc. Possible attributes are:
- place
- for a written text, specifies the location of an original
note in the source text.
- FOOT note at foot of page.
- END note at end of current division or
text.
- SIDE note in left or right margin.
- UNSPEC* placement unknown or
unspecified.
- <bibl>
- a loosely-structured bibliographic citation appearing within a corpus
text.
Original notes may contain paragraphs, s-units, dialogue, and any
other phrase-level element. The global n attribute can be used to
indicate the value of a numbered note.
Like captions, notes are often moved from their original location in the
original data and placed at another point so as not to
interrupt the normal flow of a text, by using the <ptr>
tag as follows
(see the section on Pointing and reference):
Here is a text, with a "1" at the end for a
footnote. [1].
<<Then, this note appears at
this point in the original.>>
But we would like to keep the text together.
This can be encoded as
<p>Here is a text.
<ptr target=N1 n=1 rend=bracketed>
But we would like to keep the text together. </p>
<note id=N1 place=foot>Then, this note appears at
this point in the original.</note>
Bibliographic citations or references within running texts are marked using the
<bibl> element, which can contain any phrase-level element plus
the <author> element.
The <sp> element is used to mark parts of a written text which are
intended to be spoken (for example the speeches in a dramatic text), or which
comprise the transcription of a speech, interview, debates, etc. typically
intended for publication (i.e., which have been transcribed to be read as
text). Such parts are generally readily identifiable by the use of conventions
such as speaker prefixes (the label supplying the name of the speaker) and
stage directions.
Within the TUSNELDA standard, this element has been substantially revised, for two main reasons:
- 1. CES provides a sub-element <stage> for the annotation of stage directions, which may occur on an arbitrary level of embedding within <sp> or one of its sub-elements. This kind of occurrence restriction (a so-called inclusion exception) is admissible in SGML, but not in XML. Therefore, it could neither be reproduced in XCES (CES in XML), nor in the TUSNELDA standard.
- 2. The transcription of comics, as undertaken in one of the SFB441 projects, requires additional elements for the description of gesture, for the inclusion of "metatextual" remarks added by the author (e.g. on the edge of single pictures), and for situation descriptions added by the annotator.
For a more detailed description of the differences between the CES standard and the TUSNELDA standard cf. Wagner, A. & L. Kallmeyer (2001).
In the TUSNELDA standard, the <sp> element takes the following attributes (in addition to the global ones):
- who
- name of the speaker
- what
- name of the object displaying a non-spoken piece of text
The <sp> element may contain these sub-elements:
- <speaker>
- contains the information provided in the original source to identify the
speaker of a passage written to be spoken
- <display>
- contains the information provided in the original source to identify the container
of a displayed passage (e.g. a sign)
- <stage>
- contains any kind of stage direction within a dramatic text, with the attribute
- type
- indicating the kind of stage direction
- <spokenPar>
- is structured like a written paragraph (i.e. the element <p>), but may additionally contain <stage> elements for the annotation of stage directions
- <displayedPar>
- contains the text displayed by a non-speaking container, e.g. an inscription on a sign. Its internal structure is equal to that of a <p> element.
- <situation>
- specifies relevant parameters of the situation as specified by the annotator, either as informal text or in (a) <keywords> element(s).
All of the above may be further marked by the global attributes appopriate for textual elements, i.e. id, n, lang, rend, and wsd (see section 3.2).
Syntactically, a <sp> element contains a arbitrary sequence of <speaker> and/or <display> tags, followed by a sequence of one or more <spokenPar>, <displayedPar>, <stage>, or <situation> tags (in an arbitrary mixture). Thus, stage directions (inside the <stage> element) may be annotated at the <sp> or the <spokenPar> level, as demonstrated in the following example:
<sp who="Lady Windermere">
<speaker>Lady Windermere.</speaker>
<spokenPar>That will do !</spokenPar>
</sp>
<sp><stage>Exit Parker C.</stage></sp>
<sp who="Lady Windermere">
<spokenPar><stage>Speaking to Lord Windermere</stage>
Arthur , if that woman comes here - I warn you -
</spokenPar>
</sp>
The stage direction pertaining to the scene as a whole is annotated as a seperate <sp> element, whereas the directions for Lady Windermere are part of the <spokenPar> associated with her as a speaker.
Consider also the usage of the <spokenPar> and <situation> elements in this transcription of a comic picture (on the tags <figure>, <figtrans>, and <marked> see below):
<figure id="s35b5" entity="belgiji/s35b5.bmp">
<figtrans>
<sp who="Obeliks">
<spokenpar>
Gde da nađem belu zastavu ?
<marked type="deic-loc">Ovde</marked> je sve pusto !
</spokenpar>
<situation>
<keywords>
<term>open hands <term>
<term>slightly bent</term>
</keywords>
</situation>
</sp>
<sp who="Asteriks">
<spokenpar>
<marked type="deic-loc">Tamo</marked> je neki mališan !
</spokenpar>
<situation>
<keywords>
<term>forefinger</term>
<term>stretched out</term>
</keywords>
</situation>
</sp>
</figtrans>
</figure>
|
|
Finally, here is an example of the use of <displayedPar>:
<figure id="s15b8">
<figtrans>
<sp who="Metaloplastiks">
<spokenpar>
Spavaj , idiote !
</spokenpar>
</sp>
<sp what="scoreboard">
<displayedpar>
Metalopastiks
</displayedpar>
</sp>
</figtrans>
</figure>
A quotation is a (usually long) extract from some other work than the text
itself which is embedded within it. It is set off from the paragraphs
that surround it typographically, by spacing similar to that for paragraphs
(e.g., white space before and after). It
may contain paragraphs, poems, s-units, dialogue (marked with <q>) or any
other phrase-level element. In TUSNELDA, the contents of the <quote> tag has been further enlarged, such that quotations may also include <table> elements.
The use of the <quote> tag is sharply distinguished
from that of the <q> tag, which is used to mark quoted material
that appears inside a paragraph.
Quotations are often split up by pieces of the main text. Nevertheless, they form a single -- more abstract -- entity, rather than two seperate ones. This ought to be expressed in the annotation, such that the parts may be treated as a whole if necessary. For this purpose, <quote> elements may be marked by the following attributes:
- next
- contains the ID of the next part of the same quotation
- prev
- contains the ID of the previous part of the same quotation
- broken
- YES broken quotation
- NO* non-broken quotation
Additionally, the usual text-level attributes id, n, lang, rend, wsd, and the attribute type are defined.
Figures are marked with the following tag, which enables a reference to a stored image in another file:
- <figure>
- indicates the location of a graphic, illustration, or figure.
- entity
- names the external entity within which the graphic image of
the figure is stored.
The
<figure> element contains an optional <head> element
for the figure title or heading, followed by an optional sequence of paragraphs
for commentary or caption, an optional <figTrans> element, an optional <figDesc> element, and an optional <text> element for including the graphic itself, where desired. The <figure> element can be empty, serving
only to mark the presence of a figure in the text.
- <figTrans>
- contains a transcript of the text contained in a picture, e.g. in comics, or displayed in parts of a graphic. This element has been added in TUSNELDA for the annotation of comics, cf. the example above.
- <figDesc>
- contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying
it.
Note that in many instances, figures will not be retained at all in the encoded version of the text. In this case, the <gap> element should be used to indicate the
omission.
The <table> element is used to include tables in the text. It
takes the attributes:
-
- rows
- indicates the number of rows in the table.
- cols
- indicates the number of columns in the table.
A <table> element optionally contains a <head> element, and at least one <row> element. A <row> consists of an arbitrary, non-empty sequence of <cell> elements and/or further tables. A <cell> may host any kind of phrase-level elements. The following attributes are defined for the element <cell>:
-
- rows
- indicates the number of rows in the cell.
- cols
- indicates the number of columns in the cell.
In order to mark explicitly row or column headers/titles, the attribute role is used for both <row> and <cell> elements.
Note that in many instances, tables will not be retained at all in the encoded version of the text. In this case, the <gap> element should be used to indicate the
omission.
In recording transcriptions (among others), it is necessary to align parallel and overlapping
utterances / context descriptions. CES does not offer a mechanism for
alignment within one document. TEI offers several alignment mechanisms.
In the TUSNELDA standard, we adopted the approach of placing reference points (<anchor> elements) within utterances and attach corresponding reference points by <link> elements. (For an example, see the section on pointing and reference. Several of these links can be bundled into a <linkGrp> element, which is a paragraph-level element. This makes sense e.g. in order to keep the links of one paragraph or dialogue together.
The tusneldaDoc DTD also includes tags for marking sub-paragraph-level elements. The phrase-level elements that are provided for in the tusneldaDoc DTD are
selected on
the basis of their relevance for corpus-based work. There are five main
categories of phrase-level elements:
- elements for identifying s-units (typically orthographic sentences) and
quoted dialogue;
- elements indicating editorial changes to the original text;
- the <hi> element for marking typographically distinct words or
phrases, especially when the purpose of the highlighting is not yet
determined;
- elements of linguistic interest;
- elements for pointing and reference.
The tusneldaDoc DTD imposes a relatively strict structure on sub-paragraph elements,
intended to disallow options and suit the
needs of corpus-handling tools.
The segmentation of texts into s-units, or orthographic sentences, is usually accomplished by special tools. In some cases it is still desirable to mark s-units and/or quoted dialogue in the primary data. We
therefore provide mechanisms for marking these elements.
In some cases quoted dialogue is not marked in the primary data, because the identification of quoted dialogue can be accomplished automatically (by detecting quotation marks etc.).
-
- <s>
- identifies an s-unit within a document, typically an orthographic
sentence.
- next
- gives the id reference of a subsequent <s> element which contains a continuation of the current sentence.
- prev
- gives the id reference of a previous <s> element which contains the beginning fragment of the current sentence.
- type
- indicates the type of sentence.
- broken
- indicates whether this <s> element is broken between two or more <s> elements (linked using the next and prev attributes). The default is NO*.
Sometimes, sentences occur as (e.g. quoted) parts of other sentences, an information which might be interesting when processing the corpus. For this purpose, we introduced a new attribute into the TUSNELDA standard, namely
- nested
- NESTED
- NOTNESTED*
Its usage is shown in the example at the end of this sub-section.
The element <q> contains quoted dialogue or other quoted material appearing inside a paragraph. Similarly as above,
- next
- gives the id reference of a subsequent <q> element which contains a continuation of the current quote.
- prev
- gives the id reference of a previous <q> element which contains the beginning fragment of the current quote.
- type
- indicates the type of quote.
- broken
- indicates whether this <q> element is broken between two or more <q> elements (linked using the next and prev attributes).
Furthermore,
- who
- indicates the speaker of the quote.
- direct
- shows whether a quote is given in direct or indirect speech:
- YES
- NO
- UNSPECIFIED*
- nested
- indicates whether a <q> element has a further <q> subelement.
When s-units are tagged, no
split should be made between a colon or semi-colon followed by a word beginning
with a capital initial (unless there is an end-of-paragraph marker).
When both <s> and <q> are marked, the problem of overlapping hierarchies can arise.
For this reason it has been necessary to allow for mutual recursive nesting of
<s> and <q> tags, a practice
which is otherwise avoided. This allows all the following encodings:
<s>
<q>Indeed yes,</q>
she replied.
</s>
<q rend="PRE lsquo POST rsquo">
<s>I know precisely what you are feeling.</s>
<s>I know all about your contempt, your hatred, your disgust.</s>
<s>But don't worry, I am on your side!</s>
</q>
<s>And then the flash of intelligence was gone...</s>
However, it is recommended for the TUSNELDA standard that the <p> - <s> - <q>
hierarchy be retained if possible -- that is, the hierarchy of
<s> elements is treated as primary, and the hierarchy of
<q> elements is treated as secondary. In a case such as the
one above, this can be accomplished by breaking the quotes and using the
next and prev attributes together
with the global id attribute to associate the fragments, as follows:
<s>
<q id=q1 type=part next=q2>
I know precisely what you are feeling.
</q>
</s>
<s>
<q id=q2 type=part prev=q1 next=q3>
I know all about your contempt, your hatred, your disgust.
</q>
</s>
<s>
<q id=q3 type=part prev=q2>
But don't worry, I am on your side!
</q>
</s>
<s>
And then the flash of intelligence was gone...
</s>
In the following case, this method solves the problem of overlapping
hierarchies:
<p>
<s>According to the visiting leader, the economy of the country is
<q id=q1 type=part next=q2>
better than ever.
</q>
</s>
<q id=q2 type=part prev=q1>
"
<s>
It is in fact in very good shape.
</s>
"
</q>
</p>
Finally, consider a simple example that shows overlapping hierarchies in combination with the <nested> attribute for a broken up quote:
<s nested="nested">
<q id=q1 broken="nested" next=q2>
<s id=s1 broken="nested" next=s2 nested="notnested">Draußen</s>
</q>, sprach er,
<q id=q2 broken="nested" prev=q1>
<s id=s2 broken="nested" prev=s1 nested="notnested">ist es kalt.</s>
</q>
</s>
If any editorial changes are made to a corpus text, it must be guaranteed that they are recognizable and that the original form may be easily recovered. The following tags are used to mark editorial changes:
-
- <corr>
- contains the correct form of a passage apparently erroneous in the copy
text.
- sic
- gives the original form. This attribut is obligatory in the TUSNELDA standard.
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made. In the TUSNELDA standard, three possible values for this attribute have been adopted, in order to ensure comparability:
- SURE
- PROBABLE
- PRESUMABLE
- <reg>
- contains text which has been regularized or normalized in some sense.
- orig
- gives the original form. This attribute is obligatory in the TUSNELDA standard.
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made. (Possible values: see above)
- <gap>
- indicates a point where material has been omitted in a transcription,
whether for editorial sampling practice, or because the material is
illegible.
- desc
- describes the omitted text
- reason
- gives the reason for the omission (sampling, illegible, etc.)
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been made.
- <unclear>
- This element has been taken over from TEI in order to mark unclear, pieces of text. This may be e.g. destroyed bits of old documents or stretches of speech recordings which are hard to understand and cannot be clearly transcribed. Its attributes are
- reason
- indicates the source of unclarity
- resp
- names the responsible editor
- resp
- marks the degree of certainty with which the change has been made.
Note that the <gap> element is useful for noting the omission
of material which is often uninteresting for corpus-based language
engineering applications, in particular, figures, tables, etc.
In general it is not desirable to mark typographic features of a given
printing of a text in texts designated for use in corpus-based research.
However, there are circumstances under which it is desirable to retain this
information. In particular, certain items of linguistic interest may be
marked by typography in the original; e.g., linguistic emphasis and foreign
words are often rendered in italics. In addition, some applications (e.g.,
machine translation which attempts to reproduce the format of the original)
demand retaining the rendition information.
In the process of up-translation from legacy data, a first step is often to
translate relevant typographic information into SGML, with no attempt to
interpret the significance of the rendering (e.g., that the italics signify
a foreign word). Interpretation is often too costly because it is ambiguous
(e.g., italics signify not only foreign words, but also emphasis, titles,
etc.). In such cases the
<hi>
element can be used. Normally, in later phases of up-translation,
<hi> tags are changed to more descriptive tags, such as
<title>,
<foreign>,
<mentioned>, or
<distinct>.
-
- <hi>
- marks a word or phrase as graphically distinct from the surrounding
text, for reasons concerning which no claim is made. The rend attribute
should provide the original rendition information when its function has not yet
been determined.
- rend
- describes the rendition or presentation of the
highlighted item.
- BO bold face
- BX boxed
- IT italic font
- RO roman font
- UL underlined
- CA capital letters
Note: Several values from the list may be specified where appropriate,
separated by spaces, e.g., "ro it".
When the <hi> tag is used, no claim about the reason is made.
This may be the case in a low-level encoding, since determining the reasons
for highlighting (e.g., presence of a foreign word, vs. emphasis, vs. a
title, etc.) demands human intervention and is therefore too costly in the
early stages of up-translation. Note that typographically highlighted
phrases and the kind of highlighting used may be recorded in one of two
ways:
- using the global rend attribute
- using the <hi> element with a rend attribute
The first method specifies an attribute on some element which contains
all of and only the highlighted phrase. In this case, the function
of the highlighting is clear (for example, to mark a heading), and the
boundaries of the highlighted phrase therefore coincide with the boundaries
of some other element. The rend attribute is given on the tag for
that element, for example
<head rend=bo>The world beyond</head>
The second method inserts a new tag indicating that what it contains is
highlighted. It is used
- when the function of the highlighting is not clear;
- where there is no tag identifying the feature concerned;
- where the highlighted phrase is not co-terminous with some other
element.
The rend attribute must be supplied on the <hi>
element. The rend attribute is optional on all other elements.
Note that in cases where the <hi> element often appears with
the same value for rend, a default value can be provided on the <tagUsage> element. When this
mechanism is used, the rend attribute need be given only when the
default does not apply to the given occurrence of the <hi>
element.
Both the start and end tag for any SGML element must be contained within
the start and end tag of any of its ancestors in the tree for that
document. Since by definition <hi> elements can appear only
within <p> elements, this means that where, for example, an
italicized passage contains more than one paragraph or starts within a
paragraph and spans one or more others, the <hi> element must
be closed at the end of the enclosing element, and then re-opened within
the next. For example, an italicized passage which crosses a
<p> boundary must be tagged as follows:
-
<p>This is the start of a paragraph which
<hi rend=it>switches to italics here
and then goes on for several paragraphs.</hi>
</p>
<p>
<hi rend=it>This second paragraph is all in italics</hi>
</p>
<p>
<hi rend=it>This is the last bit of italics</hi>
and the rest is in roman.</p>
That is, the <hi> element is closed before the end of the
first paragraph and re-opened at the start of the next. Note that the
following encoding is not acceptable:
-
<p>This is the start of a paragraph which
<hi rend=it>switches to italics here and
then goes on for several paragraphs.</hi>
</p>
<p rend=it>This second paragraph is all in italics</p>
<p><hi rend=it>This is the last bit of italics</hi>
and the rest is in roman.</p>
This second encoding mixes different styles of marking the same feature for
a given span of text, which will cause problems for retrieval.
In the CES standard, it is not admissible to include one <hi> element in another one -- <hi> may not be used recursively. The way this is excluded is neither XML-conformant nor factually desirable: E.g. one might find part of a bold headline in italics, as in
<hi rend="bo">Eine <hi rend="it">teilweise</hi> kursive Überschrift</hi>
For the same reasons, recursion is allowed in the TUSNELDA standard also for the elements <foreign>, <distinct>, <mentioned>, and <title> (see next section).
There have been three main defining forces behind the choice of elements:
- the needs of corpus-annotation tools, such as morpho-syntactic taggers,
whose performance can often be improved by pre-identification of elements such
as names, addresses, title, dates, measures, foreign words and phrases, etc.
- the need to identify objects which have intrinsic linguistic interest, or
are often useful for the purposes of translation, text alignment, etc., such as
abbreviations, names, terms, linguistically distinct words and phrases,
etc.
- the needs of the projects of SFB 441, which perform linguistic research based on specifically annotated corpora.
The phrase-level elements identifying linguistically relevant elements are:
- <abbr>
- contains an abbreviation of any sort. Consult Handling
Punctuation for guidelines for encoding abbreviations.
- expan
- contains the expansion of the abbreviation
- <date>
- contains a date in any format.
- ISO8601
- ISO 8601 normalized form of the date
- <list>
- a collection of distinct items flagged as such by special layout in
written texts, often functioning as a single syntactic unit.
Note that <list> is the only phrase-level element which is also a paragraph-level element; its content model is exactly the same in both instances. For its full definition see section 3.3.4.2
- <measure>
- contains a number, word, phrase indicating a quantity.
- type
- the type of measure is handled very restrictively in CES. In the TUSNELDA standard, the CES values will be recommended, but not enforced: Any textual string will be appropriate. The CES values for type are:
- WEIGHT
- LENGTH
- COUNT
- AREA
- VOLUME
- CURRENCY
- TEMPERATURE
- value
- contains the the ISO 4217 codes for currency representation when
the type attribute specifies currency.
- <name>
- contains a proper noun or noun phrase.
- type
- indicates the type of proper noun. Suggested values include:
- PERSON
- PLACE
- ORG
- LANGUAGE
See Encoding Names.
- <num>
- contains a number, written in any form.
- value
- contains the normalized value of the number.
- <term>
- contains a single-word, multi-word or symbolic designation which is
regarded as a technical term.
- <time>
- contains a phrase defining a time of day in any format.
- ISO8601
- ISO 8601 normalized form of the time.
- type
- the type attribute takes one of the following values:
- AM
- PM
- 24HOUR
- DESCRIPTIVE
- <distinct>
- identifies a word or phrase regarded as linguistically distinct (e.g., archaic, technical, dialect, etc.). At the moment, there are no restrictions on this characterization. At a later stage, the type attribute may be used with a restricted set of linguistically useful values, depending on the demands of the corpus encoders.
- <foreign>
- identifies a word or phrase as belonging to some language other than
that of the surrounding text. Use the global lang attribute to
indicate the language.
- <mentioned>
- marks words or phrases mentioned, not used. This element should be used for the markup of e.g. linguistic examples (either in the main text or set off by paragraph breaks). However, corpora which consist only of a collection of linguistic examples should not be annotated using the <mentioned> element.
- <title>
- contains the title of a work, whether article, book, journal, or
series, including any alternative titles or subtitles.
- <marked>
- contains any piece of text which is being marked for a specific, restricted research purpose. If an elaborate annotation (as e.g. POS-tagging) is available, this marking may be redundant; however, even elaborate tagging may not be sufficient for pragmatic aspects like e.g. courteous forms. The <marked> element has the following attributes:
- type
- indicates textually the category of the specially marked element. This attribute is required.
- next
- provides the ID of the next part of a marked string which is split up by other text
- prev
- provides the ID of the previous part of a marked string which is split up by other text
- broken
- indicates whether a marked string is split up by other text
The linguistic elements fall into two groups, which determine their content
models:
- elements which are, for many purposes of language engineering such as
morpho-syntactic tagging, regarded as individual tokens, even when they may
contain sub-constituents. In TUSNELDA, this group includes names, dates,
times, measures, abbreviations, and terms. These elements therefore may
contain PCDATA. They may also contain the <abbr> and
<num> elements; abbreviations and numbers are frequently
identified and tagged automatically, and therefore their placement must be
relatively free. Note that to avoid unnecessary recursive nesting of
elements, the<abbr> cannot contain another
<abbr> tag, and <num> cannot contain another
<num>.
This group of elements, which comprise the element class M.TOKEN, includes:
- <abbr>
- <num>
- <name>
- <date>
- <measure>
- <time>
- <term>
- elements which may contain sub-constituents which are treated by
corpus-analytic tools as tokens, or may be regarded as tokens in
themselves. Each of these elements can contain any other phrase-level
element. It is assumed that tokenizing tools may further analyze the
content of these elements in order to identify constituent tokens where
they exist.
This group of elements includes the following elements:
- <title>
- <foreign>
- <mentioned>
- <distinct>
- <marked>
This latter group also includes another tag, the <hi> tag,
which is used to mark information which is rendered specially in some
original, but for which the function of the highlighting is either unknown
or unspecified. In later phases of up-translation when the function of the
highlighting is determined, <hi> tags are very often changed
to one of the other more descriptive tags in this group. See section 3.4.3 for a full discussion of the use of
the <hi> tag.
References in the text which refer to another part of it can be tagged
with
-
- <ref>
- a reference to another location in the current document, in terms of
one or more identifiable elements, possibly modified by additional text or
comment.
Attributes include the global attributes plus the following:
- corresp
- points to elements that correspond to the current element in some way.
- next
- gives the id reference of an element which contains a continuation of the current element.
- prev
- gives the id reference of an element which contains the previous portion of the current element.
- type
- indicates the type of pointer, e.g., aggregating, aligning, etc.
- resp
- specifies the creator of the pointer.
- crdate
- specifies when the pointer was created.
- targType
- indicates the type of data being linked, e.g., paragraph, sentence, etc.
- targOrder
- specifies whether the order in which the identifiers in the targets list is significant. Values:
- Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.
- N No: the order of the IDREFs specified as the value of the targets attribute has no significance.
- U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.
- evaluate
- specifies the intended meaning when the target or targets are pointers themselves. Values:
- ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.
- ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.
- NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.
- target
- provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.
In some cases it is desirable to move an element to another
location in the encoded text. This is common for footnotes which occur in-line
in the electronic text, but which appear as footnotes, endnotes, etc. in a
printed version. It is also common for captions, figures, bibliographic
citations, and stage directions.
-
- <ptr>
- a pointer to another location in the current document in terms of one
or more identifiable elements.
Attributes include the global attributes plus the following:
- corresp
- points to elements that correspond to the current element in some way.
- next
- gives the id reference of an element which contains a continuation of the current element.
- prev
- gives the id reference of an element which contains the previous portion of the current element.
- type
- indicates the type of pointer, e.g., aggregating, aligning, etc.
- resp
- specifies the creator of the pointer.
- crdate
- specifies when the pointer was created.
- targType
- indicates the type of data being linked, e.g., paragraph, sentence, etc.
- targOrder
- specifies whether the order in which the identifiers in the targets list is significant. Values:
- Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.
- N No: the order of the IDREFs specified as the value of the targets attribute has no significance.
- U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.
- evaluate
- specifies the intended meaning when the target or targets are pointers themselves. Values:
- ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.
- ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.
- NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.
- target
- provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.
Examples:
-
Here is a text.
This caption appears at this point.
But we would like to keep the text together.
This can be encoded as
-
<p>Here is a text.
<ptr target=C1>
But we would like to keep the text together.</p>
<caption id=C1>This caption appears at this point.</caption>
The note in the following example originally appeared at the location of
the <ptr> tag:
-
The <name type=org>Ministry of Truth</name>, —
<name type=org lang=ns>Minitrue</name>, in
<name>Newspeak</name><ptr target=N1 rend=asterisk>
— was startlingly different from any other object in
sight...</p>
<note place=foot id=N1><name>Newspeak</name> was the
official language of <name type=place>Oceania</name>. For an
account of its structure and etymology see Appendix.</note>
Another mechanism, which has been adapted for TUSNELDA from TEI and CESAlign, is the use of anchors and links, much in the same way as in HTML-documents. An <anchor> is a phrase-level element which designates a reference point, a <link> element designates a corresponding point which is attached to the anchor. <link> elements may be bundled together using the <linkGrp> element. The use of these concepts is illustrated in the following example:
<sp who="S1">
<spokenpar>Cinco , cinco veces<anchor id="S1.12a">
tranquilo<anchor id="S1.12e"> .
</spokenpar>
</sp>
<sp who="S3">
<spokenpar>
<anchor id="S3.37a">Ahí esa<anchor id="S3.37e">
dis puesto el cuarto árbitro para enseñar
la cartulina para que todo el sepa .
</spokenpar>
</sp>
<linkGrp>
<link targets="S1.12a S3.37a">
<link targets="S1.12e S3.37e">
</linkGrp>
For purposes of alignment or other reference to elements within a text, a
reference system can be built up using the id attribute on appropriate
elements.
We recommend the following strategy:
- supply a unique identifying label in the id attribute of the
<body> tag
- for each nested division, give each unit an identifier which is built up
by successively adding to the identifier of the text; for example
<body id=ORW1>
<div type=part id=ORW1.1>
<div type=chapter id=ORW1.1.1>
<div type=section id=ORW1.1.1.1>
</div>
</div>
</div>
</body>
- for each paragraph, add another layer to the immediately superordinate
identifier, as follows:
<div type=chapter id=ORW1.1.1>
<p id=ORW1.1.1.1.p1></p>
<p id=ORW1.1.1.1.p2></p>
</div>
- for each s-unit, add another layer to the superordinate identifer on the
enclosing <p> element:
<div type=chapter id=ORW1.1.1>
<p id=ORW1.1.1.1.p1>
<s id=ORW1.1.1.1.p1.s1></s>
</p>
</div>
When a string of characters is tagged as a name, many corpus-handling tools treat the string as a single token (e.g. some morpho-syntactic
taggers) and do not perform additional analysis.
For English, we can state the following rules:
- Titles such as "Mr." and role names such as "Secretary" are not considered
part of a person name:
-
Mme. <name>Edith Cresson</name>
(or : <abbr>Mme.</abbr> <name>Edith
Cresson</name>)
President <name>Boris Yeltsin</name>
- Appositives such as "Jr." are considered part of a person name:
-
<name>Sammy Davis, Jr.</name>
Where these rules can be used for encoding other languages they should be followed. Obviously, other languages may treat titles rather as part of the proper name, such that one would want to include the title in the <name> tag.
In English the possessive is formed by the addition of "'s" which is
tokenized separately, and should not be encoded as a part of the name:
<name>Winston</name>'s
Whereas this CES approach makes sense for English, it already causes difficulties for a language like German, where the possessive/genitive suffix and the name are mostly joined into one single word form. Depending on the language, one should therefore try to include the minimal word form expressing a name within the <name> tag. Cf.
- <name>Tübingens</name> malerische Altstadt
- <name>Hans</name>' neue Theorie
Adjectives derived from names should also be annotated using the <name> element (as opposed to the English rule mentioned in the CES guidelines), cf.
- die malerische <name>Tübinger</name> Altstadt
Punctuation is normally considered to be a separate token, and should be
encoded outside the <name> tag. See the discussion in the next
section.
Examples:
-
Jaguar is made is <name type=place>Britain</name>.
<name type=place>France</name>-based
<name type=place>U.S.</name>-<name
type=place>Japan</name> trade negotations
- Laws, diseases, prizes, etc. named after people or saints, etc. should not
be tagged with <name type=person>.
- Street addresses, street names, adjectival forms of place names should not
be tagged as <name type=place>.
Punctuation should be left as in the original text, except in the cases
noted below.
Note that punctuation and special characters are treated by many corpus-handling
tools as separate tokens. For example, a text such as
<q>Ignorance is strength.</q>
may be tokenized as
TOKEN Ignorance
TOKEN is
TOKEN strength
TOKEN .
Full stops and ellipses
The full stop should be kept as both a part of an abbreviation and as an
end-of-sentence indicator. The disambiguation of the two uses is
accomplished by the marking of abbreviations and/or s-units, when such
markup is provided.
Ellipses should be regularized so that the three periods are contiguous,
with no spaces in between.
Full stops appearing as a part of abbreviations should not be separated from
the rest of the abbreviation string when the abbreviation is marked with
the <abbr> tag, even though the full stop may serve a double
function (i.e., also signal end-of-sentence).
Example:
I'm back in the U.S.
should be tagged as
I'm back in the <abbr>U.S.</abbr>
even though the period is both part of the abbreviation and a signal of
end-of-sentence. On the other hand, where punctuation after a name is a clear indication of end-of-sentence, it should does not be included in the <name> element (see also below on sentence punctuation):
Er besucht das malerische <name>Tübingen</name>.
Hyphens and dashes
Line-end (soft) hyphens should be removed where they are not part of the
regular spelling of the word. In cases of doubt, guidance should be
sought elsewhere in the same text or in dictionaries. If doubt still
remains, a hyphen should be retained rather than removed. In any event, the original spelling has to be included in the value of the orig attribute, as in
Er besucht die malerische <reg orig="Alt-stadt">Altstadt</reg>.
Dashes are marked by an entity reference (—). No
distinction should be made between different types of dashes.
Apostrophes
Apostrophes should be left as they are in the original text. Note that the
apostrophe can be ambiguous with the single quotation mark (e.g., in
English the possessive "Joneses'"). This may be disambiguated by the
marking of quotations.
Punctuation and tokens identified by the encoder
There is a small class of tags which mark the presence of tokens that have
been isolated and classified by the encoder. Among the elements included in the
tusneldaDoc DTD, the following may be used to identify individual tokens:
<abbr>
<date>
<num>
<measure>
<name>
<term>
<time>
For many tools, when such an element is identified in the input stream, it
is not desirable to further tokenize the string inside the tag; rather, the
string inside the tag can be regarded as a single token (possibly with the type
indicated by the tag name). For example, in some languages it may be
possible be assumed for lexical lookup routines and morpho-syntactic
taggers to assume that an element with the tag <name>
is a
single token with the grammatical category PROPER NOUN (Np). For example,
<name type=person>Big Brother</name>
can be tokenized as
TOKEN(name) Big Brother
Similarly, the string
<date>April 4th, 1984</date>
can be tokenized as
TOKEN(date) April 4th, 1984
Therefore, punctuation that is not a part of an identified token should not
appear
within the tag (except abbreviations--see below). For example, the text
The
Ministry of Love, which maintained law and order.
should be encoded as
-
The <name type=org>Ministry of Love</name>, which
maintained law and order.
Other examples:
-
<name type=org>Jaguar</name> company in <name
type=place>Britain</name>.
...he had been born in <date>1944</date> or
<date>1945</date>; but it...
...the three slogans of the <name
type=org>Party</name>:...
When the
<q> or <quote> tag is used, any quotation marks
or other typographical device
for indicating quoted dialogue should be removed from the text. The
rend attribute can be used to indicate the means by which the
quotation was
originally marked in the text (this is not required). In these cases, the
value of the rend
attribute should be one of the following, which are consistent with entity
names in ISOpub and ISOnum:
-
laquo angle quotation mark, left
raquo angle quotation mark, right
lsquo single quotation mark, left
rsquo single quotation mark, right
ldquo double quotation mark, left
rdquo double quotation mark, right
lsquor rising single quote, left (low)
ldquor rising dbl quote, left (low)
rdquor rising dbl quote, right (high)
rsquor rising single quote, right (high)
mdash dash the width of lowercase m
In principle, encode punctuation as inside or outside the <q>
tag according to the position of the quotation marks in the original, as in
these examples:
- ('dealing on the free market', it was called)
(<q rend="PRE lsquo POST rsquo">dealing on the free
market</q>, it was called)
- The dark-haired girl behind Winston had begun crying out `Swine!
Swine! Swine!'
The dark-haired girl behind <name
type=person>Winston</name> had begun crying out <q rend="PRE lsquo
POST rsquo">Swine! Swine! Swine!</q>
- 'I am with you,' O'Brien seemed to be saying to him.
<q rend="PRE lsquo POST rsquo">I am with you,</q><name
type=person>O'Brien</name>seemed to be saying to him.
In cases where the <q> tag is used for text that is not
enclosed in quotation marks in the original, leave punctuation that is not a
part of the actual cited text outside the <q> tags:
- BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
<q rend=ca type=slogan><name type=person>Big
Brother</name> is watching you</q>, the caption beneath it ran.
- Never mind, it doesn't matter, he thought. ["Never mind, it doesn't
matter" in italics]
<q rend=it>Never mind, it doesn't matter</q>, he
thought.
- Eureka! he shouted. ["Eureka!" in italics]
<q rend=it>Eureka!</q> he
shouted.
Note, however, that the tokenization of the text should not be affected by
the position of the punctuation relative to the closing tag; the same set of
tokens is ultimately generated in either case.
Sentence terminating punctuation should always appear within an enclosing
set of <s> and </s> tags:
- <s><q rend=it>Eureka!</q> he
shouted.</s>
- <s>The dark-haired girl behind <name
type=person>Winston</name> had begun crying out <q rend="PRE
lsquo POST rsquo">Swine! Swine!
Swine!</q></s>
Because tokenizers typically treat text within tags such as
<hi> and <foreign> independently of punctuation, which can appear either
inside or outside the closing tag without effect. Therefore, given this
text:
She ordered a croque monsieur. ["croque monsieur" in italics]
either of the two following encodings is in principle acceptable, although the first one is to be preferred and should be used in TUSNELDA corpora:
She ordered a <foreign rend="it">croque
monsieur</foreign>.
She ordered a <foreign rend="it">croque
monsieur.</foreign>
For fruitful discussions, we would like to thank the participants of
the TUSNELDA standardization workshop, Michael Betsch, Bernhard
Brehmer, Hervé Dejean, Sam Featherston, Gabriela Fulir,
Stefanie Herrmann, Sandra Kübler, Lothar Lemnitzer, Jürgen
Mellinger, Detmar Meurers, Frank Henrik Müller, Reimar
Müller, Slavica Stevanovic and Tylman Ule.
Furthermore, we want to emphasize that the TUSNELDA corpus
annotation standard owes a lot both to the EAGLES corpus encoding
standard (CES) and to Text Encoding Initiative (TEI). Most parts of
the TUSNELDA standard are adaptions of either CES or TEI to the needs
of the projects in SFB 441.
Finally, we also want to mention that the CES guidelines were very
helpful when writing the TUSNELDA guidelines. Since the structures of
CES and TUSNELDA are very similar, for our guidelines, many parts were
just taken over in a modified version from the CES guidelines.
Laura
Kallmeyer, Roland
Meyer and
Andreas Wagner, 09/11/2001. Reviewed 03/09/2009.