Sustainable Data Formats

  Homepage
  SFB Corpora
  Query Tools
  Project C1
  Project C2
  TUSNELDA Annotation Standard
Deutsch

TUSNELDA

TUSNELDA (Tübingen collection of reusable, empirical, linguistic data structures, German: Tübinger Sammlung nutzbarer empririscher Datentypen) is the general annotation standard for the corpora that were created and annotated in the Collaborative Research Center 441. It was developed in project C1. The annotation standard establishes the compatibility between the corpora of the SFB 441 and guarantees that the data can be reused with standard corpus-linguistic tools. The TUSNELDA standard is sufficiently flexible to meet the different needs of the projects in the SFB.

TUSNELDA consists of three main components:

The format was extended substantially during the last funding period of the SFB. It is now possible to express secondary relations between nodes in a treebank (this is used for the coreference annotation in TüBa-D/Z corpus (project A1), for example). Furthermore, the two additional data types collections of sentences and lexicons were added to the format. Collections of sentences were created in the projects A3, B3, B10, B17 und B18. Various kinds of lexicons are a result of the corpora-related work in the projects A2, A5, B6, and B11.

TUSNELDA provides a powerful query language, which is documented here

.

GENAU

The Generalized Sustainability Architecture for Linguistic Data (German: GEneralisierte NAchhaltigkeitsarchitekUr für linguistische Daten, GENAU) is a new sustainable data format which was developed in project C2. It integrates the three annotation formats TUSNELDA, EXMARaLDA (Extensible Markup Language for Discourse Annotation, SFB 538 "Multilingualism" – University of Hamburg) and the exchange format Paula (SFB 632 "Information Structure", University of Potsdam/Humboldt University Berlin). The two latter formats support standoff-annotation of multimedia data (such as audio- or video data), as well as the annotation of overlapping units on multiple levels. For the representation of such data, GENAU employs (multi-rooted trees).

Descriptions of GENAU can be found on the publications page of project C2.


Last update 03/10/2009