TUSNELDA (Tübingen collection of reusable, empirical, linguistic data structures, German: Tübinger Sammlung nutzbarer empririscher Datentypen) is the general annotation standard for the corpora that were created and annotated in the Collaborative Research Center 441. It was developed in project C1. The annotation standard establishes the compatibility between the corpora of the SFB 441 and guarantees that the data can be reused with standard corpus-linguistic tools. The TUSNELDA standard is sufficiently flexible to meet the different needs of the projects in the SFB.

The format was extended substantially during the last funding period of the SFB. It is now possible to express secondary relations between nodes in a treebank (this is used for the coreference annotation in TüBa-D/Z corpus (project A1), for example). Furthermore, the two additional data types collections of sentences and lexicons were added to the format. Collections of sentences were created in the projects A3, B3, B10, B17 und B18. Various kinds of lexicons are a result of the corpora-related work in the projects A2, A5, B6, and B11.

The Generalized Sustainability Architecture for Linguistic Data (German: GEneralisierte NAchhaltigkeitsarchitekUr für linguistische Daten, GENAU) is a new sustainable data format which was developed in project C2. It integrates the three annotation formats TUSNELDA, EXMARaLDA (Extensible Markup Language for Discourse Annotation, SFB 538 "Multilingualism" – University of Hamburg) and the exchange format Paula (SFB 632 "Information Structure", University of Potsdam/Humboldt University Berlin). The two latter formats support standoff-annotation of multimedia data (such as audio- or video data), as well as the annotation of overlapping units on multiple levels. For the representation of such data, GENAU employs (multi-rooted trees).

