List of Corpora

  Homepage
  Sustainable Data Formats
  Query Tools
Deutsch

This list is a comprehensive overview of all corpora that were created in the Collaborative Research Centre 441.

ProjectResourceAvailabilityFormatDescription
A1TüBa-D/Z
(Corpus)
Homepage NEGRA-Export
Export-XML
TUSNELDA-XML
The Tübinger Baumbank des Deutschen / Zeitungssprache (TüBa-D/Z; Tübingen Treebank of Written German) is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz). The treebank currently comprises approximately 36000 sentences (640000 words). The annotation was performed manually and is currently ongoing.
A1TüPP-D/Z
(Corpus)
Homepage DEREKO-XML TüPP-D/Z is a collection of articles from the taz newspaper ("die tageszeitung") which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. All texts have been processed automatically, starting from paragraph, sentence and token segmentation. Word forms include information about some regular types of named entities, including dates, telephone numbers, and number/unit combinations.
A2Warao Database
(Lexicon)
  Microsoft Access The Warao lexicon is a compilation of warao words collected from native speakers and bundled with information about their dialect, context, morphology and glosses and translations in English, German and Spanish. The lexicon is also linked with a list of native speakers containing some basic information about each one.
A3SINBAD
(Sentence Collection)
Homepage
TUSNELDA-XML The aim of the sentence collection (SINBAD) is to provide researchers with access to a large body of (suboptimal) example sentences and their grammaticality judgements from the literature and from Project A3 empirical work.
A5CoDII
(Lexicon)
Homepage
XML The Collection of Distributionally Idiosyncratic Items (CoDII) is a linguistic resource on lexical items which have highly idiosyncratic occurrence patterns.
B1Russian Interviews
(Corpus)
Homepage
TUSNELDA-XML  
B1Uppsala Corpus
(Corpus)
Homepage
TUSNELDA-XML The Uppsala Corpus of modern Russian texts was developed at the Department of Slavic Studies at Uppsala University, Sweden, under the direction of Lennart Lönngren, from whom we obtained the permission to use the Uppsala corpus for the SFB 441 project B1. All rights regarding the Uppsala corpus belong to the author. Corpus data may be used for research purposes only; commercial use of the corpus is prohibited. This corpus (Upsal'skij korpus russkix tekstov) consists of some 600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose. The informative texts are from between 1985 and 1989, while the literary texts, whose vocabulary does not date as quickly, cover a longer period, 1960-88. The corpus does not include poetry or drama.
B3Götz von Berlichingen Corpus
(Corpus)
Online Query Form
TUSNELDA-XML The Early Modern High German text "Götz von Berlichingen" was digitised by the SFB 441 project B3. The original text was scanned, OCR processed and manually corrected. The encoding follows the TUSNELDA standards. In order to preserve the line numbers of the source text, we used the TUSNELDA "poem" element. Pages starting with a n=0 paragraph refer to the preceding paragraphs.
B3Alltagserzählungen Corpus
(Sentence Collection)
 
TUSNELDA-XML This annotated data collection was constructed based on recordings of monologues, and their subsequent transcription. Participants were asked to talk about current events that influenced their lives in a good or bad way. The research question of this study concerns the investigation of the processing and acquisition of German coordinate structures.
B6LexiTypeSyn
(Lexicon)
 
MySQL  
B6LexiTypeDia
(Lexicon)
 
Microsoft Access Cognitive patterns about body parts, specifically, in the domain of the head.
B6Motivational Partner
(Lexicon)
 
  A lexicon of analyzed answers to a semi-open questionnaire designed to obtain motivational partners of word+meaning stimulus units.
B6Polysemy
(Lexicon)
 
  A sentence collection constructed by conducting a sentence generation and definition task targeted at word sense disabiguation performed by the informants in a set environment.
B6Semantic Relation
(Lexicon)
 
  A lexicon of analyzed answers to a questionnaire targeted at specifying the semantic relations between motivated and motivating stimulus units.
B8Comic Corpus
(Corpus)
Online Query Form
TUSNELDA-XML The Comic Corpus belongs to the BKS corpus group. It consists of several serbian Asteriks comic strips. Some of the serbian comics are originally written in cyrillics; the texts were transcribed in latin script in order to have all comic texts in the same encoding. The comic texts are predestined for any type of research with regard to imitated spoken-language-phenomenons. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in combination with a pointing gesture in a typical demonstratio ad oculos situation including all extralinguistic information given by the communication situation. For that purpose the panels that include deictics and pointing gestures were digitised and added to the corpus.
B8Bosnian Interviews
(Corpus)
Online Query Form
TUSNELDA-XML The Bosnian Interviews belongs to the BKS corpus group. It contains 13 narrative interviews which were conducted with Bosnian refugees (Croats, Muslims and Serbs) in 1994. These texts are predestined for any type of research with regard to Bosnian spoken-language-phenomenona. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in narrative conversation-situations.
B8Novosadski Corpus of Spoken Language
(Corpus)
Online Query Form
TUSNELDA-XML  
B9TüPoDia
(Corpus)
Online Query Form
TUSNELDA-XML The TüPoDia corpus contains editions of Portugese texts, specifically collected on a historical basis. The texts were digitized in order to enable automatic analyses with regard to, for example, word frequencies.
B9BraToLi
(Corpus)
Online Query Form
TUSNELDA-XML The BraToLi corpus contains transcriptions of soccer match commentaries (TV and radio) as well as conversations about steeringwheel locks. Languages include Brazillian Portugese, European Spanish (Toledo) and American Spanish (Lima).
B10TüNeg
(Sentence Collection)
 
Filemaker The TüNeg database contains about 2700 entries from mostly the same languages as the TüTeAM database using sources similar in kind. Sentences are analysed according to the following criteria: licensing environment, different possible readings, negative polarity items involved, types of morphological negation and the general type of negation. Furthermore, where appropriate, contrasting examples have been recorded.
B10TüTeAm
(Sentence Collection)
 
n.a. The TüTeAM corpus contains about 2800 entries from Ancient Greek, German, English, Italian, Hungarian, Latin, Swedish, Russian, Ukrainian, Bulgarian. The data come from various sources: linguistic literature (the "classics" on tense and aspect), fiction, documentary evidence. Examples appear in the original script, if necessary with transliteration, English or German gloss and translation. The examples also contain an indication of the source or a complete denotation of the bibliographic source. Sentences are analysed according to various criteria: tense and aspect morphology, types of time adverbials, Aktionsarten. The analysis allows a specific search for similar phenomena in a variety of languages and makes the discovery of typological regularities easier.
B11 OTC
(Corpus, syntactic annotation)
Homepage
TUSNELDA-XML

Documentation
Elements
Categories
Arguments
Argument structure
OTC = Fonds Pelliot Ms. 250 / Pelliot tibétain 1287, commonly known as the Old Tibetan Chronicle, ca. mid 9th century, found in Dunhuang (Central Asia), the original is kept in the Bibliothèque Nationale de France. Chapter I (l.1-62).
Language: Old Tibetan ca. mid 7th – mid 11th century.
Size: 6 divisions (paragraphs), 101 sentences, 232 clauses (225 verbs), 733 tokens ('words').
Metadata
B11 TVP
(Corpus, syntactic annotation)
Homepage
TUSNELDA-XML

Documentation
Elements
Categories
Arguments
Argument structure
TVP = Die tibetische Version des Papageienbuches, a 15th century adaptation of the Indian narrations of the parrot, styled as stories about previous reincarnations of Atisha and his disciples. Ca. 40% (fol. 261v1 - 268v5, pp. 43-48).
Language: Classical Tibetan ca. 11th – 19th century.
Size: 11 divisions plus 13 subdivisions, 415 sentences, 903 clauses (849 verbs), 2669 tokens ('words').
Metadata
B11 LLV
(Corpus, syntactic annotation)
Homepage
TUSNELDA-XML

Documentation
Elements
Categories
Arguments
Argument structure
LLV = /Gsh////amyulna bsh//adpa//vi Kesargyi sgrungs bzh//ugs/
A Lower Ladakhi version of the Kesar saga, collected around 1900. Chapter I (pp. 1-16).
Language: contemporary Ladakhi, West Tibetan.
Size: 12 divisions, 589 clauses (585 verbs), 295 sentences, 1926 tokens ('words').
Metadata
B11 VDLV
(Lexicon with example sentences)

extended TEI / TUSNELDA-xml

For documentation of semantic and syntactic categories see also:

Categories
Arguments
Argument structure
VDLV = Valency Dictionary of Ladakhi Verbs (work in progress).
Language: contemporary Ladakhi, West Tibetan. Basically, one reference dialect each of the two main dialect groups: Domkhar for Shamskat (speech of Lower Ladakh) and Gya for Kenkhat (speech of Upper Ladakh).
Size: 925 main entries (783 for the Domkhar, 814 for the Gya), 717 additional sub-entries (507 for Domkhar, 596 for Gya), 12.980 example sentences (6.252 for Domkhar, 4492 for Gya).
Overview
B14Sentence connectors Old Spanish
(Corpus)
 
XML Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis
B14Sentence connectors Surselvian
(Corpus)
 
XML Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis.
B16Russian-German Language Acquisition
(Corpus)
 
CHILDES  
B17Degree constructions
(Sentence Collection)
 
Database The database presents parallel sets of data on comparison constructions from 15 languages: Bulgarian, Guaraní (an Amerindian language spoken mostly in Paraguay), Hindi, Hungarian, Japanese, Mandarin Chinese, Mooré (a Gur language), Motu (from Papua New Guinea), Romanian, Russian, Samoan, Spanish, Thai, Turkish and Yorùbá (a Kwa language). The sentences have been elicited from naive informants with the help of language specific questionnaires. The goal has been an in-depth study of those languages, with the perspective of figuring out how their grammars differ in order to yield the diverse empirical picture that comparisons present across languages. Each language set contains at most 19 examples presented in the following order: 1) descriptive part that exemplifies the basic types of degree constructions in the given language (predicative phrasal, adverbial and attributive comparative, comparative of quantity, clausal comparative, equative, less-comparative, positive, superlative, too/enough-constructions) and gives an impression of the systematicity of degree constructions in the syntax and semantics of the language; 2) data that pertains to different aspects of cross-linguistic variation in the semantics of degree (differential comparative, comparison with a degree, ?negative island effect' test, tests for scope interactions of the comparative with the modals, degree question, measure phrase construction, subcomparative). Examples appear partly in the original script and are provided with the gloss, the translation, the grammaticality/felicity judgement and the context/reading where necessary. The judgement field contains felicity judgements for the scope interaction examples (supplied with the relevant contexts or readings) and grammaticality judgements for the rest. The following ranking has been used in both cases: ok(grammatical/felicitous); ?(slightly marked/slightly odd); ??(marked/odd); *(ungrammatical/infelicitous). "n/c" and "n/a" in the judgement field indicate that the example cannot be constructed or the test is not applicable. In the latter case, the comment field in the footer row contains a short explanation. "n/c" and "*" rows usually contain alternative examples (Alt) along with the literal ones (Lit). The former reflect alternative ways to express the relevant meaning, e.g. in the form of paraphrases.
B18Stative Passive
(Sentence Collection)
 
Microsoft Excel The "Grammatik und Pragmatik des Zustandspassivs" is a sentence collection result of an investigation of the meaning of the German stative passive.

Corpora of co-operating Collaborative Research Centres


Last update 03/20/2009