Liste der am SFB 441 erstellten Korpora

Zurück zum Projekt C2

Name Details
Total 24 projects
TüPP-D/Z Corpus
A01: Representation and Automatic Acquisition of Linguistic Data
Version3rd September, 2004 TüPP-D/Z is a collection of articles from the taz newspaper ("die tageszeitung") which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. All texts have been processed automatically, starting from paragraph, sentence and token segmentation. Word forms include information about some regular types of named entities, including dates, telephone numbers, and number/unit combinations.
Annotation layersALLLAYERSUNIFIED, Clause, Chunk, Field, Named Entities, Lexeme
Number of associated resource files30350
TüBa-D/Z Corpus
A01: Representation and Automatic Acquisition of Linguistic Data
VersionR3 15th August, 2006 The Tübinger Baumbank des Deutschen / Zeitungssprache (TüBa-D/Z; Tübingen Treebank of Written German) is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz). The treebank currently comprises approximately 36000 sentences (640000 words). The annotation was performed manually and is currently ongoing.
Annotation layersClause, Lexeme, Field, Discourse, Named Entities, ALLLAYERSUNIFIED, Grammatical Function, Phrase
Number of associated resource files10314
Warao Lexicon Corpus
A02: Linguistic Theories as Data Types
Version16th December, 2008 The Warao lexicon is a compilation of warao words collected from native speakers and bundled with information about their dialect, context, morphology and glosses and translations in English, German and Spanish. The lexicon is also linked with a list of native speakers containing some basic information about each one.
Annotation layersn.a.
Number of associated resource files12
A03: Suboptimal Syntactic Structures
Version16th December, 2008 The aim of the sentence collection (SINBAD) is to provide researchers with access to a large body of (suboptimal) example sentences and their grammaticality judgements from the literature and from Project A3 empirical work.
Annotation layersn.a.
Number of associated resource files11
CoDII Corpus
A05: Distributional Idiosyncrasies in Logical Form
Version16th December, 2008 The Collection of Distributionally Idiosyncratic Items (CoDII) is a linguistic resource on lexical items which have highly idiosyncratic occurrence patterns.
Annotation layersn.a.
Number of associated resource files15
Uppsala Corpus Corpus
B01: Corpus Based Analysis of Forms of Address and Politeness in the Slavonic Languages
Version20th July, 2000 The Uppsala Corpus of modern Russian texts was developed at the Department of Slavic Studies at Uppsala University, Sweden, under the direction of Lennart Lönngren, from whom we obtained the permission to use the Uppsala corpus for the SFB 441 project B01. All rights regarding the Uppsala corpus belong to the author. Corpus data may be used for research purposes only; commercial use of the corpus is prohibited. This corpus (Upsal'skij korpus russkix tekstov) consists of some 600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose. The informative texts are from between 1985 and 1989, while the literary texts, whose vocabulary does not date as quickly, cover a longer period, 1960-88. The corpus does not include poetry or drama.
Annotation layersn.a.
Number of associated resource files9
Götz von Berlichingen Corpus
B03: Modal Verbs and Modality in German
Version13th September, 2000 The Early Modern High German text "Götz von Berlichingen" was digitised by the SFB 441 project B3. The original text was scanned, OCR processed and manually corrected. The encoding follows the TUSNELDA standards. In order to preserve the line numbers of the source text, we used the TUSNELDA "poem" element. Pages starting with a n=0 paragraph refer to the preceding paragraphs.
Annotation layersn.a.
Number of associated resource files5
Alltagserzählungen Corpus
B03: Modal Verbs and Modality in German
Version1st January, 2000 This annotated data collection was constructed based on recordings of monologues, and their subsequent transcription. Participants were asked to talk about current events that influenced their lives in a good or bad way. The research question of this study concerns the investigation of the processing and acquisition of German coordinate structures.
Annotation layersn.a.
Number of associated resource files9
LexiTypeDia Corpus
B06: Lexical Motivation in French, Italian and German
Version1st January, 2000 Cognitive patterns about body parts, specifically, in the domain of the head.
Annotation layersn.a.
Number of associated resource files11
Motivational Partner Corpus
B06: Lexical Motivation in French, Italian and German
Version16th December, 2008 A lexicon of analyzed answers to a semi-open questionnaire designed to obtain motivational partners of word+meaning stimulus units.
Annotation layersn.a.
Number of associated resource files13
Polysemy Lexicon
B06: Lexical Motivation in French, Italian and German
Version16th December, 2008 A sentence collection constructed by conducting a sentence generation and definition task targeted at word sense disabiguation performed by the informants in a set environment.
Annotation layersn.a.
Number of associated resource files13
Semantic Relation Corpus
B06: Lexical Motivation in French, Italian and German
Version16th December, 2008 A lexicon of analyzed answers to a questionnaire targeted at specifying the semantic relations between motivated and motivating stimulus units.
Annotation layersn.a.
Number of associated resource files13
BKS-Korpus Super Corpus Group
B08: Corpusbased Analysis of Local and Temporal Deictics in (Spontaneously) Spoken and (Reflected) Written Language
Version14th September, 2001 The BKS Corpus consists of three subcorpora: (a) Comic Corpus, (b) Bosnian Interviews, (c) Novosadski Corpus of Spoken Language. The research interest of the SFB 441 project B8 lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in different text classes.
Annotation layersn.a.
Number of associated resource files0
Bosnische Interviews Corpus
B08: Corpusbased Analysis of Local and Temporal Deictics in (Spontaneously) Spoken and (Reflected) Written Language
Version14th September, 2001 Part of the BKS-Korpus corpus group.
The subcorpus Bosnian Interviews contains 13 narrative interviews which were conducted with Bosnian refugees (Croats, Muslims and Serbs) in 1994. These texts are predestined for any type of research with regard to Bosnian spoken-language-phenomenona. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in narrative conversation-situations.
Annotation layersEditorial Notes, Deictics, Conversation
Number of associated resource files124
Comic Korpus Corpus
B08: Corpusbased Analysis of Local and Temporal Deictics in (Spontaneously) Spoken and (Reflected) Written Language
Version14th September, 2001 Part of the BKS-Korpus corpus group.
The Comic Corpus consists of several serbian Asteriks comic strips. Some of the serbian comics are originally written in cyrillics; the texts were transcribed in latin script in order to have all comic texts in the same encoding. The comic texts are predestined for any type of research with regard to imitated spoken-language-phenomenons. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in combination with a pointing gesture in a typical demonstratio ad oculos situation including all extralinguistic information given by the communication situation. For that purpose the panels that include deictics and pointing gestures were digitised and added to the corpus.
Annotation layersn.a.
Number of associated resource files8
BraToLi-Korpus Corpus
B09: Local and Temporal Deixis in the Romance Languages: History and Variation
Version7th June, 2000 The BraToLi corpus contains transcriptions of soccer match commentaries (TV and radio) as well as conversations about steeringwheel locks. Languages include Brazillian Portugese, European Spanish (Toledo) and American Spanish (Lima).
Annotation layersn.a.
Number of associated resource files8
TüPoDia-Korpus Corpus
B09: Local and Temporal Deixis in the Romance Languages: History and Variation
Version7th June, 2000 The TüPoDia corpus contains editions of Portugese texts, specifically collected on a historical basis. The texts were digitized in order to enable automatic analyses with regard to, for example, word frequencies.
Annotation layersEditorial Notes, Deictics, Text Strcture
Number of associated resource files62
TüTeAM Corpus
B10: Typology and Logical Form of Sentential Negation
Version7th June, 2000 The TüTeAM corpus contains about 2800 entries from Ancient Greek, German, English, Italian, Hungarian, Latin, Swedish, Russian, Ukrainian, Bulgarian. The data come from various sources: linguistic literature (the "classics" on tense and aspect), fiction, documentary evidence. Examples appear in the original script, if necessary with transliteration, English or German gloss and translation. The examples also contain an indication of the source or a complete denotation of the bibliographic source. Sentences are analysed according to various criteria: tense and aspect morphology, types of time adverbials, Aktionsarten. The analysis allows a specific search for similar phenomena in a variety of languages and makes the discovery of typological regularities easier.
Annotation layersn.a.
Number of associated resource files16
TüNeg Corpus
B10: Typology and Logical Form of Sentential Negation
Version7th June, 2000 The TüNeg database contains about 2700 entries from mostly the same languages as the TüTeAM database using sources similar in kind. Sentences are analysed according to the following criteria: licensing environment, different possible readings, negative polarity items involved, types of morphological negation and the general type of negation. Furthermore, where appropriate, contrasting examples have been recorded.
Annotation layersn.a.
Number of associated resource files16
TVP (Tibetische Version Papagaienbuch) Corpus
B11: Semantic Roles, Case Relations, and Cross-Clausal Reference in Tibetan
Version7th June, 2000 Semantic roles, case relations, and cross-clausal reference in Tibetan.
Annotation layersn.a.
Number of associated resource files15
Satzkonnektoren Altspanisch Corpus
B14: Discourse Traditions of Romance Languages and Multidimensional Analysis of Diachronic Corpora
Version7th June, 2000 Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis
Annotation layersn.a.
Number of associated resource files22
Satzkonnektoren Surselvisch Corpus
B14: Discourse Traditions of Romance Languages and Multidimensional Analysis of Diachronic Corpora
Version7th June, 2000 Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis.
Annotation layersn.a.
Number of associated resource files18
Zustandpassiv Corpus
B18: Grammar and Pragmatics of the German Stative Passive
Version16th December, 2008 The "Grammatik und Pragmatik des Zustandspassivs" is a sentence collection result of an investigation of the meaning of the German stative passive.
Annotation layersn.a.
Number of associated resource files11
Gradkonstruktionen Corpus
B17: Comparative Constructions
Version1st March, 2008 The database presents parallel sets of data on comparison constructions from 15 languages: Bulgarian, Guaraní (an Amerindian language spoken mostly in Paraguay), Hindi, Hungarian, Japanese, Mandarin Chinese, Mooré (a Gur language), Motu (from Papua New Guinea), Romanian, Russian, Samoan, Spanish, Thai, Turkish and Yorùbá (a Kwa language). The sentences have been elicited from naive informants with the help of language specific questionnaires. The goal has been an in-depth study of those languages, with the perspective of figuring out how their grammars differ in order to yield the diverse empirical picture that comparisons present across languages. Each language set contains at most 19 examples presented in the following order: 1) descriptive part that exemplifies the basic types of degree constructions in the given language (predicative phrasal, adverbial and attributive comparative, comparative of quantity, clausal comparative, equative, less-comparative, positive, superlative, too/enough-constructions) and gives an impression of the systematicity of degree constructions in the syntax and semantics of the language; 2) data that pertains to different aspects of cross-linguistic variation in the semantics of degree (differential comparative, comparison with a degree, ?negative island effect' test, tests for scope interactions of the comparative with the modals, degree question, measure phrase construction, subcomparative). Examples appear partly in the original script and are provided with the gloss, the translation, the grammaticality/felicity judgement and the context/reading where necessary. The judgement field contains felicity judgements for the scope interaction examples (supplied with the relevant contexts or readings) and grammaticality judgements for the rest. The following ranking has been used in both cases: ok(grammatical/felicitous); ?(slightly marked/slightly odd); ??(marked/odd); *(ungrammatical/infelicitous). "n/c" and "n/a" in the judgement field indicate that the example cannot be constructed or the test is not applicable. In the latter case, the comment field in the footer row contains a short explanation. "n/c" and "*" rows usually contain alternative examples (Alt) along with the literal ones (Lit). The former reflect alternative ways to express the relevant meaning, e.g. in the form of paraphrases.
Annotation layersn.a.
Number of associated resource files10

Diese Liste wurde in Zusammenarbeit mit Projekt C1 erstellt.