Liste der am SFB 441 erstellten Korpora

Name

Details

Total

24 projects

TüPP-D/Z Corpus

A01: Representation and Automatic Acquisition of Linguistic Data

Version	3rd September, 2004	TüPP-D/Z is a collection of articles from the taz newspaper ("die tageszeitung") which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. All texts have been processed automatically, starting from paragraph, sentence and token segmentation. Word forms include information about some regular types of named entities, including dates, telephone numbers, and number/unit combinations.
Annotation layers	ALLLAYERSUNIFIED, Clause, Chunk, Field, Named Entities, Lexeme
Number of associated resource files	30350

TüBa-D/Z Corpus

A01: Representation and Automatic Acquisition of Linguistic Data

Version	R3 15th August, 2006	The Tübinger Baumbank des Deutschen / Zeitungssprache (TüBa-D/Z; Tübingen Treebank of Written German) is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz). The treebank currently comprises approximately 36000 sentences (640000 words). The annotation was performed manually and is currently ongoing.
Annotation layers	Clause, Lexeme, Field, Discourse, Named Entities, ALLLAYERSUNIFIED, Grammatical Function, Phrase
Number of associated resource files	10314

Warao Lexicon Corpus

A02: Linguistic Theories as Data Types

Version	16th December, 2008	The Warao lexicon is a compilation of warao words collected from native speakers and bundled with information about their dialect, context, morphology and glosses and translations in English, German and Spanish. The lexicon is also linked with a list of native speakers containing some basic information about each one.
Annotation layers	n.a.
Number of associated resource files	12

SINBAD Corpus

A03: Suboptimal Syntactic Structures

Version	16th December, 2008	The aim of the sentence collection (SINBAD) is to provide researchers with access to a large body of (suboptimal) example sentences and their grammaticality judgements from the literature and from Project A3 empirical work.
Annotation layers	n.a.
Number of associated resource files	11

CoDII Corpus

A05: Distributional Idiosyncrasies in Logical Form

Version	16th December, 2008	The Collection of Distributionally Idiosyncratic Items (CoDII) is a linguistic resource on lexical items which have highly idiosyncratic occurrence patterns.
Annotation layers	n.a.
Number of associated resource files	15

Uppsala Corpus Corpus

B01: Corpus Based Analysis of Forms of Address and Politeness in the Slavonic Languages

Version	20th July, 2000	The Uppsala Corpus of modern Russian texts was developed at the Department of Slavic Studies at Uppsala University, Sweden, under the direction of Lennart Lönngren, from whom we obtained the permission to use the Uppsala corpus for the SFB 441 project B01. All rights regarding the Uppsala corpus belong to the author. Corpus data may be used for research purposes only; commercial use of the corpus is prohibited. This corpus (Upsal'skij korpus russkix tekstov) consists of some 600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose. The informative texts are from between 1985 and 1989, while the literary texts, whose vocabulary does not date as quickly, cover a longer period, 1960-88. The corpus does not include poetry or drama.
Annotation layers	n.a.
Number of associated resource files	9

Götz von Berlichingen Corpus

B03: Modal Verbs and Modality in German

Version	13th September, 2000	The Early Modern High German text "Götz von Berlichingen" was digitised by the SFB 441 project B3. The original text was scanned, OCR processed and manually corrected. The encoding follows the TUSNELDA standards. In order to preserve the line numbers of the source text, we used the TUSNELDA "poem" element. Pages starting with a n=0 paragraph refer to the preceding paragraphs.
Annotation layers	n.a.
Number of associated resource files	5

Alltagserzählungen Corpus

B03: Modal Verbs and Modality in German

Version	1st January, 2000	This annotated data collection was constructed based on recordings of monologues, and their subsequent transcription. Participants were asked to talk about current events that influenced their lives in a good or bad way. The research question of this study concerns the investigation of the processing and acquisition of German coordinate structures.
Annotation layers	n.a.
Number of associated resource files	9

LexiTypeDia Corpus

B06: Lexical Motivation in French, Italian and German

Version	1st January, 2000	Cognitive patterns about body parts, specifically, in the domain of the head.
Annotation layers	n.a.
Number of associated resource files	11

Motivational Partner Corpus

B06: Lexical Motivation in French, Italian and German

Version	16th December, 2008	A lexicon of analyzed answers to a semi-open questionnaire designed to obtain motivational partners of word+meaning stimulus units.
Annotation layers	n.a.
Number of associated resource files	13

Polysemy Lexicon

B06: Lexical Motivation in French, Italian and German

Version	16th December, 2008	A sentence collection constructed by conducting a sentence generation and definition task targeted at word sense disabiguation performed by the informants in a set environment.
Annotation layers	n.a.
Number of associated resource files	13

Semantic Relation Corpus

B06: Lexical Motivation in French, Italian and German

Version	16th December, 2008	A lexicon of analyzed answers to a questionnaire targeted at specifying the semantic relations between motivated and motivating stimulus units.
Annotation layers	n.a.
Number of associated resource files	13

BKS-Korpus Super Corpus Group

B08: Corpusbased Analysis of Local and Temporal Deictics in (Spontaneously) Spoken and (Reflected) Written Language

Version	14th September, 2001	The BKS Corpus consists of three subcorpora: (a) Comic Corpus, (b) Bosnian Interviews, (c) Novosadski Corpus of Spoken Language. The research interest of the SFB 441 project B8 lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in different text classes.
Annotation layers	n.a.
Number of associated resource files	0

Bosnische Interviews Corpus

B08: Corpusbased Analysis of Local and Temporal Deictics in (Spontaneously) Spoken and (Reflected) Written Language

Version	14th September, 2001	Part of the BKS-Korpus corpus group. The subcorpus Bosnian Interviews contains 13 narrative interviews which were conducted with Bosnian refugees (Croats, Muslims and Serbs) in 1994. These texts are predestined for any type of research with regard to Bosnian spoken-language-phenomenona. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in narrative conversation-situations.
Annotation layers	Editorial Notes, Deictics, Conversation
Number of associated resource files	124

Comic Korpus Corpus

B08: Corpusbased Analysis of Local and Temporal Deictics in (Spontaneously) Spoken and (Reflected) Written Language

Version	14th September, 2001	Part of the BKS-Korpus corpus group. The Comic Corpus consists of several serbian Asteriks comic strips. Some of the serbian comics are originally written in cyrillics; the texts were transcribed in latin script in order to have all comic texts in the same encoding. The comic texts are predestined for any type of research with regard to imitated spoken-language-phenomenons. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in combination with a pointing gesture in a typical demonstratio ad oculos situation including all extralinguistic information given by the communication situation. For that purpose the panels that include deictics and pointing gestures were digitised and added to the corpus.
Annotation layers	n.a.
Number of associated resource files	8

BraToLi-Korpus Corpus

B09: Local and Temporal Deixis in the Romance Languages: History and Variation

Version	7th June, 2000	The BraToLi corpus contains transcriptions of soccer match commentaries (TV and radio) as well as conversations about steeringwheel locks. Languages include Brazillian Portugese, European Spanish (Toledo) and American Spanish (Lima).
Annotation layers	n.a.
Number of associated resource files	8

TüPoDia-Korpus Corpus

B09: Local and Temporal Deixis in the Romance Languages: History and Variation

Version	7th June, 2000	The TüPoDia corpus contains editions of Portugese texts, specifically collected on a historical basis. The texts were digitized in order to enable automatic analyses with regard to, for example, word frequencies.
Annotation layers	Editorial Notes, Deictics, Text Strcture
Number of associated resource files	62

TüTeAM Corpus

B10: Typology and Logical Form of Sentential Negation

Version	7th June, 2000	The TüTeAM corpus contains about 2800 entries from Ancient Greek, German, English, Italian, Hungarian, Latin, Swedish, Russian, Ukrainian, Bulgarian. The data come from various sources: linguistic literature (the "classics" on tense and aspect), fiction, documentary evidence. Examples appear in the original script, if necessary with transliteration, English or German gloss and translation. The examples also contain an indication of the source or a complete denotation of the bibliographic source. Sentences are analysed according to various criteria: tense and aspect morphology, types of time adverbials, Aktionsarten. The analysis allows a specific search for similar phenomena in a variety of languages and makes the discovery of typological regularities easier.
Annotation layers	n.a.
Number of associated resource files	16

TüNeg Corpus

B10: Typology and Logical Form of Sentential Negation

Version	7th June, 2000	The TüNeg database contains about 2700 entries from mostly the same languages as the TüTeAM database using sources similar in kind. Sentences are analysed according to the following criteria: licensing environment, different possible readings, negative polarity items involved, types of morphological negation and the general type of negation. Furthermore, where appropriate, contrasting examples have been recorded.
Annotation layers	n.a.
Number of associated resource files	16

TVP (Tibetische Version Papagaienbuch) Corpus

B11: Semantic Roles, Case Relations, and Cross-Clausal Reference in Tibetan

Version	7th June, 2000	Semantic roles, case relations, and cross-clausal reference in Tibetan.
Annotation layers	n.a.
Number of associated resource files	15

Satzkonnektoren Altspanisch Corpus

B14: Discourse Traditions of Romance Languages and Multidimensional Analysis of Diachronic Corpora

Version	7th June, 2000	Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis
Annotation layers	n.a.
Number of associated resource files	22

Satzkonnektoren Surselvisch Corpus

B14: Discourse Traditions of Romance Languages and Multidimensional Analysis of Diachronic Corpora

Version	7th June, 2000	Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis.
Annotation layers	n.a.
Number of associated resource files	18

Zustandpassiv Corpus

B18: Grammar and Pragmatics of the German Stative Passive

Version	16th December, 2008	The "Grammatik und Pragmatik des Zustandspassivs" is a sentence collection result of an investigation of the meaning of the German stative passive.
Annotation layers	n.a.
Number of associated resource files	11

Gradkonstruktionen Corpus

B17: Comparative Constructions

Version	1st March, 2008	The database presents parallel sets of data on comparison constructions from 15 languages: Bulgarian, Guaraní (an Amerindian language spoken mostly in Paraguay), Hindi, Hungarian, Japanese, Mandarin Chinese, Mooré (a Gur language), Motu (from Papua New Guinea), Romanian, Russian, Samoan, Spanish, Thai, Turkish and Yorùbá (a Kwa language). The sentences have been elicited from naive informants with the help of language specific questionnaires. The goal has been an in-depth study of those languages, with the perspective of figuring out how their grammars differ in order to yield the diverse empirical picture that comparisons present across languages. Each language set contains at most 19 examples presented in the following order: 1) descriptive part that exemplifies the basic types of degree constructions in the given language (predicative phrasal, adverbial and attributive comparative, comparative of quantity, clausal comparative, equative, less-comparative, positive, superlative, too/enough-constructions) and gives an impression of the systematicity of degree constructions in the syntax and semantics of the language; 2) data that pertains to different aspects of cross-linguistic variation in the semantics of degree (differential comparative, comparison with a degree, ?negative island effect' test, tests for scope interactions of the comparative with the modals, degree question, measure phrase construction, subcomparative). Examples appear partly in the original script and are provided with the gloss, the translation, the grammaticality/felicity judgement and the context/reading where necessary. The judgement field contains felicity judgements for the scope interaction examples (supplied with the relevant contexts or readings) and grammaticality judgements for the rest. The following ranking has been used in both cases: ok(grammatical/felicitous); ?(slightly marked/slightly odd); ??(marked/odd); (ungrammatical/infelicitous). "n/c" and "n/a" in the judgement field indicate that the example cannot be constructed or the test is not applicable. In the latter case, the comment field in the footer row contains a short explanation. "n/c" and "" rows usually contain alternative examples (Alt) along with the literal ones (Lit). The former reflect alternative ways to express the relevant meaning, e.g. in the form of paraphrases.
Annotation layers	n.a.
Number of associated resource files	10

Diese Liste wurde in Zusammenarbeit mit Projekt C1 erstellt.