Liste der Korpora

  Hauptseite des SFB
  Nachhaltige Datenformate
  Abfragewerkzeuge
English

Diese Liste enthält alle Korpora, die im Sonderforschungsbereich 441 erstellt wurden.

ProjektRessourceVerfügbarkeitFormatBeschreibung
A1TüBa-D/Z
(Korpus)
Homepage NEGRA-Export
Export-XML
TUSNELDA-XML
The Tübinger Baumbank des Deutschen / Zeitungssprache (TüBa-D/Z; Tübingen Treebank of Written German) is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of "die tageszeitung" (taz). The treebank currently comprises approximately 36000 sentences (640000 words). The annotation was performed manually and is currently ongoing.
A1TüPP-D/Z
(Korpus)
Homepage DEREKO-XML TüPP-D/Z is a collection of articles from the taz newspaper ("die tageszeitung") which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. All texts have been processed automatically, starting from paragraph, sentence and token segmentation. Word forms include information about some regular types of named entities, including dates, telephone numbers, and number/unit combinations.
A2Warao Datenbank
(Lexikon)
  Microsoft Access The Warao lexicon is a compilation of warao words collected from native speakers and bundled with information about their dialect, context, morphology and glosses and translations in English, German and Spanish. The lexicon is also linked with a list of native speakers containing some basic information about each one.
A3SINBAD
(Satzsammlung)
Homepage
TUSNELDA-XML The aim of the sentence collection (SINBAD) is to provide researchers with access to a large body of (suboptimal) example sentences and their grammaticality judgements from the literature and from Project A3 empirical work.
A5CoDII
(Lexikon)
Homepage
XML The Collection of Distributionally Idiosyncratic Items (CoDII) is a linguistic resource on lexical items which have highly idiosyncratic occurrence patterns.
B1Russische Interviews
(Korpus)
Homepage
TUSNELDA-XML  
B1Uppsala Korpus
(Korpus)
Homepage
TUSNELDA-XML The Uppsala Corpus of modern Russian texts was developed at the Department of Slavic Studies at Uppsala University, Sweden, under the direction of Lennart Lönngren, from whom we obtained the permission to use the Uppsala corpus for the SFB 441 project B1. All rights regarding the Uppsala corpus belong to the author. Corpus data may be used for research purposes only; commercial use of the corpus is prohibited. This corpus (Upsal'skij korpus russkix tekstov) consists of some 600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose. The informative texts are from between 1985 and 1989, while the literary texts, whose vocabulary does not date as quickly, cover a longer period, 1960-88. The corpus does not include poetry or drama.
B3Götz von Berlichingen Korpus
(Korpus)
Online-Abfrage
TUSNELDA-XML The Early Modern High German text "Götz von Berlichingen" was digitised by the SFB 441 project B3. The original text was scanned, OCR processed and manually corrected. The encoding follows the TUSNELDA standards. In order to preserve the line numbers of the source text, we used the TUSNELDA "poem" element. Pages starting with a n=0 paragraph refer to the preceding paragraphs.
B3Alltagserzählungen Korpus
(Satzsammlung)
 
TUSNELDA-XML This annotated data collection was constructed based on recordings of monologues, and their subsequent transcription. Participants were asked to talk about current events that influenced their lives in a good or bad way. The research question of this study concerns the investigation of the processing and acquisition of German coordinate structures.
B6LexiTypeSyn
(Lexikon)
 
MySQL  
B6LexiTypeDia
(Lexikon)
 
Microsoft Access Cognitive patterns about body parts, specifically, in the domain of the head.
B6Motivational Partner
(Lexikon)
 
  A lexicon of analyzed answers to a semi-open questionnaire designed to obtain motivational partners of word+meaning stimulus units.
B6Polysemy
(Lexikon)
 
  A sentence collection constructed by conducting a sentence generation and definition task targeted at word sense disabiguation performed by the informants in a set environment.
B6Semantic Relation
(Lexikon)
 
  A lexicon of analyzed answers to a questionnaire targeted at specifying the semantic relations between motivated and motivating stimulus units.
B8Comic Corpus
(Korpus)
Online-Abfrage
TUSNELDA-XML The Comic Corpus belongs to the BKS corpus group. It consists of several serbian Asteriks comic strips. Some of the serbian comics are originally written in cyrillics; the texts were transcribed in latin script in order to have all comic texts in the same encoding. The comic texts are predestined for any type of research with regard to imitated spoken-language-phenomenons. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in combination with a pointing gesture in a typical demonstratio ad oculos situation including all extralinguistic information given by the communication situation. For that purpose the panels that include deictics and pointing gestures were digitised and added to the corpus.
B8Bosnische Interviews
(Korpus)
Online-Abfrage
TUSNELDA-XML The Bosnian Interviews belongs to the BKS corpus group. It contains 13 narrative interviews which were conducted with Bosnian refugees (Croats, Muslims and Serbs) in 1994. These texts are predestined for any type of research with regard to Bosnian spoken-language-phenomenona. The research interest of our project lies in the use of the Bosnian/Croatian/Serbian v/t/n-deictics in narrative conversation-situations.
B8Novosadski Korpus gesprochener Sprache
(Korpus)
Online-Abfrage
TUSNELDA-XML  
B9TüPoDia
(Korpus)
Online-Abfrage
TUSNELDA-XML The TüPoDia corpus contains editions of Portugese texts, specifically collected on a historical basis. The texts were digitized in order to enable automatic analyses with regard to, for example, word frequencies.
B9BraToLi
(Korpus)
Online-Abfrage
TUSNELDA-XML The BraToLi corpus contains transcriptions of soccer match commentaries (TV and radio) as well as conversations about steeringwheel locks. Languages include Brazillian Portugese, European Spanish (Toledo) and American Spanish (Lima).
B10TüNeg
(Satzsammlung)
 
Filemaker The TüNeg database contains about 2700 entries from mostly the same languages as the TüTeAM database using sources similar in kind. Sentences are analysed according to the following criteria: licensing environment, different possible readings, negative polarity items involved, types of morphological negation and the general type of negation. Furthermore, where appropriate, contrasting examples have been recorded.
B10TüTeAm
(Satzsammlung)
 
n.a. The TüTeAM corpus contains about 2800 entries from Ancient Greek, German, English, Italian, Hungarian, Latin, Swedish, Russian, Ukrainian, Bulgarian. The data come from various sources: linguistic literature (the "classics" on tense and aspect), fiction, documentary evidence. Examples appear in the original script, if necessary with transliteration, English or German gloss and translation. The examples also contain an indication of the source or a complete denotation of the bibliographic source. Sentences are analysed according to various criteria: tense and aspect morphology, types of time adverbials, Aktionsarten. The analysis allows a specific search for similar phenomena in a variety of languages and makes the discovery of typological regularities easier.
B11 OTC
(Korpus, syntaktische Annotation)
Homepage
TUSNELDA-XML

Dokumentation
Elemente
Kategorien
Argumente
Argumentstruktur
OTC = Fonds Pelliot Ms. 250 / Pelliot tibétain 1287, commonly known as the Old Tibetan Chronicle, ca. mid 9th century, found in Dunhuang (Central Asia), the original is kept in the Bibliothèque Nationale de France. Chapter I (l.1-62).
Language: Old Tibetan ca. mid 7th – mid 11th century.
Size: 6 divisions (paragraphs), 101 sentences, 232 clauses (225 verbs), 733 tokens ('words').
Metadata
B11 TVP
(Korpus, syntaktische Annotation)
Homepage
TUSNELDA-XML

Dokumentation
Elemente
Kategorien
Argumente
Argumentstruktur
TVP = Die tibetische Version des Papageienbuches, a 15th century adaptation of the Indian narrations of the parrot, styled as stories about previous reincarnations of Atisha and his disciples. Ca. 40% (fol. 261v1 - 268v5, pp. 43-48).
Language: Classical Tibetan ca. 11th – 19th century.
Size: 11 divisions plus 13 subdivisions, 415 sentences, 903 clauses (849 verbs), 2669 tokens ('words').
Metadata
B11 LLV
(Korpus, syntaktische Annotation)
Homepage
TUSNELDA-XML

Dokumentation
Elemente
Kategorien
Argumente
Argumentstruktur
LLV = /Gsh////amyulna bsh//adpa//vi Kesargyi sgrungs bzh//ugs/
A Lower Ladakhi version of the Kesar saga, collected around 1900. Chapter I (pp. 1-16).
Language: contemporary Ladakhi, West Tibetan.
Size: 12 divisions, 589 clauses (585 verbs), 295 sentences, 1926 tokens ('words').
Metadata
B11 VDLV
(Lexikon mit Beispielsätzen)

extended TEI / TUSNELDA-xml

Zur Dokumentation semantischer und syntaktischer Kategorien siehe auch:

Kategorien
Argumente
Argumentstruktur
VDLV = Valency Dictionary of Ladakhi Verbs (work in progress).
Language: contemporary Ladakhi, West Tibetan. Basically, one reference dialect each of the two main dialect groups: Domkhar for Shamskat (speech of Lower Ladakh) and Gya for Kenkhat (speech of Upper Ladakh).
Size: 925 main entries (783 for the Domkhar, 814 for the Gya), 717 additional sub-entries (507 for Domkhar, 596 for Gya), 12.980 example sentences (6.252 for Domkhar, 4492 for Gya).
Overview
B14Satzkonnektoren Altspanisch
(Korpus)
 
XML Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis
B14Satzkonnektoren Surselvisch
(Korpus)
 
XML Discourse Tradition of Romance Languages and multi-dimensional Corpus Analysis.
B16Russisch-Deutscher-Spracherwerb
(Korpus)
 
CHILDES  
B17Gradkonstruktionen
(Satzsammlung)
 
Datenbank The database presents parallel sets of data on comparison constructions from 15 languages: Bulgarian, Guaraní (an Amerindian language spoken mostly in Paraguay), Hindi, Hungarian, Japanese, Mandarin Chinese, Mooré (a Gur language), Motu (from Papua New Guinea), Romanian, Russian, Samoan, Spanish, Thai, Turkish and Yorùbá (a Kwa language). The sentences have been elicited from naive informants with the help of language specific questionnaires. The goal has been an in-depth study of those languages, with the perspective of figuring out how their grammars differ in order to yield the diverse empirical picture that comparisons present across languages. Each language set contains at most 19 examples presented in the following order: 1) descriptive part that exemplifies the basic types of degree constructions in the given language (predicative phrasal, adverbial and attributive comparative, comparative of quantity, clausal comparative, equative, less-comparative, positive, superlative, too/enough-constructions) and gives an impression of the systematicity of degree constructions in the syntax and semantics of the language; 2) data that pertains to different aspects of cross-linguistic variation in the semantics of degree (differential comparative, comparison with a degree, ?negative island effect' test, tests for scope interactions of the comparative with the modals, degree question, measure phrase construction, subcomparative). Examples appear partly in the original script and are provided with the gloss, the translation, the grammaticality/felicity judgement and the context/reading where necessary. The judgement field contains felicity judgements for the scope interaction examples (supplied with the relevant contexts or readings) and grammaticality judgements for the rest. The following ranking has been used in both cases: ok(grammatical/felicitous); ?(slightly marked/slightly odd); ??(marked/odd); *(ungrammatical/infelicitous). "n/c" and "n/a" in the judgement field indicate that the example cannot be constructed or the test is not applicable. In the latter case, the comment field in the footer row contains a short explanation. "n/c" and "*" rows usually contain alternative examples (Alt) along with the literal ones (Lit). The former reflect alternative ways to express the relevant meaning, e.g. in the form of paraphrases.
B18Zustandspassiv
(Satzsammlung)
 
Microsoft Excel The "Grammatik und Pragmatik des Zustandspassivs" is a sentence collection result of an investigation of the meaning of the German stative passive.

Korpora von Partner-SFBs


Zuletzt aktualisiert am 20.03.2009