Russian Corpora in Tübingen

SFB 441 Main Page

TUSNELDA

Project B1

Russian Corpora in Tübingen

One aim of project B1 in SFB 441 is to provide access to Russian text corpora for online search. We provide both a simple query interface, and a complex query interface. The latter offers more corpora and advanced search options. If you have other texts in electronic form available, we would be very pleased if you placed them at our disposal to make them accessible to this online query project (used for research purposes only).

Encodings

You may choose between three different encodings of the Russian texts for input and output: our Latin transliteration (cf. our transliteration table), and two Cyrillic encodings (KOI8, and Windows-CP 1251). If you choose a Cyrillic encoding, you may however use the Latin transliteration for input of your search string as well.

Simple Query Interface

Via the Simple Query Interface, you can access the Uppsala Corpus of Modern Russian and a growing corpus of Russian interview texts.

Complex Query Interface

Via the Complex Query Interface, you can access the Uppsala and Interview Corpus, as well as several other corpora. These corpora belong to three categories:

Contemporary Corpora - predominantly press texts. The Uppsala Corpus belongs to this category and it is possible to search selectively only its literary or press contents.
20th Century Literature
19th Century Literature

The Complex Query Interface uses the corpus processor CQP developed by the Institute for Natural Language Processing of the University of Stuttgart (further information).

Morphologically tagged corpora

Several corpora are available together with morphological annotation (tagging). Tagging has been performed by a statistical tagger (TnT by Thorsten Brants). It is possible to display tags, and to search for tags as well as for word forms.

The Uppsala Corpus

The Uppsala Corpus of modern Russian texts was developed at the Department of Slavic Studies at Uppsala University, Sweden, under the direction of Prof. Lennart Lönngren, from whom we obtained the permission to use the Uppsala corpus for this online query project. All rights regarding the Uppsala corpus belong to the author. Corpus data may be used for research purposes only; commercial use of the corpus is prohibited. A short description of the Uppsala Corpus can be found here. Based on the Uppsala Corpus, a frequency dictionary (Lönngren, Lennart (ed), 1993. Chastotnyj slovar' sovremennogo russkogo jazyka. Uppsala) was compiled.

The Corpus of Interviews

This growing corpus of Russian interview texts is collected and annotated by staff from project B1 (Anja Gattnar, Sebastian Bücking and Jennifer Haberhauer). The interviews are taken from the following free online published Russian newspapers: Argumenty i Fakty, Argumenty i Fakty Vladivostok, Art Peterburga, Ogonek, Otdyxaj, Psixologicheskaja Gazeta, Pjat' Uglov, Ptchela, Segodnja, Strannik, Vasha Gazeta, Vedomosti, Vestnik. The corpus of interview texts includes interviews from 1996 until now. The topics covered by the texts are 'politics and society', economy, music, literature, lifestyle and sports.

Written by Anja Gattnar, last modified by Michael Betsch on 30. August 2004