|
Russian Corpora in Tübingen
One aim of project B1 in SFB 441 is to provide access
to Russian text corpora for online search.
We provide both a simple query interface, and a complex query interface.
The latter offers more corpora and advanced search options.
If you have other texts in electronic form available, we would be very
pleased if you placed them at our disposal to make them accessible to this
online query project (used for research purposes only).
Encodings
You may choose between three different encodings of the Russian texts for input and output:
our Latin transliteration (cf. our transliteration table),
and two Cyrillic encodings (KOI8, and Windows-CP 1251). If you choose
a Cyrillic encoding, you may however use the Latin
transliteration for input of your search string as well.
Simple Query Interface
Via the Simple Query Interface, you can access the Uppsala Corpus of
Modern Russian and a
growing corpus of Russian interview texts.
Via the Complex Query Interface, you can access the Uppsala and
Interview Corpus, as well as several other corpora. These
corpora belong to three categories:
- Contemporary Corpora - predominantly press
texts. The Uppsala Corpus belongs to this category and
it is possible to search selectively only its literary
or press contents.
- 20th Century Literature
- 19th Century Literature
The Complex Query Interface uses the corpus processor CQP
developed by the Institute for Natural Language Processing
of the University of Stuttgart
(further information).
Morphologically tagged corpora
Several corpora are available together with morphological annotation
(tagging). Tagging has been performed by a statistical
tagger (TnT by
Thorsten Brants). It is possible to display tags,
and to search for
tags as well as for word forms.
The Uppsala Corpus of modern Russian texts was developed at the Department
of Slavic Studies at Uppsala University, Sweden, under the direction
of Prof. Lennart Lönngren, from whom we obtained the permission to
use the Uppsala corpus for this online query project. All rights regarding
the Uppsala corpus belong to the author. Corpus data may be used for research
purposes only; commercial use of the corpus is prohibited. A short description
of the Uppsala Corpus can be found here.
Based on the Uppsala Corpus, a frequency
dictionary (Lönngren, Lennart (ed), 1993. Chastotnyj slovar' sovremennogo
russkogo jazyka. Uppsala) was compiled.
This growing corpus of Russian interview texts is collected and annotated
by staff from project B1 (Anja Gattnar, Sebastian Bücking and
Jennifer Haberhauer). The interviews are taken
from the following free online published Russian newspapers: Argumenty
i Fakty, Argumenty i Fakty Vladivostok, Art Peterburga,
Ogonek,
Otdyxaj, Psixologicheskaja Gazeta, Pjat' Uglov, Ptchela,
Segodnja, Strannik, Vasha Gazeta,
Vedomosti,
Vestnik. The corpus of interview texts includes interviews from
1996 until now. The topics covered by the texts are 'politics and society',
economy, music, literature, lifestyle and sports.
Written by Anja Gattnar,
last modified by Michael Betsch
on 30. August 2004
|