Russian Corpora in Tübingen:

Regular expressions

  1. Alternatives
  2. Quantifiers
  3. Wildcard
  4. Brackets
  5. transformations
Return to "Russian Corpora in Tübingen"; Corpus Query Reference Page.
In regular expressions, the following metasymbols (special characters) are used:

Alternation

Several alternatives are separated by "|".

Examples

dom|izba=>"dom" or "izba"
dom|ulica|dvor=>"dom" or "ulica" or "dvor"

Quantifiers

The symbols ?, + und * are used as quantifiers. The modify the preceding symbol; if preceded by brackets, they apply to the bracketed text.
?=>0 or 1 occurrence.
+=>1 or more occurrences
*=>0 or more occurrences

Examples

doma?=>"dom" or "doma"
doma+=>"doma", "domaa" etc.
doma*=>"dom", "doma, "domaa" etc.

Special quantifiers

The number of occurences may be explicitly specified by the following symbols:
{n}=>Match n occurrences
{n,}=>Match at least n occurrences
{n,m}=>Match at least n, but no more than m occurrences

Wildcard

The dot "." matches any character. Several characters in square brackets "[]" (a "character class") match any one of this characters. If the first character in square brackets is "^", then it matches any one character except those in square brackets. Several consecutive characters may be written with "-"; thus "[a-z]" matches any small letter.

Examples

[abc]=>"a", "b" or "c"
[^abc]=>any character except "a", "b" or "c".
with quantifiers:
[a-z]?=>1 or 0 small letters
.*=>0 or more occurrences of any character

Brackets

The scope of quantifiers or alternatives is indicated by brackets. Alternatives apply only inside brackets; quantifiers apply to the whole of the bracketed text.

Examples

dom(ami|om)=>"domami" or "domom"
domami|om=>"domami" or "om".
domami?=>"domami" or "domam".
dom(ami)?=>"dom" or "domami".

Internal transformations

If a Cyrillic encoding has been selected for output, it can be used in the search string, whether it consists of simple words or regular expressions. However, before the actual search ist performed, all search strings are passed to a transliteration routine and transformed into our Latin transliteration (cf Transliteration table). Besides that, the dot "." is replaced by the string "[a-zA-Z]" in order to match only parts of words. The transformed search string is output to keep the operation transparent. In some cases, the transformation may alter the effects of quantifiers or character classes.
Michael Betsch
Last modified: Mon Jul 26 09:34:59 MET DST