+++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++ A FLEXEMIC TAGSET FOR POLISH Adam Przepiórkowski Polish Academy of Sciences Morphosyntactic, or part of speech (POS), tagging is often considered to be an uninteresting aspect of corpus linguistics; after all, robust morphological analyzers and good-accuracy disambiguators exist for many languages, while the same cannot be said about, e.g., comprehensive computational grammars or dialogue models. Hence, morphological annotation is often considered a done deal, with much annotation work focusing on higher levels of linguistics representation (mainly syntax and, recently, semantics). While there exist many morphological analyzers for Polish and other Slavic languagages which are useful and robust, I argue that they often are linguistically naive, which has the practical consequence of lack of reusability of such tools. At least the following features of many currently used tagsets for Slavic seem problematic from the point of view of linguistic theory and reusability: - uncritical adoption of traditional and sometimes ill-defined POS classes, such as `pronoun' or vaguely delimited classes such as `verb' or `noun' (it is often not clear whether gerunds are `verbs' or `nouns' in such classifications); - POS classes and categories are often chosen on the basis of a mix of morphological, syntactic and semantic criteria, e.g., gender in Slavic is sometimes defined on the basis of mixed morphosyntactic and semantic properties, and so is pronoun and numeral; - mixing morphosyntactic annotation with what might be called dictionary annotation; e.g., tagsets often include tags for proper names or morphosyntactically transparent collocations, which -- in our opinion -- do not belong to the realm of POS annotation; - ignoring the finer points of the morphosyntactic system of a given language, e.g., the multitude of genders in languages such as Polish, or categories such as depreciation and accommodability; - unclear segmentation rules (should so-called analytic tenses or reflexive verbs be treated as single units for the purpose of annotation?). In this talk, I will present a tagset for Polish which has the following characteristics: Segmentation: - what is being tagged is a single orthographic word or, in some well-defined cases, a part thereof; multi-word constructions, even those sometimes considered to be morphological formations (so-called analytic forms) or dictionary entries (proper names), should be considered by a different level of processing; Grammatical classes: - the main criteria for delimiting grammatical classes (`POS classes') are morphological (esp., inflectional) and morphosyntactic (agreement); - in particular, grammatical classes consist of flexemes (Bień 1991) understood as morphosyntactically homogeneous sets of forms (this notion will be explained during the talk); - at the moment 27 grammatical classes are postulated for Polish, including: - classes corresponding to traditional POSs such as `noun', `adjective', - more specific verbal classes such as `l-participle' (inflects for gender), `infinitive' (does not inflect) or `imperative' (inflects for number, but in a defective way), and - ephemeral classes such as `sobie' (the accented anaphoric pronoun; overtly inflects for case only) or `foreign' expressions. Grammatical categories: - a fine repertoire of grammatical categories for Polish is adopted, mainly based on work by Zygmunt Saloni and his colleagues; - apart from the relatively uncontroversial categories of number, case, person and degree, other traditional categories such as gender and aspect have been reanalysed on the basi of Saloni's work; - some more ephemeral categories unknown to traditional linguistics are assumed, including depreciation (`profesorowie' vs. `profesory'), accentability (`go' vs. `jego'), postprepositionality (`jego' vs. `niego') and accommodability (`dwaj' vs. `dwóch'). The work reported here is the result of joint work with other participants of the Polish Corpus project at the Institute of Computer Science, Polish Academy of Sciences, esp., with Marcin Woliński and Łukasz Dębowski. +++++++++++++++++++++++++++++++++++++++