Sonderforschungsbereich 441

International Conference on

LINGUISTIC DATA STRUCTURES

University of Tübingen, February 22-24, 2001

SFB 441 Homepage
Activities

 
 
 

Abstracts 

Steven Abney
Bootstrapping

There is recent promising work on algorithms for "bootstrapping", in the sense of a learning task in which one has a small amount of labelled data and a large amount of unlabelled data. Bootstrapping represents an interesting hybrid between supervised and unsupervised learning. I will give a very brief introduction to bootstrapping algorithms; why they are particularly interesting for language; and one particular proposal of my own for applying AdaBoost (a successful supervised-learning algorithm) to bootstrapping.

Fabrizio Arosio & Graham Katz
Structure Variety in the Linguistic Representation of Temporal Information

In this talk we present a language-neutral method for annotating sentence-internal temporal relations that are encoded by tenses and other temporal expressions. We have applied this annotation method to sentences from a variety of languages, creating a searchable multi-language database. We expect that, when expanded, this database will be invaluable to the cross-linguistic theoretical investigation of temporal semantic phenomena. Traditionally there has been a tradeoff between the semantic sophistication of an analysis of temporal reference and its typological generality. In recent years, however, increasingly sophisticated semantic analyses have been proposed for a variety of languages. We will sketch recent treatments of tense semantics and indicate some of the issues that become relevant in a cross-linguistic setting. Of particular interest to us is the contrast between the temporal interpretation of complement and adjunct clauses. One of the difficulties we face, however, is that the semantic judgements upon which theories of temporal reference are based are quite subtle and, typically, can only be made by native speakers. How, then, can we engage in sophisticated and complex cross-linguistic semantics without constant access to native speakers of many languages? Our annotated database is a first step towards a solution to this problem.
We will show how our method of semantic annotation is applied to matrix and embedded tensed clauses. These annotations indicate the temporal information encoded in a sentence and, therefore, can be provided with well-defined model-theoretic interpretation.

Tilman Berger, Michael Betsch & Bernhard Brehmer
Address Systems and Politeness - Independent or Interdependent? A Study Based on Russian and Czech Data

This talk will deal with the relationship between address usage and politeness in Slavonic languages. Most research has focused only on one of these phenomena and thus has not addressed the issue of their possible interdependency. Some researchers treated address as a part of politeness; others treated both phenomena as mutually independent (see Kasper's (1990) distinction of indexical (e.g. address) and strategic politeness). Corpus studies provide empirical evidence for some but not all issues involved. We will argue that corpus analyses are especially valuable for studies of less common or peripheral forms, but for the study of central forms other methodologies are also needed; we will support our corpus analyses with questionnaire studies using magnitude estimation. We conclude that the degree of interdependency between politeness and address usage is specific for any given language. If we consider a scale of interdependency, then Russian will occupy a rather extreme position (with a great interdependency between address and politeness), Czech has less interdependency, but is not too far from Russian. (The other extreme, with address and politeness almost independent from each other, is represented e.g. by English).

Angela Dorn, Boštjan Dvorák & Wiltrud Mihatsch
An Onomasiological Database for Lexical Diachronic Research

The study of lexical polygenesis requires an analysis of various complex and interacting data types. The aim of this talk is to elucidate the nature of these data types, their extraction, storage and interpretation. Our project studies the cross-linguistic paths of lexical change in the conceptual area of body parts and interprets them in the light of related observations in other linguistic and extralinguistic domains such as child language acquisition or perceptual psychology. The integration of such varied data types demands a good theoretical foundation. The second data problem is the genuinely lexical analysis of diachronic patterns. This search for (poly-)genetic paths alone is a highly intricate enterprise that includes the choice of a representative sample of languages, the extraction of semantic, morphological and syntactical properties of the source and target lexemes as well as the semantic associations between them from dictionaries. The extracted information then has to be structured and stored in an appropriate database. In conclusion, we will show how these elaborately filtered data can then contribute to a sound description and explanation of polygenesis as well as serve other kinds of lexicological research.

Hans-Bernhard Drubig
On the Syntactic Form of Epistemic Modality

Recent work on the semantics of modality (Westmoreland 1998) challenges commonly held views on the meaning of modals in English and other languages by showing that modals like must and may/might are not versions of the modal logical necessity and possibility operators, rather must be analyzed as evidential markers labelling informative propositions as deductions and plausible assumptions, respectively. This paper addresses the syntactic aspects of the issue and attempts to demonstrate that the view according to which epistemically labelled propositions, (must f , May/Might f ,) are not propositions themselves and hence are inaccessible to propositional operators is backed up by syntactic evidence. The paper discusses polarity (negative/interrogative), tense (sequence of tense/past tense replacement effects) and focus phenomena ((de)accentuation/ellipsis) and offers support for an approach in which epistemic and other evidential modals have a distinct syntactic representation. A brief look at what is known about the typology of evidential systems yields additional supportive evidence on the basis of comparative data.

Sam Featherston
Object Co-reference in German: New Data Solves Old Problems But Requires New Theory

While object coreference is quite feasible in English, the equivalent German construction is marginal (1). Additionally, there is little grammaticality difference between a pronoun or a reflexive here (1)a:(1)b, a problem for binding theory (eg Reis 1976, Grewendorf 1985).
(1) a. ?Der Friseur zeigte Paul_i sich_i im Spiegel
the barber showed Paul himself in.the mirror
b. ?Der Friseur zeigte Paul_i ihn_i im Spiegel
the barber showed Paul him in.the mirror
One factor which has obstructed progress on this question is the lack of agreement on the data: each author makes different claims about which structures are grammatical. Another is that some grammatical changes not obviously connected to binding seem to produce strong differences. A common suggestion has been that bound elements must be lower on an obliqueness hierarchy than their antecedents (Grewendorf 1985, Primus 1987).
In order to clarify this situation we applied the methodology of magnitude estimation (Bard et al 1996) to carefully matched sets of materials. This experimental approach captures very fine grammaticality differences and produces hard, replicable data. We tested in sixteen conditions, varying the four binary factors Case Order (Dat>Acc, Acc>Dat), Anaphor (reflexive, pronoun), Antecedent (full NP, pronoun), and Selbst (with or without reflexivity marker selbst). The results revealed clear effects for all of these factors: each had a preferred and a dispreferred value, and the sentences showed a continuum of grammaticality. Each dispreferred value was associated with a cumulative degradation in grammaticality. The data does not support the obliqueness hierarchy approach: dative binders were consistently better than accusatives. Binding Principles A and B did not appear adequate: reflexives and pronouns were not in complementary distribution. Neither will classic Optimality Theory (OT) capture these relationships without major revisions. We show that the experimental factors can be reduced to a range of grammatical constraints on structure and binding commonly assumed in the literature and argue that the most adequate way of allowing these constraints to interact is in a constraint weighting model. We finish with some remarks about the features of this data type and its relationship with syntactic theory (see also Keller 2000). In particular it strongly supports constraint violability within grammar, and the division of the constraint application and output selection functions in a grammar framework (contra OT).
Bard E., Robertson D. & Sorace A. 1996. Magnitude Estimation of Linguistic Acceptability. Language 72 (1), 32-68
Grewendorf G. 1985. Anaphern bei Objekt-Koreferenz im Deutschen: Ein Problem für die Rektions-Bindungs-Theorie. In: Abraham W. (ed) Erklärende Syntax des Deutschen, 137-71. Tübingen: Narr
Keller F. 2000. Gradience in Grammar: Experimental and Computational Aspects of Degrees of Grammaticality. PhD Thesis, University of Edinburgh
Primus B. 1987. Grammatische Hierarchien. Studien zur Theoretischen Linguistik 7. Munich: Wilhem Fink
Reis M. 1976. Reflexivierung in deutschen A.c.I.-Konstruktionen. Ein transformations-grammatisches Dilemma. Papiere zur Linguistik 9, 5-82

Stefanie Herrmann & Stephan Kepser
A Broader View on the Notion of a Linguistic Datum: Formal Framework and Case Study from Warao

In current linguistic discussions, linguistic data is very often presented without any context. Example sentences to justify or refute certain claims stand almost naked. The problem may not be that clear when discussing western european languanges, because linguists can feel ``at home'' here. But it becomes apparent when the language to be analysed is typologically very different or stems from a remote culture. We argue that the notion of a linguistic datum should be broadened considerably to take into account the environmental, cultural, sociological and circumstantial context of a linguistic event. To do so we sketch a formal framework and exemplify its use with the analysis of ethno-linguistic data from Warao. It is object-orientation that provides the base to formalise the multidimensionality and multimediality of data in a rather natural fashion. And we show how a correct linguistic interpretation of the system of deictica in Warao emerges only when respecting non-linguistic informations while gathering the data.

Hanneke van Hoof
On the Meaning and Information Structure of the Rise-Fall Intonation Contour

In this talk, we will provide a phonological description of two variants of the Rise-Fall intonation contour. Following Buering (1997), we will demonstrate that both variants of the RF-contour cause the sentence carrier to have some specific semantico-pragmatic properties. However, we will argue against Buering (1997) and Jacobs (1997), who stated that the informational status of the constituent associated with the rise is always a topic. It will be shown that by far most RF-utterances in German and Dutch can (also) be analyzed in terms of a multiple focus structure. We will argue that the informational analysis proposed in the talk provides more interesting perspectives for a general theory of information structure.

Konstanze Jungbluth
Data and Deictics: Demonstratives in Spanish, Portuguese and Catalan

This paper has three aims:
1. Most data available in current corpora are underdetermined for investigations into language use in oral discourse, as they lack situational and other contextual information. We present here outlines for our Corpus BraToLi (European and American Spanish and Brazilian Portuguese), which combines transcriptions with visual representations (fotos, pieces of video, drawings etc.) thus documenting situated data for work on deixis.
2. We introduce you to the three-term-system of Spanish demonstratives (este - ese - aquel). The use of the demonstratives reflects an autonomous conceptualization of space which cannot be reduced to the conceptualizations usually actualized for example in English. Going beyond the traditional person-oriented or distance-oriented systems, I propose a dyad-orientation. The dyad of conversation recognizes the hearer and the speaker as equally important participants in communication. Not only face-to-face situations but also face-to-back and side-by-side conversations are investigated.
3. The three different heirs of the former Latin three-term-system (Spanish, Portuguese & Catalan) show three different developments: use in all varieties in Spanish (este - ese - aquel), diatopical variation in Catalan (two different two-term-systems: aquest - aquell versus aqueix - aquell), variation between orality (two-term-system: esse - aquele) and literacy (three-term-system: este - esse - aquele) in Brazilian Portuguese.

Mary Kato
Generative Grammar and Variation Methodology: A Happy Marriage in the Description of Brazilian Portuguese

Most generative works dispense empirical data from E-language and most variationist researches, on the other hand, ignore formal theories of grammar which deal with I-language. Using the Brazilian experience in diachronic analysis, spoken language description and comparative grammar, I will show the advantages of combining both approaches in analyzing real corpora. While the Principles and Parameters theory can provide us with predictions as to what correlations to find, the empirical analysis can reveal new unaccounted for correlations. Formal grammars help the researcher to constrain the universe of factors to be considered and to formulate more explicit questions. The variationist method , in its turn, help us separate syntax from other modules of the language faculty and other domains of the mind, and also distinguish innate principles from socially established factors

Anthony Kroch
Statistical Regularities in Grammatical Change: Evidence From Diachronic Corpora

When informed by grammatical theory, the statistical study of diachronic corpora yields stable and repeatable principles by which syntactic competence is reflected in performance data. One of these principles is the Constant Rate Effect, which by now has been replicated in several data sets. The CRE shows that the grammatical basis of parametric change can be detected in usage, even where the change, as recorded in texts, is gradual and extends over many generations. Another, less well-known, quantitative pattern is the statistical independence of grammatically independent processes. Such independence has repeatedly appeared in quantitative investigations of structurally ambiguous sentences, allowing us to follow the progress of grammatical changes even in surface linguistic contexts where we cannot specify the proper structural analysis of individual sentences. We will present evidence for this principle from several recent quantitative studies in the history of English and briefly discuss its implications for linguistic theory.

Sandra Kübler, Ilona Steiner & Erhard Hinrichs
Automatic Annotation and Querying of German Treebanks

We will present the robust parsing system TüSBL for the automatic annotation of a German treebank and a query tool for searching in treebanks for complex syntactic structures.
TüSBL (Tübingen Similarity Based Learning) is based on the finite-state parser CASS by Steven Abney. In order to extend the partial chunk structures into complete trees including function-argument structure, TüSBL uses a similarity-based learning approach. The instance base consists of the Tübingen German treebank, which contains ca. 38 000 syntactically annotated sentences.
TüSBL's output, as well as the original German treebank can then be searched by the query tool VIQTORYA (a VIsual Query Tool fOr sYntactically Annotated corpora). VIQTORYA uses a query language that allows to search for complex syntactic structures including tokens, syntactic categories, grammatical functions and binary relations of (immediate) dominance and linear precedence between nodes. To ensure efficient searching, VIQTORYA uses an indexing system based on a relational database. A visual user interface is provided for specifying the queries in a user-friendly way.

James A. Matisoff
Phonosemantic Problems in Sino-Tibetan/Tibeto-Burman Bodypart Etymologies

More attention has been paid to the area of bodypart nomenclature than to any other lexicosemantic field in Sino-Tibetan/Tibeto-Burman. The Sino-Tibetan Etymological Dictionary and Thesaurus project (STEDT) has concentrated on bodypart terms from the beginning (1987 to the present), > and these comprise a large proportion of STEDT's 350,000-record database. Reconstructed bodypart lexicon is the subject matter of the first printed volume of STEDT, to be published this year. The basic parameters of TB bodypart semantics were laid out in my book "Variational Semantics in Tibeto-Burman" (Philadelphia, 1978), with the key notions being those of variation on both the phonological and semantic planes. This variation is not random, but operates according to a peculiar logic of its own.

Jochen Raecke
ovo-to-ono: Demonstratives and More

From a purely morphological point of view ovo, to, ono are the neutral forms of the Bosnian/Croatian/Serbian demonstratives ovaj, taj, onaj. Therefore it is not surprising that they are used in just the same way as all the other inflected forms of ovaj, taj, onaj, though ovo, to, ono show peculiarities when used as endophorics which are not shared by other forms of ovaj, taj, onaj. Beyond that we find ovo, to, ono in a wide range of uses which have up to now not been described in linguistic research and which only partly can be explained by treating them as forms of the demonstratives. Some of these various uses will be demonstrated only, while their functioning as particles will be discussed in more detail.

Marga Reis
Do German Modal Verbs Form a Syntactic Class? - Theoretical and Empirical Considerations

The prototypical modal verbs (MV) in German (können, müssen, dürfen, mögen, sollen, wollen, (brauchen, werden)) exhibit correlating syntactic-semantic properties: Syntactically, they always construe coherently, as a rule taking bare infinitival complements; semantically, they are ,polyfunctional', i.e. all of them may have not only circumstantial but also epistemic and/or evidential uses. If the (synchronic, diachronic, ontogenetic) rise of epistemic/evidential meanings is considered as central an explanandum of modal research as it usually is, then this correlation suggests an intriguing answer: coherent viz. bare infinitival construction of MVs may be the (only) necessary syntactic ingredient for giving rise to polyfunctionality. This is the central hypothesis investigated in our SFB-project on modality and MVs in German (B3) in synchronic, diachronic, ontogenetic (and some comparative) respects.
Now, traditional as well as modern accounts proceed quite differently: polyfunctionality is seen to have its basis in the modal items themselves, with the various uses corresponding to different degrees of ,auxiliarization'. In recent generative approaches this has almost invariably led to treating MVs as classes of specific functional elements that correspond to two functional MOD projections, a lower one hosting the MVs in circumstantial use, an appropriately high one hosting MVs in epistemic/evidential use. If so, MVs form syntactic classes, their polyfunctionality is a case of syntactic class ambiguity, and the major differences in behaviour between circumstantial and epistemic/evidential modals are claimed to reflect these syntactic class differences. In my talk I will try to show that there is no evidence whatever to support such an analysis for German MVs: as for bona fide syntactic regularities, MVs (in circumstantial as well as epistemic/evidential use) behave just like full verbs, whereas, on closer inspection, the major differences in behaviour between the various uses turn out to be nonsyntactic in nature. Thus a syntactic class account of polyfunctionality seems unfounded, at least for German. This leaves the properties peculiar to the infinitival construction of MV on which to base such an account. Since it can be shown that an account in terms of ,orientation' differences (control vs. raising, different forms of raising) also fails, we are practically left with the coherence hypothesis as outlined above. Bringing to bear various types of data and results achieved in our project work, I will refine this hypothesis and provide some positive evidence for its being correct, - which suggests keeping it as our guideline in future work.

Andreas Wagner
The TUSNELDA Annotation Standard: An XML Encoding Standard for Multilingual Corpora Supporting Various Aspects of Linguistic Research

The talk presents the TUSNELDA annotation standard, a corpus encoding standard that was developed in SFB 441, and provides examples of its usage.
Several SFB projects are building corpora to empirically investigate various linguistic phenomena in various languages. These corpora will form the components of the "Tuebingen collection of reusable, empirical, linguistic data structures (TUSNELDA)".
The TUSNELDA annotation standard aims at providing auniform encoding scheme for all subcorpora and texts of TUSNELDA such that they can be processed with uniform standardized tools. To guarantee maximal reusability we use SGML / XML for encoding. Previous SGML standards for text encoding were provided by the Text Encoding Initiative (TEI) and the Expert Advisory Group on Language Engineering Standards (Corpus Encoding Standard, CES). The TUSNELDA standard is based on TEI and CES but takes into account the specific needs of the SFB projects, i.e. the peculiarities of the examined linguistic phenomena.


For questions or comments please write to Roland Meyer