GLDV-Früjahrstagung 2007

Our sponsors:

Discourse Annotation for Citation Mapping, Information Retrieval and Summarisation

Simone Teufel

In this talk, I will motivate why discourse annotation in scientific corpora is attractive for many applications requiring more text understanding than most current linguistic annotation can support. Summarisation is such an application: if we knew more about the discourse structure in a scientific text, it would be easier to produce a summary with a logical structure (e. g., the first extracted sentence gave the motivation, the second the research goal, the third a high-level description of the methodology used, and the next sentence results and conclusions). Discourse information could also help in information retrieval - for instance, if we could recognise sections where other people's ideas are described, these sections could be excluded from indexing, or they could be treat differently in some other way than sections which describe the author's ideas. After all, if the author describes somebody else's work which is later criticised, that passage does not characterise the paper very well. And finally, citation indexers and even bibliometric measurements could profit from recognition of the function of a citation (is this paper criticised? does the author agree with it? is that work cited because it is used in the current paper?). I will argue that there is a close connection between the overall scientific argumentation in the paper and the functions of the citations used in the paper.
I will describe a simple model of argumentation in science, called Argumentative Zoning (AZ; Teufel and Moens, 2002, Teufel, 1999), which is based on only seven categories:

Background	- Generally accepted background knowledge
Other	- Neutral descriptions of specific other work
Own	- Own work: method, results, future work
Aim	- Specific research goal
Textual	- Hints about section structure
Contrast	- Contrast, comparison, weakness of other solution
Basis	- Statement that cited work provides basis for own work

These categories describe the rhetorical or argumentative status of entire sentences. The theory underlying the definition of these seven categories relies on:

the recognition of modular sub-goals in the argumentation, which can be assembled to make up the entire argument;
the segmentation of text into segments according to who holds the "knowledge claim" of the ideas described;
the connection of citation function with the argumentation, e. g., criticisms of somebody's work often occur before statements of research goal; and
the connection of problem-solving statements (failed or successful) with the argumentation, e. g., a failed problem-solving activity (somebody else's) is one possiblity of motivating own research.

In previous work (Teufel et al., 1999), we showed that humans can reliably annotate naturally occurring text with the seven categories, and that a supervised machine learning system can be trained to automatically determine the categories reasonably accurately, on the basis of shallow features. While some of these features concern simple location, sentence length, presence of citations and other easily determined properties of a sentence, there are linguistically more interesting features - meta-discourse features, which detect statements such as "to our knowledge, no method for . . . has ever . . . " or "in this paper, we will present not only a new technique for . . . but also . . . ".
Meta-discourse features are fascinating because their understanding will get us one step closer to understanding those rather generic "signposts" in the sentence. What is hard about detecting meta-discourse phrases is the variation in language: while many phrases are rather fixed, there can be syntactic and lexical variation, a problem presently mainly addressed with long lists of regular expresssions. I will report on some recent work (Abdalla and Teufel, 2006) to automatically detect variations of a known cue phrase in unseen text.
How does one evaluate how well the automatic discourse annotation works? Firstly, by comparison to the human-annotated text. The results show that the machine annotation is still rather dissimilar from our humans, though much better than even ambitious baselines (e. g., Bayesian text classification on the basis of all words contained in the sentence). Secondly, one can test how well people solve tasks with artefacts built on the basis of the annotation. We built "AZ extracts" (annotated sentence extracts) on the basis of the discourse annotation, and asked subjects to answer questions about citation function based on the information contained in them. AZ extracts enabled subjects to do almost as well as a comparison group which had access to the full paper, and significantly better than control groups with a comparable amount of information in the form of keywords, random sentences, or even the abstracts themselves. Surprisingly, when the AZ extracts were built on the basis of the human gold standard annotation (not the system output), this did not significantly improve performance above AZ extracts built on system output, though we know from the intrinsic evaluation that that system out- put is rather dissimilar from human output. This raises questions about whether the intrinsic evaluation method may underestimate system performance on real tasks.
Another of the recent research goals of our group is automatic citation classification (Teufel et al., 2006a,b). This annotation scheme assigns labels to citations according to the relation of that citation to the overall argumentative act of the citing paper. There are 12 categories, which fall into four broad classes: Criticism, Comparison, Use, and Neutral. In annotation studies, we also found this annotation scheme to be reproducable (like the rhetorical AZ scheme for sentences), and we found that similar features to the ones employed for AZ also work well for automatic, supervised machine learning of the citation labels (intrinsic evaluation shows higher similarity to human annotation than was the case for the AZ task).
All work up to now uses a corpus of conference articles in computational linguistics; these display a lot of variation with respect to subdomains, structure, register, presentation traditions and writing style. Our recent research has looked at discourse analysis on other domains, namely genetics and chemistry papers in the subarea of organic synthesis. While the general principles of AZ and citation function classification hold, life scientists' searches are specialised, and in order to support them, different rhetorical moves in the papers need to be detected. For instance, we found that chemists are particularly interested in failed problem-solving activities in the description of the authors' methodology, as this information can help in troubleshooting the searchers' own unsuccessful syntheses.

Bibliography

Abdalla, R. and Teufel, S. (2006): "A Bootstrapping Approach to Unsupervised Detection of Cue Phrase Variants". In: Proceedings of ACL/COLING 2006.
Teufel, S. (1999): Argumentative Zoning: Information Extraction from Scientific Articles. Ph.D. thesis, University of Edinburgh.
Teufel, S.; Carletta, J. and Moens, M. (1999): "An Annotation Scheme for Discourse-Level Argumentation in Research Articles". In: Proceedings of EACL.
Teufel, S. and Moens, M. (2002): "Summarizing Scientific Articles - Experiments with Relevance and Rhetorical Status". Computational Linguistics 28 (4): pp. 409-445.
Teufel, S.; Siddharthan, A. and Tidhar, D. (2006a): "An Annotation Scheme for Citation Function". In: Proceedings of SIGdial-06. Sydney, Australia.
Teufel, S.; Siddharthan, A. and Tidhar, D. (2006b): "Automatic Classification of Citation Function". In: Proceedings of EMNLP-06. Sydney, Australia.