Discourse Annotation for Citation Mapping, Information Retrieval and Summarisation
Simone Teufel
In this talk, I will motivate why discourse annotation in scientific corpora is attractive for
many applications requiring more text understanding than most current linguistic annotation
can support. Summarisation is such an application: if we knew more about the discourse
structure in a scientific text, it would be easier to produce a summary with a logical structure
(e. g., the first extracted sentence gave the motivation, the second the research goal, the third
a high-level description of the methodology used, and the next sentence results and conclusions).
Discourse information could also help in information retrieval - for instance, if we
could recognise sections where other people's ideas are described, these sections could be
excluded from indexing, or they could be treat differently in some other way than sections which
describe the author's ideas. After all, if the author describes somebody else's work which is later
criticised, that passage does not characterise the paper very well. And finally, citation indexers
and even bibliometric measurements could profit from recognition of the function of a citation
(is this paper criticised? does the author agree with it? is that work cited because it is used in
the current paper?). I will argue that there is a close connection between the overall scientific
argumentation in the paper and the functions of the citations used in the paper.
I will describe a simple model of argumentation in science, called Argumentative Zoning
(AZ; Teufel and Moens, 2002, Teufel, 1999), which is based on only seven categories:
Background | - Generally accepted background knowledge |
Other | - Neutral descriptions of specific other work |
Own | - Own work: method, results, future work |
Aim | - Specific research goal |
Textual | - Hints about section structure |
Contrast | - Contrast, comparison, weakness of other solution |
Basis | - Statement that cited work provides basis for own work |
These categories describe the rhetorical or argumentative status of entire sentences. The theory underlying the definition of these seven categories relies on:
In previous work (Teufel et al., 1999), we showed that humans can reliably annotate naturally
occurring text with the seven categories, and that a supervised machine learning system
can be trained to automatically determine the categories reasonably accurately, on the basis of
shallow features. While some of these features concern simple location, sentence length, presence
of citations and other easily determined properties of a sentence, there are linguistically
more interesting features - meta-discourse features, which detect statements such as "to our
knowledge, no method for . . . has ever . . . " or "in this paper, we will present not only a new
technique for . . . but also . . . ".
Meta-discourse features are fascinating because their understanding will get us one step closer
to understanding those rather generic "signposts" in the sentence. What is hard about detecting
meta-discourse phrases is the variation in language: while many phrases are rather fixed,
there can be syntactic and lexical variation, a problem presently mainly addressed with long
lists of regular expresssions. I will report on some recent work (Abdalla and Teufel, 2006) to
automatically detect variations of a known cue phrase in unseen text.
How does one evaluate how well the automatic discourse annotation works? Firstly, by comparison
to the human-annotated text. The results show that the machine annotation is still
rather dissimilar from our humans, though much better than even ambitious baselines (e. g.,
Bayesian text classification on the basis of all words contained in the sentence). Secondly, one
can test how well people solve tasks with artefacts built on the basis of the annotation. We
built "AZ extracts" (annotated sentence extracts) on the basis of the discourse annotation, and
asked subjects to answer questions about citation function based on the information contained
in them. AZ extracts enabled subjects to do almost as well as a comparison group which had
access to the full paper, and significantly better than control groups with a comparable amount
of information in the form of keywords, random sentences, or even the abstracts themselves.
Surprisingly, when the AZ extracts were built on the basis of the human gold standard annotation
(not the system output), this did not significantly improve performance above AZ extracts
built on system output, though we know from the intrinsic evaluation that that system out-
put is rather dissimilar from human output. This raises questions about whether the intrinsic
evaluation method may underestimate system performance on real tasks.
Another of the recent research goals of our group is automatic citation classification (Teufel
et al., 2006a,b). This annotation scheme assigns labels to citations according to the relation of
that citation to the overall argumentative act of the citing paper. There are 12 categories, which
fall into four broad classes: Criticism, Comparison, Use, and Neutral. In annotation studies,
we also found this annotation scheme to be reproducable (like the rhetorical AZ scheme for
sentences), and we found that similar features to the ones employed for AZ also work well for
automatic, supervised machine learning of the citation labels (intrinsic evaluation shows higher
similarity to human annotation than was the case for the AZ task).
All work up to now uses a corpus of conference articles in computational linguistics; these
display a lot of variation with respect to subdomains, structure, register, presentation traditions
and writing style. Our recent research has looked at discourse analysis on other domains,
namely genetics and chemistry papers in the subarea of organic synthesis. While the general
principles of AZ and citation function classification hold, life scientists' searches are specialised,
and in order to support them, different rhetorical moves in the papers need to be detected. For
instance, we found that chemists are particularly interested in failed problem-solving activities
in the description of the authors' methodology, as this information can help in troubleshooting
the searchers' own unsuccessful syntheses.
Bibliography
Abdalla, R. and Teufel, S. (2006): "A Bootstrapping Approach to Unsupervised Detection of Cue Phrase
Variants". In: Proceedings of ACL/COLING 2006.
Teufel, S. (1999): Argumentative Zoning: Information Extraction from Scientific Articles. Ph.D. thesis,
University of Edinburgh.
Teufel, S.; Carletta, J. and Moens, M. (1999): "An Annotation Scheme for Discourse-Level
Argumentation in Research Articles". In: Proceedings of EACL.
Teufel, S. and Moens, M. (2002): "Summarizing Scientific Articles - Experiments with Relevance and
Rhetorical Status". Computational Linguistics 28 (4): pp. 409-445.
Teufel, S.; Siddharthan, A. and Tidhar, D. (2006a): "An Annotation Scheme for Citation Function".
In: Proceedings of SIGdial-06. Sydney, Australia.
Teufel, S.; Siddharthan, A. and Tidhar, D. (2006b): "Automatic Classification of Citation Function".
In: Proceedings of EMNLP-06. Sydney, Australia.