Digital Humanities 2007 – Session Description:
Digital Text Resources for the Humanities – Legal Issues

Georg Rehm^1,2, Andreas Witt²
Tübingen University
SFB 441: Linguistic Data Structures
Project C1: Cross-Linguistic Annotation and Data Types¹
Project C2: Sustainability of Linguistic Data²
Nauklerstraße 35, 72074 Tübingen, Germany
`georg.rehm@uni-tuebingen.de`, `andreas.witt@uni-tuebingen.de`

Session Description

The session "Digital Text Resources for the Humanities – Legal Issues" consists of three papers that address the legal aspects connected to several crucial phases of handling text resources: collecting, compiling, curating, analysing, distributing, and archiving text resources such as corpora, are tasks carried out on a day-to-day basis by people involved in fields such as, for example, humanities computing, computational and corpus linguistics, information retrieval and text mining. Despite the ubiquity of document collections, the legal issues that are intrinsically tied to virtually all texts created and published by third parties (most importantly, their copyright, as well as privacy issues), do not typically attract a lot of interest. Though these issues are acknowledged, they are often regarded as rather insignificant for the research question at hand, or a project does not have any jurisprudential expertise to deal with legal issues in an adequate way. As a consequence, distributing a corpus (for example, to other interested researchers) whose provenance is unknown or questionable, or publishing excerpts from a document collection on a website, may become next to impossible from a legal point of view. This is why scholars often decide not to publish their collections (or parts thereof) online at all, in order to avoid any potential legal problems. The session aims to provide an overview of the following legal aspects:

The first contribution, "Language Corpora – Copyright – Data Protection: The Legal Point of View" (Timm Lehmberg, and Felix Zimmermann), highlights the legal requirements that hold with regard to the construction of digital text resources, special emphasis is given to the aspect of copyright and data protection (for example, potential reasons for the need to anonymise text corpora).
The second presentation, "Collecting Legally Relevant Metadata by Means of a Decision-Tree-Based Questionnaire System" (Timm Lehmberg, Christian Chiarcos, Erhard Hinrichs, Georg Rehm, and Andreas Witt), consists of two parts: first, a web-based questionnaire is introduced that was developed to capture the requirements research projects have with regard to the archiving and distribution of their corpora; second, initial results from a study that spans three large research centres and more than 60 individual research projects are reported.
The final paper, "Corpus Masking: Legally Bypassing Licensing Restrictions for the Free Distribution of Text Collections" (Georg Rehm, Andreas Witt, Heike Zinsmeister, and Johannes Dellert), introduces the idea of masking an annotated text corpus whose original source text collection is copyright-protected, so that the masked version can be distributed without any restrictions; furthermore, a fully working tool for masking an XML-annotated corpus is presented.

The authors of the three papers are associated with a joint project situated in three Collaborative Research Centres (SFB, Sonderforschungsbereich) that are sponsored by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft): SFB 441 (Linguistic Data Structures, Tübingen University), SFB 538 (Multilingualism, Hamburg University), and SFB 632 (Information Structure, Potsdam University). Each of these three research centres consists of about 15 to 20 research projects. Most projects work with digital text collections, in practically all cases these collections and corpora are constructed by the respective researchers themselves. A problem people involved in the fields of digital humanities or computational linguistics are often confronted with concerns the fact that the sustainability and reusability of corpora is not given too much attention – or that these aspects, in a worst case scenario, are completely ignored. Corpora are often created for an application or for a project that has a very specific research question, but when the project is finished it becomes next to impossible (especially for third parties) to gain access to the resource that took several months or maybe even years to create. The joint project Sustainability of Linguistic Data was therefore established to provide the conceptual, technical and infrastructural basis for a solution to the problem of sustainably archiving these digital text collections, addressing issues as diverse as, for example, annotation and metadata frameworks, best practice guidelines, legal issues of distributing text collections, and unifying diverse tag sets by means of an ontology.

Digital Humanities 2007 – Session Description: Digital Text Resources for the Humanities – Legal Issues

Georg Rehm1,2, Andreas Witt2 Tübingen University SFB 441: Linguistic Data Structures Project C1: Cross-Linguistic Annotation and Data Types1 Project C2: Sustainability of Linguistic Data2 Nauklerstraße 35, 72074 Tübingen, Germany georg.rehm@uni-tuebingen.de, andreas.witt@uni-tuebingen.de

Session Description

Digital Humanities 2007 – Session Description:
Digital Text Resources for the Humanities – Legal Issues

Georg Rehm^1,2, Andreas Witt²
Tübingen University
SFB 441: Linguistic Data Structures
Project C1: Cross-Linguistic Annotation and Data Types¹
Project C2: Sustainability of Linguistic Data²
Nauklerstraße 35, 72074 Tübingen, Germany
`georg.rehm@uni-tuebingen.de`, `andreas.witt@uni-tuebingen.de`