Hanne Martine Eckhoff , Silvia Luraghi & Marco Passarotti University of Oxford | University of Pavia | Catholic University of the Sacred Heart, Milan
Hanne Martine Eckhoff , Silvia Luraghi and Marco Passarotti University of Oxford | University of Pavia | Catholic University of the Sacred Heart, Milan
Ancient languages and digital sources
Over the last few decades, the widespread diffusion of digital technology and the growing ease of transferring information via the Internet have made an enormous amount of textual data available to scholars. The vastly increased availability of primary sources has radically changed the everyday life of scholars in the humanities, who are now able to access, query and process a wealth of empirical evidence in ways not possible before.
This development also encompasses ancient languages. The first aim in the eighties and the nineties was to digitize textual data and make them available on CD-ROM and online. Later, the need for linguistic annotation gave rise to projects aimed at building corpora enhanced with increasingly complex layers of metalinguistic information, such as part-of-speech (PoS) tagging and syntactic annotation, opening the field to precise queries for particular linguistic phenomena. We are now at a stage where several of these syntactically annotated corpora, or treebanks, have reached a mature state, providing representative selections of texts for several diachronic stages of a given language. These new resources allow for a new approach to diachronic studies of syntactic phenomena where scholars previously had to content themselves with empirical work on a much smaller scale.
This volume brings together a set of papers that report research on various diachronic matters supported by evidence from diachronic treebanks for different languages, i.e., treebanks that provide data for a language across several historical stages. We show that diachronic treebanks can provide considerable methodological advances in terms of greater transparency and better ways of exploiting frequently problematic source material, thus allowing us to shed new light on vexed topics.
What is a treebank?
In linguistics and philology, the term corpus has traditionally been used simply to denote a set of texts used to explore some linguistic phenomena. Many types of digital text resources are now also referred to as corpora. gives a much stricter definition: a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. However, not even the strictest definitions have linguistic annotation of any kind among the criteria. Thus there is a great deal of variation in the amount of work that has gone into building and processing corpora and in the usefulness of the resource for linguists researching particular phenomena in a given corpus. A corpus may be anything from a digitized, machine-readable text collection that only allows queries for text strings, to a sophisticated, multi-layered text resource with several types of linguistic markup, queriable by a dedicated query engine. In this volume, we concern ourselves with one of the most labor-intensive corpus types of all: the treebank.
A treebank is a text corpus with exhaustive syntactic annotation, typically applied on top of lemmatization, PoS tagging and morphological annotation. Each of these annotation layers adds to the precision of queries. Lemmatization allows for queries for all word forms subsumed under a single lemma, eliminating the need to use regular expressions. Part-of-speech and morphological tags allow for queries for specific combinations of linguistic features at the word level, without having to refer to the word form. Syntactic tagging makes it possible to search for groups of words that are syntactically related, regardless of whether they are adjacent to each other or not. Since syntactic queries are mostly multi-word queries, and are typically combined with features from other layers, they can quickly become quite complex and require either a good query engine or that users master a query language. However, given such facilities, a treebank allows queries of great precision: if the annotation is good enough, it is possible to make queries almost entirely free of noise in terms of false positives and false negatives. For example, in a given language one may find all infinitives with preverbal pronominal direct objects in a single query.
Although some treebanks are annotated in accordance with the formalism of a particular syntactic framework, most strive to be relatively theory-neutral. There are two major groups of annotation schemes: phrase-structure-based schemes and dependency-based schemes. The first major treebank to be released, the Penn Treebank (19891996; initiative has developed a universal consensus-based scheme and works to convert as many treebanks as possible into that scheme.
The two main treebank styles are based on two different syntactic notions, both of which clearly have some psychological reality. Phrase-structure treebanks are based on the idea that words are organized into groups (constituents) with certain properties; for example, that an entire constituent can be substituted by a pro-word and will normally move together. Dependency treebanks, on the other hand, are based on the idea that every word in a sentence has one and only one syntactic head. As a brief illustration of the differences between these two main treebank styles, consider the two syntactic trees below. The tree in presents the dependency analysis in linear order.
Figure 1. A Penn-style phrase structure tree
Figure 2. A Prague dependency treebank tree
Figure 3. A Prague dependency treebank tree in linear order
Here we see that the phrase-structure analysis in the Penn-style tree is fairly flat, which brings the two analyses closer than they might have been if the Penn scheme had been binary-branching. The most striking difference in these examples is that the Penn analysis cannot have crossing branches, and therefore it deals with split coordination (the topic of Taylor and Pintzuks paper) with a trace (the *ICH*-1). The index of the trace is then picked up again in the CONJP-1, the second part of the coordination which is represented in its linear place in the sentence. In the dependency analysis, the fact that the coordination is split is not represented at all and can only be retrieved by combining the dependency analysis with word order information stored in a different layer (visualized in ). However, this analysis is computationally simpler, since every node in the tree corresponds to a lexical item.
Historical corpora and treebanks
Historical linguistics necessarily relies on corpora. This observation is captured by the German term Korpussprachen to refer to historical languages. Indeed, with most historical languages, all we have is a more or less extended corpus of written texts. This constitutes a limitation (one cannot check if something that is not in the corpus is not there because it is ungrammatical or through mere accident), but it also enables linguists working on these languages to base their assumptions on all attested forms. Extended corpora, even if finite, often exceed the linguists ability to check all occurrences: for this reason, the introduction of digitized corpora has been a welcome addition to historical linguistics, as indeed in research on spoken language. Parsed corpora have the further advantage of adding information at various levels of linguistic analysis through the addition of metadata. Among these, treebanks have become an increasingly useful resource for the data-driven study of linguistic structures at various levels ().