Logic, Linguistics and Computer Science Set
coordinated by
Christian Retor
Volume 1
Application of Graph Rewriting to Natural Language Processing
Guillaume Bonfante
Bruno Guillaume
Guy Perrier
First published 2018 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St Georges Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com
ISTE Ltd 2018
The rights of Guillaume Bonfante, Bruno Guillaume and Guy Perrier to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2018935039
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-78630-096-6
Introduction
Our purpose in this book is to show how graph rewriting may be used as a tool in natural language processing. We shall not propose any new linguistic theories to replace the former ones; instead, our aim is to present graph rewriting as a programming language shared by several existing linguistic models, and show that it may be used to represent their concepts and to transform representations into each other in a simple and pragmatic manner. Our approach is intended to include a degree of universality in the way computations are performed, rather than in terms of the object of computation. Heterogeneity is omnipresent in natural languages, as reflected in the linguistic theories described in this book, and is something which must be taken into account in our computation model.
Graph rewriting presents certain characteristics that, in our opinion, makes it particularly suitable for use in natural language processing.
A first thing to note is that language follows rules, such as those commonly referred to as grammar rules, some learned from the earliest years of formal education (for example, use a singular verb with a singular subject), others that are implicit and generally considered to be obvious for a native speaker (for example in French we say une voiture rouge (a car red), but not une rouge voiture (a red car)). Each rule only concerns a small number of the elements in a sentence, directly linked by a relation (subject to verb, verb to preposition, complement to noun, etc.). These are said to be local. Note that these relations may be applied to words or syntagms at any distance from each other within a phrase: for example, a subject may be separated from its verb by a relative.
Note, however, that in everyday language, notably spoken, it is easy to find occurrences of text which only partially respect established rules, if at all. For practical applications, we therefore need to consider language in a variety of forms, and to develop the ability to manage both rules and their real-world application with potential exceptions.
A second important remark with regard to natural language is that it involves a number of forms of ambiguity. Unlike programming languages, which are designed to be unambiguous and carry precise semantics, natural language includes ambiguities on all levels. These may be lexical, as in the phrase Theres a bat in the attic, where the bat may be a small nocturnal mammal or an item of sports equipment. They may be syntactic, as in the example call me a cab: does the speaker wish for a cab to be hailed for them, or for us to say youre a cab? A further form of ambiguity is discursive: for example, in an anaphora, She sings songs, who is she?
In everyday usage by human speakers, ambiguities often pass unnoticed, as they are resolved by context or external knowledge. In the case of automatic processing, however, ambiguities are much more problematic. In our opinion, a good processing model should permit programmers to choose whether or not to resolve ambiguities, and at which point to do so; as in the case of constraint programming, all solutions should a priori be considered possible. The program, rather than the programmer, should be responsible for managing the coexistence of partial solutions.
The study of language, including the different aspects mentioned above, is the main purpose of linguistics. Our aim in this book is to propose automatic methods for handling formal representations of natural language and for carrying out transformations between different representations. We shall make systematic use of existing linguistic models to describe and justify the representations presented here. Detailed explanations and linguistic justifications for each formalism used will not be given here, but we shall provide a sufficiently precise presentation of each case to enable readers to follow our reasoning with no prior linguistic knowledge. References will be given for further study.
I.1. Levels of analysis
A variety of linguistic theories exist, offering relatively different visions of natural language. One point that all of these theories have in common is the use of multiple, complementary levels of analysis, from the simplest to the most complex: from the phoneme in speech or the letter in writing to the word, sentence, text or discourse. Our aim here is to provide a model which is sufficiently generic to be compatible with these different levels of analysis and with the different linguistic choices encountered in each theory.
Although graph structures may be used to represent different dimensions of linguistic analysis, in this book, we shall focus essentially on syntax and semantics at sentence level. These two dimensions are unavoidable in terms of language processing, and will allow us to illustrate several aspects of graph rewriting. Furthermore, high-quality annotated corpora are available for use in validating our proposed systems, comparing computed data with reference data.
The purpose of syntax is to represent the structure of a sentence. At this level, lexical units in practice, essentially what we refer to as words form the basic building-blocks, and we consider the ways in which these blocks are put together to construct a sentence. There is no canonical way of representing these structures and they may be represented in a number of ways, generally falling into one of two types: syntagmatic or dependency-based representations.
The aim of semantic representation is to transmit the meaning of a sentence. In the most basic terms, it serves to convey who did what, where, how, etc. Semantic structure does not, therefore, necessarily follow the linear form of a sentence. In particular, two phrases with very different syntax may have the same semantic representation: these are known as paraphrases. In reality, semantic modeling of language is very complex, due to the existence of ambiguities and non-explicit external references. For this reason, many of the formalisms found in published literature focus on a single area of semantics. This focus may relate to a particular domain (for example legal texts) or semantic phenomena (for example dependency minimal recursion semantics (DMRS) considers the scope of quantifiers, whilst abstract meaning representation (AMR) is devoted to highlighting predicates and their arguments).
Next page