ANALYTIC PATTERN MATCHING
How do you distinguish a cat from a dog by their DNA? Did Shakespeare really write all his plays? Pattern matching techniques can offer answers to these questions and to many others, in contexts from molecular biology to telecommunications to the classification of Twitter content.
This book, intended for researchers and graduate students, demonstrates the probabilistic approach to pattern matching, which predicts the performance of pattern matching algorithms with very high precision using analytic combinatorics and analytic information theory. and also introduce new methodology such as the Mellin transform and analytic depoissonization.
More than 100 end-of-chapter problems will help the reader to make the link between theory and practice.
ANALYTIC PATTERN MATCHING
From DNA to Twitter
PHILIPPE JACQUET
Institut National de Recherche en Informatique et en Automatique (INRIA), Rocquencourt
WOJCIECH SZPANKOWSKI
Purdue University, Indiana
University Printing House, Cambridge CB2 8BS, United Kingdom
Cambridge University Press is part of the University of Cambridge.
It furthers the Universitys mission by disseminating knowledge in the pursuit of education, learning and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9780521876087
Philippe Jacquet and Wojciech Szpankowski 2015
This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
First published 2015
Printed in the United Kingdom by Clays, St lves plc
A catalog record for this publication is available from the British Library
Library of Congress Cataloging in Publication data
Jacquet, Philippe, 1958
Analytic pattern matching from DNA to Twitter / Philippe Jacquet, Institut National de Recherche en Informatique et en Automatique (INRIA), Rocquencourt Wojciech Szpankowski, Purdue University, Indiana.
pagescm
Includes bibliographical references and index.
ISBN 978-0-521-87608-7 (Hardback)
1. Pattern recognition systems.I. Szpankowski, Wojciech, 1952II. Title.
TK7882.P3J332015
519.2dc23
2014017256
ISBN 978-0-521-87608-7 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
Dedicated to PHILIPPE FLAJOLET,
our mentor and friend.
Contents
Foreword
Early computers replaced calculators and typewriters, and programmers focused on scientific computing (calculations involving numbers) and string processing (manipulating sequences of alphanumeric characters, or strings). Ironically, in modern applications, string processing is an integral part of scientific computing, as strings are an appropriate model of the natural world in a wide range of applications, notably computational biology and chemistry. Beyond scientific applications, strings are the lingua franca of modern computing, with billions of computers having immediate access to an almost unimaginable number of strings.
Decades of research have met the challenge of developing fundamental algorithms for string processing and mathematical models for strings and string processing that are suitable for scientific studies. Until now, much of this knowledge has been the province of specialists, requiring intimate familiarity with the research literature. The appearance of this new book is therefore a welcome development. It is a unique resource that provides a thorough coverage of the field and serves as a guide to the research literature. It is worthy of serious study by any scientist facing the daunting prospect of making sense of huge numbers of strings.
The development of an understanding of strings and string processing algorithms has paralleled the emergence of the field of analytic combinatorics, under the leadership of the late Philippe Flajolet, to whom this book is dedicated. Analytic combinatorics provides powerful tools that can synthesize and simplify classical derivations and new results in the analysis of strings and string processing algorithms. As disciples of Flajolet and leaders in the field nearly since its inception, Philippe Jacquet and Wojciech Szpankowski are well positioned to provide a cohesive modern treatment, and they have done a masterful job in this volume.
ROBERT SEDGEWICK
Princeton University
Preface
Repeated patterns and related phenomena in words are known to play a central role in many facets of computer science, telecommunications, coding, data compression, data mining, and molecular biology. One of the most fundamental questions arising in such studies is the frequency of pattern occurrences in a given string known as the text. Applications of these results include gene finding in biology, executing and analyzing tree-like protocols for multiaccess systems, discovering repeated strings in LempelZiv schemes and other data compression algorithms, evaluating string complexity and its randomness, synchronization codes, user searching in wireless communications, and detecting the signatures of an attacker in intrusion detection.
The basic pattern matching problem is to find for a given (or random) pattern w or set of patterns W and a text X how many times W occurs in the text X and how long it takes for W to occur in X for the first time. There are many variations of this basic pattern matching setting which is known as exact string matching. In approximate string matching, better known as generalized string matching, certain words from W are expected to occur in the text while other words are forbidden and cannot appear in the text. In some applications, especially in constrained coding and neural data spikes, one puts restrictions on the text (e.g., only text without the patterns 000 and 0000 is permissible), leading to constrained string matching. Finally, in the most general case, patterns from the set W do not need to occur as strings (i.e., consecutively) but rather as subsequences; that leads to subsequence pattern matching, also known as hidden pattern matching.
These various pattern matching problems find a myriad of applications. Molecular biology provides an important source of applications of pattern matching, be it exact or approximate or subsequence pattern matching. There are examples in abundance: finding signals in DNA; finding split genes where exons are interrupted by introns; searching for starting and stopping signals in genes; finding tandem repeats in DNA. In general, for gene searching, hidden pattern matching (perhaps with an exotic constraint set) is the right approach for finding meaningful information. The hidden pattern problem can also be viewed as a close relative of the longest common subsequence (LCS) problem, itself of immediate relevance to computational biology but whose probabilistic aspects are still surrounded by mystery.