Niladri Sekhar Dash
Language Corpora Annotation and Processing
1st ed. 2021
Logo of the publisher
Niladri Sekhar Dash
Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India
ISBN 978-981-16-2959-4 e-ISBN 978-981-16-2960-0
https://doi.org/10.1007/978-981-16-2960-0
The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Dedicated to
Amma
Annamalai Murugan Temple
Kundah, Nilgiri Mountains, India
Preface
During the last few years, innovative techniques, strategies, and methods are introduced to annotate and process natural language texts. This has been possible due to the widespread application of texts for various academic and commercial purposes. The value of texts is multiplied if texts are normalized, annotated, and processed properly. Such texts are utilized for the development of various language-related systems and applications. These are indirect incentives for scholars engaged in designing new and innovative techniques for text annotation and processing. Some of the text annotation and processing techniques are discussed in this book. It gives more attention to text annotation as text processing issues are addressed more widely in earlier literature.
People working in various domains of machine learning, artificial intelligence, information technology, language technology, and linguistics realize that they need to devise new methods and techniques to make texts more user- and system-friendly. Texts should be easily accessible so that one can utilize data and information from texts in research and development activities. This has been another factor behind the development of innovative ways of text annotation and processing. The application of innovative techniques on texts generates new kinds of results which have an impact on looking at a language from a new perspective as well as addressing various language-related needs.
New text annotation techniques open new ways of looking at language. New annotation processes are applied to texts to make text data suitable for new applications. It generates new perspectives toward a natural language and its properties. In essence, annotation contributes to looking at a language beyond the traditional frame of analysis and description. The introduction of new text processing techniques, on the other hand, makes us more powerful in gathering offbeat information from texts, furnishing rare and irregular evidence, and modifying existing theories and models based on newly found examples and information. Moreover, analysis of annotated texts produces results to mark limitations and deficiencies of existing theories and intuitions about a natural language and its properties.
A language-based application requires special tools, systems, and techniques to interact with raw and processed texts. The application strength of a tool increases when it works on annotated and processed texts. Proficiency and precision of a system also increase when it is able to access data and information of an annotated text. Keeping these advantages in mind, we discuss some of the conventional and some non-conventional text annotation techniques in this book. We propose to apply these annotation techniques for language-specific research and applications. We have noted that text annotation techniques play a crucial role in devising new methods for language study, developing new skills for language analysis, and identifying new domains for language use. New text annotation techniques also help to collect new evidence from texts, process texts for new usages, and apply language data for new purposes. We collect information from annotated texts to explain non-canonical linguistic phenomenon convincingly, rather than reshaping language data to support the existing theories.
The majority of text annotation tools and systems that are available at present are useful for advanced and resource-rich languages. Scopes of application of these tools and systems in less-advanced and resource-poor languages are quite low. The majority of world languages fall under resource-poor languages. In this context, the present book could have tried to demonstrate how tools and techniques developed for advanced languages could also work elegantly for less-advanced and resource-poor languages. Perhaps, this would have been widely appreciated by the scholars of resource-rich languages. The goal of this book is, however, not fixed on it. It does not desire to show how annotation and processing techniques used in advanced languages are working fine for the less-advanced languages. Rather, it shows that less-advanced languages require similar tools and systems which are characteristically different and which can address language-specific requirements of the less-resourced languages. Since this book focuses on a specific group of languages, the prospective readers are likely to be specialized. This book, however, can find a home in university and research libraries as well as become a text-cum-reference book for corpus linguists in universities and colleges. The importance of this book may be realized when we understand that it discusses some of the crucial issues and aspects of text annotation keeping in view the requirements of less-advanced and resource-poor languages of the world.
Niladri Sekhar Dash
Kolkata, India
March 2021
Acknowledgments
I humbly thank my seniors, peers, and juniors who have helped me in different capacities to represent my ideas, concepts, and information in this book. I also thank those scholars who have given me insights and information to shape up new ideas and concepts presented here. I thank those unknown reviewers who have suggested changes and modifications in the manuscript for improvement of content and quality of the volume. I sincerely appreciate their constructive comments. Their suggestions have helped me to revise and upgrade the present book to a large extent.