T-Labs Series in Telecommunication Services
Series Editors
Sebastian Mller
Quality and Usability Lab, Technische Universitt Berlin, Berlin, Germany
Axel Kpper
Telekom Innovation Laboratories, Technische Universitt Berlin, Berlin, Germany
Alexander Raake
Audiovisual Technology Group, Technische Universitt Ilmenau, Ilmenau, Germany
More information about this series at http://www.springer.com/series/10013
Benjamin Weiss
Talker Quality in Human and Machine Interaction Modeling the Listeners Perspective in Passive and Interactive Scenarios
Benjamin Weiss
Technische Universitt Berlin, Berlin, Germany
ISSN 2192-2810 e-ISSN 2192-2829
T-Labs Series in Telecommunication Services
ISBN 978-3-030-22768-5 e-ISBN 978-3-030-22769-2
https://doi.org/10.1007/978-3-030-22769-2
Springer Nature Switzerland AG 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Ulrike and Toni
Preface
When humans engage in spoken interaction, they assess the obvious features of the interaction partner, such as the verbal and nonverbal signals of the voice, as these contribute to the subjective evaluation of the interlocutor. In this book, the back- ground, state of research, and own contributions to the assessment and prediction of talker quality that is constituted in voice perception and in dialog are presented. Starting from theories and empirical findings from human interaction, major results and approaches are transferred to the domain of human-computer interaction. The main subject of this book is to contribute to the evaluation of spoken interaction in both humans and between human and computer, and in particular to the quality subsequently attributed to the speaking system or person, based on the listening and interactive experience.
The theories, methods, and results presented are focused on the first impression of the people engaged in such a vocal conversation. This means hearing a voice for the first time in either a passive scenario (listening only) or interacting with a person or computer for the first time (interactive scenario). By using the term first impression, the research focus is set to the human perception and evaluation of voices and conversations experienced, which represents the beginning attitude formation of the participating human towards the (other) speaker, may it be a real person or a computer agent.
The main scientific contribution is not to psychological theory development, but it is an informed engineering approach for describing subjective quality as experienced by users. As mentioned in the subtitle, the major results of this book are the development of quantitative models of user ratings to represent subjective quality. The most important part of this modeling is the identification of relevant parameters as predictors. Such predictors of talker quality can be acoustic features, for the case of voices quality that is typically assessed in passive scenarios, or data describing interaction behavior for the course of conversation. By modeling voice-stimulated user ratings, this book also contributes to the identification of the most important perceptual dimensions as these factors provide the valuable insight in search for quantitative predictors, and derives basic insights and principles in designing and evaluating modern spoken conversational systems , with the aim of increasing and ensuring quality.
This book is intended for advanced readers, who have already a background in speech signal processing and some basic knowledge in social psychology. Results on the topic of talker appraisal are presented mostly in a rather condensed form, and not all basic terms are defined or explained. Several of my own contributions, in addition to other work and results presented here, are of course published elsewhere. Therefore, please use the references to obtain more details and additional information if the respective summarized section does not answer all your questions.
Benjamin Weiss
Berlin, Germany
February 2019
Acknowledgments
I want to thank first and foremost my mentor, Sebastian Mller, who supported me for over 10 years, in which I have worked on my favorite topics. He has built up a great team at the Quality and Usability Lab. There are so many current and former colleagues, too many to name here, without whom this book would not exist. Many are listed in the references, so I mention just Christine, Klaus, Ina, Felix, Irene, Yasmin, Stefan, Tilo, Thilo, Matthias, and Laura. On many conferences and on other occasions, I received valuable comments from colleagues in the field, in anonymous reviews but also in person while enjoying lively discussions. In particular, I want to thank Jrgen and Timo. My gratitude also goes to Petra Wagner and Elmar Nth for writing reports on this manuscript for the habilitation committee. And lastly, I want to thank my family for being there for me.
I would like to acknowledge the financial support of the Deutsche Forschungsgemeinschaft throughout my academic career, lastly by the projects Sympathie von Stimme und Sprechweise Analyse und Modellierung auditiver und akustischer Merkmale and Human Perception and Automatic Detection of Speaker Personality and Likability: Influence of Modern Telecommunication Channels.
Contents
1. Theory: Foundations of Quality in Natural and Synthesized Speech
Speech is one of the most important modes to communicate and interact in humanhuman interaction (HHI). It contains semantic and pragmatic meaning, often in an underspecified and indirect way, by referencing to situational and world knowledge. Apart from that, however, each utterance also includes nonverbal information, simply due to the prosody inevitably produced by speaking. Prosody is of great importance, as its variation signals linguistic and non-linguistic information, such as emphasis or syntactic and semantic structure, turn organization, or affective states. It comprises the primary [], such as voice quality or nasality, and by vocal tract shape or articulatory precisionall with multiple effects on acoustics. In order to support an easy understanding of such terms throughout the book,