Project Details
Social Perceptions of Synthetic Speakers
Applicant
Professor Dr.-Ing. Sebastian Möller
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term
from 2019 to 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 423651352
Speech signals automatically induce social perceptions in listeners regarding the speakers. With acoustic analysis and signal manipulation, a great body of knowledge has been accumulated regarding relevant acoustic correlates of social perceptions, such as spectral and prosodic parameters, as well as perceptual dimensions for natural speech. However, despite the advent of modern speech synthesis paradigms providing very high quality, it is yet to be understood, if results from natural speech also hold for synthesized speech. Hence, the major research question is: “Which acoustic features of synthesized speech affect subjective perceptions of social speaker characteristics?”In order to answer this question, this project studies social perception of the two basic social attributions, competence and benevolence, for text-to-speech (TTS) synthesizers in two potential application domains: Stimuli from the topics of healthcare and of customer service. Results are compared to those obtained from natural speech in earlier projects. It is tested whether competence and benevolence also emerge as basic social attributions, or if other dimensions are more relevant. Regarding the speech signal, similarities and differences in acoustic parameters and their systematics are identified. A mid-term result is an acoustic prediction model of the identified social dimensions for synthesized speech.On a methodological level, utterances are created with state-of-the-art TTS systems and systematically modified on the signal level, in order to produce stimuli for empirical testing with human listeners. Crowd-sourcing techniques are applied for the required listening and rating tests. The final goal is to examine, how acoustic features and patterns can be directly incorporated in modern TTS methodologies (Hidden-Markov-Models, Deep Neural Networks) instead of post-processing signal manipulation. This leads to the secondary research question: “Which alterations of the synthesis procedure lead to positive perceptions of speakers?” For this aim, current approaches from speaker conversion are applied.Apart from the fundamental knowledge gained from this research, results will be relevant for TTS system developers, in order to efficiently improve voices for particular service domains.
DFG Programme
Research Grants