Wahrnehmung sozialer Merkmale von synthetischen Sprechern
Allgemeine und Vergleichende Sprachwissenschaft, Experimentelle Linguistik, Typologie, Außereuropäische Sprachen
Zusammenfassung der Projektergebnisse
Through this research project “Social Perceptions of Synthetic Speakers” we attempt to propose a workflow for the generation of socially acceptable synthetic voices. A series of subjective evaluations have displayed that the social perceptions are the underlying dimensions and can only be interpreted through a combination of adjectives. However, the use of a long list of adjectives for the evaluation of VC or TTS systems is impractical. Therefore, either the social perceptions need to interpreted through a series of evaluations on various adjective combinations or there is a dire need for the development of objective metrics for such evaluations. Correspondingly, the acoustic features contributing to various social perceptions were further analyzed and were utilized in the automatic prediction of these social perceptions from synthetic speech. However, the data used for the automatic prediction of social perceptions was collected from the subjective evaluations carried out in work package 1 and the size of the data was lower than the acoustic feature dimensions per each speech sample derived from OpenSMILE toolkit. We have utilized multiple dimensionality reduction techniques to reduce the number of dimensions while also considering the multi-collinearity between the acoustic features. Nevertheless, due to limited data size, we could only explore linear regression and Support Vector Regressors for the current experiments. Through this project, we therefore encourage the research community to explore different social perceptions of synthetic voices and also publish the evaluation results that can be used by the community for building better evaluation metrics and models for social perceptions. Further, we also show that the social perceptions are separable and also transferable from one speaker to another. This was displayed through the voice conversion experiments presented in work package 3. Furthermore, we propose to use the synthetic voices of high speech quality and naturalness as the source and target speakers for voice conversion experiments as the voices undergo the speech generation twice (TTS and VC) if using VC on TTS voices. Additionally, signal manipulation techniques can also be carried out on the TTS voices for manipulation of their social perceptions. Since, we are aware of the acoustic features contributing to various social perceptions, modifications to specific acoustic features can be carried out using tools like PRAAT. Finally, we have also explored modification of the synthesis procedure through the introduction of acoustic correlates of warmth and competence in the training mechanism of a TTS system. A linear combination of the acoustic correlates of warmth and competence was carried out in the current experiments, however, other combinations can also be explored in the future depending on the coefficient values corresponding to the features. The acoustic feature with a negative coefficient could be assigned a lower weight while the feature with positive coefficient holds a higher weight at the time of computing the weighted combinations.
Projektbezogene Publikationen (Auswahl)
-
A framework to incorporate aspects of social perception in synthetic voices. In Proceedings of Interspeech, Doctorial Consortium.
Rallabandi, S.S.
-
Identifying the vocal cues of likeability, friendliness and skilfulness in synthetic speech. 11th ISCA Speech Synthesis Workshop (SSW 11), 1-6. ISCA.
Rallabandi, Sai Sirisha; Naderi, Babak & Möller, Sebastian
-
Investigating disentanglement of speaker identity and characteristics through user experience. In Proceedings of ITG Conference on Speech Conference.
Rallabandi, S.S. & Möller, S.
-
Towards understanding the perceptions of warmth and competence in synthetic speech. Ph.D. Thesis.
Rallabandi, S.S.
