Project Details
Projekt Print View

Social Perceptions of Synthetic Speakers

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term from 2019 to 2023
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 423651352
 
Final Report Year 2023

Final Report Abstract

Through this research project “Social Perceptions of Synthetic Speakers” we attempt to propose a workflow for the generation of socially acceptable synthetic voices. A series of subjective evaluations have displayed that the social perceptions are the underlying dimensions and can only be interpreted through a combination of adjectives. However, the use of a long list of adjectives for the evaluation of VC or TTS systems is impractical. Therefore, either the social perceptions need to interpreted through a series of evaluations on various adjective combinations or there is a dire need for the development of objective metrics for such evaluations. Correspondingly, the acoustic features contributing to various social perceptions were further analyzed and were utilized in the automatic prediction of these social perceptions from synthetic speech. However, the data used for the automatic prediction of social perceptions was collected from the subjective evaluations carried out in work package 1 and the size of the data was lower than the acoustic feature dimensions per each speech sample derived from OpenSMILE toolkit. We have utilized multiple dimensionality reduction techniques to reduce the number of dimensions while also considering the multi-collinearity between the acoustic features. Nevertheless, due to limited data size, we could only explore linear regression and Support Vector Regressors for the current experiments. Through this project, we therefore encourage the research community to explore different social perceptions of synthetic voices and also publish the evaluation results that can be used by the community for building better evaluation metrics and models for social perceptions. Further, we also show that the social perceptions are separable and also transferable from one speaker to another. This was displayed through the voice conversion experiments presented in work package 3. Furthermore, we propose to use the synthetic voices of high speech quality and naturalness as the source and target speakers for voice conversion experiments as the voices undergo the speech generation twice (TTS and VC) if using VC on TTS voices. Additionally, signal manipulation techniques can also be carried out on the TTS voices for manipulation of their social perceptions. Since, we are aware of the acoustic features contributing to various social perceptions, modifications to specific acoustic features can be carried out using tools like PRAAT. Finally, we have also explored modification of the synthesis procedure through the introduction of acoustic correlates of warmth and competence in the training mechanism of a TTS system. A linear combination of the acoustic correlates of warmth and competence was carried out in the current experiments, however, other combinations can also be explored in the future depending on the coefficient values corresponding to the features. The acoustic feature with a negative coefficient could be assigned a lower weight while the feature with positive coefficient holds a higher weight at the time of computing the weighted combinations.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung