Project Details
Instrumental Quality Estimation for Synthesized Speech Signals
Subject Area
Electronic Semiconductors, Components and Circuits, Integrated Systems, Sensor Technology, Theoretical Electrical Engineering
Term
from 2010 to 2014
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 162351856
The project planned for three years in the initial proposal [Möller, Heute, 2009], granted up to now for two years aims at an instrumental measure for the auditory quality of text-to-speech (TTS) synthesis systems. Thereby, expensive auditive measurements (listening tests) shall not be replaced completely; they can, however, be avoided widely during a system development or pre-selection of potential system candidates. The goal is a measure which estimates the quality from the TTS-synthesis signal only, without any reference signal. It is based on attributes i.e., it compiles the total impression, like a listener, from different single aspects. So, the perceptual quality space is spanned by a certain number of dimensions. For the dimensions, instrumental measurements are to be found, from which, finally, an integral quality measure is to be constructed. Beyond, however, and different from a direct integral quality estimation, detailed diagnostic information is available, valuable for the system developer.In the initial proposal [Möller, Heute, 2009] the following questions were raised: Which aspects are relevant for the perception and quality judgment of synthetic speech at various applications? Which of these aspects are reflected in the speech signal? How can the single quality aspects be estimated instrumentally? How can the total quality be estimated from the synthesized speech signal alone? How can the estimated values be represented in a quality profile for different applications? For which synthesis methods and applications does such a profile hold?After the first two years, partial answers have been found. For the continued investigations, the following questions are added: How does the measured quality depend on the TTS speakers voice? How can supra-segmental quality aspects be taken into account? How do relevant features depend on time in the sense of time-variance on one hand, in the sense of a reasonable estimation reliability on the other hand? Which features are useful for the quality analysis of short TTS signals, which ones for longer sequences? While the compilation of the integral quality from single aspects as well as the validity question were already planned mainly for the third year in the initial proposal [Möller, Heute, 2009], the extension towards an inclusion of voice quality and time dependence resulted newly from the present investigations. As well, the central importance of a supra-segmental quality measurement is a novel point of view, now to be taken into account.
DFG Programme
Research Grants