Contributions to the design of automatic voice quality analysis systems using speech technologies

Gómez García, Jorge Andrés

Contributions to the design of automatic voice quality analysis systems using speech technologies

Gómez García, Jorge Andrés

Supervised by:

Juani Godino Director

Defence university: Universidad Politécnica de Madrid

Fecha de defensa: 07 February 2018

Committee:

Manuel Blanco Velasco Chair
José Luis Blanco Murillo Secretary
Athanasios Tsanas Committee member
Daniel Ramos Castro Committee member
Alfonso Ortega Giménez Committee member

Type: Thesis

Teseo: 529274 DIALNET Archivo Digital UPM editor

Abstract

The production of speech relies in a complex process to generate audible outputs for, most typically, communication purposes. Not only speech contains a message encoded in the form of language, but also delivers information about sex, age, condition, and diverse aspects describing the speaker itself. Due to this fact, there exists a great interest in designing systems that extract this non-linguistic information for automatic analysis purposes. One interesting application -on which this thesis is centred- is in the design of automatic systems capable of characterising the presence and severity of voice disorders. This has potential applications as objective supplementary tools in clinical settings. Notwithstanding, the design of automatic systems poses several problems that include the intrinsic variability of speech, the simultaneous presence of multiple phenomena characterising vocal pathology, the existence of spurious extralinguistic information, or the reliance on perceptual assessments which are highly subjective. With these antecedents in mind, this thesis evaluates the influence of extralinguistic information, differing types of speech tasks, diverse decision machines and characteristics, in the design of automatic voice quality analysis systems whose objective is to generalise decisions about the presence and severity of pathologies present in voices and/or speech. A novel methodology based on feature ranking algorithms, ordinal classification and Gaussian regression is also proposed to emulate the perceptual capabilities of a human evaluator. The regressor is used to convert the discrete perceptual scale to a continuum, more in accordance to the nature of the evaluations. Moreover, the robustness of the proposed systems is evaluated in several cross-database experiments. Results indicate that the sex of the speaker plays an important role in automatic voice quality analysis systems and that hierarchical designs should be considered. It has also been found that the most consistent set of features for both pathology detection and assessment tasks, are two perturbation measures and a descriptor of the dispersion in modulation spectra representations: glottal-to-noise excitation ratio, cepstral harmonics-to-noise ratio and rate of points above linear average. The best automatic detector trained with the Saarbrücken voice disorders database achieves an AUC of 0.88 when the information provided by the different speech tasks is fused via logistic regression. In several cross-database scenarios, AUC varies between 0.75 to 0.94, thus demonstrating the robustness of the system. These are some of the best efficiencies reported in literature using this database. The best assessment system incurs in errors that differ on average half an unit from the actual label, when G and B are considered in cross-database settings. Moreover, the system has been assessed clinically by an expert who certified its validity. Results for the system clinically evaluated are of about 0.3 units for the G trait.