Multi-channel speech separation in reverberant environments
- LLERENA AGUILAR, COSME
- Manuel Rosa Zurera Director
- Roberto Gil Pita Co-director
Universidade de defensa: Universidad de Alcalá
Fecha de defensa: 10 de marzo de 2016
- María del Pilar Jarabo Amores Presidenta
- Manuel Utrilla Manso Secretario
- David Ayllón Álvarez Vogal
- María de Diego Antón Vogal
- Jorge Plata Chaves Vogal
Tipo: Tese
Resumo
Humans are capable of following a single speaker and understanding what that speaker is saying in rooms where many people are speaking. This ability is enormously complex since many processes are included. In the scientific community this problem is known as cocktail party problem and has been subject of study for decades. The development of technical solutions presents a challenge due to its complexity. Indeed, many of the processes performed by the human auditory system are still unknown. In general terms, there are two ways of understanding the cocktail party problem, as a human speech recognition problem or as a speech separation issue. In this thesis we will focus on the second class of problems. Concerning separation problems, many methodologies can be found in the literature. The type of separation solution depends on the circumstances in which the separation problem take places, that is, on factors as for instance, the number of speakers and microphones, the level of noise and reverberation, among others. Perhaps one of the main problems for speech separation algorithms is reverberation, which causes that many of them do not perform correctly. And so, the necessity of robust separation algorithms against reverberation seems to be clear. In order to solve the separation problem in reverberant environments, three main groups of techniques can be distinguished. CASA techniques are the first group of solutions, but it must be said that this line of research is not the most successful one. Other types of proposals are beamforming techniques which present important problems in the presence of reverberation. Beamforming techniques are based on the principle of spatial filtering, that is, they are specially designed to enhance and eliminate signals coming from specific directions. Reverberation can be considered as interferences that can come from any direction and so, beamforming is not the most suitable solution for eliminating the effects of reverberation. Furthermore, broadly speaking, beamforming methods require sophisticated sensor arrays and entail important computational load. These drawbacks limit the use of beamforming in many applications. The third group of solutions are BSS techniques, which are based on statistical and other signal properties. In this PhD thesis we focus on the latter group of techniques. Many of these techniques are jointly used with different tools to develop signal separation. Within these tools, sensor networks play a key role. Nowadays, the use of wireless sensor networks is becoming very popular since they entail many advantages. However, these networks have some particularities that give rise to some problems for classical separation algorithms engineered to wired networks. With this in mind, this thesis can be divided into two major parts. The first one is related to the design of new robust speech separation techniques against reverberation. Additionally, it will be very important to obtain computationally efficient separation methods, without requiring complex microphone networks. The second part deals with, perhaps, the most important problem for classical speech separation algorithms when WASNs are used, the synchronization problem. With respect to the first part, a new separation procedure has been introduced. The main requirements of this solution are: it must overcome classical BSS algorithms in reverberant environments, the simplest (two-) microphone array is used, and it must be computationally efficient. Aiming at comparing it with classical BSS methods, comparable algorithms should be chosen. One of the most successful types of separation methods are those based on sparsity. Within these methods, the well-known DUET algorithm has been selected since it works with only two microphones in echoic problems, achieving acceptable outcomes. To make a valid comparison, a study to determine the best microphone array configurations and frame lengths for DUET in our separation problems has been carried out. Using these configurations, the DUET algorithm is ready to be compared with our proposal. Moreover, the separation stage of DUET based on time-frequency binary masking has also be compared with another very popular masking technique that relies on l1-norm minimization. From this study, we have observed that binary masking outperforms the other one. Our separation procedure performs the estimation of the mixing matrix based on a geometric analysis of the separation scenario. Knowing available information, such as the microphone separation, the mutual angle or the type of microphones, the mixing matrix is estimated. Mixing parameters have two components, time and level differences. Using our mixing matrix estimation method, only time differences are calculated since a relationship between both differences has been established. It involves that our separation method is computationally less expensive. Furthermore, to avoid the estimation of level differences is important since it is a very difficult task in the presence of reverberation. To estimate time differences, TDE methods are used. In this sense, one of the most robust TDE method against reverberation has been used, the GCC-PHAT algorithm. A study has demonstrated its suitability in all our separation problems except when microphone separations are small. Considering it, a new TDE method has been developed for small arrays, obtaining very good results. Finally, comparing our separation solution with DUET, it has been demonstrated that our proposal outperforms DUET in all the separation scenarios (level of reverberation, number of talkers, etc.). It must also be mentioned that both DUET and our proposal have a separation stage based on binary masking which introduces an important problem in acoustic applications, the musical noise. To minimize this problem, a musical noise reduction algorithm has been proposed, demonstrating good outcomes. In the second part of the thesis, we have tackled the introduction of classical BSS algorithms that use short-time analysis tools in WASNs. Perhaps the main problem of those BSS algorithms in WASNs is the desynchronization of the signals received at the different nodes. In this sense, a novel synchronization methodology based on signal processing has been proposed. The first aspect to be mentioned is the novelty of considering the differences in propagation delays, while traditional synchronization solutions only deal with the clock problem. A theoretical analysis has been developed to establish the theoretical delay between speech mixtures. Moreover, two new TDE methods according to our theoretical delay have been implemented. These methods have the additional bonus of using a reduce amount of information for transmission and they do not require many computational resources. A study reveals that with our synchronization solution, classical BSS algorithms can be introduced in WASNs