May 7 - Poster session n. 2
Hands-free Speech Recognition and Speech Communication
(chaired by Maurizio Omologo - FBK-irst, Italy)

1) Matthias Wölfel
"A Joint Particle Filter and Multi-step Linear Prediction Framework to Provide Enhanced Speech Features Prior to Automatic Recognition"
2) Itai Peer, Boaz Rafaely and Yaniv Zigel
"Room Acoustics Parameters Affecting Speaker Recognition Degradation under Reverberation"
3) Randy Gomez, Jani Even, Hiroshi Saruwatari, and Kiyohiro Shikano
"Fast Dereverberation for Hands-Free Speech Recognition"
4) Noam R. Shabtai, Yaniv Zigel and Boaz Rafaely
"The Effect of GMM Order and CMS on Speaker Recognition with Reverberant Speech"
5) Hyunsin Park, Tetsuya Takiguchi and Yasuo Ariki
"Integration of Phoneme-subspaces using ICA for Speech Feature Extraction and Recognition"
6) Min-Seok Choi and Hong-Goo Kang
"A Two-channel Minimum Mean-square Error Log-spectral Amplitude Estimator for Speech Enhancement"
7) Thomas Ploetz and Gernot A. Fink
"On the Use of Empirically Determined Impulse Responses for Improving Distant Talking Speech Recognition"
8) Yotam Peled and Boaz Rafaely
"Study of Speech Intelligibility in Noisy Enclosures using Spherical Microphones Arrays"
9) Yu Takahashi, Keiichi Osako, Hiroshi Saruwatari and Kiyohiro Shikano
"Blind Source Extraction For Hands-Free Speech Recognition based on Wiener Filtering and ICA-based Noise Estimation"
10) Armin Sehr and Walter Kellermann
"New Results for Feature-Domain Reverberation Modeling"
11) Simone Cifani, Emanuele Principi, Cesare Rocchi, Stefano Squartini and Francesco Piazza
"A Multichannel Noise Reduction Front-end based on Psychoacoustics for Robust Speech Recognition in Highly Noisy Environments"
12) Ji Hun Park, Jae Sam Yoon and Hong Kook Kim
"HMM-based Mask Estimation for a Speech Recognition Front-end Using Computational Auditory Scene Analysis"
13) Kenichi Kumatani, John McDonough, Dietrich Klakow, Philip Garner and Weifeng Li
"Adaptive Beamforming with a Maximum Negentropy Criterion"
14) Gustavo Esteves Coelho, Antonio Joaquim Serralheiro and Joao Paulo Neto
"Microphone Array Front-end Interface for Home Automation"


Poster session n. 2 - abstracts


Matthias Woelfel "A Joint Particle Filter and Multi-step Linear Prediction Framework to Provide Enhanced Speech Features Prior to Automatic Recognition"
Automatic speech recognition, which works well on recordings captured with mid- or far-field microphones, is essential for a natural verbal communication between humans and machines. While a great deal of research effort has addressed one of the two distortions frequently encountered in mid- and far-field sound capture, namely non-stationary noise and reverberation, much less work has undertaken to jointly combat both kinds of distortions. In our view, however, this joint approach is essential in order to further reduce catastrophic effects of noise and reverberation that are encountered as soon as the microphone is more than a few centimeters from the speakerís mouth. We propose here to integrate an estimate of the reverberation obtained by multi-step linear prediction into a particle filter framework that tracks and removes non-stationary additive distortions. Evaluations on actual recordings with different speaker to microphone distances demonstrate that techniques combating either non-stationary noise or reverberation can be combined for good effect.

>> Go up

Itai Peer, Boaz Rafaely and Yaniv Zigel "Room Acoustics Parameters Affecting Speaker Recognition Degradation Under Reverberation"
The performance of speaker recognition systems may degrade significantly when speech is recorded in reverberant environments by a microphone positioned far from the speaker. Most of the literature on speaker recognition uses the reverberation time to classify the reverberation effects. However, as described in this work, the reverberation time is mainly a room feature and is less affected by the distance between the source and the microphone. This paper presents a comprehensive study of room acoustics parameters and their relationship with speaker recognition performance. The definition and centra-time, acoustic parameters which are affected by both room properties and distance, were found to be more correlated with the degradation in the speaker recognition performance.

>> Go up

Randy Gomez, Jani Even, Hiroshi Saruwatari, and Kiyohiro Shikano "Fast Dereverberation for Hands-Free Speech Recognition"
A robust dereverberation technique for real-time handsfree speech recognition application is proposed. Real-time implementation is made possible by avoiding time-consuming blind estimation. Instead, we use the impulse response by effectively identifying the late reflection components of it. Using this information, together with the concept of Spectral Subtraction (SS), we were able to remove the effects of the late reflection of the reverberant signal. After dereverberation, only the effects of the early component is left and used as input to the recognizer. In this method, multi-band SS is used in order to compensate for the error arising from approximation. We also introduced a training strategy to optimize the values of the multi-band coefficients to minimize the error.

>> Go up

Noam R. Shabtai, Yaniv Zigel and Boaz Rafaely "The Effect of GMM Order and CMS on Speaker Recognition with Reverberant Speech"
Speaker recognition is used today in a wide range of applications. The presence of reverberation, in hands-free systems for example, results in performance degradation. The effect of reverberation on the feature vectors and its relation to optimal GMM order are investigated. Optimal model order is calculated in terms of minimum BIC and KIC, and tested for EER of a GMM-based speaker recognition system. Experimental results show that for high reverberation time, reducing model order reduces EER values of speaker recognition. The effect of CMS on state of the art GMM and AGMM-based speaker recognition systems is investigated for reverberant speech. Results show that high reverberation time reduces the effectiveness of CMS.

>> Go up

Hyunsin Park, Tetsuya Takiguchi and Yasuo Ariki "Integration of Phoneme-subspaces using ICA for Speech Feature Extraction and Recognition"
In our previous work, the use of PCA instead of DCT shows robustness in distorted speech recognition because the main speech element is projected onto low-order features, while the noise or distortion element is projected onto high-order features [1]. This paper introduces a new feature extraction technique that collects the correlation information among phoneme subspaces and their elements are statistically mutual independent. The proposed speech feature vector is generated by projecting observed vector onto integrated space obtained by PCA and ICA. The performance evaluation shows that the proposed method provides a higher isolated word recognition accuracy than conventional methods in some reverberant conditions.

>> Go up

Min-Seok Choi and Hong-Goo Kang "A Two-channel Minimum Mean-square Error Log-spectral Amplitude Estimator for Speech Enhancement"
This paper proposes a novel two-channel speech enhancement structure using the minimum mean-square error logspectral amplitude (MMSE-LSA) estimator. The proposed two-channel enhancement algorithm utilizes a spatial relationship between two input signals to accurately estimate the noise power spectral density (PSD) needed for the MMSELSA algorithm. The proposed structure improves the noise reduction capacity with less speech distortion, while its complexity is much lower than simple cascade structures. The performance of the proposed algorithm is evaluated by automatic speech recognition tests in a car environment. Comparing to a simple cascading of two- and single-channel algorithms, the proposed algorithm improves the relative recognition rate by 17.5 % for high speed conditions and 14.8 % for low speed conditions, respectively.

>> Go up

Thomas Ploetz and Gernot A. Fink "On the Use of Empirically Determined Impulse Responses for Improving Distant Talking Speech Recognition"
Recognition rates of distant talking speech recognition applications substantially decrease if the acoustic environment contains reverberation. Although standard approaches for compensating such distortions, e.g. cepstral mean subtraction (CMS), are quite effective, they are not appropriate for dynamic human machine interaction. When only short portions of speech are uttered by speakers at different positions, compensation methods fail that require several seconds of speech. For this kind of applications we present a dereverberation approach utilizing empirically determined impulse responses. Prior to speaking users are asked to produce some impulse-like signal (clapping their hands, or snipping the fingers) which is used for compensation. By means of an experimental evaluation on the German Verbmobil corpus we demonstrate the promising potential of the approach.

>> Go up

Yotam Peled and Boaz Rafaely "Study of Speech Intelligibility in Noisy Enclosures using Spherical Microphones Arrays"
Detection of clear speech in highly reverberant and noisy enclosures is an extremely difficult problem. Recently, spherical microphone arrays have been studied that are suitable for noise reduction and de-reverberation in three dimensions. This paper presents the development of a model for investigating speech intelligibility in noisy enclosures when recorded and processed by spherical microphones arrays. The model uses the image method, diffuse sound fields, spherical array beamforming and speech intelligibility measures, to predict the array order required to overcome noise and reverberation when detecting speech in noisy enclosures. Having such a model, one can design a spherical array that overcomes given acoustic conditions, or assess whether a given problem can be solved by a practical array configuration.

>> Go up

Yu Takahashi, Keiichi Osako, Hiroshi Saruwatari and Kiyohiro Shikano "Blind Source Extraction For Hands-Free Speech Recognition based on Wiener Filtering and ICA-based Noise Estimation"
In this paper, we proposed a new blind speech extraction method consisting ofWiener filtering and noise estimation based on independent component analysis (ICA). First, we provide both theoretically and experimental investigations on proficiency of ICA in noise estimation under a non-point-source noise condition. Next, computer simulation and experiment in an actual railway-station environment are conducted, and their results also indicate that ICA is proficient in noise estimation under a non-point-source noise condition. Finally, we newly propose a blind speech extraction method based on Wiener filtering and ICA-based noise estimation, and the effectiveness of the proposed method via speech recognition test in an actual railway-station environment.

>> Go up

Armin Sehr and Walter Kellermann "New Results for Feature-Domain Reverberation Modeling"
To achieve robust distant-talking automatic speech recognition in reverberant environments, the effect of reverberation on the speech feature sequences has to be modeled as accurately as possible. A convolution in the feature domain has been proposed recently in [1, 2, 3, 4] to capture the dispersion of the feature vectors caused by reverberation. These publications use a fixed representation of the acoustic path between speaker and microphone or an elementary statistical reverberation model based on simplifying assumptions. In this contribution, we propose a Monte-Carlo approach that allows for an explicit determination of the joint probability density function of a feature-domain reverberation model.

>> Go up

Simone Cifani, Emanuele Principi, Cesare Rocchi, Stefano Squartini and Francesco Piazza "A Multichannel Noise Reduction Front-end based on Psychoacoustics for Robust Speech Recognition in Highly Noisy Environments"
Microphone array systems, due to their spatial filtering capability, usually overcome the traditional mono approaches in noise reduction. Moreover, the employment of psychoacoustically motivated speech enhancement schemes typically allows to achieve a good balance between noise reduction and speech distortion. This drove some of the authors to merge the two advantageous aspects into a unique solution, allowing to achieve relevant performances in terms of enhanced speech quality in a wide range of operating conditions. Now, in this paper, the objective is assessing the effectiveness of the approach when applied as Noise Reduction Front-end to an Automatic Speech Recognition system working in adverse acoustic environments. Some computer simulations have been carried out and they show that a significant improvement of recognition rate is registered when such front-end is used, also w.r.t. the performances achievable when another Multichannel Noise Reduction architecture, not based on psychoacoustics concepts, is adopted on purpose.

>> Go up

Ji Hun Park, Jae Sam Yoon and Hong Kook Kim "HMM-based Mask Estimation for a Speech Recognition Front-end Using Computational Auditory Scene Analysis"
In this paper, we propose a new mask estimation method for the computational auditory scene analysis (CASA) of speech using two microphones. The proposed method is based on a hidden Markov model (HMM) in order to incorporate an observation that the mask information should be correlated over contiguous analysis frames. In other words, HMM is used to estimate the mask information represented as the interaural time difference (ITD) and the interaural level difference (ILD) of two channel signals, and the estimated mask information is finally employed in the separation of desired speech from noisy speech. To show the effectiveness of the proposed mask estimation, we then compare the performance of the proposed method with that of a Gaussian kernel-based estimation method in terms of the performance of speech recognition. As a result, the proposed HMM-based mask estimation method provided an average word error rate reduction of 69.14% when compared with the Gaussian kernel-based mask estimation method.

>> Go up

Kenichi Kumatani, John McDonough, Dietrich Klakow, Philip Garner and Weifeng Li "Adaptive Beamforming with a Maximum Negentropy Criterion"
In this paper, we address an adaptive beamforming application in realistic acoustic conditions. After the position of a speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the G and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski post-filtering is performed to further enhance the speech by removing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional adaptive beamforming algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with post-filtering achieved a word error rate (WER) of 16.5%. MN beamforming with the G pdf achieved a 15.8% WER, which was further reduced to 13.2% with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%.

>> Go up

Gustavo Esteves Coelho, Antonio Joaquim Serralheiro and Joao Paulo Neto "Microphone Array Front-end Interface for Home Automation"
In this paper we present a Microphone Array (MA) interface to a Spoken Dialog System. Our goal is to create a handsfree home automation system with a vocal interface to control home devices. The user establishes a dialog with a virtual butler that is able to control a plethora of home devices, such as ceiling lights, air-conditioner, windows shades, hi-fi and TV features. A MA is used for the speech acquisition front-end. The multi-channel audio acquisition is pre-processed in real-time, performing speech enhancement with Delay-and-Sum Beamforming algorithm. The Direction of Arrival is estimated with the Generalized Cross Correlation with Phase Transform algorithm, enabling us to track the user. The enhanced speech signal is then processed in order to recognize orally issued commands that will control the house appliances. This paper describes the complete system emphasizing the MA and its implications on command recognition performance.

>> Go up