Speech Communication Laboratory



About SCL

Research at the Speech Communication Lab focuses on combining the principles of science with the innovation of engineering to solve problems in speech and related areas. The emphasis of research is on understanding the principles of speech production and perception, and applying these principles in the development of acoustic parameters that will enable machines to automatically identify speakers or recognize speech. Research is also aimed at enhancing the quality of speech for such applications. All the projects are headed by Dr. Carol Espy-Wilson, the director of the Speech Communication Lab.


Some of the ongoing projects at the lab are:


Speech Enhancement using the Modified Phase Opponency Model

Graduate Student : Om Deshmukh

In this work we address the problem of single-channel speech enhancement when the speech is corrupted by additive noise. The model presented here, called the Modified Phase Opponency (MPO) model, is based on the auditory PO model, proposed by Carney et al for detection of tones in noise. The PO model includes a physiologically realistic mechanism for processing the information in neural discharge times, and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery by using a cross-auditory-nerve-fiber coincidence detection for extracting temporal cues. The MPO model is an extension of the PO model wherein great flexibility is added to the filter design procedure, and knowledge of acoustic phonetics has been taken advantage of for suppression of noise. Initial evaluation of the MPO model on speech corrupted by white noise at different SNRs shows that the MPO model is able to enhance the spectral peaks while suppressing the noise-only regions. The MPO enhancement scheme outperforms many of the statistical and signal-theoretic speech enhancement techniques when evaluated using three different objective quality measures on a subset of the Aurora database. The MPO speech enhancement scheme has also shown to be superior in enhancing speech signals when the amplitude level and the spectral characteristics of the background noise are fluctuating.


Analysis, Vocal-tract Modeling and Automatic Detection of Vowel Nasalization

Graduate Student : Tarun Pruthi

The aim of this work is to clearly understand the salient features of nasalization and the sources of acoustic variability in nasalized vowels, and to suggest Acoustic Parameters (APs) for the automatic detection of vowel nasalization based on this knowledge. Possible applications in automatic speech recognition, formant tracking, speech enhancement, speaker recognition and clinical assessment of nasal speech quality have made the detection of vowel nasalization an important problem to study. Although several researchers in the past have found a number of acoustical and perceptual correlates of nasality, automatically extractable APs that work well in a speaker-independent manner are yet to be found. The reason behind this difficulty is the immense variation in the acoustic consequences of nasalization with changes in the speaker, the sound on which nasalization is superimposed, and the degree of nasal coupling. A detailed Magnetic Resonance Imaging data based study has been performed in this work to clearly understand the acoustic consequences of vowel nasalization. Based on this understanding and an extensive survey of past literature, acoustic parameters for the purpose have been proposed. Results on previously proposed APs have also been presented to give a baseline for comparing results obtained by the APs proposed in this study. Future work would include implementing these proposed APs and testing on several databases with different sampling rates, recording conditions and languages.


Acoustics of vocal tract shape for American English liquids

Graduate Student : Xinhui Zhou

In American English, Liquid sounds /r/ and /l/ are the most articulatorily complex sounds. They can be produced by several distinct types of tongue configuration and are among the most troublesome for children and nonnative speakers. Better understanding of this many-to-one relationship between articulation and acoustics would help model these two sounds more accurately and may be beneficial to the areas of speech motor control, speech pathology, speaker verification, speech recognition and speech synthesis. Using magnetic resonance imaging (MRI), we acquired a multispeaker volumetric database including a series of continuous vocal tract shapes producing /r/ and /l/. These volumetric data are used to get the detailed three-dimensional (3D) geometry information of the vocal tract and develop comprehensive acoustic models of interspeaker differences in vocal tract configuration through finite element analysis of 3D model and computer vocal tract model. The ultimate goal of this research is to understand the acoustic and perceptual effects of the different tongue configurations producing /r/ and /l/.

A Matlab based program, VTAR (Vocal Tract Acoustic Response), is developed in this project to do acousitc analysis based on the area functions we collected. It is free for download.

Robust Voice Mining Techniques for Telephone Conversations

Graduate Student : Sandeep Manocha

Voice mining involves speaker detection in a set of multi-speaker files. In published work, training data is used for constructing target speaker models. In this study, a new voice mining scenario was considered, where there is no demarcation between training and testing data and prior target speaker models are absent. Given a database of telephone conversations, the task is to identify conversations having one or more speakers in common. Various approaches including semi-automatic and fully automatic techniques were explored and different scoring strategies were considered. Given the poor audio quality, automatic speaker segmentation is not very effective. A new technique was developed which does not require speaker segmentation by training a multi-speaker model on the entire conversation. This technique is more robust and it outperforms the automatic speaker segmentation approach. On the ENRON database, the EER is 15.98% and 6.25% for at least one and two speakers in common, respectively.


A Set of Acoustic Parameters for Speaker Identification

Graduate Students : Sandeep Manocha, Daniel Garcia-Romero, Srikanth Vishnubhotla

The work involves the use of knowledge of acoustic phonetics to develop a set of acoustic parameters that potentially capture speaker specific information and help in speaker identification applications. The focus of the project was to compare the performance of a set of acoustic features against that of the standard Mel-Frequency Cepstral Coefficients (MFCCs) for text-independent speaker identification. The eight acoustic features that were used included the four formants F1 through F4, the spectral slope, harmonic difference H1-H2 and the aperiodicity & periodicity contents in the speech signal. The first four of these parameters are useful in capturing the vocal tract information, like the dynamic range of configurations of the speaker's vocal tract etc. The latter four features capture the source information of the speaker, and help characterize the voice quality. It was seen that the set of acoustic parameters give comparable performance to the standard MFCCs on average, and perform better for female speakers in general. More work is in the pipeline, to include a parameter that can identify creakiness and other irregular phonation variations, thus adding an additiional parameter to the feature set.


Automatic Detection of Irregular Phonation in Continuous Speech

Graduate Student : Srikanth Vishnubhotla

This project involves the extraction of acoustic features that can detect irregular phonation in a speech signal. This work is part of a bigger project with the aim of distinguishing between different voice qualities. In particular, the work is based on the Aperiodicity, Periodicity and Pitch (APP) Detector to analyze creaky voices (and other instances of irregular phonation) for their characteristic APP profiles. Some other knowledge-based constraints are then added, to characterize irregular phonation from other confusion elements. The long term aim of this project is to develop an acoustic parameter that can add to a knowledge-based acoustic feature set for speaker recognition experiments, as well as to aid in speech recognition.




People & Alumni
Events & News
umd ece isr