next up previous
Next: About this document Up: Thrust Area V. Previous: D. Detection and

E. Speech Analysis And Recognition, and Speaker Identification

A natural application of auditory representations at all levels of analysis and complexity is in systems for automatic speech recognition (ASR) and speaker identification (SI). In principle, all the advantages already discussed of noise robustness and enhancement of perceptually significant features are immediately transferable to these applications. This has been partially demonstrated in numerous studies in the past [. gopal 1986,ghitza 1988, shamma 1989 robust.], and continue to be an active area of research. In fact, many auditory processing principles have gradually become an integral part of the best performing implementations such as the mel scale, critical bandwidths, and automatic gain control. Much more, however, can be done to harness the advantages of auditory processing and representations, such as the use of the self-normalized auditory spectra at the earliest front-end, and the detection and feature classification afforded by the multiscale cortical representations (both described in Thrust area I).

Auditory-based algorithms are particularly relevant for ASR and SI systems used in DoD applications because of unique and extreme circumstances. Among them are the extreme background noise and clutter in airplane cockpits or in military communication channels, stressed speakers such as soldiers and pilots, and the hardware and software limitations imposed by a battlefield or similar environments. Speech applications are also especially useful as a research tool because of the availability of extensive and varied speech databases, their wide use in physiological, psychoacoustical, and computational experiments, and the deep knowledge gathered over the years of the characteristics of the speech signal and its vocal tract source. For all these reasons, it is important that we utilize speech signals as a vehicle and a catalyst in the development of robust processors and recognizers of acoustic signals.

We propose to use speech signals to accomplish two specific objectives: (1) to demonstrate the unequivocal advantages of auditory representations in extreme noisy circumstances; (2) to use speech as a ``diagnostic tool'' to help develop and analyze the performance characteristics of various algorithms such as those for multiscale clustering and recognition ( Thrust area II(A)), and the detection and recognition of dynamic sequences ( Thrust area II(B)).

For our first objective, we plan to carry out a systematic evaluation of the gains in noise-robustness obtained with auditory ``front-end'' representations in well defined ASR and SI tasks. The auditory ``front-ends'' will be based on models of early auditory processing (e.g., the self-normalized auditory spectrum [.Shamma Wang 1994 normalize.] and the nonlinear cochlear filterbank model due to [.Carney 1993.]), of intermediate stages of the auditory system [.Singh Mountain 1996.] and of the cortical spectro-temporal multiscale representation [.Shamma Wang 1995.]. In either case, an ASR and/or SI system will first be trained with relatively clean speech in various forms (spectrograms, LPC, and LPC-cepstra, and the auditory representations). Various forms and levels of noise are then added to the signals and the degree of performance decrement is measured for the different representations. Less controlled and more realistic experiments will also be carried out using training and testing with normally noisy speech such as in telephone channels (the switchboard database) or with changing microphones. Some of these experiments will be carried out in collaboration with, and using the ASR and SI systems that are widely available (e.g., from Apple Inc. and Bell Labs).

For our second objective, the various clustering algorithms discussed in Thrust area II (e.g., TSVQ, wavelet-PCA and ICA) will be applied to multiscale representations of speech spectra. The resulting clusters and distinctive classifications at various scales may then be interpreted in terms of the well-known and numerous vocal tract articulatory features, acoustic cues, or phonemic classifications and relations. Thus using speech (and possibly musical sounds) allow us to gain insights that can be carried over to less structured and less understood signals such as the acoustic signals encountered in manufacturing and machine monitoring or from battlefield acoustic sensors.



next up previous
Next: About this document Up: Thrust Area V. Previous: D. Detection and



Didier A. Depireux
Mon May 19 22:18:36 EDT 1997