School of Philosophy, Psychology and Language Sciences University of Edinburgh
A fundamental question is: What are the basic building blocks of speech? To answer this question, I am working in a number of areas.
In speech recognition, I am looking at new acoustic models, such as Linear Dynamical Models, factorial-HMMs and other graphical models that can represent speech not as ‘beads on a string’ but as streams of interacting factors. I’ve investigated ways to automatically find an inventory of suitable units to model, as well as working on other alterntives to phonetic units, such as graphemes. One long-standing interest is the use of phonological/acoustic/articulatory features and articulatory measurement data as a tool to develop models of speech.
In speech synthesis, I work on both unit selection methods and HMM-based speech synthesis. In both of these areas, the definition of the unit of speech is crucial. Both typically use context-dependent phonemes or diphones so, in this context, we can gain some insight into the basic building blocks of speech by asking “What contextual features must we model?” In unit selection, this means learning the target cost and in HMM-based speech synthesis, it relates to the clustering of acoustically similar units. Neither of these processes is entirely satisfactory, but to improve them requires a better understanding of how we can construct speech from basic units.
I am increasingly interested in perceptual measures in speech synthesis, not just for evaluation of the final output, but within the synthesis process itself. In unit selection, perceptual measures should be used to determine equivalent units or contexts, because acoustic similarity and perceptual interchangeability are not the same thing. In HMM-based speech synthesis, the training criterion should be perceptual: perhaps minimum generation error gives us a way to use such a criterion? How can the requirements of acoustic modelling fit with this idea of perceptual equivalence?
In both recognition and synthesis, I have recently started work on multilingual systems as an additional way to look at the basic units of speech. Is there a univeral set of building blocks for speech, and can we build systems that use common models or unit inventories for multiple languages?