The Centre for Speech Technology Research, University of Edinburgh
A fundamental question is: What are the basic building blocks of speech? To answer this question, I have worked on a number of problems in speech technology. In recent years, I have concentrated mainly on speech synthesis, working in both unit selection and statistical parametric paradigms (HMM-based and DNN-based). I have been considering the "building blocks question" in text processing, acoustic modelling and waveform generation.
In text processing, building blocks co-exist at many different levels of representation. Some, such as phonemes or syllables, need reliable linguistic knowledge of the language. Other unit types, such as graphemes, can be used in a wider range of practical situations.
In acoustic modelling, the definition of the unit of speech is crucial. Both unit selection and statistical parametric approaches typically use small, naively-defined units such as phones or diphones, but then adorn them with many contextual features, leading to severe sparsity . This sparsity is solved by finding units-in-context that are somehow “equivalent” or “perceptually interchangeable”.
In waveform generation, we still obtain the best quality by concatenating recordings, but parametric methods are improving rapidly. There is a gradual convergence of acoustic modelling and waveform generation, which may overcome the limitations of current systems that couple a statistical model with a vocoder. But, in this convergence, much of what we know about the building blocks of speech signals – such as source and filter – is now being questioned or simply ignored.