The Centre for Speech Technology Research, University of Edinburgh
In current approaches to speech synthesis, we generally obtain the best sounding output by concatenating recordings of natural speech, in the waveform domain. The competing method, in which a statistical model drives a vocoder, has until recently suffered from significant artefacts that reduced perceived naturalness.
Now, statistical parametric methods are suddenly and rapidly improving, and are finally good enough for commercial deployment. The most recent boost to quality has come about through a convergence of acoustic modelling and waveform generation, in which the model directly generates a waveform, or a very closely-related representation.
But, in this rush to use statistical models to directly generate waveforms, much of what we know about speech signal processing – whether source-filter modelling, the cepstrum, or something as simple as perceptually-motivated frequency scale warping – is now being questioned or simply forgotten. We see models generating exceedingly naive representations, such as 8-bit quantised waveform samples.
In my talk, I will ask whether this is just the inevitable march of Machine Learning, or if there is a missed opportunity. It seems rather unlikely that an 8-bit quantised waveform is the best domain for an objective function aiming to maximise perceived naturalness. Surely, experts in signal processing (by which, I mean YOU!) can do better?