15 January: FW: Talk on Friday 18th Jan at =?utf-8?Q?1?=

Dublin Computational Linguistics Research Seminar: Index of January 2019 | Dublin Computational Linguistics Research Seminar - Index of year: 2019 | Full index

From: Naomi Harte

Hi All,

We have an excellent talk to add to your diary for next week. Prof. Simon King will be over visiting from Edinburgh, and will be giving a talk on deep learning and exploring ideas around whether there is still space for signal processing in speech processing. It promises to be a really interesting talk and we have booked a large room as we hope to get a good crowd for this. See below for details.

Talk from Prof. Simon King on Friday 18th January at 11.30am
Location: Thomas Davis Lecture Theatre (Room 2043), Arts Block. (map - https://www.tcd.ie/Maps/map.php?b=58)

--

BIO

Simon King is Professor of Speech Processing at the University of Edinburgh, where he is director of the Centre for Speech Technology Research and of a long-running Masters programme in Speech and Language Processing. He has research interests in speech synthesis, speech recognition, speaker verification spoofing, and signal processing, with around 200 publications. He co-authored the Festival speech synthesis toolkit and made contributions to Merlin. He is a Fellow of IEEE and of ISCA. Currently: associate editor Computer Speech and Language. Previously: associate editor IEEE Trans Audio Speech & Language Proc; member IEEE Spoken Language Technical Committee; board member ISCA Speech Synthesis SIG; coordinator of Blizzard Challenges 2007-2018.

TITLE

Does “end-to-end” speech synthesis mean we don’t need text processing or signal processing any more?

ABSTRACT

Almost every text-to-speech synthesiser contains three components. A front-end text processor normalises the input text and extracts useful features from it. An acoustic model performs regression from these features to an acoustic representation, such as a spectrogram. A waveform generator then creates the corresponding waveform.

In many commercially-deployed speech synthesisers, the waveform generator still constructs the output signal by concatenating pre-recorded fragments of natural speech. But very soon we expect that to be replaced by a neural vocoder that directly outputs a waveform. Neural approaches are already the dominant choice for acoustic modelling, starting with simple Deep Neural Networks guiding waveform concatenation, and progressing to sequence-to-sequence models driving a vocoder. Completely replacing the traditional front-end pipeline with an entirely neural approach is trickier, although there are some impressive so-called "end-to-end" systems.

In this rush to use end-to-end neural models to directly generate waveforms given raw text input, much of what we know about text and speech signal processing appears to have been cast aside. Maybe this is a good thing: the new methods are a long-overdue breath of fresh air. Or, perhaps there is still some value in the knowledge accumulated from 50+ years of speech processing. If there is, how do we decide what to keep and what to discard - for example, is source-filter modelling still a good idea?

--
Associate Professor Naomi Harte
School of Engineering,
Trinity College Dublin

www.sigmedia.tv
+353 1 896 1861/1580

----- End forwarded message -----

Dublin Computational Linguistics Research Seminar - Index of January 2019 | Index of year: 2019 | Full index