1 |
Statistical parametric speech synthesis using conversational data and phenomena
|
|
|
|
Abstract:
Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods.
|
|
Keyword:
filled pause synthesis; neutral voice; phonetic modelling; pronunciation variant alignment; psycholinguistic; text-to-speech synthesis
|
|
URL: http://hdl.handle.net/1842/29016
|
|
BASE
|
|
Hide details
|
|
5 |
Speaker similarity evaluation of foreign-accented speech synthesis using HMM-based speaker adaptation
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project
|
|
|
|
In: http://infoscience.epfl.ch/record/150620 (2010)
|
|
BASE
|
|
Show details
|
|
8 |
Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Speech production knowledge in automatic speech recognition
|
|
|
|
BASE
|
|
Show details
|
|
16 |
An elitist approach to automatic articulatory-acoustic feature classification for phonetic characterization of spoken language.
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Asynchronous Articulatory Feature Recognition Using Dynamic Bayesian networks
|
|
|
|
BASE
|
|
Show details
|
|
18 |
On the Articulatory Representation of Speech within the Evolving Transformation System Formalism
|
|
|
|
BASE
|
|
Show details
|
|
20 |
Syllable Classification Using Articulatory-Acoustic Features
|
|
Wester, Mirjam. - : International Speech Communication Association, 2003
|
|
BASE
|
|
Show details
|
|
|
|