1 |
Breathing and Speech Planning in Spontaneous Speech Synthesis
|
|
|
|
Abstract:
Breathing and speech planning in spontaneous speech are coordinated processes, often exhibiting disfluent patterns. While synthetic speech is not subject to respiratory needs, integrating breath into synthesis has advantages for naturalness and recall. At the same time, a synthetic voice reproducing disfluent breathing patterns learned from the data can be problematic. To address this, we first propose training stochastic TTS on a corpus of overlapping breath-group bigrams, to take context into account. Next, we introduce an unsupervised automatic annotation of likely-disfluent breath events, through a product-of-experts model that combines the output of two breath-event predictors, each using complementary information and operating in opposite directions. This annotation enables creating an automatically-breathing spontaneous speech synthesiser with a more fluent breathing style. A subjective evaluation on two spoken genres (impromptu and rehearsed) found the proposed system to be preferred over the baseline approach treating all breath events the same. ; QC 20210414
|
|
Keyword:
breathing; ensemble method; General Language Studies and Linguistics; Jämförande språkvetenskap och allmän lingvistik; speech planning; Speech synthesis; spontaneous speech
|
|
URL: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-283731 https://doi.org/10.1109/ICASSP40776.2020.9054107
|
|
BASE
|
|
Hide details
|
|
2 |
Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows
|
|
Kucherenko, Taras; Henter, Gustav Eje; Beskow, Jonas. - : KTH, Tal, musik och hörsel, TMH, 2020. : KTH, Robotik, perception och lärande, RPL, 2020. : Wiley, 2020
|
|
BASE
|
|
Show details
|
|
3 |
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
|
|
Stefanov, Kalin; Beskow, Jonas; Salvi, Giampiero. - : KTH, Tal, musik och hörsel, TMH, 2020. : Institute for Creative Technologies, University of Southern California, Los Angeles, CA 90089, United States, 2020. : NTNU Norwegian University of Science and Technology, Trondheim, Norway, 2020. : Institute of Electrical and Electronics Engineers (IEEE), 2020
|
|
BASE
|
|
Show details
|
|
4 |
The speech synthesis phoneticians need is both realistic and controllable ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
The speech synthesis phoneticians need is both realistic and controllable ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network
|
|
Malisz, Zofia; Berthelsen, Harald; Beskow, Jonas. - : KTH, Tal, musik och hörsel, TMH, 2019. : KTH, Tal-kommunikation, 2019. : STTS – Södermalms talteknologiservice AB, 2019. : Vienna, 2019
|
|
BASE
|
|
Show details
|
|
7 |
Modern speech synthesis for phonetic sciences : A discussion and an evaluation
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Off the cuff: Exploring extemporaneous speech delivery with TTS
|
|
|
|
BASE
|
|
Show details
|
|
9 |
The speech synthesis phoneticians need is both realistic and controllable
|
|
Malisz, Zofia; Henter, Gustav Eje; Valentini-Botinhao, Cassia. - : KTH, Tal, musik och hörsel, TMH, 2019. : KTH, Tal-kommunikation, 2019. : The Centre for Speech Technology, The University of Edinburgh, UK, 2019. : Stockholm, 2019
|
|
BASE
|
|
Show details
|
|
11 |
A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction
|
|
Kontogiorgos, Dimosthenis; Avramova, Vanya; Alexanderson, Simon. - : KTH, Tal, musik och hörsel, TMH, 2018. : KTH, 2018. : Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland, 2018. : Paris, 2018
|
|
BASE
|
|
Show details
|
|
12 |
The proceedings of the 14th International Conference on Auditory-Visual Speech Processing
|
|
|
|
In: The 14th International Conference on Auditory-Visual Speech Processing (AVSP2017) ; https://hal.inria.fr/hal-01596625 ; The 14th International Conference on Auditory-Visual Speech Processing (AVSP2017), Aug 2017, Stockholm, Sweden. 2017 ; http://avsp2017.loria.fr (2017)
|
|
BASE
|
|
Show details
|
|
13 |
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Using deep neural networks to estimate tongue movements from speech face motion
|
|
|
|
BASE
|
|
Show details
|
|
16 |
Tutoring Robots
|
|
|
|
In: IFIP Advances in Information and Communication Technology ; 9th International Summer Workshop on Multimodal Interfaces (eNTERFACE) ; https://hal.inria.fr/hal-01350740 ; 9th International Summer Workshop on Multimodal Interfaces (eNTERFACE), Jul 2013, Lisbon, Portugal. pp.80-113, ⟨10.1007/978-3-642-55143-7_4⟩ (2013)
|
|
BASE
|
|
Show details
|
|
18 |
Visual Recognition of Isolated Swedish Sign Language Signs ...
|
|
|
|
BASE
|
|
Show details
|
|
|
|