1 |
Unsupervised word-level prosody tagging for controllable speech synthesis ...
|
|
|
|
Abstract:
Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ... : 5 pages, 6 figures, accepted to ICASSP2022 ...
|
|
Keyword:
Artificial Intelligence cs.AI; Audio and Speech Processing eess.AS; FOS Computer and information sciences; FOS Electrical engineering, electronic engineering, information engineering; Machine Learning cs.LG; Sound cs.SD
|
|
URL: https://dx.doi.org/10.48550/arxiv.2202.07200 https://arxiv.org/abs/2202.07200
|
|
BASE
|
|
Hide details
|
|
2 |
LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Glyph Enhanced Chinese Character Pre-Training for Lexical Sememe Prediction ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Bridging the Communication Gap between Radiographers and Patients to Improve Chest Radiography Image Acquisition: A Multilingual Solution in the COVID-19 Pandemic
|
|
|
|
In: Radiography (Lond) (2021)
|
|
BASE
|
|
Show details
|
|
6 |
Concept Transfer Learning for Adaptive Language Understanding ...
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Differences in Oral Structure and Tissue Interactions during Mouse vs. Human Palatogenesis: Implications for the Translation of Findings from Mice
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Text Flow: A Unified Text Detection System in Natural Scene Images ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Women Leaders of Higher Education: Female Executives in Leading Universities in China
|
|
|
|
In: Cross-Cultural Communication; Vol 9, No 6 (2013): Cross-Cultural Communication; 40-45 ; 1923-6700 ; 1712-8358 (2013)
|
|
BASE
|
|
Show details
|
|
13 |
Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis
|
|
|
|
In: ISSN: 0167-6393 ; EISSN: 1872-7182 ; Speech Communication ; https://hal.archives-ouvertes.fr/hal-00746106 ; Speech Communication, Elsevier : North-Holland, 2011, ⟨10.1016/j.specom.2011.03.003⟩ (2011)
|
|
BASE
|
|
Show details
|
|
19 |
RECOGNITION OF SYLLABLE-CONTRACTED WORDS IN SPONTANEOUS SPEECH USING WORD EXPANSION AND DURATION INFORMATION
|
|
|
|
In: http://isca-speech.org/archive_open/archive_papers/iscslp2008/225.pdf
|
|
BASE
|
|
Show details
|
|
|
|