1 |
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition ...
|
|
|
|
BASE
|
|
Show details
|
|
2 |
Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Improving the fusion of acoustic and text representations in RNN-T ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Automatic Depression Detection: An Emotional Audio-Textual Corpus and a GRU/BiLSTM-based Model ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Separate What You Describe: Language-Queried Audio Source Separation ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Chain-based Discriminative Autoencoders for Speech Recognition ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Unsupervised word-level prosody tagging for controllable speech synthesis ...
|
|
|
|
BASE
|
|
Show details
|
|
11 |
gTLO: A Generalized and Non-linear Multi-Objective Deep Reinforcement Learning Approach ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Cetacean Translation Initiative: a roadmap to deciphering the communication of sperm whales ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
NVC-Net: End-to-End Adversarial Voice Conversion ...
|
|
|
|
Abstract:
Voice conversion has gained increasing popularity in many applications of speech synthesis. The idea is to change the voice identity from one speaker into another while keeping the linguistic content unchanged. Many voice conversion approaches rely on the use of a vocoder to reconstruct the speech from acoustic features, and as a consequence, the speech quality heavily depends on such a vocoder. In this paper, we propose NVC-Net, an end-to-end adversarial network, which performs voice conversion directly on the raw audio waveform of arbitrary length. By disentangling the speaker identity from the speech content, NVC-Net is able to perform non-parallel traditional many-to-many voice conversion as well as zero-shot voice conversion from a short utterance of an unseen target speaker. Importantly, NVC-Net is non-autoregressive and fully convolutional, achieving fast inference. Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than ...
|
|
Keyword:
Artificial Intelligence cs.AI; Audio and Speech Processing eess.AS; FOS Computer and information sciences; FOS Electrical engineering, electronic engineering, information engineering; Sound cs.SD
|
|
URL: https://arxiv.org/abs/2106.00992 https://dx.doi.org/10.48550/arxiv.2106.00992
|
|
BASE
|
|
Hide details
|
|
16 |
Speech2Slot: An End-to-End Knowledge-based Slot Filling from Speech ...
|
|
|
|
BASE
|
|
Show details
|
|
17 |
NIST SRE CTS Superset: A large-scale dataset for telephony speaker recognition ...
|
|
|
|
BASE
|
|
Show details
|
|
18 |
Interpreting intermediate convolutional layers of CNNs trained on raw speech ...
|
|
|
|
BASE
|
|
Show details
|
|
19 |
A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images ...
|
|
|
|
BASE
|
|
Show details
|
|
20 |
A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images ...
|
|
|
|
BASE
|
|
Show details
|
|
|
|