1 |
Learning and controlling the source-filter representation of speech with a variational autoencoder
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03650569 ; 2022 (2022)
|
|
Abstract:
17 pages, 4 figures, companion website: https://samsad35.github.io/site-sfvae/ ; Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency f0 and the formants are of primary importance. In this work, we show that the source-filter model of speech production naturally arises in the latent space of a variational autoencoder (VAE) trained in an unsupervised manner on a dataset of natural speech signals. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we experimentally illustrate that f0 and the formant frequencies are encoded in orthogonal subspaces of the VAE latent space and we develop a weakly-supervised method to accurately and independently control these speech factors of variation within the learned latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on f0 and the formant frequencies, and which is applied to the transformation of speech signals.
|
|
Keyword:
[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]; [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]; Deep generative models; Representation learning; Source-filter model; Variational autoencoder
|
|
URL: https://hal.archives-ouvertes.fr/hal-03650569 https://hal.archives-ouvertes.fr/hal-03650569/document https://hal.archives-ouvertes.fr/hal-03650569/file/sadok2022learning.pdf
|
|
BASE
|
|
Hide details
|
|
2 |
Learning and controlling the source-filter representation of speech with a variational autoencoder ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
High-resolution speaker counting in reverberant rooms using CRNN with Ambisonics features
|
|
|
|
In: EUSIPCO 2020 - 28th European Signal Processing Conference (EUSIPCO) ; https://hal.archives-ouvertes.fr/hal-03537323 ; EUSIPCO 2020 - 28th European Signal Processing Conference (EUSIPCO), Jan 2021, Amsterdam, Netherlands. pp.71-75, ⟨10.23919/Eusipco47968.2020.9287637⟩ (2021)
|
|
BASE
|
|
Show details
|
|
5 |
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input
|
|
|
|
In: Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association ; https://hal.archives-ouvertes.fr/hal-03372802 ; Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic. pp.3865-3869, ⟨10.21437/Interspeech.2021-275⟩ (2021)
|
|
BASE
|
|
Show details
|
|
6 |
Learning robust speech representation with an articulatory-regularized variational autoencoder
|
|
|
|
In: Proccedings of Interspeech 2021 ; Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association ; https://hal.archives-ouvertes.fr/hal-03373252 ; Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic (2021)
|
|
BASE
|
|
Show details
|
|
7 |
Learning robust speech representation with an articulatory-regularized variational autoencoder ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Towards an articulatory-driven neural vocoder for speech synthesis
|
|
|
|
In: ISSP 2020 - 12th International Seminar on Speech Production ; https://hal.archives-ouvertes.fr/hal-03184762 ; ISSP 2020 - 12th International Seminar on Speech Production, Dec 2020, Providence (virtual), United States (2020)
|
|
BASE
|
|
Show details
|
|
9 |
Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning
|
|
|
|
In: ISSN: 0899-7667 ; EISSN: 1530-888X ; Neural Computation ; https://hal.archives-ouvertes.fr/hal-03016083 ; Neural Computation, Massachusetts Institute of Technology Press (MIT Press), 2020, 32 (3), pp.596-625. ⟨10.1162/neco_a_01264⟩ (2020)
|
|
BASE
|
|
Show details
|
|
10 |
Deeppredspeech: Computational Models Of Predictive Speech Coding Based On Deep Learning ...
|
|
|
|
BASE
|
|
Show details
|
|
11 |
DeepPredSpeech: computational models of predictive speech coding based on deep learning ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
DeepPredSpeech: computational models of predictive speech coding based on deep learning ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Extending the Cascaded Gaussian Mixture Regression Framework for Cross-Speaker Acoustic-Articulatory Mapping
|
|
|
|
In: ISSN: 2329-9290 ; EISSN: 2329-9304 ; IEEE/ACM Transactions on Audio, Speech and Language Processing ; https://hal.archives-ouvertes.fr/hal-01485540 ; IEEE/ACM Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2017, 25 (3), pp.662-673. ⟨10.1109/TASLP.2017.2651398⟩ (2017)
|
|
BASE
|
|
Show details
|
|
14 |
Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract
|
|
|
|
In: ISSN: 0167-6393 ; EISSN: 1872-7182 ; Speech Communication ; https://hal.archives-ouvertes.fr/hal-01578315 ; Speech Communication, Elsevier : North-Holland, 2017, 93, pp.63 - 75. ⟨10.1016/j.specom.2017.08.002⟩ (2017)
|
|
BASE
|
|
Show details
|
|
15 |
Voice Activity Detection Based on Statistical Likelihood Ratio With Adaptive Thresholding
|
|
|
|
In: IWAENC 2016 - International Workshop on Acoustic Signal Enhancement (IWAENC) ; https://hal.inria.fr/hal-01349776 ; IWAENC 2016 - International Workshop on Acoustic Signal Enhancement (IWAENC), Sep 2016, Xi'an, China. pp.1-5, ⟨10.1109/IWAENC.2016.7602911⟩ (2016)
|
|
BASE
|
|
Show details
|
|
16 |
Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces
|
|
|
|
In: ISSN: 1553-734X ; EISSN: 1553-7358 ; PLoS Computational Biology ; https://hal.archives-ouvertes.fr/hal-01459706 ; PLoS Computational Biology, Public Library of Science, 2016, 12 (11), pp.e1005119. ⟨10.1371/journal.pcbi.1005119⟩ (2016)
|
|
BASE
|
|
Show details
|
|
18 |
Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces
|
|
|
|
BASE
|
|
Show details
|
|
19 |
Characterizing and classifying Cued Speech vowels from labial parameters
|
|
|
|
In: 8th International Conference on Spoken Language Processing (ICSLP'04 or InterSpeech'04) ; https://hal.archives-ouvertes.fr/hal-00328134 ; 8th International Conference on Spoken Language Processing (ICSLP'04 or InterSpeech'04), 2004, Jeju, South Korea (2004)
|
|
BASE
|
|
Show details
|
|
|
|