1 |
Learning representations of speech from the raw waveform ; Apprentissage de représentations de la parole à partir du signal brut
|
|
|
|
In: https://tel.archives-ouvertes.fr/tel-02278616 ; Machine Learning [cs.LG]. Université Paris sciences et lettres, 2019. English. ⟨NNT : 2019PSLEE004⟩ (2019)
|
|
BASE
|
|
Show details
|
|
2 |
Learning to detect dysarthria from raw speech
|
|
|
|
In: ICASSP ; ICASSP-2019 - IEEE International Conference on Acoustics, Speech and Signal Processing ; https://hal.archives-ouvertes.fr/hal-02274504 ; ICASSP-2019 - IEEE International Conference on Acoustics, Speech and Signal Processing, May 2019, Brighton, United Kingdom (2019)
|
|
BASE
|
|
Show details
|
|
3 |
Learning Filterbanks from Raw Speech for Phoneme Recognition
|
|
|
|
In: ICASSP 2018 - IEEE International Conference on Acoustics, Speech and Signal Processing ; https://hal.archives-ouvertes.fr/hal-01888737 ; ICASSP 2018 - IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2018, Calgary, Alberta, Canada (2018)
|
|
BASE
|
|
Show details
|
|
4 |
Sampling strategies in Siamese Networks for unsupervised speech representation learning
|
|
|
|
In: Interspeech 2018 ; https://hal.archives-ouvertes.fr/hal-01888725 ; Interspeech 2018, Sep 2018, Hyderabad, India (2018)
|
|
BASE
|
|
Show details
|
|
5 |
End-to-End Speech Recognition From the Raw Waveform
|
|
|
|
In: Interspeech 2018 ; https://hal.archives-ouvertes.fr/hal-01888739 ; Interspeech 2018, Sep 2018, Hyderabad, India. ⟨10.21437/Interspeech.2018-2414⟩ (2018)
|
|
Abstract:
Accepted for presentation at Interspeech 2018 ; International audience ; State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.
|
|
Keyword:
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]; [SCCO.LING]Cognitive science/Linguistics; [SCCO]Cognitive science; deep; end-to-end; gammatones; Index Terms: speech recognition; scattering; Speech recognition; waveform
|
|
URL: https://doi.org/10.21437/Interspeech.2018-2414 https://hal.archives-ouvertes.fr/hal-01888739 https://hal.archives-ouvertes.fr/hal-01888739/file/Zeghidour_USCD_2018_End2end_from_wav.Interspeech.pdf https://hal.archives-ouvertes.fr/hal-01888739/document
|
|
BASE
|
|
Hide details
|
|
6 |
SING: Symbol-to-Instrument Neural Generator
|
|
|
|
In: Conference on Neural Information Processing Systems (NIPS) ; https://hal.archives-ouvertes.fr/hal-01899949 ; Conference on Neural Information Processing Systems (NIPS), Dec 2018, Montréal, Canada (2018)
|
|
BASE
|
|
Show details
|
|
8 |
Fader Networks: Manipulating Images by Sliding Attributes
|
|
|
|
In: 31st Conference on Neural Information Processing Systems (NIPS 2017) ; https://hal.archives-ouvertes.fr/hal-02275215 ; 31st Conference on Neural Information Processing Systems (NIPS 2017), Dec 2017, Long Beach, CA, United States. pp.5969-5978 (2017)
|
|
BASE
|
|
Show details
|
|
9 |
Learning Weakly Supervised Multimodal Phoneme Embeddings
|
|
|
|
In: Interspeech 2017 ; https://hal.inria.fr/hal-01687415 ; Interspeech 2017, 2017, Stockholm, Sweden. ⟨10.21437/Interspeech.2017-1689⟩ (2017)
|
|
BASE
|
|
Show details
|
|
10 |
Learning weakly supervised multimodal phoneme embeddings ...
|
|
|
|
BASE
|
|
Show details
|
|
|
|