Page: 1 2 3 4 5 6 7 8 9... 46
81 |
Unsupervised word-level prosody tagging for controllable speech synthesis ...
|
|
|
|
BASE
|
|
Show details
|
|
82 |
Filter-based Discriminative Autoencoders for Children Speech Recognition ...
|
|
|
|
BASE
|
|
Show details
|
|
83 |
Transducer-based language embedding for spoken language identification ...
|
|
|
|
BASE
|
|
Show details
|
|
84 |
Multi-sequence Intermediate Conditioning for CTC-based ASR ...
|
|
|
|
Abstract:
End-to-end automatic speech recognition (ASR) directly maps input speech to a character sequence without using pronunciation lexica. However, in languages with thousands of characters, such as Japanese and Mandarin, modeling all these characters is problematic due to data scarcity. To alleviate the problem, we propose a multi-task learning model with explicit interaction between characters and syllables by utilizing Self-conditioned connectionist temporal classification (CTC) technique. While the original Self-conditioned CTC estimates character-level intermediate predictions by applying auxiliary CTC losses to a set of intermediate layers, the proposed method additionally estimates syllable-level intermediate predictions in another set of intermediate layers. The character-level and syllable-level predictions are alternately used as conditioning features to deal with mutual dependency between characters and syllables. Experimental results on Japanese and Mandarin datasets show that the proposed ... : This paper was submitted to INTERSPEECH 2022 ...
|
|
Keyword:
Audio and Speech Processing eess.AS; Computation and Language cs.CL; FOS Computer and information sciences; FOS Electrical engineering, electronic engineering, information engineering; Sound cs.SD
|
|
URL: https://arxiv.org/abs/2204.00175 https://dx.doi.org/10.48550/arxiv.2204.00175
|
|
BASE
|
|
Hide details
|
|
86 |
Multistream neural architectures for cued-speech recognition using a pre-trained visual feature extractor and constrained CTC decoding ...
|
|
|
|
BASE
|
|
Show details
|
|
87 |
Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech ...
|
|
|
|
BASE
|
|
Show details
|
|
88 |
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling ...
|
|
|
|
BASE
|
|
Show details
|
|
89 |
CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations ...
|
|
|
|
BASE
|
|
Show details
|
|
90 |
Enhance Language Identification using Dual-mode Model with Knowledge Distillation ...
|
|
|
|
BASE
|
|
Show details
|
|
91 |
MAESTRO: Matched Speech Text Representations through Modality Matching ...
|
|
|
|
BASE
|
|
Show details
|
|
93 |
Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems ...
|
|
|
|
BASE
|
|
Show details
|
|
95 |
Wavebender GAN: An architecture for phonetically meaningful speech manipulation ...
|
|
|
|
BASE
|
|
Show details
|
|
97 |
Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features ...
|
|
|
|
BASE
|
|
Show details
|
|
98 |
MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data ...
|
|
|
|
BASE
|
|
Show details
|
|
99 |
DeepFry: Identifying Vocal Fry Using Deep Neural Networks ...
|
|
|
|
BASE
|
|
Show details
|
|
100 |
MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis ...
|
|
|
|
BASE
|
|
Show details
|
|
Page: 1 2 3 4 5 6 7 8 9... 46
|
|