Page: 1 2 3 4 5 6 7 8 9... 292
81 |
Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model ...
|
|
|
|
Abstract:
In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-phonetic model from a DNN-HMM hybrid automatic speech recognition (ASR) system instead of training one from scratch. Hence we fine-tune on speech only for the word module, and we apply multi-target learning (MTL) on the word and intent modules to jointly optimize SLU performance. MTL yields a relative reduction of 40% in intent-classification error rates (from 1.0% to 0.6%). Note that our three-module model is a streaming method. The final outcome of the proposed three-module modeling approach yields an intent accuracy of 99.4% on FluentSpeech, an intent error rate reduction of 50% compared to that of Lugosch et al. ... : Published in INTERSPEECH 2021 ...
|
|
Keyword:
Audio and Speech Processing eess.AS; Computation and Language cs.CL; FOS Computer and information sciences; FOS Electrical engineering, electronic engineering, information engineering; Sound cs.SD
|
|
URL: https://arxiv.org/abs/2204.03315 https://dx.doi.org/10.48550/arxiv.2204.03315
|
|
BASE
|
|
Hide details
|
|
84 |
Improving speaker de-identification with functional data analysis of f0 trajectories ...
|
|
|
|
BASE
|
|
Show details
|
|
85 |
Unsupervised word-level prosody tagging for controllable speech synthesis ...
|
|
|
|
BASE
|
|
Show details
|
|
86 |
Filter-based Discriminative Autoencoders for Children Speech Recognition ...
|
|
|
|
BASE
|
|
Show details
|
|
87 |
Transducer-based language embedding for spoken language identification ...
|
|
|
|
BASE
|
|
Show details
|
|
88 |
Effects of Spatial Speech Presentation on Listener Response Strategy for Talker-Identification ...
|
|
|
|
BASE
|
|
Show details
|
|
90 |
Multi-sequence Intermediate Conditioning for CTC-based ASR ...
|
|
|
|
BASE
|
|
Show details
|
|
92 |
Multistream neural architectures for cued-speech recognition using a pre-trained visual feature extractor and constrained CTC decoding ...
|
|
|
|
BASE
|
|
Show details
|
|
93 |
Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech ...
|
|
|
|
BASE
|
|
Show details
|
|
94 |
Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling ...
|
|
|
|
BASE
|
|
Show details
|
|
95 |
CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations ...
|
|
|
|
BASE
|
|
Show details
|
|
96 |
Enhance Language Identification using Dual-mode Model with Knowledge Distillation ...
|
|
|
|
BASE
|
|
Show details
|
|
97 |
MAESTRO: Matched Speech Text Representations through Modality Matching ...
|
|
|
|
BASE
|
|
Show details
|
|
99 |
Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems ...
|
|
|
|
BASE
|
|
Show details
|
|
Page: 1 2 3 4 5 6 7 8 9... 292
|
|