Home Catalogue search

eng

Refine your search:

Search in the Catalogues and Directories






	Sort by
Simple Search

Hits 1 – 11 of 11

1	Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training ...
	Sanabria, Ramon; Hsu, Wei-Ning; Baevski, Alexei. - : arXiv, 2022
	BASE
	Show details

2	Simple and Effective Unsupervised Speech Synthesis ...
	Liu, Alexander H.; Lai, Cheng-I Jeff; Hsu, Wei-Ning. - : arXiv, 2022
	BASE
	Show details

3	Textless Speech-to-Speech Translation on Real Data ...
	Lee, Ann; Gong, Hongyu; Duquenne, Paul-Ambroise. - : arXiv, 2021
	BASE
	Show details

4	HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units ...
	Hsu, Wei-Ning; Bolte, Benjamin; Tsai, Yao-Hung Hubert. - : arXiv, 2021
	BASE
	Show details

5	Textless Speech Emotion Conversion using Discrete and Decomposed Representations ...
	Kreuk, Felix; Polyak, Adam; Copet, Jade. - : arXiv, 2021
	BASE
	Show details

6	A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning ...
	Khurana, Sameer; Laurent, Antoine; Hsu, Wei-Ning. - : arXiv, 2020
	BASE
	Show details

7	Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech ...
	Harwath, David; Hsu, Wei-Ning; Glass, James. - : arXiv, 2019
	BASE
	Show details

8	Transfer Learning from Audio-Visual Grounding to Speech Recognition ...
	Hsu, Wei-Ning; Harwath, David; Glass, James. - : arXiv, 2019
	Abstract: Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts. As semantics of speech are largely determined by its lexical content, grounding models learn to preserve phonetic information while disregarding uncorrelated factors, such as speaker and channel. To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models. Empirical results demonstrate that layers closer to input retain more phonetic information, while following layers exhibit greater invariance to domain shift. Moreover, while most previous studies include training data for speech recognition for feature extractor training, our ... : Accepted to Interspeech 2019. 4 pages, 2 figures ...
	Keyword: Audio and Speech Processing eess.AS; Computation and Language cs.CL; FOS Computer and information sciences; FOS Electrical engineering, electronic engineering, information engineering; Machine Learning cs.LG; Sound cs.SD
	URL: https://arxiv.org/abs/1907.04355 https://dx.doi.org/10.48550/arxiv.1907.04355
	BASE
	Hide details

9	Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition ...
	Hsu, Wei-Ning; Tang, Hao; Glass, James. - : arXiv, 2018
	BASE
	Show details

10	Unsupervised Representation Learning of Speech for Dialect Identification ...
	Shon, Suwon; Hsu, Wei-Ning; Glass, James. - : arXiv, 2018
	BASE
	Show details

11	Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data ...
	Hsu, Wei-Ning; Zhang, Yu; Glass, James. - : arXiv, 2017
	BASE
	Show details

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern