Catalogue search • Linguistik portal • Fachinformationsdienst (FID)

1	A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning ...
	Islam, Md Mofijul; Aguilar, Gustavo; Ponnusamy, Pragaash; Mathialagan, Clint Solomon; Ma, Chengyuan; Guo, Chenlei. - : arXiv, 2022
	Abstract: Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in low-resource languages, leading models to produce suboptimal representations. Additionally, the dependency on a fixed vocabulary limits the subword models' adaptability across languages and domains. In this work, we propose a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization. We pre-train our character-based tokenizer by processing unique words from multilingual corpus, thereby extensively increasing word diversity across languages. Unlike the predefined and fixed vocabularies in subword methods, our tokenizer allows end-to-end task learning, resulting in optimal task-specific tokenization. The experimental results show that replacing the subword tokenizer with our neural tokenizer consistently improves ...
	Keyword: Artificial Intelligence cs.AI; Computation and Language cs.CL; FOS Computer and information sciences
	URL: https://dx.doi.org/10.48550/arxiv.2204.10815 https://arxiv.org/abs/2204.10815
	BASE
	Hide details

2	Query Expansion and Entity Weighting for Query Reformulation Retrieval in Voice Assistant Systems ...
	Sun, Zhongkai; Lu, Sixing; Ma, Chengyuan. - : arXiv, 2022
	BASE
	Show details

3	A regularized maximum figure-of-merit (rMFoM) approach to supervised and semi-supervised learning
	Lee, Chin-Hui; Ma, Chengyuan
	In: Institute of Electrical and Electronics Engineers. IEEE transactions on audio, speech and language processing. - New York, NY : Inst. 19 (2011) 5, 1316-1327
	BLLDB
	OLC Linguistik
	Show details

Search in the Catalogues and Directories