Home Catalogue search

eng

Refine your search:

Search in the Catalogues and Directories






	Sort by
Simple Search

Hits 1 – 9 of 9

1	A large English–Thai parallel corpus from the web and machine-generated text [<Journal>]
	Lowphansirikul, Lalita [Verfasser]; Polpanumas, Charin [Verfasser]; Rutherford, Attapol T. [Verfasser].
	DNB Subject Category Language
	Show details

2	Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation ...
	The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021; Chuangsuwanich, Ekapol; Limkonchotiwat, Peerat. - : Underline Science Inc., 2021
	BASE
	Show details

3	Robust Fragment-Based Framework for Cross-lingual Sentence Retrieval ...
	The 2021 Conference on Empirical Methods in Natural Language Processing 2021; Chuangsuwanich, Ekapol; Limkonchotiwat, Peerat. - : Underline Science Inc., 2021
	BASE
	Show details

4	WangchanBERTa: Pretraining transformer-based Thai Language Models ...
	Lowphansirikul, Lalita; Polpanumas, Charin; Jantrakulchai, Nawat; Nutanong, Sarana. - : arXiv, 2021
	Abstract: Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and ... : 24 pages, edited the citation of the syllable-level tokenizer from [Chormai et al., 2020] to [Phatthiyaphaibun et al., 2020] as the authors used the syllable-level tokenizer from PyThaiNLP [Phatthiyaphaibun et al., 2020] in the experiments ...
	Keyword: Computation and Language cs.CL; FOS Computer and information sciences
	URL: https://dx.doi.org/10.48550/arxiv.2101.09635 https://arxiv.org/abs/2101.09635
	BASE
	Hide details

5	Handling cross and out-of-domain samples in Thai word segmentation
	Sarwar, Raheem; Phatthiyaphaibun, Wannaphong; Nutanong, Sarana...
	In: 1003 ; 1016 (2021)
	BASE
	Show details

6	Robust fragment-based framework for cross-lingual sentence retrieval
	Nutanong, Sarana; Sarwar, Raheem; Phatthiyaphaibun, Wannaphong...
	In: Findings of the Association for Computational Linguistics: EMNLP 2021 ; 935 ; 944 (2021)
	BASE
	Show details

7	Domain adaptation of Thai word segmentation models using stacked ensemble
	Limkonchotiwat, Peerat; Chuangsuwanich, Ekapol; Phatthiyaphaibun, Wannaphong...
	In: 3841 ; 3847 (2020)
	BASE
	Show details

8	Native language identification of fluent and advanced non-native writers
	Sarwar, Raheem; Rutherford, Attapol T; Hassan, Saeed-Ul...
	In: 19 ; 4 ; 1 (2020)
	BASE
	Show details

9	A scalable framework for stylometric analysis query processing
	Nutanong, Sarana; Yu, Chenyun; Sarwar, Raheem. - : IEEE, 2017
	BASE
	Show details

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern