DE eng

Search in the Catalogues and Directories

Hits 1 – 9 of 9

1
A large English–Thai parallel corpus from the web and machine-generated text [<Journal>]
Lowphansirikul, Lalita [Verfasser]; Polpanumas, Charin [Verfasser]; Rutherford, Attapol T. [Verfasser].
DNB Subject Category Language
Show details
2
Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation ...
BASE
Show details
3
Robust Fragment-Based Framework for Cross-lingual Sentence Retrieval ...
BASE
Show details
4
WangchanBERTa: Pretraining transformer-based Thai Language Models ...
Abstract: Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and ... : 24 pages, edited the citation of the syllable-level tokenizer from [Chormai et al., 2020] to [Phatthiyaphaibun et al., 2020] as the authors used the syllable-level tokenizer from PyThaiNLP [Phatthiyaphaibun et al., 2020] in the experiments ...
Keyword: Computation and Language cs.CL; FOS Computer and information sciences
URL: https://dx.doi.org/10.48550/arxiv.2101.09635
https://arxiv.org/abs/2101.09635
BASE
Hide details
5
Handling cross and out-of-domain samples in Thai word segmentation
In: 1003 ; 1016 (2021)
BASE
Show details
6
Robust fragment-based framework for cross-lingual sentence retrieval
In: Findings of the Association for Computational Linguistics: EMNLP 2021 ; 935 ; 944 (2021)
BASE
Show details
7
Domain adaptation of Thai word segmentation models using stacked ensemble
In: 3841 ; 3847 (2020)
BASE
Show details
8
Native language identification of fluent and advanced non-native writers
In: 19 ; 4 ; 1 (2020)
BASE
Show details
9
A scalable framework for stylometric analysis query processing
BASE
Show details

Catalogues
0
0
0
0
1
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
8
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern