1 |
Does Corpus Quality Really Matter for Low-Resource Languages? ...
|
|
|
|
Abstract:
The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is primarily constrained by the quantity rather than the quality of the data, prompting for ...
|
|
Keyword:
Artificial Intelligence cs.AI; Computation and Language cs.CL; FOS Computer and information sciences; Machine Learning cs.LG
|
|
URL: https://dx.doi.org/10.48550/arxiv.2203.08111 https://arxiv.org/abs/2203.08111
|
|
BASE
|
|
Hide details
|
|
3 |
Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
Benchmarking Meta-embeddings: What Works and What Does Not ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Multilingual Stance Detection: The Catalonia Independence Corpus ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
A Common Semantic Space for Monolingual and Cross-Lingual Meta-Embeddings ...
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Give your Text Representation Models some Love: the Case for Basque ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Q-WordNet PPV: Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages ...
|
|
|
|
BASE
|
|
Show details
|
|
11 |
IXA pipes: Efficient and Ready to Use Multilingual NLP tools ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Norms of Conversation in a Framework for Agent Communication Languages
|
|
Agerri, Rodrigo. - : Dagstuhl Seminar Proceedings. 07122 - Normative Multi-agent Systems, 2007
|
|
BASE
|
|
Show details
|
|
14 |
Multilingual event detection using the NewsReader pipelines
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Robust multilingual Named Entity Recognition with shallow semi-supervised features
|
|
|
|
BASE
|
|
Show details
|
|
|
|