DE eng

Search in the Catalogues and Directories

Hits 1 – 15 of 15

1
Does Corpus Quality Really Matter for Low-Resource Languages? ...
Abstract: The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is primarily constrained by the quantity rather than the quality of the data, prompting for ...
Keyword: Artificial Intelligence cs.AI; Computation and Language cs.CL; FOS Computer and information sciences; Machine Learning cs.LG
URL: https://dx.doi.org/10.48550/arxiv.2203.08111
https://arxiv.org/abs/2203.08111
BASE
Hide details
2
Multilingual Counter Narrative Type Classification ...
BASE
Show details
3
Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter ...
BASE
Show details
4
Benchmarking Meta-embeddings: What Works and What Does Not ...
BASE
Show details
5
Multilingual Stance Detection: The Catalonia Independence Corpus ...
BASE
Show details
6
A Common Semantic Space for Monolingual and Cross-Lingual Meta-Embeddings ...
BASE
Show details
7
Give your Text Representation Models some Love: the Case for Basque ...
BASE
Show details
8
Multilingual and Cross-lingual Timeline Extraction ...
BASE
Show details
9
Q-WordNet PPV: Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages ...
BASE
Show details
10
EliXa: A Modular and Flexible ABSA Platform ...
BASE
Show details
11
IXA pipes: Efficient and Ready to Use Multilingual NLP tools ...
Agerri, Rodrigo; Bermudez, Josu; Rigau, German. - : Unpublished, 2014
BASE
Show details
12
Norms of Conversation in a Framework for Agent Communication Languages
Agerri, Rodrigo. - : Dagstuhl Seminar Proceedings. 07122 - Normative Multi-agent Systems, 2007
BASE
Show details
13
Fixing unsaid meanings
In: Trends in cognitive sciences. - Amsterdam [u.a.] : Elsevier Science 6 (2002) 4, 149-150
BLLDB
Show details
14
Multilingual event detection using the NewsReader pipelines
Agerri, Rodrigo; Aldabe, Itziar; Laparra, Egoitz. - : International Conference on Language Resources and Evaluation (LREC)
BASE
Show details
15
Robust multilingual Named Entity Recognition with shallow semi-supervised features
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
1
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
14
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern