1 |
Cross-Situational Learning Towards Robot Grounding
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03628290 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
2 |
Cross-Situational Learning Towards Robot Grounding
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03628290 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
3 |
A Neural Pairwise Ranking Model for Readability Assessment ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
pNLP-Mixer: an Efficient all-MLP Architecture for Language ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Does Corpus Quality Really Matter for Low-Resource Languages? ...
|
|
|
|
Abstract:
The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is primarily constrained by the quantity rather than the quality of the data, prompting for ...
|
|
Keyword:
Artificial Intelligence cs.AI; Computation and Language cs.CL; FOS Computer and information sciences; Machine Learning cs.LG
|
|
URL: https://dx.doi.org/10.48550/arxiv.2203.08111 https://arxiv.org/abs/2203.08111
|
|
BASE
|
|
Hide details
|
|
6 |
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers ...
|
|
|
|
BASE
|
|
Show details
|
|
7 |
Learning Bidirectional Translation between Descriptions and Actions with Small Paired Data ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Improving Intrinsic Exploration with Language Abstractions ...
|
|
|
|
BASE
|
|
Show details
|
|
10 |
GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records ...
|
|
|
|
BASE
|
|
Show details
|
|
11 |
Chain-based Discriminative Autoencoders for Speech Recognition ...
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Cross-Platform Difference in Facebook and Text Messages Language Use: Illustrated by Depression Diagnosis ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Improving Word Translation via Two-Stage Contrastive Learning ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics ...
|
|
|
|
BASE
|
|
Show details
|
|
16 |
EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background Prediction in English ...
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Adversarial Robustness of Neural-Statistical Features in Detection of Generative Transformers ...
|
|
|
|
BASE
|
|
Show details
|
|
18 |
The Mapping of Deep Language Models on Brain Responses Primarily Depends on their Performance
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03361439 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
19 |
Recognizing lexical units in low-resource language contexts with supervised and unsupervised neural networks
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03429051 ; [Research Report] LACITO (UMR 7107). 2021 (2021)
|
|
BASE
|
|
Show details
|
|
20 |
Privacy and utility of x-vector based speaker anonymization
|
|
|
|
In: https://hal.inria.fr/hal-03197376 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
|
|