DE eng

Search in the Catalogues and Directories

Page: 1 2
Hits 1 – 20 of 22

1
XTREME-S: Evaluating Cross-lingual Speech Representations ...
BASE
Show details
2
Multilingual Mix: Example Interpolation Improves Multilingual Neural Machine Translation ...
BASE
Show details
3
Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning ...
BASE
Show details
4
mSLAM: Massively multilingual joint pre-training for speech and text ...
Bapna, Ankur; Cherry, Colin; Zhang, Yu. - : arXiv, 2022
BASE
Show details
5
Examining Scaling and Transfer of Language Model Architectures for Machine Translation ...
BASE
Show details
6
MAESTRO: Matched Speech Text Representations through Modality Matching ...
BASE
Show details
7
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
In: https://hal.inria.fr/hal-03177623 ; 2021 (2021)
Abstract: To appear in the proceedings of the AfricaNLP 2021 workshop. ; With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
Keyword: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
URL: https://hal.inria.fr/hal-03177623
BASE
Hide details
8
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets ...
BASE
Show details
9
Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents ...
BASE
Show details
10
Joint Unsupervised and Supervised Training for Multilingual ASR ...
Bai, Junwen; Li, Bo; Zhang, Yu. - : arXiv, 2021
BASE
Show details
11
Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation ...
BASE
Show details
12
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference ...
BASE
Show details
13
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference ...
BASE
Show details
14
Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation
In: Zhang, Biao; Bapna, Ankur; Sennrich, Rico; Firat, Orhan (2021). Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation. In: International Conference on Learning Representations, Virtual, 3 May 2021 - 7 May 2021, ICLR. (2021)
BASE
Show details
15
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation ...
BASE
Show details
16
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus ...
BASE
Show details
17
Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation ...
BASE
Show details
18
Investigating Multilingual NMT Representations at Scale ...
BASE
Show details
19
Simple, Scalable Adaptation for Neural Machine Translation ...
BASE
Show details
20
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges ...
BASE
Show details

Page: 1 2

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
22
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern