Home Catalogue search

eng

Refine your search:

Search in the Catalogues and Directories






	Sort by
Simple Search

Hits 1 – 7 of 7

1	Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning ...
	Siddhant, Aditya; Bapna, Ankur; Firat, Orhan. - : arXiv, 2022
	BASE
	Show details

2	Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
	Caswell, Isaac; Kreutzer, Julia; Wang, Lisa...
	In: https://hal.inria.fr/hal-03177623 ; 2021 (2021)
	BASE
	Show details

3	Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets ...
	Kreutzer, Julia; Caswell, Isaac; Wang, Lisa. - : arXiv, 2021
	BASE
	Show details

4	Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus ...
	Caswell, Isaac; Breiner, Theresa; van Esch, Daan; Bapna, Ankur. - : arXiv, 2020
	Abstract: Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based ... : Accepted to COLING 2020. 9 pages with 8 page abstract ...
	Keyword: Computation and Language cs.CL; FOS Computer and information sciences; Machine Learning cs.LG
	URL: https://arxiv.org/abs/2010.14571 https://dx.doi.org/10.48550/arxiv.2010.14571
	BASE
	Hide details

5	BLEU might be Guilty but References are not Innocent ...
	Freitag, Markus; Grangier, David; Caswell, Isaac. - : arXiv, 2020
	BASE
	Show details

6	Investigating Multilingual NMT Representations at Scale ...
	Kudugunta, Sneha Reddy; Bapna, Ankur; Caswell, Isaac. - : arXiv, 2019
	BASE
	Show details

7	Translationese as a Language in "Multilingual" NMT ...
	Riley, Parker; Caswell, Isaac; Freitag, Markus. - : arXiv, 2019
	BASE
	Show details

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern