Catalogue search • Linguistik portal • Fachinformationsdienst (FID)

1	Towards a Cleaner Document-Oriented Multilingual Crawled Corpus ...
	Abadji, Julien; Suarez, Pedro Ortiz; Romary, Laurent. - : arXiv, 2022
	BASE
	Show details

2	Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets ...
	Kreutzer, Julia; Caswell, Isaac; Wang, Lisa; Wahab, Ahsan; van Esch, Daan; Ulzii-Orshikh, Nasanbayar; Tapo, Allahsera; Subramani, Nishant; Sokolov, Artem; Sikasote, Claytone; Setyawan, Monang; Sarin, Supheakmungkol; Samb, Sokhar; Sagot, Benoît; Rivera, Clara; Rios, Annette; Papadimitriou, Isabel; Osei, Salomey; Suarez, Pedro Ortiz; Orife, Iroro; Ogueji, Kelechi; Rubungo, Andre Niyongabo; Nguyen, Toan Q.; Müller, Mathias; Müller, André; Muhammad, Shamsuddeen Hassan; Muhammad, Nanda; Mnyakeni, Ayanda; Mirzakhalov, Jamshidbek; Matangira, Tapiwanashe; Leong, Colin; Lawson, Nze; Kudugunta, Sneha; Jernite, Yacine; Jenny, Mathias; Firat, Orhan; Dossou, Bonaventure F. P.; Dlamini, Sakhile; de Silva, Nisansa; Ballı, Sakine Çabuk; Biderman, Stella; Battisti, Alessia; Baruwa, Ahmed; Bapna, Ankur; Baljekar, Pallavi; Azime, Israel Abebe; Awokoya, Ayodele; Ataman, Duygu; Ahia, Orevaoghene; Ahia, Oghenefego. - : arXiv, 2021
	Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases. ... : Accepted at TACL; pre-MIT Press publication version ...
	Keyword: Artificial Intelligence cs.AI; Computation and Language cs.CL; FOS Computer and information sciences
	URL: https://dx.doi.org/10.48550/arxiv.2103.12028 https://arxiv.org/abs/2103.12028
	BASE
	Hide details

Search in the Catalogues and Directories