Home Catalogue search

eng

Refine your search:

Search in the Catalogues and Directories






	Sort by
Simple Search

Hits 1 – 4 of 4

1	A dataset for automatic detection of places in (early) modern French texts ; Un jeu de données pour la détection automatique de lieux dans les textes français modernes
	Gabay, Simon; Ortiz Suárez, Pedro Javier
	In: NASSCFL 2021 - 50th Annual North American Society for Seventeenth-Century French Literature Conference ; https://hal.archives-ouvertes.fr/hal-03187097 ; NASSCFL 2021 - 50th Annual North American Society for Seventeenth-Century French Literature Conference, NASSCFL, May 2021, Iowa City / Virtual, United States. pp.5 (2021)
	BASE
	Show details

2	Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus
	Abadji, Julien; Ortiz Suárez, Pedro Javier; Romary, Laurent...
	In: CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora ; https://hal.inria.fr/hal-03301590 ; CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora, Jul 2021, Limerick / Virtual, Ireland. ⟨10.14618/ids-pub-10468⟩ ; https://www.cl2021.org/ (2021)
	BASE
	Show details

3	Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
	Caswell, Isaac; Kreutzer, Julia; Wang, Lisa; Wahab, Ahsan; van Esch, Daan; Ulzii-Orshikh, Nasanbayar; Tapo, Allahsera; Subramani, Nishant; Sokolov, Artem; Sikasote, Claytone; Setyawan, Monang; Sarin, Supheakmungkol; Samb, Sokhar; Sagot, Benoît; Rivera, Clara; Rios, Annette; Papadimitriou, Isabel; Osei, Salomey; Ortiz Suárez, Pedro Javier; Orife, Iroro; Ogueji, Kelechi; Niyongabo, Rubungo Andre; Nguyen, Toan,; Müller, Mathias; Müller, André; Muhammad, Shamsuddeen Hassan; Muhammad, Nanda; Mnyakeni, Ayanda; Mirzakhalov, Jamshidbek; Matangira, Tapiwanashe; Leong, Colin; Lawson, Nze; Kudugunta, Sneha; Jernite, Yacine; Jenny, Mathias; Firat, Orhan; Dossou, Bonaventure,; Dlamini, Sakhile; de Silva, Nisansa; Ballı, Sakine Çabuk; Biderman, Stella; Battisti, Alessia; Baruwa, Ahmed; Bapna, Ankur; Baljekar, Pallavi; Azime, Israel Abebe; Awokoya, Ayodele; Ataman, Duygu; Ahia, Orevaoghene; Ahia, Oghenefego; Agrawal, Sweta; Adeyemi, Mofetoluwa
	In: https://hal.inria.fr/hal-03177623 ; 2021 (2021)
	Abstract: To appear in the proceedings of the AfricaNLP 2021 workshop. ; With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
	Keyword: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
	URL: https://hal.inria.fr/hal-03177623
	BASE
	Hide details

4	Expanding the content model of annotationBlock
	Bartz, Alexandre; Janes, Juliette; Romary, Laurent...
	In: Next Gen TEI, 2021 - TEI Conference and Members’ Meeting ; https://hal.archives-ouvertes.fr/hal-03380805 ; Next Gen TEI, 2021 - TEI Conference and Members’ Meeting, Oct 2021, Virtual, United States (2021)
	BASE
	Show details

© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern