1 |
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
|
|
McMillan-Major, Angelina; Alyafeai, Zaid; Biderman, Stella; Chen, Kimbo; De Toni, Francesco; Dupont, Gérard; Elsahar, Hady; Emezue, Chris; Aji, Alham Fikri; Ilić, Suzana; Khamis, Nurulaqilla; Leong, Colin; Masoud, Maraim; Soroa, Aitor; Ortiz Suarez, Pedro; Talat, Zeerak; van Strien, Daniel; Jernite, Yacine
|
|
In: https://hal.inria.fr/hal-03550289 ; 2022 (2022)
|
|
Abstract:
8 pages plus appendix and references ; In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.
|
|
Keyword:
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; Applications; Collaborative Resource Construction & Crowdsourcing; LR Infrastructures and Architectures; Systems; Tools
|
|
URL: https://hal.inria.fr/hal-03550289
|
|
BASE
|
|
Hide details
|
|
2 |
END-TO-END SPEECH RECOGNITION FROM FEDERATED ACOUSTIC MODELS
|
|
|
|
In: The International Conference on Acoustics, Speech, & Signal Processing (ICASSP) ; https://hal.archives-ouvertes.fr/hal-03601224 ; The International Conference on Acoustics, Speech, & Signal Processing (ICASSP), May 2022, Singapour, Singapore (2022)
|
|
BASE
|
|
Show details
|
|
3 |
Space omics research in Europe: contributions, geographical distribution and ESA member state funding schemes
|
|
|
|
BASE
|
|
Show details
|
|
4 |
From FreEM to D'AlemBERT ; From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
|
|
|
|
In: Proceedings of the 13th Language Resources and Evaluation Conference ; https://hal.inria.fr/hal-03596653 ; Proceedings of the 13th Language Resources and Evaluation Conference, European Language Resources Association, Jun 2022, Marseille, France (2022)
|
|
BASE
|
|
Show details
|
|
5 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
|
|
|
|
In: https://hal.inria.fr/hal-03536361 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
10 |
Arguing About “COVID” ; Metalinguistic Arguments on What Counts as a “COVID-19 Death”
|
|
|
|
BASE
|
|
Show details
|
|
11 |
Fifty Definitions of English Learner: A Proposed Solution to Inconsistent State-by-State Systems in the United States for Classifying Students Who Speak English as a Second Language
|
|
|
|
In: Educational Considerations (2022)
|
|
BASE
|
|
Show details
|
|
12 |
Science and Heritage Language Integrated Learning (SHLIL): Evidence for the Effectiveness of an Innovative Science Outreach Program for Migrant Students ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
An NLP Solution to Foster the Use of Information in Electronic Health Records for Efficiency in Decision-Making in Hospital Care ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
72 - A Corpus of Neutral Voice Speech in Brazilian Portuguese ...
|
|
|
|
BASE
|
|
Show details
|
|
18 |
MAESTRO: Matched Speech Text Representations through Modality Matching ...
|
|
|
|
BASE
|
|
Show details
|
|
19 |
Rare Disorders: Diagnosis and Therapeutic Planning for Patients Seeking Orthodontic Treatment
|
|
|
|
In: Journal of Clinical Medicine; Volume 11; Issue 6; Pages: 1527 (2022)
|
|
BASE
|
|
Show details
|
|
20 |
The Natural, Artificial, and Social Domains of Intelligence: A Triune Approach
|
|
|
|
In: Proceedings; Volume 81; Issue 1; Pages: 2 (2022)
|
|
BASE
|
|
Show details
|
|
|
|