1 |
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
|
|
McMillan-Major, Angelina; Alyafeai, Zaid; Biderman, Stella; Chen, Kimbo; De Toni, Francesco; Dupont, Gérard; Elsahar, Hady; Emezue, Chris; Aji, Alham Fikri; Ilić, Suzana; Khamis, Nurulaqilla; Leong, Colin; Masoud, Maraim; Soroa, Aitor; Ortiz Suarez, Pedro; Talat, Zeerak; van Strien, Daniel; Jernite, Yacine
|
|
In: https://hal.inria.fr/hal-03550289 ; 2022 (2022)
|
|
Abstract:
8 pages plus appendix and references ; In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.
|
|
Keyword:
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; Applications; Collaborative Resource Construction & Crowdsourcing; LR Infrastructures and Architectures; Systems; Tools
|
|
URL: https://hal.inria.fr/hal-03550289
|
|
BASE
|
|
Hide details
|
|
2 |
Efficiency of Use of Internet Resources in Teaching a Foreign Language at Non-Linguistic Universities ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
A rapid review of anti-racism strategies and tools to inform the establishment of an anti-racism strategy across South Australian Government agencies Protocol ...
|
|
|
|
BASE
|
|
Show details
|
|
4 |
A rapid review of anti-racism strategies and tools to inform the establishment of an anti-racism strategy across South Australian Government agencies Protocol ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
The Korsakow platform and nonlinear narratives as a means to enhance foreign language learning in HE
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Second Language Assessment Issues in Refugee and Migrant Children’s Integration and Education: Assessment Tools and Practices for Young Students with Refugee and Migrant Background in Greece
|
|
|
|
In: Languages; Volume 7; Issue 2; Pages: 82 (2022)
|
|
BASE
|
|
Show details
|
|
7 |
Language-in-Education Policy of Kazakhstan: Post-Pandemic Technology Enhances Language Learning
|
|
|
|
In: Education Sciences; Volume 12; Issue 5; Pages: 311 (2022)
|
|
BASE
|
|
Show details
|
|
8 |
Élaboration d’une liste pour l’enseignement du vocabulaire considérant la fréquence d’utilisation à l’oral et la polysémie ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Corpus linguistics and translation tools for digital humanities: An introduction.
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Corpus Linguistics and Translation Tools for Digital Humanities. Research Methods and Applications
|
|
|
|
BASE
|
|
Show details
|
|
11 |
Social Media and Intercultural Learning: An approach to EFL for Secondary Students
|
|
|
|
BASE
|
|
Show details
|
|
12 |
Development of a Remote, Course-Based Undergraduate Experience to Facilitate In Silico Study of Microbial Metabolic Pathways
|
|
|
|
In: J Microbiol Biol Educ (2022)
|
|
BASE
|
|
Show details
|
|
13 |
Особенности международного стиля: анализ текстов официально-делового стиля : магистерская диссертация ; Specific features of international etiquette: texts analysis of official business style
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Ser tradutor num mundo globalizado e em constante evolução: experiência de estágio na SMARTIDIOM ; Being a translator in a globalised world in constant evolution: internship experience at SMARTIDIOM
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Audacity and Praat as Pedagogical Tools : Analysing Fluency and Pronunciation Accuracy
|
|
|
|
BASE
|
|
Show details
|
|
16 |
El desarrollo de las cuatro destrezas lingüísticas en el aprendizaje del inglés a través de la gamificación y las herramientas digitales
|
|
|
|
BASE
|
|
Show details
|
|
17 |
Using ‘How To …’ Videos in Feedforward Practices to Support the Development of Academic Writing
|
|
|
|
In: Journal on Empowering Teaching Excellence (2022)
|
|
BASE
|
|
Show details
|
|
18 |
SonAmi: A Tangible Creativity Support Tool for Productive Procrastination
|
|
|
|
In: C&C ’21 - 13th ACM Conference on Creativity & Cognition ; https://hal.inria.fr/hal-03442565 ; C&C ’21 - 13th ACM Conference on Creativity & Cognition, Jun 2021, Virtual Event, Italy. pp.1-10, ⟨10.1145/3450741.3465250⟩ (2021)
|
|
BASE
|
|
Show details
|
|
19 |
Promoting Health via mHealth Applications Using a French Version of the Mobile App Rating Scale: Adaptation and Validation Study
|
|
|
|
In: ISSN: 2291-5222 ; JMIR mHealth and uHealth ; https://hal-univ-lyon1.archives-ouvertes.fr/hal-03331985 ; JMIR mHealth and uHealth, JMIR Publications, 2021, 9 (8), pp.e30480. ⟨10.2196/30480⟩ (2021)
|
|
BASE
|
|
Show details
|
|
20 |
Visualiser des textes en humanités numériques
|
|
|
|
In: Semaine Data SHS : Traiter et analyser des données en sciences humaines et sociales ; https://hal.archives-ouvertes.fr/hal-03479616 ; Semaine Data SHS : Traiter et analyser des données en sciences humaines et sociales, Plateforme universitaire de données de Nanterre - MSH-Mondes; Plateforme universitaire de données des Grands-Moulins, Dec 2021, Nanterre, France ; https://pudndatashs.sciencesconf.org/resource/page/id/17 (2021)
|
|
BASE
|
|
Show details
|
|
|
|