DE eng

Search in the Catalogues and Directories

Hits 1 – 5 of 5

1
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
In: https://hal.inria.fr/hal-03550289 ; 2022 (2022)
Abstract: 8 pages plus appendix and references ; In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.
Keyword: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; Applications; Collaborative Resource Construction & Crowdsourcing; LR Infrastructures and Architectures; Systems; Tools
URL: https://hal.inria.fr/hal-03550289
BASE
Hide details
2
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources ...
BASE
Show details
3
Aspects of Terminological and Named Entity Knowledge within Rule-Based Machine Translation Models for Under-Resourced Neural Machine Translation Scenarios ...
BASE
Show details
4
Back-translation approach for code-switching machine translation: A case study
BASE
Show details
5
Leveraging rule-based machine translation knowledge for under-resourced neural machine translation models
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
5
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern