1 |
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
|
|
|
|
In: https://hal.inria.fr/hal-03550289 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
2 |
END-TO-END SPEECH RECOGNITION FROM FEDERATED ACOUSTIC MODELS
|
|
|
|
In: The International Conference on Acoustics, Speech, & Signal Processing (ICASSP) ; https://hal.archives-ouvertes.fr/hal-03601224 ; The International Conference on Acoustics, Speech, & Signal Processing (ICASSP), May 2022, Singapour, Singapore (2022)
|
|
BASE
|
|
Show details
|
|
3 |
Space omics research in Europe: contributions, geographical distribution and ESA member state funding schemes
|
|
|
|
BASE
|
|
Show details
|
|
4 |
From FreEM to D'AlemBERT ; From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
|
|
|
|
In: Proceedings of the 13th Language Resources and Evaluation Conference ; https://hal.inria.fr/hal-03596653 ; Proceedings of the 13th Language Resources and Evaluation Conference, European Language Resources Association, Jun 2022, Marseille, France (2022)
|
|
BASE
|
|
Show details
|
|
5 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
|
|
|
|
In: https://hal.inria.fr/hal-03536361 ; 2022 (2022)
|
|
Abstract:
12 pages, 6 figures, 2 tables ; The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.
|
|
Keyword:
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; Common Crawl; Language Modeling; Web corpus
|
|
URL: https://hal.inria.fr/hal-03536361
|
|
BASE
|
|
Hide details
|
|
10 |
Arguing About “COVID” ; Metalinguistic Arguments on What Counts as a “COVID-19 Death”
|
|
|
|
BASE
|
|
Show details
|
|
11 |
Fifty Definitions of English Learner: A Proposed Solution to Inconsistent State-by-State Systems in the United States for Classifying Students Who Speak English as a Second Language
|
|
|
|
In: Educational Considerations (2022)
|
|
BASE
|
|
Show details
|
|
12 |
Science and Heritage Language Integrated Learning (SHLIL): Evidence for the Effectiveness of an Innovative Science Outreach Program for Migrant Students ...
|
|
|
|
BASE
|
|
Show details
|
|
13 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
An NLP Solution to Foster the Use of Information in Electronic Health Records for Efficiency in Decision-Making in Hospital Care ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
72 - A Corpus of Neutral Voice Speech in Brazilian Portuguese ...
|
|
|
|
BASE
|
|
Show details
|
|
18 |
MAESTRO: Matched Speech Text Representations through Modality Matching ...
|
|
|
|
BASE
|
|
Show details
|
|
19 |
Rare Disorders: Diagnosis and Therapeutic Planning for Patients Seeking Orthodontic Treatment
|
|
|
|
In: Journal of Clinical Medicine; Volume 11; Issue 6; Pages: 1527 (2022)
|
|
BASE
|
|
Show details
|
|
20 |
The Natural, Artificial, and Social Domains of Intelligence: A Triune Approach
|
|
|
|
In: Proceedings; Volume 81; Issue 1; Pages: 2 (2022)
|
|
BASE
|
|
Show details
|
|
|
|