DE eng

Search in the Catalogues and Directories

Hits 1 – 10 of 10

1
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
In: 19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond ; https://hal.archives-ouvertes.fr/hal-02263276 ; 19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond, Sep 2019, Graz, Austria (2019)
BASE
Show details
2
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
In: 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7) ; https://hal.inria.fr/hal-02148693 ; 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), Jul 2019, Cardiff, United Kingdom. ⟨10.14618/IDS-PUB-9021⟩ (2019)
Abstract: International audience ; Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.
Keyword: [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
URL: https://hal.inria.fr/hal-02148693/file/Asynchronous_Pipeline_for_Processing_Huge_Corpora_on_Medium_to_Low_Resource_Infrastructures.pdf
https://doi.org/10.14618/IDS-PUB-9021
https://hal.inria.fr/hal-02148693
https://hal.inria.fr/hal-02148693/document
BASE
Hide details
3
Nénufar: Modelling a Diachronic Collection of Dictionary Editions as a Computational Lexical Resource
In: ELEX 2019: smart lexicography ; https://hal.inria.fr/hal-02272978 ; ELEX 2019: smart lexicography, Oct 2019, Sintra, Portugal (2019)
BASE
Show details
4
LMF Reloaded
In: AsiaLex 2019: Past, Present and Future ; https://hal.inria.fr/hal-02118319 ; AsiaLex 2019: Past, Present and Future, Jun 2019, Istanbul, Turkey (2019)
BASE
Show details
5
TEI Encoding of a Classical Mixtec Dictionary Using GROBID- Dictionaries
In: ELEX 2019: Smart Lexicography ; https://hal.inria.fr/hal-02264033 ; ELEX 2019: Smart Lexicography, Oct 2019, Sintra, Portugal ; https://elex.link/elex2019/ (2019)
BASE
Show details
6
CamemBERT: a Tasty French Language Model
In: https://hal.inria.fr/hal-02445946 ; 2019 (2019)
BASE
Show details
7
TEI and the Mixtepec-Mixtec corpus: data integration, annotation and normalization of heterogeneous data for an under-resourced language
In: 6th International Conference on Language Documentation and Conservation (ICLDC) ; https://hal.inria.fr/hal-02075475 ; 6th International Conference on Language Documentation and Conservation (ICLDC), Feb 2019, Honolulu, United States (2019)
BASE
Show details
8
Preparing the Dictionnaire Universel for Automatic Enrichment
In: 10th International Conference on Historical Lexicography and Lexicology (ICHLL) ; https://hal.inria.fr/hal-02131598 ; 10th International Conference on Historical Lexicography and Lexicology (ICHLL), Jun 2019, Leeuwarden, Netherlands ; https://easychair.org/smart-program/ICHLL-10/ (2019)
BASE
Show details
9
Connecting the Humanities through Research Infrastructures
In: 4th Digital Humanities in the Nordic Countries (DHN 2019) ; https://hal.inria.fr/hal-02047512 ; 4th Digital Humanities in the Nordic Countries (DHN 2019), Mar 2019, Copenhagen, Denmark ; https://cst.dk/DHN2019/DHN2019.html (2019)
BASE
Show details
10
The place of lexicography in (computer) science
In: The Future of Academic Lexicography: Linguistic Knowledge Codification in the Era of Big Data and AI ; https://hal.inria.fr/hal-02358218 ; The Future of Academic Lexicography: Linguistic Knowledge Codification in the Era of Big Data and AI, Frieda Steurs; Dirk Geeraerts; Niels Schiller; Marian Klamer; Iztok Kosem, Nov 2019, Leiden, Netherlands ; https://www.lorentzcenter.nl/lc/web/2019/1177/program.php3?wsid=1177&venue=Oort (2019)
BASE
Show details

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
10
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern