63 |
Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures ...
|
|
|
|
Abstract:
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, ...
|
|
Keyword:
400 Sprache, Linguistik
|
|
URL: https://ids-pub.bsz-bw.de/9021 https://dx.doi.org/10.14618/ids-pub-9021
|
|
BASE
|
|
Hide details
|
|
65 |
Reference-less Quality Estimation of Text Simplification Systems
|
|
|
|
In: 1st Workshop on Automatic Text Adaptation (ATA) ; https://hal.inria.fr/hal-01959054 ; 1st Workshop on Automatic Text Adaptation (ATA), Nov 2018, Tilburg, Netherlands ; https://www.ida.liu.se/~evere22/ATA-18/ (2018)
|
|
BASE
|
|
Show details
|
|
66 |
ELMoLex: Connecting ELMo and Lexicon features for Dependency Parsing
|
|
|
|
In: CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies ; https://hal.inria.fr/hal-01959045 ; CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Oct 2018, Brussels, Belgium. ⟨10.18653/v1/K18-2023⟩ (2018)
|
|
BASE
|
|
Show details
|
|
67 |
A multilingual collection of CoNLL-U-compatible morphological lexicons
|
|
|
|
In: Eleventh International Conference on Language Resources and Evaluation (LREC 2018) ; https://hal.inria.fr/hal-01798798 ; Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Miyazaki, Japan ; http://lrec2018.lrec-conf.org/en/ (2018)
|
|
BASE
|
|
Show details
|
|
68 |
Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer
|
|
|
|
In: Eleventh International Conference on Language Resources and Evaluation (LREC 2018) ; https://hal.inria.fr/hal-01798801 ; Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Miyazaki, Japan (2018)
|
|
BASE
|
|
Show details
|
|
69 |
CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing
|
|
|
|
In: 11th Language Resources and Evaluation Conference ; https://hal.inria.fr/hal-01786125 ; 11th Language Resources and Evaluation Conference, May 2018, Miyazaki, Japan ; http://lrec2018.lrec-conf.org (2018)
|
|
BASE
|
|
Show details
|
|
70 |
Inferring inflection classes with description length
|
|
|
|
In: ISSN: 2299-856X ; EISSN: 2299-8470 ; Journal of Language Modelling ; https://hal.inria.fr/hal-01718879 ; Journal of Language Modelling, Institute of Computer Science, Polish Academy of Sciences, Poland, 2018, 5 (3), pp.465-525 (2018)
|
|
BASE
|
|
Show details
|
|
71 |
A new PIE root *h1er ‘(to be) dark red, dusk red’: drawing the line between inherited and borrowed words for ‘red(ish)’, ‘pea’, ‘ore’, ‘dusk’ and ‘love’ in daughter languages
|
|
|
|
In: International Colloquium on Loanwords and Substrata in Indo-European languages ; https://hal.inria.fr/hal-01798976 ; International Colloquium on Loanwords and Substrata in Indo-European languages, Jun 2018, Limoges, France ; http://www.loanwordsandsubstrata.com (2018)
|
|
BASE
|
|
Show details
|
|
72 |
New results on a centum substratum in Greek: the Lydian connection
|
|
|
|
In: International Colloquium on Loanwords and Substrata in Indo-European languages ; https://hal.inria.fr/hal-01798979 ; International Colloquium on Loanwords and Substrata in Indo-European languages, Jun 2018, Limoges, France ; http://www.loanwordsandsubstrata.com (2018)
|
|
BASE
|
|
Show details
|
|
77 |
The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy
|
|
|
|
In: Conference on Computational Natural Language Learning ; https://hal.inria.fr/hal-01584168 ; Conference on Computational Natural Language Learning, Aug 2017, Vancouver, Canada. pp.243-252, ⟨10.18653/v1/K17-3026⟩ ; http://universaldependencies.org/conll17/ (2017)
|
|
BASE
|
|
Show details
|
|
78 |
Construction automatique d'une base de données étymologiques à partir du wiktionary
|
|
|
|
In: Traitement Automatique des Langues Naturelles 2017 ; https://hal.inria.fr/hal-01584013 ; Traitement Automatique des Langues Naturelles 2017, Jun 2017, Orléans, France ; http://taln2017.cnrs.fr (2017)
|
|
BASE
|
|
Show details
|
|
79 |
Extracting an Etymological Database from Wiktionary
|
|
|
|
In: Electronic Lexicography in the 21st century (eLex 2017) ; https://hal.inria.fr/hal-01592061 ; Electronic Lexicography in the 21st century (eLex 2017), Sep 2017, Leiden, Netherlands. pp.716-728 ; https://elex.link/elex2017/ (2017)
|
|
BASE
|
|
Show details
|
|
80 |
Représentation de l’information sémantique lexicale : le modèle wordnet et son application au français
|
|
|
|
In: ISSN: 1386-1204 ; EISSN: 1875-368X ; Revue Française de Linguistique Appliquée ; https://hal.inria.fr/hal-01583995 ; Revue Française de Linguistique Appliquée, Paris : Publications linguistiques, 2017, XXII ; https://www.cairn.info/revue-francaise-de-linguistique-appliquee-2017-1-page-131.htm (2017)
|
|
BASE
|
|
Show details
|
|
|
|