1 |
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
|
|
|
|
In: https://hal.inria.fr/hal-03540069 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
2 |
Automatic Normalisation of Early Modern French
|
|
|
|
In: https://hal.inria.fr/hal-03540226 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
3 |
From FreEM to D'AlemBERT ; From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
|
|
|
|
In: Proceedings of the 13th Language Resources and Evaluation Conference ; https://hal.inria.fr/hal-03596653 ; Proceedings of the 13th Language Resources and Evaluation Conference, European Language Resources Association, Jun 2022, Marseille, France (2022)
|
|
BASE
|
|
Show details
|
|
4 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
|
|
|
|
In: https://hal.inria.fr/hal-03536361 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
5 |
Probing Multilingual Cognate Prediction Models
|
|
|
|
In: Findings of the Association for Computational Linguistics: ACL 2022 ; https://hal.inria.fr/hal-03614691 ; Findings of the Association for Computational Linguistics: ACL 2022, May 2022, Dublin, Ireland (2022)
|
|
BASE
|
|
Show details
|
|
6 |
Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?
|
|
|
|
In: Seventh Workshop on Noisy User-generated Text (W-NUT 2021, colocated with EMNLP 2021) ; https://hal.inria.fr/hal-03527328 ; Seventh Workshop on Noisy User-generated Text (W-NUT 2021, colocated with EMNLP 2021), Jan 2022, punta cana, Dominican Republic ; https://aclanthology.org/2021.wnut-1.47/ (2022)
|
|
Abstract:
International audience ; Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.
|
|
Keyword:
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]; [INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
|
|
URL: https://hal.inria.fr/hal-03527328
|
|
BASE
|
|
Hide details
|
|
7 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Rethinking Automatic Evaluation in Sentence Simplification
|
|
|
|
In: https://hal.inria.fr/hal-03199901 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
9 |
Multilingual Unsupervised Sentence Simplification
|
|
|
|
In: https://hal.inria.fr/hal-03109299 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
10 |
Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus
|
|
|
|
In: CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora ; https://hal.inria.fr/hal-03301590 ; CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora, Jul 2021, Limerick / Virtual, Ireland. ⟨10.14618/ids-pub-10468⟩ ; https://www.cl2021.org/ (2021)
|
|
BASE
|
|
Show details
|
|
11 |
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
|
|
|
|
In: https://hal.inria.fr/hal-03161685 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
12 |
Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi
|
|
|
|
In: https://hal.inria.fr/hal-03161677 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
13 |
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
|
|
|
|
In: EACL 2021 - The 16th Conference of the European Chapter of the Association for Computational Linguistics ; https://hal.inria.fr/hal-03239087 ; EACL 2021 - The 16th Conference of the European Chapter of the Association for Computational Linguistics, Apr 2021, Kyiv / Virtual, Ukraine ; https://2021.eacl.org/ (2021)
|
|
BASE
|
|
Show details
|
|
14 |
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models
|
|
|
|
In: NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ; https://hal.inria.fr/hal-03251105 ; NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun 2021, Mexico City, Mexico (2021)
|
|
BASE
|
|
Show details
|
|
15 |
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
|
|
|
|
In: https://hal.inria.fr/hal-03177623 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
16 |
Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?
|
|
|
|
In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 ; https://hal.inria.fr/hal-03243380 ; Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Aug 2021, Bangkok, Thailand (2021)
|
|
BASE
|
|
Show details
|
|
17 |
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
|
|
|
|
In: https://hal.inria.fr/hal-03109187 ; 2021 (2021)
|
|
BASE
|
|
Show details
|
|
18 |
Variation graphique dans les documents d'Ancien Régime : Nouvelles approches scriptométriques
|
|
|
|
In: Journée d’étude : « Pour une histoire de la langue ‘par en bas’: textes privés et variation des langues dans le passé » ; https://hal.inria.fr/hal-03357080 ; Journée d’étude : « Pour une histoire de la langue ‘par en bas’: textes privés et variation des langues dans le passé », Sep 2021, Paris, France (2021)
|
|
BASE
|
|
Show details
|
|
19 |
Expanding the content model of annotationBlock
|
|
|
|
In: Next Gen TEI, 2021 - TEI Conference and Members’ Meeting ; https://hal.archives-ouvertes.fr/hal-03380805 ; Next Gen TEI, 2021 - TEI Conference and Members’ Meeting, Oct 2021, Virtual, United States (2021)
|
|
BASE
|
|
Show details
|
|
|
|