1 |
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
|
|
|
|
In: https://hal.inria.fr/hal-03540069 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
2 |
A fine-grained recognition of Named Entities in ELTeC collection using cascades
|
|
|
|
In: Final Action Event of COST Action Distant Reading for European Literary History ; https://hal.archives-ouvertes.fr/hal-03615219 ; Final Action Event of COST Action Distant Reading for European Literary History, Christof Schöch, Apr 2022, Krakow, Poland ; https://www.distant-reading.net/events/conference-programme/ (2022)
|
|
BASE
|
|
Show details
|
|
3 |
RETRIEVING SPEAKER INFORMATION FROM PERSONALIZED ACOUSTIC MODELS FOR SPEECH RECOGNITION
|
|
|
|
In: IEEE ICASSP 2022 ; https://hal.archives-ouvertes.fr/hal-03539741 ; IEEE ICASSP 2022, 2022, Singapour, Singapore (2022)
|
|
BASE
|
|
Show details
|
|
4 |
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
|
|
|
|
In: https://hal.inria.fr/hal-03550289 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
5 |
Source or target first? Comparison of two post-editing strategies with translation students
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03546151 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
6 |
Automatic Normalisation of Early Modern French
|
|
|
|
In: https://hal.inria.fr/hal-03540226 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
7 |
Offline Corpus Augmentation for English-Amharic Machine Translation
|
|
|
|
In: 2022 The 5th International Conference on Information and Computer Technologies ; https://hal.archives-ouvertes.fr/hal-03547539 ; 2022 The 5th International Conference on Information and Computer Technologies, Mar 2022, New York, United States (2022)
|
|
BASE
|
|
Show details
|
|
8 |
New Version of a Translater for a Natural Language Study
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03551680 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
9 |
A Translater from Latex Trees to Coq Trees for a Natural Language Study
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03536652 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
10 |
From FreEM to D'AlemBERT ; From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French
|
|
|
|
In: Proceedings of the 13th Language Resources and Evaluation Conference ; https://hal.inria.fr/hal-03596653 ; Proceedings of the 13th Language Resources and Evaluation Conference, European Language Resources Association, Jun 2022, Marseille, France (2022)
|
|
BASE
|
|
Show details
|
|
11 |
Integrating a Phrase Structure Corpus Grammar and a Lexical-Semantic Network: the HOLINET Knowledge Graph
|
|
|
|
In: Proceedings of LREC 2022 ; https://hal-amu.archives-ouvertes.fr/hal-03655636 ; Proceedings of LREC 2022, Jun 2022, Marseille, France (2022)
|
|
BASE
|
|
Show details
|
|
12 |
Linguistic resources for paraphrase generation in Portuguese: a Lexicon-Grammar approach
|
|
|
|
In: ISSN: 1574-020X ; EISSN: 1574-0218 ; Language Resources and Evaluation ; https://hal.archives-ouvertes.fr/hal-03548861 ; Language Resources and Evaluation, Springer Verlag, 2022, ⟨10.1007/s10579-021-09561-5⟩ ; https://link.springer.com/article/10.1007/s10579-021-09561-5 (2022)
|
|
BASE
|
|
Show details
|
|
13 |
Caveats of Measuring Semantic Change of Cognates and Borrowings using Multilingual Word Embeddings
|
|
|
|
In: LChange'22 - 3rd International Workshop on Computational Approaches to Historical Language Change 2022 ; https://hal.inria.fr/hal-03635005 ; LChange'22 - 3rd International Workshop on Computational Approaches to Historical Language Change 2022, May 2022, Dublin, Ireland (2022)
|
|
BASE
|
|
Show details
|
|
14 |
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
|
|
|
|
In: https://hal.inria.fr/hal-03536361 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
15 |
Preprint Citation Praxis in PLOS
|
|
|
|
In: ISSN: 0138-9130 ; EISSN: 1588-2861 ; Scientometrics ; https://hal.archives-ouvertes.fr/hal-03506094 ; In press (2022)
|
|
BASE
|
|
Show details
|
|
16 |
Morphology in the Corsican Language Database (BDLC) : assessment and perspectives ; La morphologie dans la Banque de Données Langue Corse : bilan et perspectives
|
|
|
|
In: ISSN: 1638-9808 ; EISSN: 1765-3126 ; Corpus ; https://hal.archives-ouvertes.fr/hal-03591866 ; Corpus, Bases, Corpus, Langage - UMR 7320, 2022, Corpus et données en morpholgie, ⟨10.4000/corpus.7115⟩ ; https://journals.openedition.org/corpus/7115 (2022)
|
|
BASE
|
|
Show details
|
|
17 |
Starting a new treebank? Go SUD! Theoretical and practical benefits of the Surface-Syntactic distributional approach
|
|
|
|
In: Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021) ; https://hal.inria.fr/hal-03509136 ; Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021), Mar 2022, Sofia, Bulgaria (2022)
|
|
BASE
|
|
Show details
|
|
18 |
Assessing the impact of OCR noise on multilingual event detection over digitised documents
|
|
|
|
In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
|
|
BASE
|
|
Show details
|
|
19 |
Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0
|
|
|
|
In: Proceedings of the International Workshop on Challenges & Perspectives in Creating Large Language Models 2022 (BigScience 2022) ; https://hal.inria.fr/hal-03639144 ; Proceedings of the International Workshop on Challenges & Perspectives in Creating Large Language Models 2022 (BigScience 2022), May 2022, Dublin, France (2022)
|
|
BASE
|
|
Show details
|
|
20 |
Évaluation des propriétés multilingues d'un embedding contextualisé
|
|
|
|
In: EGC 2022 - Conférence francophone sur l'Extraction et la Gestion des Connaissances ; https://hal.archives-ouvertes.fr/hal-03578480 ; EGC 2022 - Conférence francophone sur l'Extraction et la Gestion des Connaissances, Jan 2022, Blois, France (2022)
|
|
Abstract:
International audience ; Deep learning models like BERT, a stack of attention layers with an unsupervised pretraining on large corpora, have become the norm in NLP. mBERT, a multilingual version of BERT, is capable of learning a task in one language and of generalizing it to another. This generalization ability opens the perspective of having efficient models in languages with few annotated data, but remains still largely unexplained. We propose a new method based on in-context translated words rather than translated Sentences in order to analyze the similarity between contextualized representations across languages. We show that the representations learned by mBERT are closer for deep layers, outperforming other representations that were specifically trained to be aligned. ; Les modèles d'apprentissage profond comme BERT, un empilement de couches d'attention avec un pré-entraînement non supervisé sur de larges corpus, sont devenus la norme en NLP. mBERT, une version pré-entraînée sur des corpus monolingues dans 104 langues, est ensuite capable d'apprendre une tâche dans une langue et de la généraliser à une autre. Cette capacité de généralisation ouvre la perspective de modèles efficaces dans des langues avec peu de données annotées, mais reste encore largement inexpliquée. Nous proposons une nouvelle méthode fondée sur des mots traduits en contexte plutôt que des phrases pour analyser plus finement la similarité de représentations contextualisées à travers les langues. Nous montrons que les représentations de différentes langues apprises par mBERT sont plus proches pour des couches profondes, et dépassent les modèles spécifiquement entraînés pour être alignés.
|
|
Keyword:
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
|
|
URL: https://hal.archives-ouvertes.fr/hal-03578480 https://hal.archives-ouvertes.fr/hal-03578480/file/submission_33.pdf https://hal.archives-ouvertes.fr/hal-03578480/document
|
|
BASE
|
|
Hide details
|
|
|
|