1 |
Verarbeitung und mentale Repräsentation von Idiomen im Erwachsenen- und Kindesalter ...
|
|
|
|
BASE
|
|
Show details
|
|
2 |
Linguistic resources for paraphrase generation in Portuguese: a Lexicon-Grammar approach
|
|
|
|
In: ISSN: 1574-020X ; EISSN: 1574-0218 ; Language Resources and Evaluation ; https://hal.archives-ouvertes.fr/hal-03548861 ; Language Resources and Evaluation, Springer Verlag, 2022, ⟨10.1007/s10579-021-09561-5⟩ ; https://link.springer.com/article/10.1007/s10579-021-09561-5 (2022)
|
|
BASE
|
|
Show details
|
|
3 |
DeepL et Google Translate face à l'ambiguïté phraséologique
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03583995 ; 2022 (2022)
|
|
BASE
|
|
Show details
|
|
4 |
Preprint Citation Praxis in PLOS
|
|
|
|
In: ISSN: 0138-9130 ; EISSN: 1588-2861 ; Scientometrics ; https://hal.archives-ouvertes.fr/hal-03506094 ; In press (2022)
|
|
BASE
|
|
Show details
|
|
5 |
An Overview of Indian Spoken Language Recognition from Machine Learning Perspective
|
|
|
|
In: ISSN: 2375-4699 ; EISSN: 2375-4702 ; ACM Transactions on Asian and Low-Resource Language Information Processing ; https://hal.inria.fr/hal-03616853 ; ACM Transactions on Asian and Low-Resource Language Information Processing, ACM, In press, ⟨10.1145/3523179⟩ (2022)
|
|
BASE
|
|
Show details
|
|
6 |
Morphology in the Corsican Language Database (BDLC) : assessment and perspectives ; La morphologie dans la Banque de Données Langue Corse : bilan et perspectives
|
|
|
|
In: ISSN: 1638-9808 ; EISSN: 1765-3126 ; Corpus ; https://hal.archives-ouvertes.fr/hal-03591866 ; Corpus, Bases, Corpus, Langage - UMR 7320, 2022, Corpus et données en morpholgie, ⟨10.4000/corpus.7115⟩ ; https://journals.openedition.org/corpus/7115 (2022)
|
|
BASE
|
|
Show details
|
|
7 |
Islands and Bridges of Language: Bio-Inspired Structural Analysis of Language Embedding Data
|
|
|
|
BASE
|
|
Show details
|
|
8 |
VEREINDEUTIGUNG ZUR KLASSIFIZIERUNG LEXIKALISCHER OBJEKTE ; DISAMBIGUATION FOR THE CLASSIFICATION OF LEXICAL ITEMS ; DÉSAMBÏGUISATION POUR LA CLASSIFICATION DE LEXÈMES
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03598242 ; France, Patent n° : EP3937059A1. 2022 (2022)
|
|
BASE
|
|
Show details
|
|
9 |
Computational models of disfluencies : fillers and discourse markers in spoken language understanding ; Modèles computationnels des disfluences dans le traitement de la parole
|
|
|
|
In: https://tel.archives-ouvertes.fr/tel-03653211 ; Computer science. Institut Polytechnique de Paris, 2022. English. ⟨NNT : 2022IPPAT001⟩ (2022)
|
|
BASE
|
|
Show details
|
|
10 |
Assessing the impact of OCR noise on multilingual event detection over digitised documents
|
|
|
|
In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
|
|
Abstract:
International audience ; Event detection (ED) is a crucial task for natural language processing (NLP) and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labor-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.
|
|
Keyword:
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; Digitised Documents; Event Detection; Information Extraction
|
|
URL: https://hal.archives-ouvertes.fr/hal-03635985/file/IJDL2022-Assessing%20the%20Impact%20of%20OCR%20Noise%20on%20Multilingual%20Event%20Detection%20over%20Digitised%20Documents.pdf https://doi.org/10.1007/s00799-022-00325-2 https://hal.archives-ouvertes.fr/hal-03635985/document https://hal.archives-ouvertes.fr/hal-03635985
|
|
BASE
|
|
Hide details
|
|
11 |
An Ontology based Smart Management of Linguistic Knowledge
|
|
|
|
In: EISSN: 2416-5999 ; Journal of Data Mining and Digital Humanities ; https://hal.archives-ouvertes.fr/hal-03618012 ; Journal of Data Mining and Digital Humanities, Episciences.org, In press (2022)
|
|
BASE
|
|
Show details
|
|
12 |
Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents
|
|
|
|
In: Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II ; https://hal.archives-ouvertes.fr/hal-03635971 ; Matthias Hagen; Suzan Verberne; Craig Macdonald; Christin Seifert; Krisztian Balog; Kjetil Nørvåg; Vinay Setty. Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, 13186, Springer International Publishing, pp.347-354, 2022, Lecture Notes in Computer Science, 978-3-030-99738-0. ⟨10.1007/978-3-030-99739-7_44⟩ (2022)
|
|
BASE
|
|
Show details
|
|
13 |
Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?
|
|
|
|
In: Seventh Workshop on Noisy User-generated Text (W-NUT 2021, colocated with EMNLP 2021) ; https://hal.inria.fr/hal-03527328 ; Seventh Workshop on Noisy User-generated Text (W-NUT 2021, colocated with EMNLP 2021), Jan 2022, punta cana, Dominican Republic ; https://aclanthology.org/2021.wnut-1.47/ (2022)
|
|
BASE
|
|
Show details
|
|
14 |
European Language Equality - Report on the French Language
|
|
|
|
In: https://hal.archives-ouvertes.fr/hal-03637776 ; [Research Report] CNRS - LISN. 2022 (2022)
|
|
BASE
|
|
Show details
|
|
15 |
PROTECT: A Pipeline for Propaganda Detection and Classification
|
|
|
|
In: CLiC-it 2021- Italian Conference on Computational Linguistics ; https://hal.archives-ouvertes.fr/hal-03417019 ; CLiC-it 2021- Italian Conference on Computational Linguistics, Jan 2022, Milan, Italy (2022)
|
|
BASE
|
|
Show details
|
|
16 |
Le modèle Transformer: un « couteau suisse » pour le traitement automatique des langues
|
|
|
|
In: Techniques de l'Ingenieur ; https://hal.archives-ouvertes.fr/hal-03619077 ; Techniques de l'Ingenieur, Techniques de l'ingénieur, 2022, ⟨10.51257/a-v1-in195⟩ ; https://www.techniques-ingenieur.fr/base-documentaire/innovation-th10/innovations-en-electronique-et-tic-42257210/transformer-des-reseaux-de-neurones-pour-le-traitement-automatique-des-langues-in195/ (2022)
|
|
BASE
|
|
Show details
|
|
17 |
Language identification, a tool for Corsican and for the evaluation of linguistic resources ; L'identification de langue, un outil au service du corse et de l'évaluation des ressources linguistiques
|
|
|
|
In: Traitement Automatique des Langues ; https://hal.archives-ouvertes.fr/hal-03633290 ; Traitement Automatique des Langues, 2022, Diversité Linguistique, 62 (3), pp.13-37 ; https://www.atala.org/content/diversité-linguistique-linguistic-diversity-natural-language-processing (2022)
|
|
BASE
|
|
Show details
|
|
18 |
Between History and Natural Language Processing: Study, Enrichment and Online Publication of French Parliamentary Debates of the Early Third Republic (1881-1899)
|
|
|
|
In: ParlaCLARIN III at LREC2022 - Workshop on Creating, Enriching and Using Parliamentary Corpora ; https://hal.archives-ouvertes.fr/hal-03623351 ; ParlaCLARIN III at LREC2022 - Workshop on Creating, Enriching and Using Parliamentary Corpora, Jun 2022, Marseille, France ; https://www.clarin.eu/ParlaCLARIN-III (2022)
|
|
BASE
|
|
Show details
|
|
19 |
Factives at hand: When presupposition mode affects motor response
|
|
|
|
In: ISSN: 0022-1015 ; Journal of Experimental Psychology ; https://hal.archives-ouvertes.fr/hal-03538732 ; Journal of Experimental Psychology, American Psychological Association, In press, ⟨10.1037/xge0001167⟩ (2022)
|
|
BASE
|
|
Show details
|
|
20 |
Annoter et prédire des représentations linguistiques de phrases
|
|
|
|
In: https://hal.archives-ouvertes.fr/tel-03544267 ; Informatique et langage [cs.CL]. Université de Paris, 2022 (2022)
|
|
BASE
|
|
Show details
|
|
|
|