1 |
Towards combined semantic and lexical scores based on a new representation of textual data to extract experimental data from scientific publications
|
|
|
|
In: ISSN: 1751-5858 ; EISSN: 1751-5866 ; International Journal of Intelligent Information and Database Systems ; https://hal.inrae.fr/hal-03616243 ; International Journal of Intelligent Information and Database Systems, Inderscience, 2022, 15 (1), pp.78. ⟨10.1504/IJIIDS.2022.120146⟩ (2022)
|
|
BASE
|
|
Show details
|
|
2 |
Assessing the impact of OCR noise on multilingual event detection over digitised documents
|
|
|
|
In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
|
|
Abstract:
International audience ; Event detection (ED) is a crucial task for natural language processing (NLP) and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labor-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.
|
|
Keyword:
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; Digitised Documents; Event Detection; Information Extraction
|
|
URL: https://hal.archives-ouvertes.fr/hal-03635985/file/IJDL2022-Assessing%20the%20Impact%20of%20OCR%20Noise%20on%20Multilingual%20Event%20Detection%20over%20Digitised%20Documents.pdf https://doi.org/10.1007/s00799-022-00325-2 https://hal.archives-ouvertes.fr/hal-03635985/document https://hal.archives-ouvertes.fr/hal-03635985
|
|
BASE
|
|
Hide details
|
|
3 |
Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents
|
|
|
|
In: Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II ; https://hal.archives-ouvertes.fr/hal-03635971 ; Matthias Hagen; Suzan Verberne; Craig Macdonald; Christin Seifert; Krisztian Balog; Kjetil Nørvåg; Vinay Setty. Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, 13186, Springer International Publishing, pp.347-354, 2022, Lecture Notes in Computer Science, 978-3-030-99738-0. ⟨10.1007/978-3-030-99739-7_44⟩ (2022)
|
|
BASE
|
|
Show details
|
|
4 |
Text Mining from Free Unstructured Text: An Experiment of Time Series Retrieval for Volcano Monitoring
|
|
|
|
In: Applied Sciences; Volume 12; Issue 7; Pages: 3503 (2022)
|
|
BASE
|
|
Show details
|
|
5 |
Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques
|
|
|
|
In: Applied Sciences; Volume 12; Issue 3; Pages: 1352 (2022)
|
|
BASE
|
|
Show details
|
|
6 |
Analysis of the Full-Size Russian Corpus of Internet Drug Reviews with Complex NER Labeling Using Deep Learning Neural Networks and Language Models
|
|
|
|
In: Applied Sciences; Volume 12; Issue 1; Pages: 491 (2022)
|
|
BASE
|
|
Show details
|
|
7 |
Experiences on the Improvement of Logic-Based Anaphora Resolution in English Texts
|
|
|
|
In: Electronics; Volume 11; Issue 3; Pages: 372 (2022)
|
|
BASE
|
|
Show details
|
|
8 |
Topic models do not model topics: epistemological remarks and steps towards best practices
|
|
|
|
In: EISSN: 2416-5999 ; Journal of Data Mining and Digital Humanities ; https://hal.archives-ouvertes.fr/hal-03261599 ; Journal of Data Mining and Digital Humanities, Episciences.org, 2021, 2021, ⟨10.46298/jdmdh.7595⟩ (2021)
|
|
BASE
|
|
Show details
|
|
9 |
Indirectly Named Entity Recognition ; Reconnaissance d'entités indirectement nommées
|
|
|
|
In: ISSN: 2530-9455 ; Journal of Computer-Assisted Linguistic Research (JCLR) ; https://hal.archives-ouvertes.fr/hal-03476411 ; Journal of Computer-Assisted Linguistic Research (JCLR), Universitat Politècnica de València, 2021, 5 (1), pp.27-46. ⟨10.4995/JCLR.2021.15922⟩ ; https://polipapers.upv.es/index.php/jclr/index (2021)
|
|
BASE
|
|
Show details
|
|
10 |
Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques
|
|
|
|
In: Conférence en Recherche d'Informations et Applications (CORIA 2021) ; https://hal.archives-ouvertes.fr/hal-03320332 ; Conférence en Recherche d'Informations et Applications (CORIA 2021), ARIA : Association Francophone de Recherche d’Information (RI) et Applications, Apr 2021, Grenoble (virtuel), France. pp.1 - 7 ; http://coria.asso-aria.org/2021/articles/mini_24/main.pdf (2021)
|
|
BASE
|
|
Show details
|
|
11 |
WEIR-P: An Information Extraction Pipeline for the Wastewater Domain
|
|
|
|
In: RCIS 2021 - 5th International Conference on Research Challenges in Information Science ; https://hal.archives-ouvertes.fr/hal-03211461 ; RCIS 2021 - 5th International Conference on Research Challenges in Information Science, May 2021, Virtual, Cyprus (2021)
|
|
BASE
|
|
Show details
|
|
12 |
Mapping the evolution of topics published by Education for Information. Interdisciplinary Journal of Information Studies
|
|
|
|
In: ISSN: 0167-8329 ; Education for Information ; https://hal.archives-ouvertes.fr/hal-03392553 ; Education for Information, IOS Press, 2021 (2021)
|
|
BASE
|
|
Show details
|
|
13 |
LILLIE : information extraction and database integration using linguistics and learning-based algorithms ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Impact Analysis of Document Digitization on Event Extraction ...
|
|
|
|
BASE
|
|
Show details
|
|
16 |
Impact Analysis of Document Digitization on Event Extraction ...
|
|
|
|
BASE
|
|
Show details
|
|
18 |
Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training ...
|
|
|
|
BASE
|
|
Show details
|
|
19 |
HittER: Hierarchical Transformers for Knowledge Graph Embeddings ...
|
|
|
|
BASE
|
|
Show details
|
|
20 |
AttentionRank: Unsupervised Keyphrase Extraction using Self and Cross Attentions ...
|
|
|
|
BASE
|
|
Show details
|
|
|
|