1 |
Assessing the impact of OCR noise on multilingual event detection over digitised documents
|
|
|
|
In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
|
|
Abstract:
International audience ; Event detection (ED) is a crucial task for natural language processing (NLP) and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labor-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.
|
|
Keyword:
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]; [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]; [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]; [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]; [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; Digitised Documents; Event Detection; Information Extraction
|
|
URL: https://hal.archives-ouvertes.fr/hal-03635985/file/IJDL2022-Assessing%20the%20Impact%20of%20OCR%20Noise%20on%20Multilingual%20Event%20Detection%20over%20Digitised%20Documents.pdf https://doi.org/10.1007/s00799-022-00325-2 https://hal.archives-ouvertes.fr/hal-03635985/document https://hal.archives-ouvertes.fr/hal-03635985
|
|
BASE
|
|
Hide details
|
|
2 |
Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents
|
|
|
|
In: Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II ; https://hal.archives-ouvertes.fr/hal-03635971 ; Matthias Hagen; Suzan Verberne; Craig Macdonald; Christin Seifert; Krisztian Balog; Kjetil Nørvåg; Vinay Setty. Advances in Information Retrieval. 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, 13186, Springer International Publishing, pp.347-354, 2022, Lecture Notes in Computer Science, 978-3-030-99738-0. ⟨10.1007/978-3-030-99739-7_44⟩ (2022)
|
|
BASE
|
|
Show details
|
|
3 |
État de l'art du changement sémantique à partir de plongements contextualisés
|
|
|
|
In: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference ; https://hal.archives-ouvertes.fr/hal-03320337 ; COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, Apr 2021, Grenoble (virtuel), France (2021)
|
|
BASE
|
|
Show details
|
|
4 |
Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques
|
|
|
|
In: Conférence en Recherche d'Informations et Applications (CORIA 2021) ; https://hal.archives-ouvertes.fr/hal-03320332 ; Conférence en Recherche d'Informations et Applications (CORIA 2021), ARIA : Association Francophone de Recherche d’Information (RI) et Applications, Apr 2021, Grenoble (virtuel), France. pp.1 - 7 ; http://coria.asso-aria.org/2021/articles/mini_24/main.pdf (2021)
|
|
BASE
|
|
Show details
|
|
5 |
Multilingual Epidemic Event Extraction
|
|
|
|
In: Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings ; https://hal.archives-ouvertes.fr/hal-03480551 ; Hao-Ren Ke; Chei Sian Lee; Kazunari Sugiyama. Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, 13133, Springer, pp.139-156, 2021, Lecture Notes in Computer Science, 978-3-030-91668-8. ⟨10.1007/978-3-030-91669-5_12⟩ (2021)
|
|
BASE
|
|
Show details
|
|
6 |
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie
|
|
|
|
In: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference ; https://hal.archives-ouvertes.fr/hal-03320343 ; COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, Apr 2021, Grenoble (virtuel), France (2021)
|
|
BASE
|
|
Show details
|
|
7 |
A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers
|
|
|
|
In: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval ; https://hal.archives-ouvertes.fr/hal-03418387 ; SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul 2021, Virtual Event, Canada. pp.2328-2334, ⟨10.1145/3404835.3463255⟩ (2021)
|
|
BASE
|
|
Show details
|
|
8 |
Identification et gestion des données personnelles dans les textes ; Identification et gestion des données personnelles dans les textes: modèle sémantique et applications
|
|
|
|
In: CiDE.22 : 22éme édition du Colloque International sur le Document Electronique Données Documents Connaissances : Perspectives de recherche et d’enseignement ; https://hal.archives-ouvertes.fr/hal-03506075 ; CiDE.22 : 22éme édition du Colloque International sur le Document Electronique Données Documents Connaissances : Perspectives de recherche et d’enseignement, Dec 2021, Paris, France (2021)
|
|
BASE
|
|
Show details
|
|
9 |
Data Papers et dissémination des données de la recherche : quelles pratiques en SHS ?
|
|
|
|
In: Colloque DHNord2021 : Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux ; https://hal.archives-ouvertes.fr/hal-03506077 ; Colloque DHNord2021 : Publier, partager, réutiliser les données de la recherche : les data papers et leurs enjeux, Nov 2021, virtuelle, France (2021)
|
|
BASE
|
|
Show details
|
|
10 |
Building An Automated Gesture Imitation Game For Teenagers with ASD
|
|
|
|
In: ISSN: 0973-7006 ; Far East Journal of Electronics and Communications ; https://hal-imt-atlantique.archives-ouvertes.fr/hal-02894314 ; Far East Journal of Electronics and Communications, 2020, 23 (1), pp.1 - 10. ⟨10.17654/EC023010001⟩ (2020)
|
|
BASE
|
|
Show details
|
|
11 |
Dataset for Temporal Analysis of English-French Cognates
|
|
|
|
In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) ; 12th Conference on Language Resources and Evaluation (LREC 2020) ; https://hal.archives-ouvertes.fr/hal-03026957 ; 12th Conference on Language Resources and Evaluation (LREC 2020), May 2020, Marseille, France. pp.855-859, ⟨10.5281/zenodo.3693650⟩ (2020)
|
|
BASE
|
|
Show details
|
|
12 |
A Dataset for Multi-lingual Epidemiological Event Extraction
|
|
|
|
In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) ; https://hal.archives-ouvertes.fr/hal-02732848 ; Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), May 2020, Marseille, France. pp.4139-4144 (2020)
|
|
BASE
|
|
Show details
|
|
13 |
Impact Analysis of Document Digitization on Event Extraction
|
|
|
|
In: CEUR Workshop Proceedings ; 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020) co-located with the 19th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2020) ; https://hal.archives-ouvertes.fr/hal-03026148 ; 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020) co-located with the 19th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2020), Nov 2020, Virtual, Italy. pp.17-28 ; http://sag.art.uniroma2.it/NL4AI/ (2020)
|
|
BASE
|
|
Show details
|
|
14 |
Entity Linking for Historical Documents: Challenges and Solutions
|
|
|
|
In: 22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020 ; https://hal.archives-ouvertes.fr/hal-03034492 ; 22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020, 12504, Springer, pp.215-231, 2020, Lecture Notes in Computer Science, 978-3-030-64452-9. ⟨10.1007/978-3-030-64452-9_19⟩ (2020)
|
|
BASE
|
|
Show details
|
|
15 |
Robust Named Entity Recognition and Linking on Historical Multilingual Documents
|
|
|
|
In: Conference and Labs of the Evaluation Forum (CLEF 2020) ; https://hal.archives-ouvertes.fr/hal-03026969 ; Conference and Labs of the Evaluation Forum (CLEF 2020), Sep 2020, Thessaloniki, Greece. pp.1-17, ⟨10.5281/zenodo.4068074⟩ ; http://ceur-ws.org/Vol-2696/paper_171.pdf (2020)
|
|
BASE
|
|
Show details
|
|
16 |
Linking Named Entities across Languages using Multilingual Word Embeddings
|
|
|
|
In: JCDL '20: The ACM/IEEE Joint Conference on Digital Libraries in 2020 ; ACM/IEEE Joint Conference on Digital Libraries - JCDL 2020 ; https://hal.archives-ouvertes.fr/hal-03026933 ; ACM/IEEE Joint Conference on Digital Libraries - JCDL 2020, Aug 2020, Wuhan, Hubei - Virtual event, China. pp.329-332, ⟨10.1145/3383583.3398597⟩ ; https://dl.acm.org/doi/10.1145/3383583.3398597 (2020)
|
|
BASE
|
|
Show details
|
|
17 |
Concevoir un dispositif innovant pour professionaliser la formation au référencement web: le projet SEO-ELP
|
|
|
|
In: ACFAS ; https://hal.archives-ouvertes.fr/hal-03506079 ; ACFAS, 2020, Sherbrooke, Canada (2020)
|
|
BASE
|
|
Show details
|
|
18 |
L'école est morte, vive l'école
|
|
|
|
In: Langues romanes : interaction entre littératures et cultures nationales ; https://hal.archives-ouvertes.fr/hal-03554956 ; Langues romanes : interaction entre littératures et cultures nationales, Université d’Etat de la Région de Moscou, Jun 2020, Moscou, Russie. pp.403-422 (2020)
|
|
BASE
|
|
Show details
|
|
19 |
Beyond Metada: the New Challenges in Mining Scientific Papers
|
|
|
|
In: the Eighth workshop on Bibliometric-enhanced Information Retrieval (BIR 2019) co-located with the 41st European Conference on Information Retrieval (ECIR 2019) ; https://hal.archives-ouvertes.fr/hal-03506084 ; the Eighth workshop on Bibliometric-enhanced Information Retrieval (BIR 2019) co-located with the 41st European Conference on Information Retrieval (ECIR 2019), Apr 2019, Cologne, Germany (2019)
|
|
BASE
|
|
Show details
|
|
20 |
Semantically-driven Competitive Intelligence Information Extraction : Linguistic Model and Application
|
|
|
|
In: CONTENT ; https://hal.archives-ouvertes.fr/hal-03506081 ; CONTENT, 2019, Venise, Italy (2019)
|
|
BASE
|
|
Show details
|
|
|
|